Studies in Quantitative Linguistics 24: Editorial

“History of Quantitative Linguistics in France “ ISBN: 978-3-942303-48-4

Contents Studies 24 (free of charge)

PDF-File (download-link: 15.00 EUR)

Buy Now15.00 EUR

Introduction

Jacqueline Leon,

UMR 7597 HTL,

jacqueline.leon@univ-paris-diderot.fr

Sylvain Loiseau,

UMR 8202 Sedyl,

sylvain.loiseau@univ-paris13.fr

Scope of this volume

This volume give a historical account of the field of quantitative linguistics in France. It focuses about developments initiated in France implying mathematical methods or the usage and interpretation of quantitative data. It does not include material about corpora compilation, computational implementation, or formal modelization.

Quantitative linguistics in France

Early quantitative linguistics

Historically, quantitative linguistics has a specific status in France. It could be said that statistical studies, namely statistical studies of vocabulary, opened the way to the reception of formal languages and the computerization of linguistics in France. On the one hand, French statistical works were deeply anchored in French linguistic tradition mainly concerned by philology, etymology, dialectology, stylistics and studies on specialized vocabularies. On the other hand, contrary to the USA, the USSR and Great Britain, France significantly lagged behind for computing, logic and formal languages. In fact, the field of vocabulary statistics played a crucial role by the rivalry it had introduced between both approaches, formal and quantitatives. This is shown by the multitude of denominations naming the field and subfields the Americans have referred to as ‘computational linguistics’.[1]

The Centre Favard, also named the ‘Seminar of Quantitative Linguistics’, created in March 1960 at the Henri Poincaré Institute of the Faculté des Sciences de Paris, significantly exemplifies that issue. Under the name ‘Seminar of Quantitative Linguistics’ were brought together both the formal aspects of linguistics and statistical methods. It was an important place for training the linguists in mathematics, logic, information theory, set theory, language theory, statistical linguistics — and more generally statistics and probabilities.

A further example is the classification introduced by Solomon Marcus during the Séminaire International de Linguistique Formelle that took place in Aiguille in 1968. He put forward a classification of the subfields of formal linguistics, the term which subsumed the whole set (see Desclés et Fuchs 1969). Marcus distinguished between algebraic linguistics (for instance Chomsky-Schützenberger’s work on monoids) and mathematical linguistics (using Markov chains), the latter involving probabilistic linguistics and quantitative linguistics, automatic, computational and cybernetic linguistics, finally applied linguistics. It should be noted that Marcus did not include either works on formal grammars or vocabulary statistics within computational linguistics.

Yet another grouping was offered by Bernard Vauquois, a major leader in the field of machine translation in France. He grouped generative grammar and statistical studies of vocabulary on one side, and natural language processing and semantic formalisation on the other side. His reasons were probably more political than epistemological, as he aimed to ensure that Natural Language Processing be recognized by the CNRS.

It is worth noting that these various classifications do not reflect the distinctions made by Computational Linguistics as defined in 1962 by American institutions, such as the Association for Machine Translation and Computational Linguistics (AMTCL) and the ALPAC report in 1966. Computational Linguistics claimed to involve every theoretical aspect of the interaction between formal languages, linguistics and programming on the one hand, and the practical aspects of language engineering on the other hand. The whole set would be carried out by NLP in the 1970s. As can be seen, for the Americans and unlike the French, statistical studies did not pertain to computational Linguistics.

Current quantitative linguistics in France

How to characterize the field of quantitative linguistics in France ? Let us stress, to start with, that the term « Quantitative Linguistics » (or a translation such as linguistique quantitative) is not really used in French nowadays. The field is mainly termed Lexicométrie (cf. infra), linguistique de corpus (corpus linguistics), ou statistique linguistique (statistical linguistics). The fact that the term « quantitative linguistics » is not widely used reflects the fact that researches tend to focus not so much on quantitative laws as on the historically-situated interpretation of quantative data in corpora.

In order to characterize these local developments amongst the various possible avenues of research in the field of quantitative linguistics, we can draw a basic typology of the different kind of quantitative linguistics. A good criteria for such a typology is the object they focus on and its degree of abstraction (Loiseau 2010) : quantitative linguistic works can be divided into (i) those that seek for universal tendancies, irrespective about any particular language, at a very abstract level (ii) those that work on quantitative tendancies at the scale of a particular linguistic system, and (iii) those that work on quantitative tendencies at the scale of a genre, a discourse, or a speaker for instance.

The best known example of the first type (universal law) is the Zipf law : it is valid for any language and refers to the general economy of the language faculty. Such a universal research is the heart of the denomination of « quantitative linguistics ».

An example of the second type could be the notion of functional burdening of a phonological distinction (eg Herdan 1958)[2]. Another example could be the analysis of morphological productivity (Baayen 2009): the formula proposed for the computing of productivity index is supposed to be valid for any (fusional) languages, but it aims nevertheless at quantifying the productivity of a given morpheme in a given language. It focuses on the linguistic system.

Exemples of the third type are now numerous with the development of corpus linguistics: many works try to caracterize, through quantitative properties, a genre, a discourse, a style, a variety, according to a corpus representative of that socio-/idio-lect. Methods of descriptive statistics or statistical modelling applied to corpora are most of the time aiming at describing such corpora.

Having this small typology in mind, we can try to better caracterized the Quantitative Linguistics developments in France. Universalist quantitative models of linguistic data are represented by works by Mandelbrot on statistical law of the distribution of words and by information theory (Le Roux ; Léon). We can also add to that group works on the morphodynamic paradigm (Petitot and also Ploux), and Guiraud’s work (Bergounioux).

However, the large bulk of works focuses on sociolect/idiolect descriptions and to the analysis of the links between discursive phenomena on one hand and historical / ideological conditionnings on the other hand. This applies to the joint works by historians and lexicologues (Mayaffre), to the lexicométrie school (Loiseau ; Brunet ; Longré and Mellet), to the works done under the umbrella of the TLF (Trésor de la langue française), a large dictionary based on a corpus of French texts (Candel).

As in several other countries, the field of Quantitative linguistics in France arose during the first half of the 20th century and develop mainly during the second half of that century. Today, the major part of the research in this field has been incorporated into an international research field and have no national peculiarity anymore. However, some subfields such as Lexicométrie are still mainly developed in France. A focus on texts is still strong among French scholars.

3. Interviews with actors of the field

In the process of the preparation of this volume, several interviews have been made with actors and witnesses of the field : Robert Nicolaï (University of Nice), Micheline Petruszewycs (EHESS), Pierre Lafon (ENS Lyon), Jean Petitot (EHESS), †Maurice Tournier (ENS Lyon), Évelyne Bourion (UMR Modyco). These interviews aimed at setting up the institutional context of Quantitative linguistics in France.

Robert Nicolaï witnessed the development of quantitative linguistics in the Nice University around the proeminent figure of Pierre Guiraud. Robert Nicolai was Guiraud’s student at the University of Nice in the late 1960s, early 1970s. He recalls that Guiraud’s lectures focused on lexicology and semantics more than statistics. He was especially concerned by morphosemantic roots and the etymological structures of French which can be only tackled with large lexical data allowing to deal with semantic universals (see Bergounioux this volume for more details).

Guiraud created the department of General Linguistics at the University of Nice and, with Gabriel Manessy, a research group named Ideric (Institut de Recherche Interethnique et Interculturel) [Institute of Interethnic and Intercultural Research] which Nicolai directed after Guiraud’s death.

The interview with Jean Petitot turned into a chapter in this volume (Petitot, Léon, Loiseau, this volume).

Interviews with Pierre Lafon, Maurice Tournier, Évelyne Bourion and Micheline Petruszewycs focused mainly on the development of the lexicométrie school. Micheline Petruszewycs has been the assistant of the mathematician Georges-Théodule Guilbaud (1912 – 2008). G.-T. Guilbaud was the founder of a laboratory, the « Center for analysis and mathematics for Social sciences » (Centre d’analyse et de mathématiques sociales) at the 6th section of EPHE (École pratique des hautes études). He also animated two seminars for years.[3] The Friday seminar was devoted to mathematics for social sciences and had been very influencial. It was attended by various people such as the mathematicians Pierre Achard, Bernard Jaulin, Simon Reignier, the ethnologist Robert Jaulin, and many psychologists — among them François Bresson and his students. The composer Iannis Xenakis used to attend this seminar. It should be said that statistics, and more generally mathematics, were well regarded by social scientists of EPHE (6th section). Guilbaud was even invited to give a course on statistics within Levi-Strauss’s seminar. The Thursday seminar focused on linguistics, more specifically on lexicometry with the participation of Pierre Lafon, Annie Geoffroy, Maurice Tournier, André Salem (who studied mathematics in Moscow with Andrej Kolmogorov) as well as other members of the Laboratoire de Lexicométrie de St Cloud. Guilbaud introduced the hypergeometric law which helped solving some ill explicited formulations made by the St Cloud group.

Presentation of the contributions

The volume gathers contributions either by people that have been involved in the field of which they are giving an account, or by people specialized in the history of linguistics. In both cases, however, the same historical focus has been adopted by all contributors.

Some contributions are focussing on individuals (such as Thom, Benzécri, or Guiraud), while others are focusing on larger fields of research (chapters by Léon, Loiseau, Longré and Mellet).

The contributions have been organised in two parts. The first one focuses on vocabulary statistics. The second one gathers contributions presenting mathematical models.

The first part ‘Vocabulary statistics’ includes four papers on the pioneering works in the 1950-60s and three papers on contemporary research. Some emblematic personalities and projects played a significant role in these early years and it is not surprising that they were adressed in several chapters : Pierre Guiraud and Georges Gougenheim (Bergounioux, Léon) ; Benoît Mandelbrot (Léon, Le Roux) ; TLF (Brunet, Candel).

Jacqueline Léon deals with early French statistical linguistics and its institutionalisation: after presenting the role of the instigators, Mario Roques (1875-1961) and Marcel Cohen (1884-1974), she examines the three paths followed by the pioneers: (i) the teaching track with Georges Gougenheim (1900–1972) and Le Français Élementaire ; (ii) the stylistic track with Pierre Guiraud (1912-1983) and later Charles Muller (1909-2015) ; (iii) finally, the mathematical and Information theory track with Benoît Mandelbrot (1924-2010), René Moreau (1921-2009) and the Centre Favard. She shows that statistical studies of vocabulary contributed greatly to the changes in French linguistics that took place in the 1950-60s. As statistical studies of vocabulary, simultaneously with the first experiments of machine translation, were the first fields to be computerized, they made possible the automation of linguistics that took place in the USA ten years before.

Gabriel Bergounioux dedicates a whole chapter to Pierre Guiraud (1912-1983), one of the major pioneers of quantitative linguistics in France. He emphasizes the originality of Guiraud’s approach and his role in the beginning of statistical studies on vocabulary. Guiraud published three key works on the domain, a comprehensive bibliography (1954) and two methodological essays (1954 and 1960). At first, Guiraud characterized linguistics as an observational science grounded on statistics, like sociology and economics. Later, he claimed that it was cognitive-based. His approach was both stylistic (with statistical studies of Guillaume Apollinaire’s and Paul Valéry’s vocabulary) and etymological. For that purpose he worked out the concept of “morpho-semantic field”. Bergounioux shows how his position of outsider shed light on the conditions in which quantitative linguistics emerged in France.

Danielle Candel’s chapter is a testimony on the building of the Trésor de la Langue Française, a major dictionary project (1971-1994). This project aimed at building a reference corpus providing the basis and the data for the lexicographic analyses. The corpus built for that project is then one of the early exemples of corpus built to be representative of a language, to assist the lexicographer to derive meanings from observed usages in context, and large enough in order to extract the frequencies of the lexical items. Danielle Candel shows how the quantitative approach and the use of a large-scale database is linked with many steps in the building of the dictionary.

In the chapter devoted to the ”lexicometric” school, Sylvain Loiseau tries to show the theoretical assumptions and the institutional settings that lead to the development of this very influencial line of research. The quantitative analysis according to lexicometry is aimed at providing a scientific tool for the analysis of ideological content of texts. The main methods developed in the field of lexicometry are presented, focusing on the quantitative assumption of the method specificity. Some characteristics of lexicometry are still influential in contemporary research in corpus linguistics in France: the focus on text, the search for an ideological “backstage” beneath the words, the idea that quantitative textual analysis can help providing an objectivity in the analysis of such a backstage.

The chapter by Damon Mayaffre focuses on the historical studies of corpora of political texts. This avenue of research originates in the development of the « lexicometry » approach of quantitative analysis of vocabularies in the 1970’s in France and had always been associated with that field of research. The author shows that the lexicometry approach, aiming at unravelling the social position and ideological content, and due to its focus on political texts, has interested historians from the beginning. Damon Mayaffre then offers several exemples of the methods elaborated and of their usage for the historical interpretation of political texts.

Longrée and Mellet’s chapter contributes to the statistical handling of a corpus of Latin texts. The authors, as latinists and proponents of statistical studies, address the specific issue of the variability of word order in Latin which constitutes a guiding thread for quantitative linguistics in Latin. This issue raised latinists’s interest as early as the 1970s so that they worked up counts of various configurations in Latin texts. By taking over that type of work, Longrée and Mellet identify ‘motifs’ in the aim of establishing a typology of texts. Motifs associating lexical and grammatical constraints subsumed the notions of repeated segments, collocations and colligations and led to new software developments for the treatment of Latin.

Etienne Brunet deals with the history of large computerized corpora and data bases of written texts in France. The first French corpus was the TLF (Trésor de la Langue Française), which, in fact, was the first computerized corpus in the world while the Brown Corpus was thought out a little later. Brunet recalls how the making of the dictionary was computer-aided, with co-occurrences and frequencies at the editors’ disposal. Examining the TLF’s successor, Frantext, he makes a distinction between a corpus (Frantext) and a base (TLF). Contrary to a corpus, a base has a fixed frame with ordered sections that items must fill by a number, a code or text. He compared these two French projects with American bases such as Encarta Encyclopedia (1993-2009), Wikipedia, and with corpora of the French Language made outside of France, such as the German Wortschatz, the English Sketchengine, the American Google Books.

The second part of the book, ‘Mathematical models’, focuses on seminal and influent contributions in the field of the mathematical models of language. It includes three chapters on pioneering works and one chapter, by Sabine Ploux, that illustrates contemporary research. Three lines of research are represented : research on distribution laws by Benoît Mendelbrot (Ronan Le Roux), the development of a family of factorial analysis, correspondance analysis, for the distributional analysis of a language by Jean-Paul Benzécri (Valérie Beaudouin), and the development of Catastrophe theory by René Thom, a model aiming at being a mathematical tools for language modelling, reminiscent of neural networks (Jean Petitot, Jacqueline Léon, Sylvain Loiseau).

Ronan Le Roux devotes his chapter to Benoît Mandelbrot (1924-2010), another key pioneer of quantitative linguistic in France. The author shows that Mandelbrot’s study of language was mostly limited to Zipf’s law. He questions the sources of the mathematician’s significant work on Zipf’s law pointing out the discrepancy between the horizon of retrospection (Auroux 2007) he claimed and the real background of his works. In particular the scientific environment of the California Institute of Technology where he stayed in the late 1940s, the inspiring figures of Wiener and Von Neumann and cybernetics played a major role in the way he tackled Zipf’s law. Le Roux shows that Mandelbrot’s later works on the fractal paradigm were consistent with his early work on Zipf’s law, exemplifying what Le Roux identifies as ‘the transversal regime of scientific modelling’, a typical mode of scientific activity.

The chapter by Valérie Beaudouin accounts for the elaboration of “correspondance analysis”, a family of factorial analysis methods, by Jean-Paul Benzécri. From the middle of the 1960s, Jean-Paul Benzécri (1932-) has introduced and developed a series of methods called “Analyse des Données” (Data Analysis) whose heart is Correspondence Analysis, a method for the analysis of multidimensional data. Valérie Beaudouin traces the intellectual project behind these methods, showing that linguistic data played a major role in the elaboration of Correspondence Analysis: Correspondence Analysis aimes at supporting an inductive approach of languages, based on the exploration of corpora and the synthesis of large distributional patterns.

The next chapter is an interview with Jean Petitot by Jacqueline Léon and Sylvain Loiseau. Jean Petitot presents in great details the mathematical work of René Thom (1923-2002), and the application of this work to linguistics, of which Jean Petitot is one of the best specialists. René Thom have defined an array of concepts – singularity, structural stability, catastrophe, bifurcation – for the mathematical modelling of morphogenesis, a field pioneered by Alan Turing that studies the formation processes of complex forms, in particular those of life. One central issue of morphogenesis is the mereological problem, i.e. how totalities can be organized with constituents, relations and transformation rules between constituents, and how totalities can show an organization which is more than the sum of their constituents. This constituency problem is of course central for linguistics. The article shows how the mathematical tools built by Thom addresses these issues and shows how these propositions are related to other theoretical frameworks or approaches, such as cognitive linguistics or neural networks. Petitot (2011) shows how the theory of dynamic systems can account for categorical perception in phonology and also in syntax, where he stresses the fundamental links between vision and syntactic structures.

Sabine Ploux presents an approach aiming at modelling the polysemy and the contextual variation of the meanings of lexical items using graph theory. She first shows the limits of two other paradigms : the dynamic approach, illustrated by René Thom (cf. Chapter by Petitot, Léon & Loiseau), and the linear model, illustrated by the concordance factorial analysis elaborated by Benzécri (cf. Chapter by Beaudouin). Both use the context of lexical units to model their meaning. The former (as well as connexionnist approaches) adequatly models the structural stability of a concept or a category despite its “deformations” in context. However, it can be applied only to few lexical units. The latter is based on the whole lexicon and give a static core meaning. This approach is called today vector space models, it doesn’t give access to the processes of the building of the lexical meanings, but give a static representation of lexical meanings. Ploux shows an alternative approach based on graph theory. In graphs built from large corpora, lexemes are represented as vertices and cooccurrences are represented as edges. In such graphs, the systematic structure of the lexicon can be observed, while the various acceptions can be accessed through the cliques or the communities (dense group of vertices) in the graph.

References

Auroux, Sylvain (2007) La question de l’origine des langues suivi de L’historicité des sciences, Paris: PUF.

Baayen, R. Harald (2009) “Corpus linguistics in morphology: morphological productivity”, In : Corpus Linguistics. An international handbook A.Luedeling and M. Kyto (eds.), Berlin: Mouton De Gruyter, p.900-919.

Herdan, Gustav (1958) “The Relation Between the Functional Burdening of Phonemes and the Frequency of Occurrence” Language and Speech 1(1), 8-13.

Cori, Marcel; Léon, Jacqueline (2002) “La constitution du TAL. Etude historique des dénominations et des concepts”, Traitement Automatique des Langues 43 (3), 21-55.

Loiseau Sylvain (2010) “Paradoxes de la fréquence”, Energeia 2, 20-55.

[1] See Cori & Léon (2002).

[2] if a phonemic distinction is contrasting a large number of minimal pairs this distinction cannot be withdrawn without producing a lot of homonymies ; whereas if it is contrasting few pairs of lexemes, it can be lost without producing much lexical ambiguity ; the « functional burdening » is a measure of this importance of a phonemic distinction.

[3] Thanks to Micheline Petruszewycs, we consulted the book signing sheets of these seminars showing the diversity of people who attended it : mathematicians like Pierre Achard, Bernard Jaulin, Robert Jaulin and Simon Reignier, the musician Iannis Xenakis,