About Glottometrics:
GLOTTOMETRICS Glottometrics is a scientific journal for the quantitative research of language and text published 23 times a year.All issues of Glottometrics have the ISSN 16178351 and are available as: – printed edition: 30.00 EUR – CDROMedition: 15.00 EUR – PDFfiles free download from Internet Abstracts: Free of Charge Aims and Scope/ Editorial Board Actual external academic peers for Glottometrics Complete bibliography of all publications of the first 30 issues 
Glottometrics 36, 2017
Contenets/ Abstracts Glottometrics 36 (free of charge)
You can buy the download link for Glottometrics 36, 2016 or order the printversion here
Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (23 Ausgaben pro Jahr) für die quantitative Erforschung von Sprache und Text.
Beiträge in Deutsch oder Englisch sollten an einen der Herausgeber in einem gängigen Textverarbeitungssystem (vorrangig WORD) geschickt werden.
Glottometrics kann aus dem Internet heruntergeladen, auf CDROM (in PDF Format) oder in Buchform bestellt werden.
Glottometrics is a scientific journal for the quantitative research on language and text published at irregular intervals (23 times a year).
Contributions in English or German written with a common text processing system (preferably WORD) should be sent to one of the editors.
Glottometrics can be downloaded from the Internet, obtained on CDROM (in PDF) or in form of printed copies.
Aims and Scope/ Editorial Board of Glottometrics
Actual external academic peers for Glottometrics
Complete bibliography of all publications of the first 30 issues
Glottometrics 36, 2017 ( ISSN 16178351)
Published by: RAMVerlag
Glottometrics 36, 2017 is available as:
Printed edition: 30.00 EUR plus PP
CDROMedition: 15.00 EUR plus PP
Internet download (PDFfile): 7.50 EUR
]]>Studies in Quantitative Linguistics 24: Editorial
„History of Quantitative Linguistics in France „
Contents Studies 24 (free of charge)
Jacqueline Leon,
UMR 7597 HTL, jacqueline.leon@univparisdiderot.fr

Sylvain Loiseau,
UMR 8202 Sedyl, sylvain.loiseau@univparis13.fr

This volume give a historical account of the field of quantitative linguistics in France. It focuses about developments initiated in France implying mathematical methods or the usage and interpretation of quantitative data. It does not include material about corpora compilation, computational implementation, or formal modelization.
Historically, quantitative linguistics has a specific status in France. It could be said that statistical studies, namely statistical studies of vocabulary, opened the way to the reception of formal languages and the computerization of linguistics in France. On the one hand, French statistical works were deeply anchored in French linguistic tradition mainly concerned by philology, etymology, dialectology, stylistics and studies on specialized vocabularies. On the other hand, contrary to the USA, the USSR and Great Britain, France significantly lagged behind for computing, logic and formal languages. In fact, the field of vocabulary statistics played a crucial role by the rivalry it had introduced between both approaches, formal and quantitatives. This is shown by the multitude of denominations naming the field and subfields the Americans have referred to as ‘computational linguistics’.[1]
The Centre Favard, also named the ‘Seminar of Quantitative Linguistics’, created in March 1960 at the Henri Poincaré Institute of the Faculté des Sciences de Paris, significantly exemplifies that issue. Under the name ‘Seminar of Quantitative Linguistics’ were brought together both the formal aspects of linguistics and statistical methods. It was an important place for training the linguists in mathematics, logic, information theory, set theory, language theory, statistical linguistics — and more generally statistics and probabilities.
A further example is the classification introduced by Solomon Marcus during the Séminaire International de Linguistique Formelle that took place in Aiguille in 1968. He put forward a classification of the subfields of formal linguistics, the term which subsumed the whole set (see Desclés et Fuchs 1969). Marcus distinguished between algebraic linguistics (for instance ChomskySchützenberger’s work on monoids) and mathematical linguistics (using Markov chains), the latter involving probabilistic linguistics and quantitative linguistics, automatic, computational and cybernetic linguistics, finally applied linguistics. It should be noted that Marcus did not include either works on formal grammars or vocabulary statistics within computational linguistics.
Yet another grouping was offered by Bernard Vauquois, a major leader in the field of machine translation in France. He grouped generative grammar and statistical studies of vocabulary on one side, and natural language processing and semantic formalisation on the other side. His reasons were probably more political than epistemological, as he aimed to ensure that Natural Language Processing be recognized by the CNRS.
It is worth noting that these various classifications do not reflect the distinctions made by Computational Linguistics as defined in 1962 by American institutions, such as the Association for Machine Translation and Computational Linguistics (AMTCL) and the ALPAC report in 1966. Computational Linguistics claimed to involve every theoretical aspect of the interaction between formal languages, linguistics and programming on the one hand, and the practical aspects of language engineering on the other hand. The whole set would be carried out by NLP in the 1970s. As can be seen, for the Americans and unlike the French, statistical studies did not pertain to computational Linguistics.
How to characterize the field of quantitative linguistics in France ? Let us stress, to start with, that the term « Quantitative Linguistics » (or a translation such as linguistique quantitative) is not really used in French nowadays. The field is mainly termed Lexicométrie (cf. infra), linguistique de corpus (corpus linguistics), ou statistique linguistique (statistical linguistics). The fact that the term « quantitative linguistics » is not widely used reflects the fact that researches tend to focus not so much on quantitative laws as on the historicallysituated interpretation of quantative data in corpora.
In order to characterize these local developments amongst the various possible avenues of research in the field of quantitative linguistics, we can draw a basic typology of the different kind of quantitative linguistics. A good criteria for such a typology is the object they focus on and its degree of abstraction (Loiseau 2010) : quantitative linguistic works can be divided into (i) those that seek for universal tendancies, irrespective about any particular language, at a very abstract level (ii) those that work on quantitative tendancies at the scale of a particular linguistic system, and (iii) those that work on quantitative tendencies at the scale of a genre, a discourse, or a speaker for instance.
The best known example of the first type (universal law) is the Zipf law : it is valid for any language and refers to the general economy of the language faculty. Such a universal research is the heart of the denomination of « quantitative linguistics ».
An example of the second type could be the notion of functional burdening of a phonological distinction (eg Herdan 1958)[2]. Another example could be the analysis of morphological productivity (Baayen 2009): the formula proposed for the computing of productivity index is supposed to be valid for any (fusional) languages, but it aims nevertheless at quantifying the productivity of a given morpheme in a given language. It focuses on the linguistic system.
Exemples of the third type are now numerous with the development of corpus linguistics: many works try to caracterize, through quantitative properties, a genre, a discourse, a style, a variety, according to a corpus representative of that socio/idiolect. Methods of descriptive statistics or statistical modelling applied to corpora are most of the time aiming at describing such corpora.
Having this small typology in mind, we can try to better caracterized the Quantitative Linguistics developments in France. Universalist quantitative models of linguistic data are represented by works by Mandelbrot on statistical law of the distribution of words and by information theory (Le Roux ; Léon). We can also add to that group works on the morphodynamic paradigm (Petitot and also Ploux), and Guiraud’s work (Bergounioux).
However, the large bulk of works focuses on sociolect/idiolect descriptions and to the analysis of the links between discursive phenomena on one hand and historical / ideological conditionnings on the other hand. This applies to the joint works by historians and lexicologues (Mayaffre), to the lexicométrie school (Loiseau ; Brunet ; Longré and Mellet), to the works done under the umbrella of the TLF (Trésor de la langue française), a large dictionary based on a corpus of French texts (Candel).
As in several other countries, the field of Quantitative linguistics in France arose during the first half of the 20th century and develop mainly during the second half of that century. Today, the major part of the research in this field has been incorporated into an international research field and have no national peculiarity anymore. However, some subfields such as Lexicométrie are still mainly developed in France. A focus on texts is still strong among French scholars.
3. Interviews with actors of the field
In the process of the preparation of this volume, several interviews have been made with actors and witnesses of the field : Robert Nicolaï (University of Nice), Micheline Petruszewycs (EHESS), Pierre Lafon (ENS Lyon), Jean Petitot (EHESS), †Maurice Tournier (ENS Lyon), Évelyne Bourion (UMR Modyco). These interviews aimed at setting up the institutional context of Quantitative linguistics in France.
Robert Nicolaï witnessed the development of quantitative linguistics in the Nice University around the proeminent figure of Pierre Guiraud. Robert Nicolai was Guiraud’s student at the University of Nice in the late 1960s, early 1970s. He recalls that Guiraud’s lectures focused on lexicology and semantics more than statistics. He was especially concerned by morphosemantic roots and the etymological structures of French which can be only tackled with large lexical data allowing to deal with semantic universals (see Bergounioux this volume for more details).
Guiraud created the department of General Linguistics at the University of Nice and, with Gabriel Manessy, a research group named Ideric (Institut de Recherche Interethnique et Interculturel) [Institute of Interethnic and Intercultural Research] which Nicolai directed after Guiraud’s death.
The interview with Jean Petitot turned into a chapter in this volume (Petitot, Léon, Loiseau, this volume).
Interviews with Pierre Lafon, Maurice Tournier, Évelyne Bourion and Micheline Petruszewycs focused mainly on the development of the lexicométrie school. Micheline Petruszewycs has been the assistant of the mathematician GeorgesThéodule Guilbaud (1912 – 2008). G.T. Guilbaud was the founder of a laboratory, the « Center for analysis and mathematics for Social sciences » (Centre d’analyse et de mathématiques sociales) at the 6th section of EPHE (École pratique des hautes études). He also animated two seminars for years.[3] The Friday seminar was devoted to mathematics for social sciences and had been very influencial. It was attended by various people such as the mathematicians Pierre Achard, Bernard Jaulin, Simon Reignier, the ethnologist Robert Jaulin, and many psychologists — among them François Bresson and his students. The composer Iannis Xenakis used to attend this seminar. It should be said that statistics, and more generally mathematics, were well regarded by social scientists of EPHE (6th section). Guilbaud was even invited to give a course on statistics within LeviStrauss’s seminar. The Thursday seminar focused on linguistics, more specifically on lexicometry with the participation of Pierre Lafon, Annie Geoffroy, Maurice Tournier, André Salem (who studied mathematics in Moscow with Andrej Kolmogorov) as well as other members of the Laboratoire de Lexicométrie de St Cloud. Guilbaud introduced the hypergeometric law which helped solving some ill explicited formulations made by the St Cloud group.
The volume gathers contributions either by people that have been involved in the field of which they are giving an account, or by people specialized in the history of linguistics. In both cases, however, the same historical focus has been adopted by all contributors.
Some contributions are focussing on individuals (such as Thom, Benzécri, or Guiraud), while others are focusing on larger fields of research (chapters by Léon, Loiseau, Longré and Mellet).
The contributions have been organised in two parts. The first one focuses on vocabulary statistics. The second one gathers contributions presenting mathematical models.
The first part ‘Vocabulary statistics’ includes four papers on the pioneering works in the 195060s and three papers on contemporary research. Some emblematic personalities and projects played a significant role in these early years and it is not surprising that they were adressed in several chapters : Pierre Guiraud and Georges Gougenheim (Bergounioux, Léon) ; Benoît Mandelbrot (Léon, Le Roux) ; TLF (Brunet, Candel).
Jacqueline Léon deals with early French statistical linguistics and its institutionalisation: after presenting the role of the instigators, Mario Roques (18751961) and Marcel Cohen (18841974), she examines the three paths followed by the pioneers: (i) the teaching track with Georges Gougenheim (1900–1972) and Le Français Élementaire ; (ii) the stylistic track with Pierre Guiraud (19121983) and later Charles Muller (19092015) ; (iii) finally, the mathematical and Information theory track with Benoît Mandelbrot (19242010), René Moreau (19212009) and the Centre Favard. She shows that statistical studies of vocabulary contributed greatly to the changes in French linguistics that took place in the 195060s. As statistical studies of vocabulary, simultaneously with the first experiments of machine translation, were the first fields to be computerized, they made possible the automation of linguistics that took place in the USA ten years before.
Gabriel Bergounioux dedicates a whole chapter to Pierre Guiraud (19121983), one of the major pioneers of quantitative linguistics in France. He emphasizes the originality of Guiraud’s approach and his role in the beginning of statistical studies on vocabulary. Guiraud published three key works on the domain, a comprehensive bibliography (1954) and two methodological essays (1954 and 1960). At first, Guiraud characterized linguistics as an observational science grounded on statistics, like sociology and economics. Later, he claimed that it was cognitivebased. His approach was both stylistic (with statistical studies of Guillaume Apollinaire’s and Paul Valéry’s vocabulary) and etymological. For that purpose he worked out the concept of “morphosemantic field”. Bergounioux shows how his position of outsider shed light on the conditions in which quantitative linguistics emerged in France.
Danielle Candel’s chapter is a testimony on the building of the Trésor de la Langue Française, a major dictionary project (19711994). This project aimed at building a reference corpus providing the basis and the data for the lexicographic analyses. The corpus built for that project is then one of the early exemples of corpus built to be representative of a language, to assist the lexicographer to derive meanings from observed usages in context, and large enough in order to extract the frequencies of the lexical items. Danielle Candel shows how the quantitative approach and the use of a largescale database is linked with many steps in the building of the dictionary.
In the chapter devoted to the ”lexicometric” school, Sylvain Loiseau tries to show the theoretical assumptions and the institutional settings that lead to the development of this very influencial line of research. The quantitative analysis according to lexicometry is aimed at providing a scientific tool for the analysis of ideological content of texts. The main methods developed in the field of lexicometry are presented, focusing on the quantitative assumption of the method specificity. Some characteristics of lexicometry are still influential in contemporary research in corpus linguistics in France: the focus on text, the search for an ideological “backstage” beneath the words, the idea that quantitative textual analysis can help providing an objectivity in the analysis of such a backstage.
The chapter by Damon Mayaffre focuses on the historical studies of corpora of political texts. This avenue of research originates in the development of the « lexicometry » approach of quantitative analysis of vocabularies in the 1970’s in France and had always been associated with that field of research. The author shows that the lexicometry approach, aiming at unravelling the social position and ideological content, and due to its focus on political texts, has interested historians from the beginning. Damon Mayaffre then offers several exemples of the methods elaborated and of their usage for the historical interpretation of political texts.
Longrée and Mellet’s chapter contributes to the statistical handling of a corpus of Latin texts. The authors, as latinists and proponents of statistical studies, address the specific issue of the variability of word order in Latin which constitutes a guiding thread for quantitative linguistics in Latin. This issue raised latinists’s interest as early as the 1970s so that they worked up counts of various configurations in Latin texts. By taking over that type of work, Longrée and Mellet identify ‘motifs’ in the aim of establishing a typology of texts. Motifs associating lexical and grammatical constraints subsumed the notions of repeated segments, collocations and colligations and led to new software developments for the treatment of Latin.
Etienne Brunet deals with the history of large computerized corpora and data bases of written texts in France. The first French corpus was the TLF (Trésor de la Langue Française), which, in fact, was the first computerized corpus in the world while the Brown Corpus was thought out a little later. Brunet recalls how the making of the dictionary was computeraided, with cooccurrences and frequencies at the editors’ disposal. Examining the TLF’s successor, Frantext, he makes a distinction between a corpus (Frantext) and a base (TLF). Contrary to a corpus, a base has a fixed frame with ordered sections that items must fill by a number, a code or text. He compared these two French projects with American bases such as Encarta Encyclopedia (19932009), Wikipedia, and with corpora of the French Language made outside of France, such as the German Wortschatz, the English Sketchengine, the American Google Books.
The second part of the book, ‘Mathematical models’, focuses on seminal and influent contributions in the field of the mathematical models of language. It includes three chapters on pioneering works and one chapter, by Sabine Ploux, that illustrates contemporary research. Three lines of research are represented : research on distribution laws by Benoît Mendelbrot (Ronan Le Roux), the development of a family of factorial analysis, correspondance analysis, for the distributional analysis of a language by JeanPaul Benzécri (Valérie Beaudouin), and the development of Catastrophe theory by René Thom, a model aiming at being a mathematical tools for language modelling, reminiscent of neural networks (Jean Petitot, Jacqueline Léon, Sylvain Loiseau).
Ronan Le Roux devotes his chapter to Benoît Mandelbrot (19242010), another key pioneer of quantitative linguistic in France. The author shows that Mandelbrot’s study of language was mostly limited to Zipf’s law. He questions the sources of the mathematician’s significant work on Zipf’s law pointing out the discrepancy between the horizon of retrospection (Auroux 2007) he claimed and the real background of his works. In particular the scientific environment of the California Institute of Technology where he stayed in the late 1940s, the inspiring figures of Wiener and Von Neumann and cybernetics played a major role in the way he tackled Zipf’s law. Le Roux shows that Mandelbrot’s later works on the fractal paradigm were consistent with his early work on Zipf’s law, exemplifying what Le Roux identifies as ‘the transversal regime of scientific modelling’, a typical mode of scientific activity.
The chapter by Valérie Beaudouin accounts for the elaboration of „correspondance analysis“, a family of factorial analysis methods, by JeanPaul Benzécri. From the middle of the 1960s, JeanPaul Benzécri (1932) has introduced and developed a series of methods called “Analyse des Données” (Data Analysis) whose heart is Correspondence Analysis, a method for the analysis of multidimensional data. Valérie Beaudouin traces the intellectual project behind these methods, showing that linguistic data played a major role in the elaboration of Correspondence Analysis: Correspondence Analysis aimes at supporting an inductive approach of languages, based on the exploration of corpora and the synthesis of large distributional patterns.
The next chapter is an interview with Jean Petitot by Jacqueline Léon and Sylvain Loiseau. Jean Petitot presents in great details the mathematical work of René Thom (19232002), and the application of this work to linguistics, of which Jean Petitot is one of the best specialists. René Thom have defined an array of concepts – singularity, structural stability, catastrophe, bifurcation – for the mathematical modelling of morphogenesis, a field pioneered by Alan Turing that studies the formation processes of complex forms, in particular those of life. One central issue of morphogenesis is the mereological problem, i.e. how totalities can be organized with constituents, relations and transformation rules between constituents, and how totalities can show an organization which is more than the sum of their constituents. This constituency problem is of course central for linguistics. The article shows how the mathematical tools built by Thom addresses these issues and shows how these propositions are related to other theoretical frameworks or approaches, such as cognitive linguistics or neural networks. Petitot (2011) shows how the theory of dynamic systems can account for categorical perception in phonology and also in syntax, where he stresses the fundamental links between vision and syntactic structures.
Sabine Ploux presents an approach aiming at modelling the polysemy and the contextual variation of the meanings of lexical items using graph theory. She first shows the limits of two other paradigms : the dynamic approach, illustrated by René Thom (cf. Chapter by Petitot, Léon & Loiseau), and the linear model, illustrated by the concordance factorial analysis elaborated by Benzécri (cf. Chapter by Beaudouin). Both use the context of lexical units to model their meaning. The former (as well as connexionnist approaches) adequatly models the structural stability of a concept or a category despite its “deformations” in context. However, it can be applied only to few lexical units. The latter is based on the whole lexicon and give a static core meaning. This approach is called today vector space models, it doesn’t give access to the processes of the building of the lexical meanings, but give a static representation of lexical meanings. Ploux shows an alternative approach based on graph theory. In graphs built from large corpora, lexemes are represented as vertices and cooccurrences are represented as edges. In such graphs, the systematic structure of the lexicon can be observed, while the various acceptions can be accessed through the cliques or the communities (dense group of vertices) in the graph.
Auroux, Sylvain (2007) La question de l’origine des langues suivi de L’historicité des sciences, Paris: PUF.
Baayen, R. Harald (2009) “Corpus linguistics in morphology: morphological productivity”, In : Corpus Linguistics. An international handbook A.Luedeling and M. Kyto (eds.), Berlin: Mouton De Gruyter, p.900919.
Herdan, Gustav (1958) “The Relation Between the Functional Burdening of Phonemes and the Frequency of Occurrence” Language and Speech 1(1), 813.
Cori, Marcel; Léon, Jacqueline (2002) “La constitution du TAL. Etude historique des dénominations et des concepts”, Traitement Automatique des Langues 43 (3), 2155.
Loiseau Sylvain (2010) “Paradoxes de la fréquence”, Energeia 2, 2055.
[1] See Cori & Léon (2002).
[2] if a phonemic distinction is contrasting a large number of minimal pairs this distinction cannot be withdrawn without producing a lot of homonymies ; whereas if it is contrasting few pairs of lexemes, it can be lost without producing much lexical ambiguity ; the « functional burdening » is a measure of this importance of a phonemic distinction.
[3] Thanks to Micheline Petruszewycs, we consulted the book signing sheets of these seminars showing the diversity of people who attended it : mathematicians like Pierre Achard, Bernard Jaulin, Robert Jaulin and Simon Reignier, the musician Iannis Xenakis,
]]>
Glottometrics 35, 2016
Abstracts Glottometrics 35 (free of charge)
You can buy the download link for Glottometrics 35, 2016 or order the print version here
Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (23 Ausgaben pro Jahr) für die quantitative Erforschung von Sprache und Text.
Beiträge in Deutsch oder Englisch sollten an einen der Herausgeber in einem gängigen Textverarbeitungssystem (vorrangig WORD) geschickt werden.
Glottometrics kann aus dem Internet heruntergeladen, auf CDROM (in PDF Format) oder in Buchform bestellt werden.
Glottometrics is a scientific journal for the quantitative research on language and text published at irregular intervals (23 times a year).
Contributions in English or German written with a common text processing system (preferably WORD) should be sent to one of the editors.
Glottometrics can be downloaded from the Internet, obtained on CDROM (in PDF) or in form of printed copies.
Aims and Scope/ Editorial Board of Glottometrics
Actual external academic peers for Glottometrics
Complete bibliography of all publications of the first 30 issues
Glottometrics 34, 2016 ( ISSN 16178351)
Published by: RAMVerlag
Glottometrics 35, 2016 is available as:
Printed edition: 30.00 EUR plus PP
CDROMedition: 15.00 EUR plus PP
Internet download (PDFfile): 7.50 EUR 
]]>
Glottometrics 34, 2016
Abstracts Glottometrics 34 (free of charge)
You can buy the download link for Glottometrics 34, 2016 or order the printversion here
Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (23 Ausgaben pro Jahr) für die quantitative Erforschung von Sprache und Text.
Beiträge in Deutsch oder Englisch sollten an einen der Herausgeber in einem gängigen Textverarbeitungssystem (vorrangig WORD) geschickt werden.
Glottometrics kann aus dem Internet heruntergeladen, auf CDROM (in PDF Format) oder in Buchform bestellt werden.
Glottometrics is a scientific journal for the quantitative research on language and text published at irregular intervals (23 times a year).
Contributions in English or German written with a common text processing system (preferably WORD) should be sent to one of the editors.
Glottometrics can be downloaded from the Internet, obtained on CDROM (in PDF) or in form of printed copies.
Aims and Scope/ Editorial Board of Glottometrics
Actual external academic peers for Glottometrics
Complete bibliography of all publications of the first 30 issues
Glottometrics 34, 2016 ( ISSN 16178351)
Published by: RAMVerlag
Glottometrics 34, 2016 is available as:
Printed edition: 30.00 EUR plus PP
CDROMedition: 15.00 EUR plus PP
]]>Glottometrics 33, 2016
Abstracts Glottometrics 33 (free of charge)
You can buy the download link for Glottometrics 33, 2016 or order the Printversion here
Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (23 Ausgaben pro Jahr) für die quantitative Erforschung von Sprache und Text.
Beiträge in Deutsch oder Englisch sollten an einen der Herausgeber in einem gängigen Textverarbeitungssystem (vorrangig WORD) geschickt werden.
Glottometrics kann aus dem Internet heruntergeladen, auf CDROM (in PDF Format) oder in Buchform bestellt werden.
Glottometrics is a scientific journal for the quantitative research on language and text published at irregular intervals (23 times a year).
Contributions in English or German written with a common text processing system (preferably WORD) should be sent to one of the editors.
Glottometrics can be downloaded from the Internet, obtained on CDROM (in PDF) or in form of printed copies.
Aims and Scope/ Editorial Board of Glottometrics
Actual external academic peers for Glottometrics
Complete bibliography of all publications of the first 30 issues
Glottometrics 33, 2016 ( ISSN 16178351)
Published by: RAMVerlag
Glottometrics 33, 2016 is available as:
Printed edition: 30.00 EUR plus PP
CDROMedition: 15.00 EUR plus PP
Internet download (PDFfile): 7.50 EUR
Studies in Quantitative Linguistics 22: Editorial
„Positional Occurrences in Texts: Weighted Consensus Strings „
Contents Studies 22 (free of charge)
Written texts contain punctuation which allows us to mechanically determine units larger than the clause. Mostly such units represent some kind of grammatically determined sentences; other ones represent verses in a poem, written in one line. But poems may be written in such a way that sentences exceed the boundary of the verse. In that case one can analyze the poem in two ways. Spoken texts, e.g. telephone conversation, do not have any punctuation; one must determine “the sentence” either authoritatively or considering the intonation, or the change of the speaker in a stage play or some other signals.
If one analyzes the text, taking into account only special entities which occur in the predetermined frameentities, one can perform a “consensus” analysis to be specified below either for the text as a whole and in turn, one can compare texts; or, if the sentences or verses are too short, one can determine rather parts of the text, e.g. Frumkina’s sections, containing 100, 200,… words, or strophes, chapters, 10 sentences, etc. but the purpose of the analysis must be in some relation to this type of segmentation. Preliminarily, there is no prescription or a fixed way of constructing wholes/frames which should be sequentially analyzed.
Nevertheless, the text can always be transcribed either symbolically as a sequence of abbreviations of entities classified in some known way, e.g. parts of speech – abbreviated as Art, Pn, Aj, Av, N, V, Pp, I, C, etc. – or as degrees of properties of the individual entities, e.g. length, to obtain a sequence. Now, after having a sequence, there is a full pallet of statistical methods that can help us to state its properties. There are distances, transition frequencies, positional aspects, runs, etc. (see e.g. Zörnig et al. 2015).
In the present work we study some other aspects of a text which can be considered as a set of sequences written in separate lines, i.e. a text is an arra
where the sequences (lines) s^{i} = (s^{i}_{1}, s^{i}_{2},…) may have different lengths. We study the distribution of the elements in the columns of (1.1), in particular the most frequent element of a column is of great interest. In a certain sense we study a text “vertically”, which is a novel approach to quantitative linguistics. We may compare and evaluate the columns and test whether there are some positional regularities. In some languages these are given already by syntactic rules. In poetry they may be prescribed by the rhythm or by positional assonances, in scientific texts one expects a certain ductus, and in stage plays there is a sequence of speech acts, etc. In order to capture the positional occurrences we extend the concept of “consensus string”, a term that has been recently transferred from
computational biology to linguistics (Zörnig, Altmann 2016). A consensus string is a sequence t = (t_{1},…,t_{n}) which is – in a sense to be concretized – as close as possible to the strings given in (1.1). One possibility to define t = (t_{1},…,t_{n}) – which we adopt in linguistic applications – is setting t_{j} equal to one of the most frequent element of the jth column of (1.1).
Definition 1: Let = {s^{1},…, s^{n}} be a set of sequences as in (1.1). Let F(j) be the largest frequency of an element in column j and let N(j) denote the number of elements in column j. The latter is equal to the number of sequences having length at least j. Then the weighted consensus string (WCS) of the set of sequences is defined as the sequence
Irrespective of the type of the sequences s^{i} (which may be e.g. symbolic or numerical sequences), the WCS is always a uniquely determined numeric sequence.
Example 1: Consider the following 4 sentences:
s^{1} = (1, 3, 2, 2, 4, 1),
s^{2} = (1, 4, 3, 3, 2, 2, 2),
s^{3} = (4, 1, 2, 3, 3, 3, 4, 2, 1),
s^{4} = (2, 1, 2, 2, 1).
Since the strings are not equally long, one can bring them to the same lengths by adding zeroes, yielding
s^{1} = (1, 3, 2, 2, 4, 1, 0, 0, 0),
s^{2} = (1, 4, 3, 3, 2, 2, 2, 0, 0),
s^{3} = (4, 1, 2, 3, 3, 3, 4, 2, 1),
s^{4} = (2, 1, 2, 2, 1, 0, 0, 0, 0).
Thereby the strings have been made comparable, i.e. the Hamming distance or another distance (Zörnig, Altmann 2016, section 2) is now defined between any two of these strings.
One possible consensus string CS of the above example has the form
CS = (1, 1, 2, 3, 4, 1, 0, 0, 0).
This string is in general not uniquely determined. For a string matrix (1.1), any string having at position j a most frequent element of column j, is a consensus string. For example, in column 4 of the above example there exist two most frequent elements, namely 2 and 3. Each of them could be the fourth element of CS. Clearly, different consensus strings have different distances to the given (observed) sequences.
Such a string CS = (t_{1},…,t_{m}) minimizes the average distance to the given strings (Zörnig, Altmann 2016). The (uniquely defined) weighted consensus string is
WCS = (2/4, 2/4, 3/4, 2/4, 1/4, 1/3, 1/2, 1/1, 1/1).
In the following we will confine ourselves to weighted consensus strings.
Consider now some symbolic sequences:
Example 2: Given the five sequences
s^{1} = (a, a, b, b, b, a, c, d, a, b, b)
s^{2} = (a, b, d, a, a, c, d)
s^{3} = (b, a, c, c, d, a, c, b, a, b)
s^{4} = (d, c, b, a, b, a, c, d)
s^{5} = (a, c, b, b, b, a, c, d, c, b, b)
over the alphabet {a, b, c, d}. For example, the first column contains 5 elements, where the most frequent one a occurs three times. Thus F(1) = 3, N(1) = 5. Column 4 contains 5 elements, the most frequent ones are a and b, occurring two times each. Thus F(4) = 2 (largest frequency) and N(4) = 5. The column 9 contains three elements and the most frequent is a, occurring 2 times. Thus, F(9) = 2, N(9) = 3. The WCS is therefore
(3/5, 2/5, 3/5, 2/5, 3/5, 4/5, 4/5, 3/4, 2/3, 3/3, 2/2)
For the purposes of the present book we do not need all complete strings of (1.1). It is sufficient to have a table of the following form.
Definition 2: Given the string matrix (1.1) where the elements of the strings are chosen from the alphabet A = {a_{1},…,a_{k}}. Then the table
Columns  
1  2  m  
a_{1}
a_{k} 
f_{1,1}
f_{k,1} 
f_{1,2}
f_{k,2} 

f_{1,m}
f_{k,m} 
sum  N(1)  N(2)  N(m) 
where f_{i,j} denotes the frequency of the element a_{i} in the jth column of (1.1), is called the frequency table of the string matrix (1.1). With the notations in Definition 1 it holds that F(j) is the maximum value of column j and N(j) is the sum of values in column j.
In the following chapters we express the information about a set of strings in form of its frequency table. For example, the frequency tables of the string matrices in Examples 1 and 2 are given by
Frequency table of Example 1
Columns
1 2 3 4 5 6 7 8 9 

1
2 3 4 
2 2 0 0 1 1 0 0 1
1 0 3 2 1 1 1 1 0 0 1 1 2 1 1 0 0 0 1 1 0 0 1 0 1 0 0 
sum  4 4 4 4 4 3 2 1 1 
Frequency table of Example 2
Columns
1 2 3 4 5 6 7 8 9 10 11 

a
b c d 
3 2 0 2 1 4 0 0 2 0 0
1 1 3 2 3 0 0 1 0 3 2 0 2 1 1 0 1 4 0 1 0 0 1 0 1 0 1 0 1 3 0 0 0 
sum  5 5 5 5 5 5 5 4 3 3 2 
Another way to characterize the columns is to consider
(a) The rankfrequency distributions of the individual columns and the parameters of the theoretical distribution.
(b) A function of the corresponding moments that shows a concentration to a certain structure. One can use also Ord’s criterion, etc. These indicators compare different moments of the distribution but the testing of differences becomes more complex.
Though the above cases are merely examples, we may conjecture three hypotheses:
(1) Each column in the string matrix has its specific frequency distribution. The confirmation of this conjecture implies that something like a vertical structure of texts exists. Studying millions of sentences, we could find a distribution of sentence types viewed from the grammatical perspective. Any grammatical analysis is merely the stating of facts, not the finding of background laws. These can be established only deductively but one cannot do it before one performs a lot of indictive work in many languages.
(2) The form of the consensus string depends on the stylistic homogeneity of the text.
(3) If the sentences do not have the same length, then the consensus string frequently increases, beginning at a position corresponding approximately to the median of the text lengths, i.e. a position for which 50% of the texts end before reaching that position.
The properties of the WCS can be measured by various indicators, e.g. using the Hurst exponent, the Minkowski sausage, the V indicator defined below, etc.
The hypotheses may have various boundary conditions (exceptions of the rule) according to the level of entities, kind of measurement, text type, spoken or written text, age of the author, etc. The only way to define them more exactly must be preceded by empirical investigations because up to now the investigation of this behavior of texts is not sufficiently known (cf. Hřebíček 2000; Zörnig, Altmann 2016).
The first hypothesis may have two forms: if the elements of the string are numbers or symbols, one may obtain the rankfrequency distribution. It is conjectured that both types of distributions can be derived from the unified theory (cf. Wimmer, Altmann 2005).
The second hypothesis merely says that there is variation in the frame units/strings. They mostly begin with an entity which is most frequent in the given language but afterwards they begin to vary. This need not be the case in poetry if it follows a special meter. Our aim is to find the kind of positional dependence of the function in the consensus string. In general, we may conjecture that some positions in the sentence are preferred by some type of entities while other ones are neglected. The WCS may be different according to the type of text.
The third hypothesis is quite evident: From a given point the number of zeroes increases, hence their proportion increases. For the sake of simplicity, we omit the zero positions.
In quantitative linguistics one strives for setting up a hypothesis and testing it. The hypothesis should be set up in such a way that it is statistically testable. Nontestable hypotheses are dogmas that cannot be used in science.
Here we test whether the weighted consensus string follows a certain law and expresses a text characteristic, a text type, a language, a development of a person etc. Our aim is to find the laws which may hold in different forms depending on the character of the considered entities, express them mathematically and use them for comparisons.
Emerging Sources Citations Index
Now, Glottometrics can be found in „Web of Science“…see screen shot:
Aims and Scope/ Editorial Board
Actual external academic peers for Glottometrics
Complete bibliography of all publications of the first 30 issues
]]>
erschien in Ausgabe 2015, vo. 6, Issue 2 ein „Book Review“ über unsere Publikation:
Best, KarlHeinz & Kelih, Emmerich (Hrsg.) (2014): Entlehnungen und Fremdwörter: Quantitative Aspekte. Lüdenscheid: RAMVerlag (Studies in Quantitative Linguistics 15) ISBN 9783942303231, IV, 163 pp.
Besprochen von Thorsten Roelcke, Fachgebiet Deutsch als Fremdsprache , Technische Universität Berlin, Sekr. HBS 2, Hardenbergstr. 1618, 10623 Berlin, Germany. Email: roelcke@tuberlin.de
]]>Emmerich Kelih (2015);
Australian Journal of Linguistics, 27.10.2015 (Routledge Taylor & Frances Groupe).
Link to this review: http://www.tandfonline.com/doi/full/10.1080/07268602.2015.1099179
Link to the book itself: http://www.ramverlag.eu/booksebooks/ (Studies in Quantitative Linguistics 14)
RAMVerlag
]]>Natural Language Processing,Corpus Linguistics, Lexicography
Eighth International Conference, Bratislava, Slovakia, 21–22 October 2015 Proceedings
ISBN 9783942303323
Preface
Slovko 2015 – this year’s edition entitled NLP, Corpus Linguistics, Lexicography
– represents a follow up of previous autumn meetings in Bratislava. Organisers, both
from the Slovak National Corpus of the Ľ. Štúr Institute of Linguistics, Slovak Academy
of Sciences, and from the Slovak Centre of Scientific and Technical Information, are
honoured to welcome participants from five countries: Austria, Czech Republic, France,
Slovakia and Slovenia.
Two conference days offer 18 presentations, including two plenary talks. Not all papers
registered for presentation were also published – current programme comprises also
two presentations that cannot be found in the proceedings. Members of the programme
committee carefully reviewed every paper sent with the registration (two reviewers for
each text) and thus contributed to the overall quality of the scientific event and of this
publication, for which we would like to express our sincere gratitude.
The 8th edition of the biannual conference Slovko 2015 experiences the increase
of papers dealing with corpus linguistics including lexicography. On the other hand,
computationally oriented papers are in a minority. There is a significant shift from
presenting new written corpora and their analyses to the issues concerning the building
and research of spoken, even dialect corpora. We believe that this focus of papers will
also become a source of inspiration both for conference participants and readers of
the proceedings in their further work in the area of NLP, corpus linguistics and related
research in Slovakia and neighbouring countries.
We wish all the participants of Slovko 2015 an enjoyable stay in the Slovak Centre
of Scientific and Technical Information and in Bratislava in particular to those who came
from abroad. We would also like to invite you to Slovko 2017 that will be focusing,
besides NLP and corpus linguistics, on computational terminology and terminography.
Mária Šimková
Translated by Jana Levická
]]>
Introducing the Emerging Sources Citation Index
This year, Thomson Reuters is launching the Emerging Sources Citation Index, which will extend the universe of publications in Web of Science to include highquality, peerreviewed publications of regional importance and in emerging scientific fields. ESCI will also make content important to funders, key opinion leaders, and evaluators visible in Web of Science even if it has not yet impacted an international audience.
About ESCI Emerging Sources Citatins Index
]]>
Studies in Quantitative Linguistics 21: (Editorial)
„Problems in Quantitative Linguistics 5“
Contents Studies 21 (free of charge)
Preface
The present volume is a continuation of the series dedicated to all linguists who want to solve linguistic problems in a nonclassical way. Elementary knowledge of statistics is a necessary condition, however, even a collection of data in the prescribed way could be helpful for solving some problems. The comparisons, tests, finding a function or distribution can be made by a statistician but the linguistic background knowledge must be furnished by the Linguist.
The volume is appropriate especially for those who try to enter the field of quantitative linguistics and seek the door leading to elementary Problems.
The present volume contains 90 problems. To each problem some references are recommended but the reader can solve them in his own way. Unfortunately, qualitative linguistics contains many concepts and classifications rooted in opinions and leading to different descriptions. In the present volume the reader is forced to perform tests which corroborate or reject the primary concept formation and force him to create new data based on different definitions, concepts, criteria etc. The basic requirement is the testing of everything one says.
It is recommended to publish the results in a quantitative linguistics journal. In any case, all numbers should be presented in order to give other linguists the possibility of testing other hypotheses or to subsume the accepted results in a deeper theory.
Gabriel Altmann
]]>Studies in Quantitative Linguistics 20: (Editorial)
„Descriptiveness, Activity and Nominality in Formalized Text Sequences“
Contents Studies 20 (free of charge)
Preface In the present book we study characteristics of language based on formalized text sequences. The study of text as a sequence of various entities is rapidly developing in form of articles, omnibus volumes and monographs. In fact, our linguistic study can be considered as a part of a very fertile interdisciplinary research activity devoted to the analysis of information sequences. Such sequences occur also in computational biology (e.g. in form of DNA strings), in coding theory and data compression. While qualitative linguistic analysis searches for rules which are important for language learning, quantitative analysis tries to capture hidden mechanisms which are not necessary for the understanding of language. Except for certain poetic phenomena, e.g. rhythm which can be produced consciously, these mechanisms cannot be learned and do not represent the core of standard linguistics. In the present book, a group consisting of mathematicians and linguists – specialists for a certain language – attempts to discover textual phenomena which may seem to be strange for the “normal” linguistics but whose deciphering may help to reveal candidates for laws. Laws are the highest aim of science because without them no theories and no explanations are possible. Unfortunately, in linguistics the testing of a hypothesis is never finished, one can at most validate it to a certain degree. In practice, this validation will never terminate because one would be forced to analyze all languages and, in case of text laws, as many texts as possible. Here no corpuses can help because none of them contains the complete history of language, the evolution of an individual speaker or a complete collection of text sorts. Hence our attempts merely reveal a few of the infinite number of facets of a text. We try to collect data, find models of their behavior in form of hypotheses, test them, compare the results in texts of eleven languages available to us and try to create a research domain which will never be satisfactorily explored. We present all observed data in order to enable other researchers to analyze them applying other methods or other characterizations, and to formulate and test other hypotheses. We reduced the whole field to specific phenomena of description, activity and specifying, otherwise the study would be too extensive. Nevertheless, we show at some places the possibility of going into the depth of the hierarchy of phenomena. Peter Zörnig
]]>
Bibliography of quantitative studies in Chinese language
Wei HUANG
In the early years of 1980s, Zipf’s and Herdan’s work were introduced into China. Then a few Chinese researchers in linguistics and information science drew attentions to quantitative studies of languages, including theoretical studies of Zipf’s law and its application to frequency distribution of Chinese characters and words.
In recent years, Chinese linguists in increasing numbers are focusing on quantitative linguistics. Both of the theories and the methods of modern quantitative linguistics are comprehensively introduced into China. Meanwhile, Chinese researchers are carrying out quantitative studies in lexicology, syntax, discourse analysis and other branches of linguistics. They have published some quantitative findings of Chinese (Mandarin), English, Russian and other languages.
The following are 32 articles published in Chinese language. Each item consists of 4 parts, the translated bibliographical information in English, the Romanized transliteration according to ISO7098, the original Chinese information and a brief summary of the article. The bibliography is sorted by years of the publications are in ascending order.
Beijing Language and Culture University: hwstudio@263.net
]]>The RAMVerlag publishing house offers 4 books containing linguistic problems to be solved (including instructions and literature).
These books are the volumes 1, 4, 12 and 14 of our series “Studies in Quantitative Linguistics” by RAMVerlag (http://www.ramverlag.eu/booksebooks/studiesinquantitativelinguistics/).
They present problems for solution which were not yet scrutinized in linguistics. They are aimed especially for those who will enter the domain of quantitative linguistics and seek an orientation, but it presents also advanced problems yielding the possibility to develop quantitative linguistics. For beginning scientists there is no greater problem than to find a problem at all. Every book contain about 100 of them. Each problem is presented in three parts: The first part describes either a ready made hypothesis or a problem for which there is still no explicit hypothesis. The second part, called “Procedure” gives instructions for preparing data, setting up a hypothesis, testing it and finding a theoretical model. If a link between two properties is sought – the most popular activity in quantitative linguistics – , then one finds some hints concerning its inclusion in the synergetic control cycle.
The volumes comprise five disciplines of linguistics and a mixed chapter consisting of general problems (e.g. laws, scaling, typology, positional problems, etc.). The problems concern: Syntax, Semantics, Textology, Pragmatics, Synergetics, Various issues. To every problem one finds proposals for solving an old problem with new methods but also new problems which were not scrutinized so far.
The authors want to evoke interest for quantitative solution of linguistic problems. The solutions may be used for dissertations, writing of articles and publishing them in quantitative linguistic journals (Journal of Quantitative Linguistics; Glottotheory; Glottometrics), writing books or use the individual problems for planning projects. As a matter of fact, each of the problems is a project if performed in several languages. The authors are at the same time members of the editorial boards of the above mentioned journals and help automatically to improve the presentations.
The third part of each “Problem” is a survey of literature, both new and older, both theoretical and concerning individual languages.
The whole series of “Problems” comprises a very broad spectrum of linguistics. A complete solution of problems in one of the books would forward the given discipline and let arise quite new Domains.
The books are to be recommended both to learners and researchers.
]]>
Our publication „Quantitative Linguistics Computing with Perl“ (Studies in Quantitative Linguistics 7; ISBN: 9783942303019) was reviewed by :
Haoda Feng (2015);
Australian Journal of Linguistics, 35:2, 195196, (Routledge Taylor & Frances Groupe).
Link to this article: http://dx.doi.org/10.1080/07268602.2015.1004657
Link to the book itself: http://www.ramverlag.eu/booksebooks/ (Studies in Quantitative Linguistics 7)
RAMVerlag
]]>Just published:
Glottometrics 31, 2015
Abstracts Glottometrics 31 (free of charge)
You can buy the download link for Glottometrics 31, 2015 or order the Printversion here
Herausgeber – Editors of Glottometrics
G. Altmann ramverlag@tonline.de
K.H. Best kbest@gwdg.de
G. Djuraš Gordana.Djuras@joanneum.at
F. Fan fanfengxiang@yahoo.com
P. Grzybek peter.grzybek@unigraz.at
L. Hřebíček ludek.hrebicek@seznam.cz
R. Köhler koehler@unitrier.de
H. Liu lhtzju@gmail.com
J. Mačutek jmacutek@yahoo.com
G. Wimmer wimmer@mat.savba.sk
Glottometrics 31, 2015 ( ISSN 16178351)
Published by: RAMVerlag
Glottometrics 31, 2015 is available as:
Printed edition: 30.00 EUR plus PP
CDROMedition: 15.00 EUR plus PP
Internet download (PDFfile): 7.50 EUR
]]>
Using the jubilee of Glottometrics, we are glad to present a complete bibliography of all publications of the first 30 issues. The contributions are ordered in 5 sections: (1) General articles, (2) History, (3) Reviews, (4) Bibliographies, and (5) Miscellanea. Within each of these sections, the contributions are ordered according to authors’ names and year of publication. The Bibliography can be downloaded as PDFfile from: www.ramverlag.eu.
Peter Grzybek, Emmerich Kelih
Complete bibliography of all publications of the first 30 issues (Link to PDFfile)
]]>
General Software  
ALTMANNFitter  Iterative Fitting of Univariate Discrete Probability Distributions to Frequency Data 
Software for Linguists  
QUITA  Quantitative Index Text Analyzer (free Software download) 
Excercises in Programming Languages  
Perl  Quantitative Linguistics Computing with Perl 
Foxpro  Data Processing and Managemant for Quantitative Linguists with Foxpro 