Positional Occurrences in Texts: Weighted Consensus Strings (= Studies in Quantitative Linguistics 22)

Studies in Quantitative Linguistics 22

„Positional Occurrences in Texts: Weighted Consensus Strings „

Contents Studies 22 (free of charge)

Introduction

Written texts contain punctuation which allows us to mechanically determine units larger than the clause. Mostly such units represent some kind of grammatically determined sentences; other ones represent verses in a poem, written in one line. But poems may be written in such a way that sentences exceed the boundary of the verse. In that case one can analyze the poem in two ways. Spoken texts, e.g. telephone conversation, do not have any punctuation; one must determine “the sentence” either authoritatively or considering the intonation, or the change of the speaker in a stage play or some other signals.

If one analyzes the text, taking into account only special entities which occur in the predetermined frame-entities, one can perform a “consensus” analysis to be specified below either for the text as a whole and in turn, one can compare texts; or, if the sentences or verses are too short, one can determine rather parts of the text, e.g. Frumkina’s sections, containing 100, 200,… words, or strophes, chapters, 10 sentences, etc. but the purpose of the analysis must be in some relation to this type of segmentation. Preliminarily, there is no prescription or a fixed way of constructing wholes/frames which should be sequentially analyzed.

Nevertheless, the text can always be transcribed either symbolically as a sequence of abbreviations of entities classified in some known way, e.g. parts of speech – abbreviated as Art, Pn, Aj, Av, N, V, Pp, I, C, etc. – or as degrees of properties of the individual entities, e.g. length, to obtain a sequence. Now, after having a sequence, there is a full pallet of statistical methods that can help us to state its properties. There are distances, transition frequencies, positional aspects, runs, etc. (see e.g. Zörnig et al. 2015).

In the present work we study some other aspects of a text which can be considered as a set of sequences written in separate lines, i.e. a text is an arra

where the sequences (lines) sⁱ = (sⁱ₁, sⁱ₂,…) may have different lengths. We study the distribution of the elements in the columns of (1.1), in particular the most frequent element of a column is of great interest. In a certain sense we study a text “vertically”, which is a novel approach to quantitative linguistics. We may compare and evaluate the columns and test whether there are some positional re-gularities. In some languages these are given already by syntactic rules. In poetry they may be prescribed by the rhythm or by positional assonances, in scientific texts one expects a certain ductus, and in stage plays there is a sequence of speech acts, etc. In order to capture the positional occurrences we extend the concept of “consensus string”, a term that has been recently transferred from

computational biology to linguistics (Zörnig, Altmann 2016). A consensus string is a sequence t = (t₁,…,t_n) which is – in a sense to be concretized – as close as possible to the strings given in (1.1). One possibility to define t = (t₁,…,t_n) – which we adopt in linguistic applications – is setting t_j equal to one of the most frequent element of the j-th column of (1.1).

Definition 1: Let = {s¹,…, sⁿ} be a set of sequences as in (1.1). Let F(j) be the largest frequency of an element in column j and let N(j) denote the number of elements in column j. The latter is equal to the number of sequences having length at least j. Then the weighted consensus string (WCS) of the set of sequences is defined as the sequence

Irrespective of the type of the sequences sⁱ (which may be e.g. symbolic or numerical sequences), the WCS is always a uniquely determined numeric sequence.

Example 1: Consider the following 4 sentences:

s¹ = (1, 3, 2, 2, 4, 1),

s² = (1, 4, 3, 3, 2, 2, 2),

s³ = (4, 1, 2, 3, 3, 3, 4, 2, 1),

s⁴ = (2, 1, 2, 2, 1).

Since the strings are not equally long, one can bring them to the same lengths by adding zeroes, yielding

s¹ = (1, 3, 2, 2, 4, 1, 0, 0, 0),

s² = (1, 4, 3, 3, 2, 2, 2, 0, 0),

s³ = (4, 1, 2, 3, 3, 3, 4, 2, 1),

s⁴ = (2, 1, 2, 2, 1, 0, 0, 0, 0).

Thereby the strings have been made comparable, i.e. the Hamming distance or another distance (Zörnig, Altmann 2016, section 2) is now defined between any two of these strings.

One possible consensus string CS of the above example has the form

CS = (1, 1, 2, 3, 4, 1, 0, 0, 0).

This string is in general not uniquely determined. For a string matrix (1.1), any string having at position j a most frequent element of column j, is a consensus string. For example, in column 4 of the above example there exist two most frequent elements, namely 2 and 3. Each of them could be the fourth element of CS. Clearly, different consensus strings have different distances to the given (observed) sequences.

Such a string CS = (t₁,…,t_m) minimizes the average distance to the given strings (Zörnig, Altmann 2016). The (uniquely defined) weighted consensus string is

WCS = (2/4, 2/4, 3/4, 2/4, 1/4, 1/3, 1/2, 1/1, 1/1).

In the following we will confine ourselves to weighted consensus strings.

Consider now some symbolic sequences:

Example 2: Given the five sequences

s¹ = (a, a, b, b, b, a, c, d, a, b, b)

s² = (a, b, d, a, a, c, d)

s³ = (b, a, c, c, d, a, c, b, a, b)

s⁴ = (d, c, b, a, b, a, c, d)

s⁵ = (a, c, b, b, b, a, c, d, c, b, b)

over the alphabet {a, b, c, d}. For example, the first column contains 5 elements, where the most frequent one a occurs three times. Thus F(1) = 3, N(1) = 5. Column 4 contains 5 elements, the most frequent ones are a and b, occurring two times each. Thus F(4) = 2 (largest frequency) and N(4) = 5. The column 9 contains three elements and the most frequent is a, occurring 2 times. Thus, F(9) = 2, N(9) = 3. The WCS is therefore

(3/5, 2/5, 3/5, 2/5, 3/5, 4/5, 4/5, 3/4, 2/3, 3/3, 2/2)

For the purposes of the present book we do not need all complete strings of (1.1). It is sufficient to have a table of the following form.

Definition 2: Given the string matrix (1.1) where the elements of the strings are chosen from the alphabet A = {a₁,…,a_k}. Then the table

	Columns
	1	2	m
a₁ a_k	f_1,1 f_k,1	f_1,2 f_k,2	f_1,m f_k,m
sum	N(1)	N(2)	N(m)

where f_i,j denotes the frequency of the element a_i in the j-th column of (1.1), is called the frequency table of the string matrix (1.1). With the notations in Definition 1 it holds that F(j) is the maximum value of column j and N(j) is the sum of values in column j.

In the following chapters we express the information about a set of strings in form of its frequency table. For example, the frequency tables of the string matrices in Examples 1 and 2 are given by

Frequency table of Example 1

Columns

1 2 3 4 5 6 7 8 9

2 2 0 0 1 1 0 0 1

1 0 3 2 1 1 1 1 0

0 1 1 2 1 1 0 0 0

1 1 0 0 1 0 1 0 0

sum

4 4 4 4 4 3 2 1 1

Frequency table of Example 2

Columns

1 2 3 4 5 6 7 8 9 10 11

3 2 0 2 1 4 0 0 2 0 0

1 1 3 2 3 0 0 1 0 3 2

0 2 1 1 0 1 4 0 1 0 0

1 0 1 0 1 0 1 3 0 0 0

sum

5 5 5 5 5 5 5 4 3 3 2

Another way to characterize the columns is to consider

(a) The rank-frequency distributions of the individual columns and the parameters of the theoretical distribution.

(b) A function of the corresponding moments that shows a concentration to a certain structure. One can use also Ord’s criterion, etc. These indicators compare different moments of the distribution but the testing of differences becomes more complex.

Though the above cases are merely examples, we may conjecture three hypotheses:

(1) Each column in the string matrix has its specific frequency distribution. The confirmation of this conjecture implies that something like a vertical structure of texts exists. Studying millions of sentences, we could find a distribution of sentence types viewed from the grammatical perspective. Any grammatical analysis is merely the stating of facts, not the finding of background laws. These can be established only deductively but one cannot do it before one performs a lot of indictive work in many languages.

(2) The form of the consensus string depends on the stylistic homogeneity of the text.

(3) If the sentences do not have the same length, then the consensus string frequently increases, beginning at a position corresponding approximately to the median of the text lengths, i.e. a position for which 50% of the texts end before reaching that position.

The properties of the WCS can be measured by various indicators, e.g. using the Hurst exponent, the Minkowski sausage, the V indicator defined below, etc.

The hypotheses may have various boundary conditions (exceptions of the rule) according to the level of entities, kind of measurement, text type, spoken or written text, age of the author, etc. The only way to define them more exactly must be preceded by empirical investigations because up to now the investigation of this behavior of texts is not sufficiently known (cf. Hřebíček 2000; Zörnig, Altmann 2016).

The first hypothesis may have two forms: if the elements of the string are numbers or symbols, one may obtain the rank-frequency distribution. It is conjectured that both types of distributions can be derived from the unified theory (cf. Wimmer, Altmann 2005).

The second hypothesis merely says that there is variation in the frame units/strings. They mostly begin with an entity which is most frequent in the given language but afterwards they begin to vary. This need not be the case in poetry if it follows a special meter. Our aim is to find the kind of positional dependence of the function in the consensus string. In general, we may conjecture that some positions in the sentence are preferred by some type of entities while other ones are neglected. The WCS may be different according to the type of text.

The third hypothesis is quite evident: From a given point the number of zeroes increases, hence their proportion increases. For the sake of simplicity, we omit the zero positions.

In quantitative linguistics one strives for setting up a hypothesis and testing it. The hypothesis should be set up in such a way that it is statistically testable. Non-testable hypotheses are dogmas that cannot be used in science.

Here we test whether the weighted consensus string follows a certain law and expresses a text characteristic, a text type, a language, a development of a person etc. Our aim is to find the laws which may hold in different forms depending on the character of the considered entities, express them mathematically and use them for comparisons.