The Algorithms of Tajik Speech Synthesis by Syllable

This article is devoted to the development of a prototype of a computer synthesizer of Tajik speech by the text. The need for such a synthesizer is caused by the fact that its analogues for other languages not only help people with visual and speech defects, but also find more and more application in communication technology, information and reference systems. In the future, such programs will take their proper place in the broad acoustic dialogue of humans with automatic machines and robotics in various fields of human activity. The article describes the prototype of the Tajik computer synthesizer by the text developed by the author, which is constructed on the principle of a concatenative synthesizer, in which the syllable is chosen as the speech unit, which in turn, indicates the need for the most complete description of the variety of Tajik language syllables. To study the patterns of the Tajik language associated with the concept of syllable, it was introduced the concept of “syllabic structure of the word”. It is obtained the statistical distribution of structures, i.e. a correspondence is established between the syllabic structures of words and the frequencies of their occurrence in texts in the Tajik language. It is proposed an algorithm for breaking Tajik words into syllables, implemented as a computer program. A solution to the problem of Tajik speech synthesis from an arbitrary text is proposed. The article describes the computer implementation of the algorithm for syncronization of words, numbers, characters and text. For each syllable the corresponding sound realization is extracted from the “syllable-sound” database, then the sound of the word is synthesized from the extracted elements.


Introduction
Today speech synthesis is implemented by various methods that have both certain advantages and disadvantages. Speech synthesis is evaluated according to two characteristics -the naturalness of sound and the intelligibility of the speech it reproduces. Some speech synthesizers better convey the naturalness of sound, others -intelligibility. Depending on the purpose for which they are intended, various methods of speech synthesis are laid at the heart of their design. These methods are usually divided into three groups.
1. Articulation synthesis is considered one of the most difficult methods. Its representatives [1][2][3] try to numerically simulate the work of the human larynx and the articulatory processes occurring in it as accurately as possible in order to reproduce highquality synthetic speech. Until recently, articulatory synthesis developed mainly for scientific purposes and did not attract much attention from commercial organizations. And only recently, some of the developed models began to appear in speech synthesized systems. A definite idea of earlier and later models of articulation synthesis can be obtained from [4][5].
2. Formant synthesis, without using any samples of human speech, imitates it, producing artificial spectrograms. The speech message of synthesized speech is created by him using an acoustic model. Parameters such as natural frequency, sonication and noise levels vary over time and create a waveform of artificial speech. Many systems, which are based on formant synthesis technologies, generate artificial speech with a "robot-like" sound, so the synthesized speech message cannot be confused with natural human speech. Formant synthesis systems have some advantages over concatenative systems because, firstly, formant-synthesized speech can be very understandable in them because there are no acoustic noises inherent in concatenative systems. Secondly, formant synthesizers are often programs that are smaller in size than concatenative systems, since they do not have a base for speech samples. They can be used in embedded computer systems that require minimal memory and processor power. And finally, since formant synthesis exercises general control over all aspects of the created speech message, its achievement can be a wide variety of prosody (pronunciation systems of stressed and unstressed, long and short syllables in speech) or intonation, which conveys not only questions and statements, but and a spectrum of emotions and tones of voice. The most famous of the formant synthesizers are associated with the name of Klatt (D. H. Klatt [6 -10]).
3. Concatenative (concatenation) synthesis uses pre-recorded segments of natural speech. Such a synthesis is probably the easiest way to reproduce understandable and naturally-sounding synthetic speech. In it, one of the most important points is the selection of sound bites of suitable length. This choice is made between short and long-sounding units. With longer units, good articulation and a high degree of naturalness of speech are achieved, the number of required connections at the docking points of sound units is reduced. At the same time, a drawback also appears -the inevitable increase in the initially reserved computer memory. Working with shorter sound units (fragments) requires less memory, however, the process of automatically synthesizing them becomes more difficult and complex. Existing concatenative synthesizers use phonemes, diphones, syllables, morphemes, words, phrases, and even sentences as sound units. At first glance, it might seem that in comparison with others, a word should be given preference, however, due to the presence in each language of an immense set of different words and proper names, and also because of the uneven sound of the word in continuous speech and in isolation, one cannot recognize such a choice is acceptable.
The ideas underlying the concatenative synthesis, apparently, were first expressed by Harris (S.M. Harris) in his article on the building blocks of colloquial speech, see [11]. The current status of the issue can be obtained from the works of Potapova R.K. [12 -13].
The most common variants of concatenative synthesis are parametric synthesis and synthesis according to the rules. The first of them is more flexible due to the parameterization based on small phonetic units (allophones, diphons, syllables ...). It allows you to manipulate the parameters that are responsible for the quality of speech (formant value, bandwidth, fundamental frequency, signal amplitude). This makes it possible to glue the signals, so that the transitions at the borders become invisible. Varying parameters such as the frequency of the fundamental tone throughout the message make it possible to significantly change the intonation and temporal characteristics of the message. For the 2 ITM Web of Conferences 35, 07003 (2020) ITEE-2019 https://doi.org/10.1051/itmconf/20203507003 synthesis, speech units of various lengths are used: paragraphs, sentences, phrases, words, syllables, half syllables, diphons. The smaller the unit of synthesis, the smaller their number is required for synthesis. This requires more computation, and there are difficulties in coarticulation at the joints. The advantages of this method: flexibility, a little memory for storing the source material, preserving the individual characteristics of the speaker.
Synthesis according to the rules works with the so-called "unlimited dictionary". Its elements are phonemes or syllables, which are connected according to well-defined rules. It was found that for high-quality speech synthesis it is necessary to have several different pronunciations of the synthesis unit (for example, a syllable), which leads to an increase in the dictionary of the original units without any information about the context situation. For this reason, the synthesis process acquires an abstract character and moves from a parametric representation to the development of a set of rules by which the necessary parameters are calculated based on an introductory phonetic description. This introductory presentation contains little information per se. These are usually the names of phonetic segments (for example, vowels and consonants) with accent marks, tone designations, and temporal characteristics. This method provides freedom for modeling parameters, although the modeling rules themselves remain imperfect. Synthesized speech is worse than natural, however, it satisfies the tests of intelligibility and comprehensibility.
It should be noted that among the syntheses mentioned, formant and concatenative have found widespread use, the first of which has dominated for a long time in the past, but concatenative synthesis is becoming more popular today. Against this background, articulatory synthesis seems too complicated for high-quality reproduction, but it is possible that it may turn out to be a particularly promising method in the near future.
Other less popular speech syntheses are hybrid and HMM-based synthesis (HMM). Hybrid synthesis combines the features of formant and concatenative synthesis in order to minimize acoustic noise in the process of sounding speech segments. In a synthesis system based on HMM, the speech frequency spectrum (speech path), natural frequency (speech synthesizer) and duration (prosody) are simulated simultaneously using hidden Markov models. Speech waveforms are generated from hidden Markov models, which in turn are based on maximum likelihood criteria.
In Russia, the most notable achievements in the field of automatic speech synthesis are associated with the Computing Center of the Russian Academy of Sciences (Yu. I. Zhuravlev [14], V. Ya. Chuchupal [15]); Institute of Information Transmission Problems of the Russian Academy of Sciences (V. N. Sorokin [16] [20 -22].
Various methods of speech synthesis are the basis of computer programs -speech synthesizers. At the request of the user, such programs belonging to the category "text-tospeech" can read texts recorded in electronic memory by male or female, make intonation pauses, change the tone and timbre of speech during listening, and transmit voiced texts through the network. Here is a list of the most famous computer speech synthesizers: Reader TTS, Govorilka, ToM Reader, Sakrament, Talk Some programs, such as Sakrament Talker, Govorilka, Talk-To-Me, Text Aloud, Speech2, are reportedly adapted to read texts in any language aloud. However, when working directly with them, it is discovered that the skill attributed to them is not actually confirmed, since the high quality of the synthesized speech is directly related to the specifics of the spoken language, as a result of which the software system developed for a particular language cannot be equally successful its functions in relation to any other language. However, not only this, but also significant shortcomings, determined either by the unnatural sound, or insufficient intelligibility of messages, determines the relevance of further research on the design of speech synthesizers for natural languages.

The syllabic structure of the words of the Tajik language
A syllable, by definition, is called a minimal pronunciation unit of speech, consisting of one or more sounds that form a close phonetic unity. According to a slightly different equivalent interpretation, a syllable is a sound or a combination of sounds in a word, pronounced with one push of exhaled air.
To study the patterns of the Tajik language associated with the concept of a syllable, we introduce an additional concept of the syllable structure of a word.
Let W be a word representing a certain sequence of letters. Replacing vowels in it with the number 1, and consonants with the number 0 (we consider the letter "й" to be consonant), we thereby transform the word W into an ordered collection * 0,1 W of zeros and ones. We call such a transformation the encoding of the word W , and the result obtained, i.e. notation W . In turn, essentially for any natural language, to any * 0,1 W several words W simultaneously corresponds. This means that different words with the same number of letters can have the same syllabic structure. For example, the words "дилшод", "кардам", etc. corresponds to the same structure "010010".
The results formulated hereinafter are based on the statistical processing of a representative sample composed of fragments of the works, which amounted to 1800000 words. In the future, the images of these words, i.e. the corresponding syllabic structures represented by the set of 2. The statistical distribution of structures is obtained, that is, a correspondence is established between the syllabic structures of words and the frequencies of their occurrence in texts in the Tajik language.
These data are presented as follows, the first column gives the number of the structure in decreasing order of frequency of its occurrence, in the second -the record of the structure itself and in the third -the percentage of its occurrence in the texts. 5. Each of the 274 discovered syllable structures of Tajik words was divided into syllables "manually" (in accordance with the division into syllables of those Tajik words that fell under one or another structure). As a result, only 9 different syllable structures were discovered -1, 10, 01, 010, 100, 0100 and 001, 0010, 00100.
Of these, the first six are inherent in the nature of the Tajik language, and the last three are borrowed from other languages.

Automatic word decomposition
This article provides a conceptual description of the sequence of procedures, the implementation of which in the form of a computer program allows automatic separation of an arbitrary Tajik word into syllables. The separation process is based on the concept of the syllable structure of a word and essentially uses 6 syllable structures.
Let W -be a Tajik word representing a certain sequence of letters of the Tajik   is recognized in the same way, because it is currently known that the Tajik language does not contain words containing more than 8 syllables.
9. The end. The word "хуршед", chosen by us as an example, in coding with the help of zeros and ones is identified with the 9th record of table 1. Therefore, in encoded form this word gets a syllable representation where is the sign of agglutination, i.e. joining (gluing) one syllable structure to another without a space.

Part 2.
After decomposition * 0,1 W into syllable structures, splitting the source word W is very simple. From the first part of the algorithm, it suffices to store in memory the number of letters that make up the 1st syllable, 2nd syllable, etc. These numbers are used to highlight syllables already in the original word W .
So, in the above example, when separating

The variety of the types of the Tajik language
Based on the author's algorithm and a computer program developed on its basis, statistical studies on the variety of syllables of the Tajik language were carried out.
1. 3259 different syllables are extracted. 2. The statistical distribution of syllables in texts in the Tajik language is obtained, i.e. an empirical correspondence ν = ν (n) was established between the number of each of 3259 different syllables arranged in decreasing order of their occurrence frequencies and the frequency ν (in percent) of occurrence of the corresponding syllable corresponding to this number. 3. It is established that 41 syllables cover 50% of the Tajik text: To evaluate the synthesizer's performance, experiments were organized to voice a variety of textual information (fragments from novels, novels, scientific articles, textbooks, newspapers, magazines, Internet sites). The assessment of the completeness of the many syllables used to form synthetic speech was associated with the percentage of spoken words in relation to the total number of words within the selected text fragments. The results of the experiment showed quite satisfactory quality of the Tajik Text-to-Speech software package for scoring the Tajik text. The block diagram of the software package is presented in Figure 2.
In the first block, the "User Interface" consists of two components -"Text Entry" and "Speech", which have one-way communication, that is, the user has the opportunity to enter text information and as a result receive a speech version of the input text. To get the results, block 1 is connected with block 2 in two directions -to provide information for linguistic analysis and to obtain the results of scoring. Block 1 also interacts with block 3 directly to use the necessary data about the system settings (male or female voice selection, volume and speed of scoring).
The second block "Analytical subsystem" consists of two parts -"Linguistic analysis" and "Sound module". The first of them consists of the submodules "Text Validation", "Text Encoding" and "Separating Words into Syllables". "Text Validation" is used to validate input information, which includes text elements such as words, integers, characters, and punctuation marks. This submodule checks text elements, converts integers and characters into a test case, and then passes them for encoding. The coding process implements the submodule of the same name, which converts each word of the input text into an ordered set of zeros and ones, i.e. all words are represented by their syllabic structures. The encoded text is transmitted to the subdivision "Separation of words into syllables." Syllable words are linguistically analyzed and transmitted to the Sound Module. In this module, the formation of sound information occurs using the base "syllable-sound" of the information subsystem, stressed syllables, inter-syllable and inter-word pauses, as well as pauses marking such punctuation marks as a comma and period. The scoring module is the final stage of the analytical subsystem, and the audio version of the text information is sent to the user interface.
The third block, "Information Subsystem," contains databases called "System Settings" and "Syllable-Sound Base". The first of them is used to store temporary system setup data, the second "syllable-sound" base -to store statistical data on sound files of 3259 Tajik syllables. To work with this database, a module is used to provide access, check and select the necessary data.

Conclusion
Thus, the software package for computer dubbing of the Tajik text Tajik Text-to-Speech [24] and the announcer of the Tajik text Tajik Text-Narrator [25], although they do not completely solve the problem of synthesizing Tajik speech, are still the first software product, satisfactorily performing computer scoring of Tajik texts. At this level of development, the complex can now be used by people with impaired vision. The experiments were carried out at scientific seminars of the Khujand Polytechnic Institute of the Tajik Technical University named after Academician M.S. Osimi. Its participants, at their discretion, entered Tajik texts into the computer and then evaluated the naturalness and intelligibility of the sound of synthetic speech. The general opinion of the seminar -a computer synthesizer, built on the principle of concatenation of 3259 Tajik syllables, quite successfully performs the function of scoring Tajik texts. The synthesizer implements such elements of prosodic synthesis as the arrangement of stresses, taking into account the intonation pause between paragraphs, after the decimal point inside sentences and the point at the end of the sentence. Computational experiments have established the prospect of further development of the Tajik Text-to-Speech and Tajik Text-Narrator software systems into a Tajik speech synthesizer with Russian language.