Probably-Statistical Method for Written Signs Recognition Using the Measure of Proximity

. The paper describes ways to recognize written signs when the nature of the source is absolutely unclear and the seemingly obvious possibilities for solving the problem are not clear as well. The article deals with methods of recognition of binary images in order to compare them and highlight the best. The images of documents are obtained with the help of a camera. The quality is low. The images of the collection were segmented and passed binaryization. A control sample was selected to test the recognition methods from the resulting collection. The paper describes the method of comparing images, their advantages and disadvantages when recognizing handwritten shorthand characters. The results obtained by comparing the characters of the control sample allowed determining the best method - "method of comparison of forms".


Introduction
At present great attention is paid to methodological approaches for complex experiments in handwritten symbols recognition. In modern publications the advantages and disadvantages on the concept of the recognized object are noted either as a whole or its characteristic part as a multidimensional vector in the simplicity of features. Its disadvantage concerns the recognized object that is being clearly or implicitly distributed on properties of the vector as a mathematical object, which in real problems of identification, apparently, always wrong, and at least never enough caused [1][2][3]. To call the recognition object a tuple (.) or ncharacteristic is certainly more correct. But visibility and possibility of unconditional use of the metrics in Euclidean spaces in the quality of similarity among classified objects is completely lost. Impossibility of vector interpretation of handwritten signs at usual methods of their description is obvious. Therefore the scheme for handwritten signs recognition from the very beginning should be adapted to features of the problem. The most important features are: -absence of any hints of what to consider an adequate system of descriptions and features in this outwardly simple task; -the complete absence of objective data on the criterion of similarity used by a person to describe written signs; -very strict requirements to the result determined by a person's capabilities in the problem; -admissibility and presence of "teacher's mistakes" caused by inevitable mistakes in encoding of initial information and impossibility to control the content of samples due to their volume. It is these features that led to understanding the necessity of studying the problem of written signs recognition by such heterogeneous methods [2][3].
For problems close in complexity to the problem of written signs recognition, when the nature of the source is absolutely unclear, and the seemingly obvious possibilities of solving in an incomprehensible way turn out to be insolvent, one of the best is the possibility of cascading solution. [4][5][6]. In this case, dividing an initial problem into a series of essentially simpler subtasks within the framework of the supposed model of the situation generating the problem being solved, one can count on success by a sequence of local but mutually agreed solutions. The most natural of such subtasks or stages within the limits of the accepted model are: a) a choice or working out a system of initial descriptions for objects of recognition in process of their similarity, if possible coordinated with those estimations of similarity in this problem which a person gives, if, certainly, it is capable to do it, and b) on the basis of the found descriptions and a measure of similarity search of meaningful attributes and procedures of decision-making on them that in turn can allow to specify initial descriptions in process of similarity, etc. (see Fig. 1). The model of handwritten characters recognition was based on the representation of a flat line of a symbol by a symbol portrait -a line multidimensional in an extended space of natural descriptions, including a symbol plane as a subspace [7][8]. At the first stage of checking this model it was necessary to clarify the system of initial descriptions -the coordinates and measure of similarity of portraits, and then to engage in a formalized search for features within the model under consideration, corresponding to the areas of local grouping of portraits in the space of natural descriptions.
Consideration of the main features of the problem of handwritten symbols recognition, from the point of view of a step-by-step methodology for solving such problems shows that the hypotheses about the suitability and adequacy of the original descriptions and the criterion of similarity should be evaluated by the degree of heap breakup [9][10][11][12]. In order to carry out this part of the experiments, a program was developed. It allowed to specify the descriptions and the nature of their processing and to formulate the similarity criterion for handwritten figures given by the portraits to the space of natural descriptions.
The result of the heap-talon check reached its best in automatic character recognition [13][14][15][16]. This allowed concluding that, at least at the first stage, the model and the decisions carried out on it are substantial. Possibilities of further improvement of the results are connected with taking into account the handwriting limitations -the manner of writing and with the transition to recognition by local features -features, since it was the comparison, used at the first stage, that could explain 3/4 of the errors received and 1/2 of the refusals of recognition in completely seemingly simple cases. Experiments with the manner of writing have not yet yielded a noticeable result, as for local features, it is to search for them, as well as to clarify the original descriptions and the measure of their similarity in the space of natural descriptions. And the method was developed.
The technique refers to the widely used methods of cluster analysis and implements one of the modifications of the gradient search. To estimate the gradient of a point set, the shift for the center of gravity of the set of points caught in a test sphere of relatively small radius from the center of the sphere is used (see Fig. 1). The centre of the sphere in the next step is combined with the centre of gravity defined in the previous sphere position. It is easy to see that the value of such shift does not exceed the radius of the sphere and therefore the movement towards the mode will be stable. The sphere radius in the search process changes adapting to the number of points falling into the sphere as there are usually few of them. Otherwise, individual points away from clots would have to be considered independent clots, and with a large radius the close clots would merge. Therefore, if there are few points in the sphere, its radius will increase slightly in the next step; if there are many points, its radius will decrease (see Fig. 1). The use of this kind of adaptation allows obtaining high resolution near the clots with low probability of false clots extraction even in small samples where restoration of smooth distribution functions is a serious problem. An extremely important point in the work of search procedures is to find out whether all the modes are detected when the number is unknown in advance and can fluctuate within a very wide range. To solve this problem, the technique provides for marking points in the process of the trial sphere movement to and from mode, and the movement to the next mode can only begin with an unmarked point. Since the overwhelming number of points is marked on the modes and their nearest surroundings, the number of unmarked points decreases very quickly; when they do not appear at all, the procedure stops. In the observations made, the number of mode outputs did not exceed 8% of the sample volume on average, while the repeated outputs of the already selected modes amounted to 50-70% on average, which indicates the high efficiency of such a search procedure with "coloring".
There are two markup modes. In the first one, adequate to the situation of the peak type, one label is used, and in the second one, adequate to the ridges or taxes, the implementations near the selected mode are additionally marked. The radius of the additional marking is a bit higher than the radius of the sphere established in the mode, and all the implementations marked with such a mark are excluded from further consideration. Thus, in the first mode of marking, only the conditions for the beginning of the search for modes change, and in the second -the selection volume. Therefore, the second mode is much faster (the repeated outputs of the already selected modes are excluded, although their share is relatively small, not every release of the mod dramatically reduces the sample volume) and more suitable for the situation of crest type, as it is a gradient version of the method of coatings. In addition to true modes, the sets of modes allocate quasimodes on slopes of true modes. This increases the role of choosing the sphere radius to exclude implementations in the vicinity of the mode from the sample when the mode is found.
If there are no predictions, then the choice of a distribution approximation mode more adequate to the task is carried on the basis of the results of the exam for two types of modes, comparing the number of errors and refusals, the number of modes and their characteristics. In particular, in the observations on the recognition of handwritten figures in the space of natural descriptions, the situation corresponded to the second mode of partitioning, which may be explained by the non-use of neutralizing transformations at the input in these observations [16][17][18][19][20].
The method implements decision making on two implementations closest to the recognizable one in the sense of the measure of similarity of the reference-modes used. But this rule, if the difference between two minimum distances is less than the allowable lumen and different sense of the closest standards or if the distance to the closest standard is greater than the threshold distance, the decision is not taken. Otherwise, the decision is made by the nearest reference standard, and if its meaning does not coincide with the meaning of the implementation specified in its title, the recognition error is fixed. In case of an error and refusal of recognition, the header of the second nearest standard and the content of the counter that accumulates the examination result are additionally printed [12]. In the dialogue mode it receives all the necessary information about the characteristics of the task being solved (in addition to the parameters of the organization of information arrays of the processed implementations it also includes the indication of the number of classes and several parameters specific to this program), and the mode of its operation is also set by keys. Time spent on iteration, to a greater extent, depends on the dimension of the description space: in the first approximation it is proportional to the sample volume, the dimension of the space and the number of features in one cell.
Finally, the natural application of learning outcomes is control. The control can be carried out both on the full list of results of training, and on the edited list, for example, if complexity of the definitive rule received at training seems excessive. In the latter case, the block "control" on the training material of consecutive rearrangement of constructing planes on the saved at editing attributes, realizing the "relaxation search", sets them so that the minimum total error of classification appears. Thresholds corresponding to the constructing planes, in this case are fixed, and in the future an examination of the control sample is carried out on them. The third type of control is the examination of more than one decisive rule, referring to the class "unknown" objects, the decisions on which do not coincide. So, in particular, the area of intersection of classes, not determined in advance, can be allocated.

Methodological problems of binarization
In today's scientific community, much attention is paid to the problem of automated text recognition. Methods of handwritten text recognition apply to verbatim documents, which relate to offline methods. At carrying out offline recognition at first the problem of preliminary processing of the image consists of two parts to be solved: binarization; segmentation.
In our case, binarization was carried out by the threshold method, with the optimization of its choice [3][4][5]. Then the noise was removed from the image [5][6]. Segmentation was performed automatically. As a result of processing, a collection of symbols was formed. Special characters were selected from the total number of symbols and a control sample was created.
The method of segment length comparison uses the total difference between the lengths of segments built according to predetermined rules as a measure of similarity. For comparison of transcripts the following rules were selected: the segments are built from corner points, as well as from the midpoints of the segments located on the borders of the image in its center, until the first touch with the points of the image [3] (see Fig. 2). The difference between the lengths of the corresponding segments is added and serves as a measure of proximity. The smaller it is, the more similar the images are. This method has a high recognition rate. Its significant disadvantage is a decrease in accuracy when the alphabet (a set of original characters) increases. Note that this method is sensitive to distortions and breaks. The working principle of the projection comparison method is as follows. For the compared images the graphs of projections of image points on horizontal and vertical axes are constructed. The distance between the images is defined as the total difference between the vertical and horizontal axis graphs [4] (see Fig. 3). The calculations made when comparing images with the projection method are simple enough and provide high recognition speed. This method, as well as the projection method, is sensitive to distortion and becomes less effective when the alphabet increases.  For the measure of similarity of images we take the total shift of N points for one image from N points for the other. The points of the images are compared using the assignment task. We determine the connection cost of points based on the distribution of points in baskets using the X2 criterion: where, () i hk is a number of points in the k-th basket for the i-th point, where i = 1...N; k = 1...K. As the initial data of the assignment task we obtain the matrix C with values ij c , where i, j = 1...N. The assignment problem was solved by the Hungarian method [4,8]: As a result, we get a comparison of the selected N points of two images. The total Euclidean distance between these points is taken as a measure of similarity. This method is resistant to breaks, but requires a lot of calculations.
As a result of the analysis of the above methods to solve the problem of recognition of shorthand characters, the method of comparison of forms is decided to be the most suitable. The quality of recognition methods was determined on the control collection. An example of the results of the methods is given in Fig. 3. The images are given in the order of increasing distance compared with the standard. The method of forms comparison is 1, lengths of segments comparison is 2, projection comparison is 3. Using the control collection, estimates of accuracy, completeness and F-meters were obtained, which are shown in Fig. 4. The best recognition results were obtained using the form comparison method. Its accuracy was 54%.

Prediction as pattern recognition (one-dimensional case)
The considered post-projection task with the help of the statistical classification apparatus assumes that k parameters 1  , 2  , . . ., k  , characterizing the system state will have an identical set   for systems with equal or approximately equal performance reserve. In other words, the group of devices (systems) having the same durability will differ from the devices not included into this group by a vector or a state function described by coordinates -parameters s  , 1, 2 ..., sk  .
In such cases, elements of extrapolation are embedded in predictive methods based on a statistical classification of those stable relationships that are found between class R  with the appropriate margin of performance and combination s  . The process of establishing extrapolation relationships is based on a priori information and is called the process of learning extrapolation relationships. At forecasting by results of a current control detection and recognition of extrapolation bonds is carried out by means of mathematical model.
Thus, from the general amount of questions solved at statistical classification it is necessary to allocate two basic questions: quantitative estimation of training to extrapolation connections and formation of optimum recognition model [6].
Let there be a priori selection of letters or texts of the same type with the diagnosable ones. The state of each letter (word combination) is characterized by a set of k parameters, which are the coordinates of state vector  : In the course of training, it is established which vectors (4) form the class with a reserve of efficiency 1 R , which vectors with a reserve of efficiency 2 R , etc. In other words, while performing probabilistic and statistical processing of vectors 12  At that, the metric should have common properties for the class objects. Then the best metric that maximally approximates (compresses) the vectors of one class is found. As already noted, the task of image recognition is of dual nature: on the one hand, it is necessary to build the characteristics (description) of the class on the basis of a priori information (training); on the other hand, it is necessary to make a decision on refusal (object) of the current information (recognition -exam).
There are different ways to describe the classes, but the most suitable for practice is the way connected with the calculation of statistical parameters and characterizing the center of scattering of random variables (values of object parameters from a priori information).
Let   characterize statistically the center of  class by s -th coordinate (parameter). Then the ratio between 3. Scalar works at different methods of rationing: In general, rationing is desirable in all cases, because it becomes clearer the degree of closeness of the object being examined to this or that class. Expressions (1) to (5) are quite suitable for this purpose, but they in no way take into account the significance and weight of a given parameter. The introduction of weighting coefficients makes it possible to improve the accuracy of recognition (prediction) [3]. Thus, the measure of proximity in the linear space of features is defined as follows: where s  is the weight factor of the s-th parameter-sign, determined at the training stage for the best prediction of the diagnosed object; the condition should be observed: A measure of proximity calculated in the nonlinear feature space in many cases gives an increased accuracy of prediction: where p is the degree of nonlinearity calculated at the model training stage and designed to reduce the classification error. Usually in practice p is 2 or 3. After the training stage it usually turns out necessary to solve the problem of building a separating function, i.e. to determine the equation of the surface separating classes in the space of images-letter combinations. The separating function is based on the similarity functions (measures of proximity) of the diagnosed image to the a priori subset of images.
The zone method allows in some cases to perform the prediction quickly and easily enough. The general scheme of the method can be presented in the following way.
All parameters of the object are separated into three zones (areas) (case of two classes). is not taken and it is reasonable to apply a more accurate method. Such an analysis must be carried out for all parameters. The mathematical basis of the method can be presented in the following way.
We believe that the closeness measure (similarity function) for the s-th parameter is discrete and takes three values: The total closeness measure (by k parameters) is calculated as a linear sum of similarity functions: 1 . k s s      (15) Whether the image being diagnosed belongs to a certain zone is determined by the proximity measure sign (7): tell about belonging of the literal combination to this or that class or about necessity to refuse recognition. The accuracy of the recognition will obviously largely depend on how well the generalized expression is selected. One of the main criteria for creating such expressions is, of course, how much more classes can be removed from each other and objects can be compressed within a class. In this case, this is done as follows. A set of k parameters of letter combinations assigned to the 1st class is divided into two parts, of which one part of parameters However, the methods of sums and products of probabilities, being the ways of quick on-line verification of hypotheses about belonging of objects, do not take into account the a priori probability of appearance of each class in the general population of classes, without the use of which it is impossible to do without in some difficult recognition tasks [5]. This disadvantage is deprived of the basis recognition scheme, which introduces weighting factors: where  is a comparison threshold that depends on the required recognition accuracy.

Conclusion
The paper notes the inexpediency (and error) of the fundamental distinction between recognition methods in terms of their severity and objectivity. The criterion of the method choice can be simplicity of determining the measure of proximity, complexity of writing off the boundaries of classes and images, resolution, etc. At the same time it is very important to know individual physical features of recognizable objects, informativeness of the chosen features, quantity and quality of a priori and current information, possibility to introduce adaptation (weight) coefficients, etc.