Evaluation of Filtering Methods Applied to the Unstructured Datasets in the Predictive Learning Services

Predictive learning services perform aggregation and homogenization of open data from public sources, in particular from the online recruitment agencies. However, the sample of vacancies may contain various percentage of noise due to the frequent occurrence of homonyms. This article will consider two approaches of noise reduction: the first one is based on the cosine similarity and the second one is based on the contextual words.


Introduction
Many of the major online services provide application programming interface (API) for integration of thirdparty applications, particularly adhering to the REST principles [1].Some of the online recruitment agencies (e.g.Indeed.com[2], HeadHunter [3]) provide an API for fetching a vacancy list matching the specified search criteria.
During the development of the predictive learning service [4], the vacancy data of the IT sector were obtained from the mentioned agencies in order to analyze which skills are the most demanded and which are the fastest-growing in demand.However, in the course of experiments it was found that the data from the Indeed.commay comprise a significant percentage of noise.One of the reasons is unstructured vacancy descriptions.For example, a sample of vacancies with the Haskell programming language requirement had noise percentage about 54%, which means the unsuitability of data to assess the real needs of the market in regard of the corresponding skills.

Indeed HeadHunter
Maximal number of documents in the sample 1025 2000 Maximal document size ~200 characters ~400 characters

Historical data No No
The binary text classification is well studied [5,6], but it hasn't been applied to the analysis of the demand for skills yet.In addition, the integration of the filtering into the predictive learning service, with regard to the existing constraints (Table 1), is a relevant issue.

Materials
Samples of vacancies taken from the Indeed and HeadHunter (Tables 2 and 3 respectively) were processed manually and classified according to the following rule: job title or description must clearly specify requirement of the skill.In other words, the vacancies containing the text of a technical orientation, but not containing an explicit reference to skill/technology, are considered noise.The use of keywords within the meaning other than the skill is considered to be noise as well.
Possible error in the table values -5 vacancies which does not affect the result.Haskell, Boo and Clarion skills were excluded from Russian online recruitment agency data, due to the fact that the number of the results was lower than the possible error.
Figures 1 and 2 show visualisation of the vacancy to noise ratio.
Significantly lower noise percentage in the second sample due to the fact that in the HeadHunter service, where the main language is Russian, the probability of the use of professional English terms with varying semantics is lower than in the Indeed, where the main language is English.

Methods
Due to the sample limitations, word2vec [7] and doc2vec

Fig. 3. Filtering algorithm written in pseudocode
The second approach is to use a keyword search including keywords from the contextual domain.Procedure for this approach can be described as follows: 1. Preparation of a list of keywords for the contextual domain.2. Indexation of unfiltered vacancies.The index includes both vacancy description and its title.3. Next step is the search for vacancies matching the following condition: the text should contain the keyword and at least one of the contextual keywords.List of contextual keywords can be prepared once for each category and can be reused.The list itself was formed of the keywords with significantly higher frequency of appearance in the domain compared to the frequency of their occurrence in a common text.An additional manual selection of contextual keywords was performed after the initial automated selection to eliminate the ambiguous words, such as "basic", which is both a commonly used adjective and the name of the programming language.

Results
The experiment shows that a filter based on the cosine similarity reduces the percentage of noise in the vacancy list (fig.4).The values in the chart represent the difference between the raw number of vacancies and the number of vacancies after application of the filtering algorithm.Mean decrease of the noise is about 15.34%.However, in some cases, the filter result shows lower number of vacancies than the actual one.It may affect the representativeness of the skill assessment.
As is evident from fig. 5, the approach based on contextual keywords shows higher accuracy.The amount of noise vacancies decreased by 91.7%.Miscalculation on keywords "MySQL" and "Microsoft Access" might be related to their wide distribution outside software development sector.Therefore, contextual keywords in the software development sector do not always occur in the vacancies containing mentioned keywords, which leads to an underestimation of the number of results.

Discussion
Both solutions have their advantages and disadvantages.
The performance of the filter based on the cosine similarity is acceptable to solve problems with a limited set of data, as the complexity of calculating the cosine similarity between all the texts is O(n 2 ).Due to the fact that the algorithm can underestimate the actual number of vacancies, it is reasonable to carry out further studies in order to improve the accuracy of the algorithm.
Filter based on the contextual keywords has a lower computation complexity O(n log(n)) and shows higher accuracy.But it requires manual preparation of a list of contextual keywords prior to the filtering.As a result, filtering can be performed on a limited number of professional scopes.An algorithm for automatic selection of contextual keywords can be implemented to expand a given set of professional scopes, that would eliminate the manual stage of preparation and will provide an opportunity to apply the method to any area of job search.

Conclusion
It is reasonable to integrate both solutions into the predictive learning service and to monitor the sustainability of both approaches to changes in the labor market.

Fig. 1 .
Fig. 1.Vacancy to noise ratio in the data sample from the Indeed service.

Fig. 2 .
Fig. 2. Vacancy to noise ratio in the data sample from the HeadHunter service.

[ 8 ]
models perform with low efficiency.Therefore two different approaches have been chosen as the possible filtration method:1.Unsupervised, based on the use of cosine similarity.2. Supervised, based on the use of contextual words.The first approach is based on the calculation of the cosine similarity between the vector representations of the vacancy texts.The algorithm consists of the following steps:1.Tokenization.2. Stemming.3. Removal of the stop-words.4. Calculation of the cosine similarity matrix between all the vacancies.5. Calculation of an ordered list of factors.6. Selection of the reference factor.7. Filtering based on the reference factor and standard deviation.The algorithm is shown in the form of a pseudocode in Figure 3.

Fig. 4 .Fig. 5 .
Fig. 4. Results of the algorithm based on the cosine similarity

Table 2 .
Vacancy to noise ratio in the data sample from the Indeed service.

Table 3 .
Vacancy to noise ratio in the data sample from the HeadHunter service.