A Data Flow Model to Solve the Data Distribution Changing Problem in Machine Learning

Bo-Wen Shang; Ke Wang

doi:10.1051/itmconf/20160705012

All issues

Volume 7 (2016)

ITM Web Conf., 7 (2016) 05012

Abstract

Open Access

Issue		ITM Web Conf. Volume 7, 2016 3^rd Annual International Conference on Information Technology and Applications (ITA 2016)


Article Number		05012
Number of page(s)		5
Section		Session 5: Algorithms and Simulation
DOI		https://doi.org/10.1051/itmconf/20160705012
Published online		21 November 2016

ITM Web of Conferences 7, 05012 (2016)

A Data Flow Model to Solve the Data Distribution Changing Problem in Machine Learning

Bo-Wen Shang^a and Ke Wang

College of Computer Science, National University of Defense Technology, Changsha, 410072, China

^a Corresponding author: 819089115@qq.com

Abstract

Continuous prediction is widely used in broad communities spreading from social to business and the machine learning method is an important method in this problem.When we use the machine learning method to predict a problem. We use the data in the training set to fit the model and estimate the distribution of data in the test set.But when we use machine learning to do the continuous prediction we get new data as time goes by and use the data to predict the future data, there may be a problem. As the size of the data set increasing over time, the distribution changes and there will be many garbage data in the training set.We should remove the garbage data as it reduces the accuracy of the prediction. The main contribution of this article is using the new data to detect the timeliness of historical data and remove the garbage data.We build a data flow model to describe how the data flow among the test set, training set, validation set and the garbage set and improve the accuracy of prediction. As the change of the data set, the best machine learning model will change.We design a hybrid voting algorithm to fit the data set better that uses seven machine learning models predicting the same problem and uses the validation set putting different weights on the learning models to give better model more weights. Experimental results show that, when the distribution of the data set changes over time, our time flow model can remove most of the garbage data and get a better result than the traditional method that adds all the data to the data set; our hybrid voting algorithm has a better prediction result than the average accuracy of other predict models

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.