Classifying Informatory Tweets during Disaster Using Deep Learning

Micro blogging platforms like Twitter generate a wealth of information during a disaster. Data can be in the form of sound, image, text, video etc. by way of tweets. Tweets produced during a disaster are not always educational. Information tweets can provide useful information about affected people, infrastructure damage, civilized organizations etc. Studies show that when it comes to sharing emergency information during a natural disaster, time is everything. Research on Twitter use during hurricanes, floods and floods provide potentially lifesaving data on how information is disseminated in emergencies. The proposed system outlines how to distinguish sensitive and non-useful tweets during a disaster. The proposed method is based on the use of Word2Vec and the Convolutional Neural Network (CNN). Word2vec provides a feature vector and CNN is used to classify tweets.


Introduction
With the emergence of new technologies, our ability to communicate with others has increased dramatically. No longer can communities rely solely on buying traditional media, newspapers, radio, and TV news. With the emergence of smartphone technology, communities are simply an 'app' away from being able to deliver or receive information within milliseconds. Recently, Twitter has been used to spread updates/news all over the world and has been effective in emergencies during natural calamities like floods, earthquakes and wildfires. Twitter has proved that it has the potential to increase survival of a person during a Tornado related disaster. Communication forums allow for the communication of various networks that you can help officials during a disaster to make a list of victims, deceased, and contact with family and friends of all victims during the communication and planning of victims and respondents. Twitter has become an invaluable tool throughout the world, especially for spreading and discussing issues in everyday life. There are an estimated 500 million Twitter users worldwide. But the tweet generated is not always informative in nature. Thus there should be some method to classify the tweet.
Deep learning models have found great success in text processing such as text classification, natural language processing [1] etc. A major challenging task to classify the tweet in the form of text generated during disaster based on the informativeness of the tweet. Combination of deep learning models like CNN and word embedding model like Word2Vec provides functionality for classification of tweets with tremendous performance accuracy. Extracting features from the tweets generated during the disaster that are informative help in creating a model for classifying the tweet.
The rest of the paper format is as follows. The Section 2 of the paper describes the related work for the topic. Section 3 explains the proposed system. Section 4 gives analysis and experimental results. The conclusion is given in Section 5. Section 6 shows the future work.

Literature Survey
Several researchers are proposing strategies for classifying tweets based on the informativeness of tweets. The description of the strategies is as follows.
Deep learning has found great success in image processing and text processing. When looking for similar images, there are some articles, data of which the image and its title are not related. To address this problem paper [2] proposed a convolutional neural network model. The model proposed in this paper uses image and text features to predict similarities between them. In the model, the features of the image and text are extracted and provide opportunities for the relationship between images and text. This may have been added to the search strategy as a measure to improve search quality.
Spammers use various and advanced strategies to avoid traditional security mechanisms and create a need to come up with strong solutions that are compatible with these processes. Paper [3] proposed a method for using a lowlevel n-gram feature, which prevents the use of tokenizes or devices dependent on any language. Using publicly available databases they analyze the performance of several machines. This way development methods use the word n-gram from a tweet tone. They have also shown that technology can detect Spam tweets with low latency which is very important in the real Twitter scenario.
Twitter has been an important data source for many applications. That raises the popularity of tweet sentiment analysis. Paper [4] proposed a method for partial analysis of standard tweets by looking at the classification strategy with the winners. A large number of tweets contain phrases of questionable sensitivity. Thus, successfully solving specific conceptual phrases can help significantly improve the effectiveness of cognitive analysis techniques. This study establishes a method that focuses solely on the availability and behaviour of cognitive analyzes of a specific type of tweet containing phrases of positive emotion. The results indicate that the proposed method works well for both tasks.
Social media data analysis presents many challenges such as loud noises, short and formal messages, learning information categories from incoming message streams and categorizing them among others. One of the most basic requirements of these functions is the availability of information, especially human-specific information. To illustrate the use of adjectives, the proposed paper [5] has trained machine learning classifiers. In addition, the author has published the first of a large number of word2vec software trained in 52 million disaster-related tweets. To address the language problems of tweets, the author sets out common lexical resources for common lexical variants.
Paper [6] proposed a method based on the Convolution Neural Network (CNN) for determining the informative tweets and the real-time algorithm for detecting an event to find the time that occurred during a given event. In this study, the CNN model trained in recent earthquakerelated tweets labelled as crowded plays a major role as the student learns to predict the twitter keyword associated with the tweet he teaches or not. Subsequently, these informative tweets were considered input data for the acquisition phase. This system with the help of the CNN module can detect earthquakes after they occur at a tolerance level and confirm prior to the announcement from the official disaster website.
From the literature search, we noted that all existing programs use a small data set or require logical variable names.

Problem Definition
As twitter provides connectivity to significantly active people over the globe that leads to the generation of a tremendous amount of tweets. During a disaster, proposed system can use responses generated by these active people over twitter. Since lots of spam tweets are generated, there must be an automated system for classification of the tweets. To alleviate the problem various deep neural networks are currently being used for classification. This system should provide output with very low latency and should be accessible to the officials or local NGOs for responding to the catastrophic effect of the disaster.

Proposed System
After analysing the existing systems, to overcome the drawbacks, the proposed system has been speculated. This project aims to classify tweet from preprocessed tweets using Convolution Neural Network and Word2Vec Model [7] and system provides the classified tweet informativeness as output. The proposed system considers the correlation between features provided by word2vec and associate tweet with it to improve the performance of tweet classification. Figure 1 describes the flow of the proposed system, it can be stated as: Disaster tweets dataset is provided to the system • Text preprocessing is performed to extract useful text from the tweets.

Tweet Preprocessing:
Classification on tweet focuses on textual data. The raw tweet contains text, image, gifs, emojis and HTML links etc. Preprocessing of tweet helps in removal of nontext data like images etc. It uses the regular expression for the task of removing non-text data along with URLs and hashtags if present in the data. Since creating feature vector and classification require a correlation between words, presence of pos-tags like Personal pronoun, Determiner, Coordinating conjunction, Preposition or subordinating conjunction, Possessive pronoun, no-3rd person singular present, Proper noun, singular, 3rd person singular present, Verb, base form, Modal, Adverb, past tense, Wh-pronoun, Cardinal number, Wh-adverb, Whdeterminer provides less functionality, so can be removed. Along with this, stop words are also removed. The output of this module provides only text data with specific targeted words.

Word2vec:
Word2Vec is a two dimensional non-linear network that is specially trained to reconstruct the alphabetical conditions. It takes input a large set of words and produces a vector space, usually of a size of hundred, with each different or unique word in the corpus assigned to a matching vector in space. The Word vectors are positioned in vector space in such a way that words which share common features in the corpus are built next to each other in space. Word2Vec is a very efficient predictive model for reading word embeddings from raw text. It comes with two flavour. First is the Continuous Bag-of-Words (CBOW) model and the other is the Skip-Gram model. Figure 2 shows the working of CBOW and Skip-Gram model. • Continuous Bag-of-Words (CBOW): CBOW predicts target words (e.g. 'flood') in related set of words ('heavy flood prediction'). It has the effect that CBOW is smooth over most of the distribution data (by treating the whole context as a single observation).In many cases, this becomes a useful tool for small data.
• Skip-Gram: From target words skip-gram predicts words around context. When we deal with longer datagram treats each target based pair as a new detection.

Text to sequences:
As CNN model requires numbers or vectors to perform convolution. Textual data is converted to sequences of numbers. It uses tokenizer for creating unique indexing for the words in a tweet. This indexing is used as sequences. Further, as the CNN model requires every tweet to have a specific number of sequences. Padding is performed by appending 0 to the end of the sequences. It is shown in the figure 3 below.

Convolution Neural Network:
The Convolutional Neural Network [9] is a Deep Learning Algorithm that captures an embedded image, assigns a value (readable bits and selections) to different elements/objects in an image and is able to distinguish each other. The pre-configurations required for ConvNet are very less as compared to other classification algorithms. While using classic filtering methods is done manually, with adequate training, ConvNets has the ability to learn these filters/features. The design of ConvNet is noteworthy in common with the communication pattern of Neurons of the Human Brain. It was inspired by the Visual Cortex organization. Single neurons respond to stimuli only in the restricted region of the visual field which is known as Regency Field. A combination of such fields accumulates to cover the whole visual field. In our proposed system, we constructed a CNN model that consists of following layers that are: • Input layer: It consists of input vectors that need to be classified.

•Embedding layer: It contains features vectors
provided by the Word2Vec model. •Dense layer: Neural Network has a regular layer called dense layer. The neurons are densely connected as every neuron gets input from previous layer neurons.
• Dropout layer: Dropout layer works by removing inputs to a layer probabilistically, which might be input variables in the data sample or could be activations from the previous layer.

Dataset: For this system, Social Media Disaster
Tweets dataset provided by Kaggle [10] is been used. It contains 10k tweets. The reason for using this dataset is that it contains tweets that are labelled as relevant or nonrelevant based on keyword or type of disaster. This dataset is split in 8k tweets for training the model and 2k tweets for validation and testing of the model.

Performance Analysis:
Model is been analyzed by comparing the score with CNN + Custom Weight Vocabulary, CNN + Google News Vocab, CNN + Twitter Glove Vocab. Table 1 shows the comparison of models. From the above table, we can observe that due to large corpus of vocab present in Google News Vocab it generated an evaluation accuracy of 86%. Thus by increasing the size of our custom weight vocabulary, we can increase the accuracy of the model.

Confusion Matrix:
After evaluation of the model. This model is been tested on 200 flood tweets that were generated during the Colorado flood in 2013. Based on the model prediction, there has been creation of a confusion matrix for the same that contains actual informative and predicted informative on these 200 tweets. Table 2 shows the confusion matrix.

Conclusion
Considering the very large number of tweets data over the internet, it is a diligent task to classify them. Because of the presence of non-textual data in tweets and the unique feature of a particular tweet, text classification has become a challenging task. This system propose a tweet classifier framework which takes in input as raw tweet and outputs are classifications of the tweet as informative or non-informative. Our model has outperformed an existing system that has used a combination of CNN + ANN. Using CBOW and Skip-Gram model of Word2Vec along with Convolution neural network, thus have achieved evaluation accuracy of 84%.

Future Work
As dataset was limited to 10k tweets along with particular keywords like earthquakes, floods, fire etc. In this proposed system with increase in the size of datasets and consideration of all kinds of disasters, the system will try to improve the accuracy of the model that can classify live-streamed twitter data with better performance.