CAN EMOTICONS BE USED TO PREDICT SENTIMENT?

: Getting a machine to understand the meaning of language is a largely important goal to a wide variety of fields, from advertising to enter-tainment. In this work, we focus on Youtube comments from the top two-hundred trending videos as a source of user text data. Previous Sentiment Analysis Models focus on using hand-labelled data or predetermined lexicon-s.Our goal is to train a model to label comment sentiment with emoticons by training on other user-generated comments containing emoticons. Naive Bayes and Recurrent Neural Network models are both investigated and im- plemented in this study, and the validation accuracies for Naive Bayes model and Recurrent Neural Network model are found to be .548 and .812.


Objectives
Our focus is on another major social platform, Youtube, which garners hundreds of thousands of comments and other user generated statistics. User data yields important results in the fields of social sciences. In particular we are in-terested in the top trending Youtube videos,and aim to identify sentiment of commenters by suggesting what emoticon a user might use with their comments. We suggest emoticons give insight into the sentiment of the user, and the emoticons pictographic nature gives us a better language to indicate emotion. Using the subset of comments with emoticons we engineered a labelled dataset of com-ments and emoticons. Our models take advantage of this labelling to model the emoticon lexicon. This is further used to suggest what emoticons might ac-company a comment (Hogenboom 2013). Using this dataset and the models we have create, we hope to answer whether or not we can accurately predict what emoticon a user might use.

Literature Review
Sentiment Analysis drives many industries and being able to correctly identify sentiment in a Youtube comment would allow automated systems to moderate comments or correctly recommend media or advertisements to users. In general, there are two methods that Natural Language Processing researchers use for Sentiment Analysis; Lexicon based and Machine Learning based. Sentiment Analysis is a fairly robust field, and has consistently seen interest since its conception. This field has increased exponentially with the surge in data seen with the rise of the internet, in many cases the amount of data is intractable. Social platforms such as Youtube, by themselves generate more data than any one hu-man could analyze. Therefore a system of Natural Language Processing (NLP) is required to deal with the sheer volume of data.
Natural Language Processing can be considered a subset of cognitive science or computer science. The concept of natural language processing originally came about in the mid-20th century. The initial motivation was language translation (Salas-Za´rate 2017).
Natural Language Processing naturally lends itself to the field of Artificial Intelligence, as there is a strong desire for agents that can understand human language; for example, a chat bot. Sentiment Analysis did not pull much attention until the early 2000s. The natural language processing systems that were developed at first were only applicable to narrow subject areas, such as answering questions with information from a database about moon rocks, or answering questions from a manual on airplane maintenance (Liu 2012). The explosion of social data quickly created a necessity to autonomously understand language sentiment. Especially with the ubiquitous nature of social media in recent years, the field of sentiment analysis has become more and more applicable to many fields. It has been one of the most active areas of research in the field of natural language processing since the turn of the century (Pozzi 2017).
There are many commercial applications. It may have significant effects for the fields of management, political science, economics, and other social sciences, among others (Liu 2012). Sentiment analysis, also known as opinion mining, refers to the process of creating automatic tools or systems which can derive subjective information from text in natural (human) languages, as opposed to computer codes. The subjective information most commonly desired by researchers are opinions and sentiments, hence the name sentiment analysis. Sentiment analysis, while originally only practiced by computer scientists, has become widely used by the management scientists and the social sciences. Microsoft, Google, Hewlett-Packard, IBM, and others have created their own systems for sentiment analysis.
Before the turn of the century, there were previous developments in what would later become the field of sentiment analysis. Naive Bayes classifier pro-vided a way to model the affective tone of an entire document based on the "semantic differential scores" of each of the words in the document. The semantic meanings and scores were derived from a 1965 study by Heise. According to Lee and Pang (2002) marked an explosion of research in sentiment analysis. This increase in the study of this topic was partially attributed to the increasing popularity of machine learning models, and the availability of training sets with which machine learning models could be trained. Turney (2002) used an algorithm based on parts-of-speech tagging and semantic orientation in order to classify online reviews as recommended or not recommended. Anderson and McMaster (1982) used machine learning techniques such as Support Vector Ma-chines and Naive Bayes in order to classify the sentiment of movie reviews. Dave, Lawrence, and Pennock (2003) classified polarity of web reviews based on several n-gram methods. It was not as accurate when applied to individual sentences because it was developed with the purpose of classifying reviews which normally contained multiple sentences. Hu and Liu (2004) used a method that could predict the sentimental orientation of opinion words and therefore the opinion orientation of a sentence. It was an unsupervised method and did not require a corpus, and was loosely based off the work of Dave, Lawrence and Pennock. It returned the sentiments at the sentence level instead of at the entire review at once. Then it combined the sentence-level sentiments to give a summary of the entire review. Moraes, Valiati, and Neto (2013) showed the effectiveness of machine learning processes as opposed to lexicon-based models. They empirically compared the Support Vector Machines and Artificial Neural Network machine learning methods for sentiment analysis and found that the Artificial Neural Networks performed better. In 2015, Wang, Liu, Sun, Wang.B, and Wang.X. showed the effectiveness of Long short-term memory recurrent neural networks for sentiment analysis by predicting the sentiments of tweets.

Sentiment Lexicon
The lexicon method splits input text into many individual words or phrases called tokens. Then, it creates a table of these tokens and records the number of times each token shows up in the text. The resulting tally is called a "Bag of Words" model. Once this process is done, another tool called "Sentiment Lexicon" is used for computing the classification of the bag of tokens we mentioned above. The Sentiment Lexicon has the sentiment values, which can be just positive or negative numbers or some other valuerepresentations, like vectors, that are pre-recorded for each token. This can be done either manually or by some machine learning techniques. Once we have the input text tokenized and a suit-able Sentiment Lexicon, the final task is to design a function to compute the final sentiment. The simplest way to compute the final sentiment is to sum the sentiment values of each token together. The lexicon method is a traditional way to deal with natural language processing problems, and it has a good theoretical basis. Many people are still using and studying this method in spite of its origins in the 1960s. However, it does have some drawbacks such as ignoring the importance of integrality and continuity of the text.
We know that the meaning of a sentence highly depends on the order of words and context; these should not be ignored if we want a real intelligent sentiment processing system (Tbboada 2011).

Machine Learning
In the Machine Learning technique of sentiment analysis the classification algorithm uses a training set to learn a model based on features in the set. This makes a more nuanced classification possible and can help with ambiguous words or interpretations that vary by context. A method of feature extraction must be chosen. Some of these methods include n-grams, which are sets of words that contain n words each. Others use parts-of-speech information, emotional, affective, or semantic data. One of the disadvantages of the machine learning method is that it requires a large set of labelled data to be used as the training set. It is simpler to use the lexicon-based method unless a suitable training set is available (Salas-Za´rate 2017).
We will need to classify the sentiments of the emoticons manually in order to prepare them for use in our analysis. Once that is done, we can compile our training set using the comments in the data that already contain emoticons, using the sentiments of each emoticon. Then our model will be able to classify and assign an emoticon to each comment in the data set that does not already contain one. Recurrent Neural Networks(RNNs) have had a great deal of success in the Natural Language Processing Realm. The reason is that text data is highly sequential, for example, the word "day" does not mean much unless you know the words that came before it; i.e "Have a great day." RNNs have pushed the state of the art of previous architectures in short-length text data (Lee and Dernoncourt 2016).
Given previous attempts to model sentiment have not thoroughly explored emoticons, we hope to answer the question of whether or not we can accurately recommend emoticons that might accompany a piece of text. Once we have answered this, further research can make attempts to analyze sentiment with emoticons on a machine.

Data
To get our data, we used the Data Science Competition Website Kaggle. On this website, people share datasets, competitions, and tutorials. We found a dataset containing comments from the top 200 trending Youtube videos. The author of this dataset obtained the data through Youtube's publicly available API, which allows developers to easily query for data on Youtube. The data itself contains profanity, nonsensical text, and in general is noisy. The data itself could be generated by bots, and we do no vetting to determine whether a comment actually comes form a human. The noisiness of the data might prevent us from training a successful model; however, we assume that the large amount of data will help our models perform well in spite of the low quality of data.
In order to answer the question of whether or not a model could recommend emoticons, we created 2 models that attempt to perform this recommendation. We also created a simple dummy model for purposes of comparison. We have roughly three-hundred thousand comments with emoticons, and use that to boos-trap a dataset of comments with labels. More data is desirable, but this is a fairly large corpus for initial research.
In total, there are 691, 388 rows in the dataset. A large proportion of them contain emoticons, (more than 200, 000), so there is a quite a bit of data, and it would be fairly straightforward to access the Youtube API and get more if needed. This means I have as much data as I could possibly want, and more if needed. As for features, I will only use the text, likes, reply threads, and so on will be ignored in this phase of the project. On average, each text is 15 words long. Figure 1 shows some examples of how the data looks.

Evaluation Metrics
The models will be evaluated using a holdout set of data, in which each will recommend five emoticons that might accompany a text. If at least one recommendation is an emoticons that occurs in the validation comments, then I will consider it to be a "correct" guess. Accuracy is then the number of correct guesses divided by total guesses.
Keras calls this accuracy "top k categorical accuracy", and will be implemented for our models. Mathematically, this would look something like this where matching x ∈ Comments and y ∈ Labels and score(x) = 1 if any p ∈ argmaxk=5(predict labels(x)) is in y, else score(x) = 0. predict labels(x) would return the probabilities of each output class occurring. Then the accuracy of the model would be ΣN(score(xi)) where xi∈ Comments and N =| Comments |.
One consideration is that the distribution of emoticons occurring in the corpus of data is highly skewed; this would be good reason to suggest F1 scores and might be better for future analysis. However, we chose this evaluation metric because it more closely resembles the question we are asking. The important thing to note is that the distribution is indeed skewed(see Figure 2).

Analysis Plan
In order to compare the performance of our model, we created a holdout set of data meant for only validation of accuracy. We also defined what a prediction would be for each model, each model would output its top five highest predictions. If any of those predictions are in the output validation set, then we considered it an accurate prediction.
Then in order to analyze the dataset, we will compute the prediction accuracy of each model and compare those scores. One might also consider looking at the training accuracy of each model; however, these scores are not directly comparable, so we ignore them except for the purposes of optimizing the model.

Approach
In our approach, we had to make a few crucial assumptions and simplifications to contextualize our problem. Firstly, our dataset involved input data with multiple output classifications. For example, a users can add hundreds of the same emoticon or many different emoticons. As a preprocessing step, we narrowed down these classes to the unique emoticons that show up in a comment, and unrolled the data set to have a single label. The other assumption exists only for our Naive Bayes Model, and it is that all words in the comments are independent. This assumption is difficult to back up, and it is not clear whether there is mutual dependence or mutual exclusivity between words. However, our Recurrent Neural Network does not have this limitation because it can model the entire sequence.

Preprocessing
One of the most important steps is the preprocessing stage. This is done before all models are trained. We first separate the data into comments with emoticons and comments without emoticons. We then make all comments lowercase and afterwards normalize our comments on both by creating a dictionary of punctuation to tokens, and a dictionary of word counts over all comments that use thes ordering of each word as its embedding. Table   2 shows an example of how the dictionaries are used to tokenize a comment. A similar process is used to encode the emoticons, we use a dictionary to encode them as integers.
Preprocessing the comments in this way gives us a normalized integer sequence, which deals with comments that might have different capitalizations of words.

Dummy Model
For purposes of comparison, we created a very simple model that always predicts that a comment would use the emoticon with the largest prior probability. The motivation behind this, is that it gives us a baseline score to beat. If we can do significantly better than this, then we know that the models have potential.

Naive Bayes Model
Our second model uses Bayesian Statistics that creates tables of posterior probabilities for each class given a word using Bayes rule. Naive Bayes is a conditional probability model, and given some instance to be classified, represented by a vector of features: We then compute the probability of each output class using conditional probability p( | 1 , … , ) We can then rewrite the numerator using the chain rule for repeated applications of conditional probability, derivation is in appendix 1. Then we add the naive as-sumption of conditional independence, allowing use to further simplify our model Where Z is: Which is the scaling factor dependent on the instance. The derivation is in appendix 2.
In order to make a classifier, we would generally take the argmax of the simplified model without Z, but in our case we take the top five arguments as our program is recommending multiple emoticons that might be appropriate to the definition of Naive Bayes classifier .
We implement this model in python and the model follows figure 3.
Another problem is that we have to deal with words that never show up in our corpus of texts. In order to deal with this, we smooth the probabilities. To do this, we make any word or class that doesn't show up have a very small probability that is close, but not zero.
Otherwise, the probability would zero out when words are not in the corpus.

Recurrent Neural Network
Our third and final model, is a recurrent neural network and our architecture is as follows in  Recurrent Neural networks are a class of neural networks that form a directed cycle, allowing them to take time into account, or a notion of memory. This allows for the RNN to be suited to predicted arbitrary sequences by taking advantage of their memories.
The label data also undergoes another transformation before the RNN begins the learning process. Since the emoticons are encoded using an ordinal number, the integer representation does not quite make sense as one emoticon is not greater than another. To rectify this, we represent this integer as a one-hot vector, essentially we take a fixed-length vector that is the size of the total number of output classes. Then the integer is used as an index of the "hot" class. Table 4 gives a small example of encoding a small class space. In addition to our baseline architecture, we also preform dropout on each lay-er, which helps prevent against training bias because the network probabilistic "drops" some of the weight which forces the network to build redundancies. For the training metric, we implemented the top k categorical accuracy metric listed in the evaluation metrics. For the objective function we found that categorical cross entropy work best which typically works well in multi-class, single-label s-cenarios.Using TFLearn, a deep learning library for Python, we implemented the architecture we decided on with relative ease. TFLearn builds on top of Tensor-Flow, abstracting away many of the more intimate computational components, and allowing the programming to think about the layers and interactions between layers rather than how to build a well known type of layer or cell.

rogramming Language Libraries
•Python 3 •TFLearn a deep learning library featuring a higher-level API for Tensor-Flow.
•TensorFlow a deep learning library As mentioned throughout the text, the models where implemented using the listed libraries. We did our coding on the website FloydHub via iPython Notebooks, which abstracted away much of the setup. We split our code up into three notebooks, one for preprocessing, Bayesian Model, and RNN. We ran into very few problems implementing our solution; however, some are outlined below.

Problems •Bayes Smoothing
We ran into a small hitch with the Bayesian when dealing with querying prior probabilities when certain values did not exist in the data. However, we used a technique to "smooth" the values by assigning a small probability to these values.

•Skin Tone Modifters
There are emoticons that exist that modify other emoticons, i.e.
allowing one to change the skin tone of the smiley face. We found that these confounded our predictions, and removed them as possible predictions.
•Finding loss, activation, and metrics We had to experiment many times to find the best loss, activation, and metric functions for our RNN. This process may be simple trialand-error as we experienced.

Reftnement
Originally, our RNN model did not preform as well as we had hoped; however, a few optimization to our model vastly impacted our performance. The first model we used was a multi-class, multi-label classifier which performed very poorly. Our RNN had performance at .508 which left much to be desired. We believe the reason for this is that instead of one-hot encoded vector, we had many-hot encoded. This means that the label space would be of order 2 # of emoticons . Since this space is extremely large, the model would have trouble representing any reasonable portion of this. For this reason, we needed to unroll data points to preform multi-class, single-label classification. After adjusting our loss function, metric function, and activation function we ended up with much better performance. We believe this to be because of the reduction in potential labels to just # of emoticons. In addition, hyper parameters were adjusted, such as, learning rate and batch size to find out what setting worked best. The best we found was a learning rate of .001 and a batch size of 128.

Results
In order to validate the models, we created a holdout set of labelled data that none of the models got to use for training or testing. The accuracy of each model using top k categorical accuracy is in tables 5 and 6.  Table 6 gives us a measurement of how well our recommendation engine gives us accurate emoticons to represent our text. Our results do not promote strong confidence in our Naive Bayes Model's ability to recommend emoticons; however, there are some potential improvements to the model such as n-gram modelling. Notably, the Bayesian Model preforms decently on the training data, but generalizes quite poorly and shows signs of over-fitting. The RNN on the other hand, surprisingly preforms slightly worse on training, but preforms much better on the validation set. For whatever reason this phenomenon occurs, it is clear that the model generalizes much better.

Visualization of Model Functionality
We have a model that could be incorporated into a wide variety of applications; for example, a browser plugin that predicts what emoticons you might put with a comment and assist the user similar to an auto-complete feature. One issue to consider might be the nature of Youtube comments themselves, which might pre-vent the generalization of this model to other applications. However, the models do show that this sort of functionality is possible. For example, we have pulled some examples from the data and run them through our models to produces the tables below, and the comments themselves seem to be quite different than more formal forms of language. While the machine learning back-end may not be the most sophisticated, the model does a good job in practice of giving recommendations, and we think the model would be good enough to use for applications to be built on top of.

Limitations
One limitation of our models is that words that do not show up in the Youtube Comment corpus cause issues, as our models have trouble predicting outputs for words that it has never seen. One way to fix this, might be to mine for more Comment data. Some drawbacks of the Naive Bayes Model is that we may not be able to model longer term trends in comments, however with the short length of the comments, this may be a non issue. We also are limited in our choice of language modelling because we are on the word level. We would likely see large improvement by expanding our level of modelling to some type of n-gram. The RNN has limitations in multi-class classification, and this may be hindering its ability to learning. Another limitation might be that the training time is cost prohibitive. The model would likely continue to learn and perform better with more training time and data, meaning ultimately a higher cost for the model. The naive bayes is easy to program with fast run time, and no need to train for hours upon hours.
Another major consideration is that an RNN might be a bad fit. We originally though long term sequential modelling would be important, but it turns out the average comment length is 15 words long. It may be the case that sense the length of texts are so short, that we might have to thoroughly rethink what our strategy would be if this sequential modelling is unimportant.

Future Work
In order to eliminate the assumption of independence in the Bayesian model, we can add complexity by changing at what level we model the data. To do such we would need to employ a skip-gram or n-gram model that contain larger parts of the sequence data. One might also explore alternative Bayesian Models such as Hidden Markov Models. The same improvements to the data modelling using n-grams would likely improve the quality of the RNN results. The RNN model likely has a great deal of room for improvement, one might experiment with hyperparameter tuning or modifying the architecture. There are even more powerful models such as CRNNs and GANs that push the state of the art in deep learning.
These models would be worth exploring; however, we pushed our newfound deep learning knowledge as far as we could in the time allotted.
Another important consideration is the unrolling of the data. Future work should further explore how to deal with multi-class classification, which would likely involve writing new validation and loss functions for the neural network model. However, the Naive Bayes Model does not suffer from this limitation.
Future work might also try and further connect the emoticons and sentiment. We hypothesize that emoticons will naturally lend themselves to a easily convert into sentiment classes. However, our current models predict only what emoticon might be used, and the user of the model would have to infer what sentiment the emoticon might convey depending on context.
One might also find more optimizations by adding further preprocessing steps, for example, eliminating common english words that add very little information.

Reflection
Looking back at the process, here are the steps we took to get to the current models • Literature Review We made sure to have a rough idea of what people in this field have tried, and what the state of the art is.
• Deciding on a Model After reviewing the field, we made a decision on what models we wanted to implement which set the tone for preprocessing and implementation.
• FloydHub Next we setup our programming environment with cloud computing in mind. It's important to setup an environment such as FloydHub or AWS to minimize training time on a fast gpu. At this step we also made sure to download all the libraries we would need • Preprocessing a large majority of time was spent trying to learn how to deal with the data, and exploring the data itself. We had to go through multiple iterations of embedding and tokenization to find the method that made sense.
• Model Implementation After preprocessing our data, this step was fairly straightforward. Most of the time at this step is dealing with edge cases, or optimization of models rather than the actual implementation.
• Reftnement Refinement may have been the hardest part because we had to make inferences about why our model was not performing up to our desires. It's hard to say what the potential of each model was, so we kept iterating until we had something that seemed substantial.

Conclusion
Overall, there are many areas for potential improvement, and our work serves as a baseline for recommending emoticons. However, we have begun to answer our original question, it seems plausible the emoticons can be assigned with accuracy to comments as noisy as Youtube comments, making it easy for a casual observer to understand the sentiment of a text.

1.
Chain rule for repeated applications of conditional probability.