bigdata.nyc
bigdata.nyc
  • Home
  • About me
  • Resume
  • Portfolio
  • Blog
  • Home
  • About me
  • Resume
  • Portfolio
  • Blog

YASA (Yet Another Sentiment Analysis)

6/16/2015

0 Comments

 
Picture
So, I bought an Amazon Echo. (Yes, I am an early adopter. lol).
As of now, all this expensive gadget does for me is to play music.  (The very first thing I say when I wake up in the morning is, "Alexa, play iHeartRadio.").

I know some people finds Siri/Google Now as great interactive tools, and some never uses them. (I don't because the accuracy of search has not been yet satisfied me, but I am very certain I will use them more often. )

NLP is a hot topic, and there are tons of resources trying to improve human-computer interactions.
(hey, we all grew up watching Knight Rider hoping someday we would own a KITT. )

As a personal project, I challenge myself to take a peak into Sentimental Analysis. 
I am not an expert by no means, but I am just curious to know and learn. 

UCI has Machine Learning Depository where it offers free data. 
(https://archive.ics.uci.edu/ml/datasets.html)

The data that I download is Sentiment Labelled Sentences Data Set from this UCI Machine Learning Depository. 
(https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) 
I probably need to mention the donator of the data: Dimitrios Kotzias -> dkotzias '@' ics.uci.edu  

The dataset contains sentences gathered from imdb, amazon, and yelp which I will use as a training data. 

I removed stop-words from each sentences. (as of now, my stop-words are: 
a, able, about, across, after, all, almost, also, am, among, an, and, any, are, as, at, be, because, been, but, by, can, could, dear, did, do, does, either, else, ever, every, for, from, get, got, had, has, have, he, her, hers, him, his, how, however, i, if, in, into, is, it, its, just, least, let, like, likely, may, me, might, most, must, my, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, since, so, some, than, that, the, their, them, then, there, these, they, this, tis, to, too, twas, us, wants, was, we, were, what, when, where, which, while, who, whom, why, will, with, would, yet, you, your). 


Here is some of the output from mapper.  
(I will use MapReduce). 
Each sentence will be printed out as a list of words and the value is whether it is positive or negative.  (I am still working on how to structure key-value, so I am certain that this may change). 
Some makes sense with stop-words being deleted, but the very last output ['not', 'fun', 'experience.'] shows as positive. Hmm.  So, I looked up, and the original sentence is "It was just not a fun experience" and has a positive rating.  So, I guess it is time for data cleaning and validation!  From skimming through the output, the length of the sentence does not seem to be a relevant measure, but I'll dig more into that. 


Picture
Some adjustment to my mapper. I paired 2 words of each sentences to figure out which 2 words occurs the most since I read "Pairs and Stripes" are common in Natural Language Processing. Maybe, I can apply association rule down the road. 

Picture
Hmmm...little bit disappointing output of reducer. I wish the size of the training data was bigger since only 5 reviews contain "really good", 9 reviews contain "very good", and 11 reviews contain "not good" :(  I guess it's time for different approach.  

Picture
So I was able to get more data from Yelp via its Data Challenge (http://www.yelp.com/dataset_challenge). 









1,569,264 rows of reviews! Yay!
Not only the Yelp Challenge offers the dataset, but also it asks very informative questions. 
"Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?"


Picture
Chopping begins.  Hmm...'and' should be a stop-word, and it shows up....check the code again. Stop-words filter out ok. 
(ah...it's because of comma :) I need to strip comma and other syntax. 

Picture

Picture
Ahhh...the wait time while hadoop MapReduce is running is so pleasant. It's time to meditate. 
I should've run this on HPC. Running this task on local after 10 minutes start to spill to hard drive. 

So, I made some changes, and ran both on local and HPC. 
Some of results are:  
(pair words, first number is the number of occurrence of these pairs, second number is the average)
['very', 'good'] 77114 3.87
['first', 'time'] 60827 3.72 
['las', 'vegas'] 54478 3.93
['customer', 'service'] 54360 3.08 
['pretty', 'good'] 53200 3.48 
['really', 'good'] 52922 3.95
['come', 'back'] 50622 3.56 
['happy', 'hour'] 45753 3.79
['great', 'place'] 40049 4.24
['very', 'friendly'] 39055 4.19
['ice', 'cream'] 37710 3.95   
['service', 'great'] 26867 4.24
['mexican', 'food'] 17679 3.68  
['very', 'tasty'] 15269 4.01   
['quality', 'food'] 14360 3.57  
['prime', 'rib'] 13241 3.68
['super', 'friendly'] 12987 4.45
['saturday', 'night'] 12468 3.52
['very', 'helpful'] 11603 4.28
['always', 'good'] 11529 4.11
['friendly', 'helpful'] 11285 4.28
['mac', 'cheese'] 11181 3.78
['chinese', 'food'] 10798 3.57
['chips', 'salsa'] 9490 3.49
['very', 'happy'] 9189 4.32
['peanut', 'butter'] 6560 4.02
['delicious', 'food'] 6405 4.46
['very', 'disappointed'] 6086 1.83  
['absolutely', 'love'] 5995 4.57
['24', 'hours'] 5897 3.64
['never', 'disappointed'] 4974 4.52 
['very', 'pleased'] 4922 4.40
['portion', 'size'] 4769 3.63
['very', 'knowledgeable'] 4686 4.53
['over', 'cooked'] 4417 2.48
['always', 'fresh'] 4247 4.49
['free', 'wifi'] 4096 3.91  -->Listen, restaurant owners. Free wifi will get you bonus point!!
['perfect', 'place'] 3705 4.31
['waste', 'money'] 3167 1.72
.....List goes on and on

Funny that this captured the fact that more time = worse the rating. 
['few', 'minutes'] 10373 3.26
['5', 'minutes'] 11669 2.91
['10', 'minutes'] 18188 2.71
['15', 'minutes'] 15919 2.69
['20', 'minutes'] 14952 2.60
['25', 'minutes'] 2799 2.43
['40', 'minutes'] 3371 2.45
['45', 'minutes'] 7742 2.60
['over', 'hour'] 4179 2.40

Retention?
['first', 'time'] 60827 3.72
['second', 'time'] 12280 3.32
['3', 'times'] 7504 3.17
['4', 'times'] 3103 3.29
['5', 'times'] 2267 3.33

Stars
['2', 'stars'] 7071 2.31
['3', 'stars'] 10977 3.07
['4', 'stars'] 15022 3.68
['5', 'stars'] 25409 4.17

We don't want to be disappointed!
['very', 'disappointed'] 6086 1.83  --> low
['never', 'disappointed'] 4974 4.52  -->high 
["won't", 'disappointed'] 4776 4.58 -->higher

To be continued
0 Comments

    Author

    Tae

    Archives

    November 2016
    June 2015
    April 2015

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.