bigdata.nyc
bigdata.nyc
  • Home
  • About me
  • Resume
  • Portfolio
  • Blog
  • Home
  • About me
  • Resume
  • Portfolio
  • Blog

NYC Restaurant Evaluator

11/26/2016

0 Comments

 
I was able to publish a web version of NYC Restaurant Evaluator. 
https://www.linkedin.com/pulse/nyc-restaurant-evaluator-tae-h-kim?trk=hp-feed-article-title-publish

www.plot.nyc

0 Comments

YASA (Yet Another Sentiment Analysis)

6/16/2015

0 Comments

 
Picture
So, I bought an Amazon Echo. (Yes, I am an early adopter. lol).
As of now, all this expensive gadget does for me is to play music.  (The very first thing I say when I wake up in the morning is, "Alexa, play iHeartRadio.").

I know some people finds Siri/Google Now as great interactive tools, and some never uses them. (I don't because the accuracy of search has not been yet satisfied me, but I am very certain I will use them more often. )

NLP is a hot topic, and there are tons of resources trying to improve human-computer interactions.
(hey, we all grew up watching Knight Rider hoping someday we would own a KITT. )

As a personal project, I challenge myself to take a peak into Sentimental Analysis. 
I am not an expert by no means, but I am just curious to know and learn. 

UCI has Machine Learning Depository where it offers free data. 
(https://archive.ics.uci.edu/ml/datasets.html)

The data that I download is Sentiment Labelled Sentences Data Set from this UCI Machine Learning Depository. 
(https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) 
I probably need to mention the donator of the data: Dimitrios Kotzias -> dkotzias '@' ics.uci.edu  

The dataset contains sentences gathered from imdb, amazon, and yelp which I will use as a training data. 

I removed stop-words from each sentences. (as of now, my stop-words are: 
a, able, about, across, after, all, almost, also, am, among, an, and, any, are, as, at, be, because, been, but, by, can, could, dear, did, do, does, either, else, ever, every, for, from, get, got, had, has, have, he, her, hers, him, his, how, however, i, if, in, into, is, it, its, just, least, let, like, likely, may, me, might, most, must, my, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, since, so, some, than, that, the, their, them, then, there, these, they, this, tis, to, too, twas, us, wants, was, we, were, what, when, where, which, while, who, whom, why, will, with, would, yet, you, your). 


Here is some of the output from mapper.  
(I will use MapReduce). 
Each sentence will be printed out as a list of words and the value is whether it is positive or negative.  (I am still working on how to structure key-value, so I am certain that this may change). 
Some makes sense with stop-words being deleted, but the very last output ['not', 'fun', 'experience.'] shows as positive. Hmm.  So, I looked up, and the original sentence is "It was just not a fun experience" and has a positive rating.  So, I guess it is time for data cleaning and validation!  From skimming through the output, the length of the sentence does not seem to be a relevant measure, but I'll dig more into that. 


Picture
Some adjustment to my mapper. I paired 2 words of each sentences to figure out which 2 words occurs the most since I read "Pairs and Stripes" are common in Natural Language Processing. Maybe, I can apply association rule down the road. 

Picture
Hmmm...little bit disappointing output of reducer. I wish the size of the training data was bigger since only 5 reviews contain "really good", 9 reviews contain "very good", and 11 reviews contain "not good" :(  I guess it's time for different approach.  

Picture
So I was able to get more data from Yelp via its Data Challenge (http://www.yelp.com/dataset_challenge). 









1,569,264 rows of reviews! Yay!
Not only the Yelp Challenge offers the dataset, but also it asks very informative questions. 
"Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?"


Picture
Chopping begins.  Hmm...'and' should be a stop-word, and it shows up....check the code again. Stop-words filter out ok. 
(ah...it's because of comma :) I need to strip comma and other syntax. 

Picture

Picture
Ahhh...the wait time while hadoop MapReduce is running is so pleasant. It's time to meditate. 
I should've run this on HPC. Running this task on local after 10 minutes start to spill to hard drive. 

So, I made some changes, and ran both on local and HPC. 
Some of results are:  
(pair words, first number is the number of occurrence of these pairs, second number is the average)
['very', 'good'] 77114 3.87
['first', 'time'] 60827 3.72 
['las', 'vegas'] 54478 3.93
['customer', 'service'] 54360 3.08 
['pretty', 'good'] 53200 3.48 
['really', 'good'] 52922 3.95
['come', 'back'] 50622 3.56 
['happy', 'hour'] 45753 3.79
['great', 'place'] 40049 4.24
['very', 'friendly'] 39055 4.19
['ice', 'cream'] 37710 3.95   
['service', 'great'] 26867 4.24
['mexican', 'food'] 17679 3.68  
['very', 'tasty'] 15269 4.01   
['quality', 'food'] 14360 3.57  
['prime', 'rib'] 13241 3.68
['super', 'friendly'] 12987 4.45
['saturday', 'night'] 12468 3.52
['very', 'helpful'] 11603 4.28
['always', 'good'] 11529 4.11
['friendly', 'helpful'] 11285 4.28
['mac', 'cheese'] 11181 3.78
['chinese', 'food'] 10798 3.57
['chips', 'salsa'] 9490 3.49
['very', 'happy'] 9189 4.32
['peanut', 'butter'] 6560 4.02
['delicious', 'food'] 6405 4.46
['very', 'disappointed'] 6086 1.83  
['absolutely', 'love'] 5995 4.57
['24', 'hours'] 5897 3.64
['never', 'disappointed'] 4974 4.52 
['very', 'pleased'] 4922 4.40
['portion', 'size'] 4769 3.63
['very', 'knowledgeable'] 4686 4.53
['over', 'cooked'] 4417 2.48
['always', 'fresh'] 4247 4.49
['free', 'wifi'] 4096 3.91  -->Listen, restaurant owners. Free wifi will get you bonus point!!
['perfect', 'place'] 3705 4.31
['waste', 'money'] 3167 1.72
.....List goes on and on

Funny that this captured the fact that more time = worse the rating. 
['few', 'minutes'] 10373 3.26
['5', 'minutes'] 11669 2.91
['10', 'minutes'] 18188 2.71
['15', 'minutes'] 15919 2.69
['20', 'minutes'] 14952 2.60
['25', 'minutes'] 2799 2.43
['40', 'minutes'] 3371 2.45
['45', 'minutes'] 7742 2.60
['over', 'hour'] 4179 2.40

Retention?
['first', 'time'] 60827 3.72
['second', 'time'] 12280 3.32
['3', 'times'] 7504 3.17
['4', 'times'] 3103 3.29
['5', 'times'] 2267 3.33

Stars
['2', 'stars'] 7071 2.31
['3', 'stars'] 10977 3.07
['4', 'stars'] 15022 3.68
['5', 'stars'] 25409 4.17

We don't want to be disappointed!
['very', 'disappointed'] 6086 1.83  --> low
['never', 'disappointed'] 4974 4.52  -->high 
["won't", 'disappointed'] 4776 4.58 -->higher

To be continued
0 Comments

Restaurant Evaluator 

4/5/2015

0 Comments

 
While searching for a restaurant on Yelp, I was curious to know what these letter grades of the Restaurant Inspection indicate.  
So, I started to search for the top restaurants in NYC on ZAGAT because they are supposed to be the cleanest. 
Picture
Pretty expensive and I am sure they offer very delicious dishes! (28-29 out of 30 must be good, especially if anyone is willing to pay that much!)

So, I parsed Yelp links, and merged with the inspection data from The New York City Department of Health and Mental Hygiene. 
Picture
Picture
I added the links of the top 10 restaurants on ZAGAT. 
Picture
Picture
Please excuse the ugly user interface since it is only used by me. The purpose of this program is to keep a list of restaurants that I like, so that I can easily look up for phone numbers and addresses. 

These violations were found from these top 10 restaurants:
1. Le Bernardin: 
  • Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit - 01/31/2013
  • Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored. -04/10/2014
  • Food worker does not use proper utensil to eliminate bare hand contact with food that will not receive adequate additional heat treatment. - 08/27/2013

2. Bouley
  • Evidence of mice or live mice present in facility's food and/or non-food areas. -03/14/2014
  • Shellfish not from approved source, improperly tagged/labeled; tags not retained for 90 days. - 09/23/2014
  • Evidence of rats or live rats present in facility's food and/or non-food areas - 12/04/2014
            (I will probably stop here)

3. Jean-Georges
  • Food not protected from potential source of contamination during storage, preparation, transportation, display or service. - 04/05/2013
  • Nuisance created or allowed to exist. Facility not free from unsafe, hazardous, offensive or annoying conditions.- 04/11/2014

4. Gotham Bar and Grill
         I could not parse data for this restaurant.  Maybe, I could, but I did not attempt to. 

5. Eleven Madison Park
  • Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred - 05/07/2013
  • Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit. - 05/07/2013

6. Daniel
        I could not parse data again.  
        If I see more of results like this, then maybe I should find a different way to merge the data. 

7. Sushi Yasuda
  •  Single service item reused, improperly stored, dispensed; not used when required.  -07/14/2014 
  •  Cold food item held above 41 F (smoked fish and reduced oxygen packaged foods above 38F) except during necessary preparation.  - 05/20/2014
  • Food contact surface not properly maintained. - 05/20/2014

8. Gramercy Tavern
  • Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies. - 06/02/2014
  • Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. - 06/02/2014
  • Live roaches present in facility's food and/or non-food areas. - 07/07/2014

9. Peter Luger Steak House
  • Proper sanitization not provided for utensil ware washing operation.-02/28/2013
  •  Evidence of mice or live mice present in facility's food and/or non-food areas. -03/13/2015
  • Live roaches present in facility's food and/or non-food areas. - 02/28/2013
  • Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist - 03/13/2015

10. La Grenouille
  • Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored.- 04/09/2014
  • Cold food item held above 41F (smoked fish and reduced oxygen packaged foods above 38F) except during necessary preparation.- 03/05/2013
  • Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.-04/09/2014

This concludes my finding.  Thanks for reading!

*References: 
Top 100 data: 
https://www.zagat.com/best-restaurants/new-york

Inspection data (Jan 2013 - Apr 2015)
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59

* Please note that the most, if not all, of these violations may/should have been resolved by now. This is a personal project using publicly available data without any intention to harm or damage.  Request for deletion:kimtaehoon@post.harvard.edu  



0 Comments

    Author

    Tae

    Archives

    November 2016
    June 2015
    April 2015

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.