(Github: https://github.com/pinhao1994/best-at-times-square)

1. Introduction

Our project explores the relationships between hotels and restaurants located in Times Square, based on analyzing data from “NYC Open Data Source” and Yelp. Current applications in the market might focus solely on hotels or restaurants, but we would like to see if there exists interactive correlations between hotels and restaurants surrounded in terms of price and rating. The results might have an influence on travelers’ hotel choice when they travel, especially for the top NYC attraction and landmark-Times Square.

Group responsibilities are listed as follow: Pin-Hao Chen—data preprocessing and interactive component designing; Youyang Liu—data exploration and data visualization; Jolie Tan—data preprocessing, data exploration, and data visualization; Anna Zhou—data exploration and interactive component implementing.

2. Description of Data

We download Times Square Food & Beverage Locations and Times Square Hotels datasets from NYC Open Data (https://opendata.cityofnewyork.us/) and fetch Yelp data through Yelp API. The data binding and data wrangle processes are described in following sections.

2.1. Data Cleaning, Yelp Combining, and Calculating

(R script: https://github.com/pinhao1994/best-at-times-square/blob/master/Data_Preprocess.Rmd)

2.1.1. Data Cleaning

The first step would be formating the phone numbers to be a sequence of integer numbers by replacing the special characters, like “(”, “)”, and “-”.

Secondly, removing “Website”, “Location.1”, and “Borough” columns in both datasets due to following reasons.

  1. Website column: there are many missing values in the column, and all those website url links give us pretty limited information. For example, all the Starbucks have the same official website in our original dataset. This is not helping to separate or to compare which Starbucks will be better according to the link.

  2. Location.1 column: this column contains tuple of latitude and longitude which is duplicated since there are already two other columns containing the same information.

  3. Borough column: Since all of our data points are around Times Square, they have same borough – Manhattan – which is not informative.

Because Food and Beverage Data is more complicated, there are more following steps would be working on. Third, replacing special unicode character such as “Café / Deli” with “Deli” in the “Subindustry” column. Fourthly, substituting the value in “Subindustry” with the value in “Sub.Subindustry” column, if the former value is “Quick Serve” or “Full Serve”. Finally, removing “Sub.Subindustry” column from the dataset since we have combined it with the Subindustry column.

2.1.2. Yelp Combining

Yelp API allows us to grab info to fetch the “Yelp Category”, “Yelp Rating”, and “Review Count on Yelp”. The program we setup would search stores (hotels and restaurants) by the Phone Number; if there was no matches by searching phone numbers, the program would search hotels and restaurants by the combination of other features, such as Address, category, etc.; after the previous two steps, if the program cannot find the corresponding objects, it would recommend 5 candidates for users to manually decide the Yelp store for preprocessing.

2.1.3. Caculating

We deployed levenshtein distance to find the the neighbor of each hotel or restaurant within \(100^2\pi\) square meters.

Moreover, averaged the Yelp ratings, prices, and review counts of all neighborhood restaurants for each hotels.

We also add an index to calculate the score. If a restaurant is rated with 4 stars but only having 2 persons rate it, the rating of the restaurant is not reliable. The formula would be (60% x review counts)*(40% x rating).

2.2. Web Interactive: GeoJson Processing

(R Script: https://github.com/pinhao1994/best-at-times-square/blob/master/Data_Preprocess_Map.Rmd)

In order to perform the web interactive part, we pre-calculate and format the data into self-defined form.

For Hotel data, we add two more columns top5cat and top3restaurant to store further information as follow:

  1. top5cat: Within \(100^2 \pi\) square meters, counting what would be the top 5 categories with respect to each hotel. Store the information in the form of “\(Category_A:Count_A\) | \(Category_B:Count_B\) |…”.

  2. top3restaurant: Within \(100^2 \pi\) square meters, counting what would be the top 3 restaurants with respect to each hotel by our designed importance weight. Store the information in the form of “\(food\_id_A\) | \(food\_in_B\) | \(food\_id_C\)”.

For Restaurant data, we add one more column called Category2ndlevel. We would like to have our upper level categories to be classified according to continents. So we have Asian, Mid East, North American, South American. Also, we keep other categories, Steakhouse, Coffee, Pizza, Deli, and Other. Also, as for those extra categories, it is ambiguous to classify them to the continental categories, but if we group them into Other, that would cause Other overweighted.

Last, we transformed csv to geojson by online transformer (https://mygeodata.cloud/converter/csv-to-geojson). Geojson type of data is prepared for the interactive web.

3. Data Analaysis Quality

3.1. General Distribution

We want to explore the restuarant data gerenally in rating, review, postcode, price, category and missing data.

library(tidyverse)
library(ggplot2)
food<-read.csv('data/food_processed.csv')

# plot rating distribution
## from the histgram we know that most of our resturant are rated as 3.0~4.0 and there is no resturant rated as 5. And it's a left-skewed data
ggplot(data = food, aes(food$Rating)) + geom_histogram(binwidth = 0.5, color = "lightblue") + labs(x="Rating")

From the histgram we know that most of our resturant are rated as 3.0~4.0 and there is no resturant rated as 5. And it’s left-skewed.

# plot postcode distribution

ggplot(data = food, aes(factor(food$zip_code))) + geom_bar() + labs(x="Zip Code")

From the histgram we can see that most of our resturants are located in 10036 and 10019 district. The restuarant in 10011, 10012, 10033 might be outliers or out of boarder.

# plot price distribution

ggplot(data = food, aes(food$Price)) + geom_bar(color = "lightblue") + labs(x="Price")

We can see that the majority resturants have price on level 2 and no resturant on level 5. Since our price data are shown as dollar sign from yelp API, the data point of 2.5 might be an error.

# plot review distribution 

ggplot(data = food, aes(food$Review_Count)) + geom_histogram(color = "lightblue") + labs(x="Review Count")

From the plot we can see that most of our restuarants review are below 1,000. The higher the review is, the fewer resturants there are. And there are some outliers that have more than 1,000 ratings.

## from this plot we can see that most common category are American, Deli, Italian, Pizza, and steakhouse.
ggplot(data = food,aes(food$Category_data)) +
  geom_bar()+coord_flip() + labs(x="Original Category")

From this plot we can see that, there are 35 categories in total. The most common category are American, Deli, Italian, Pizza, and steakhouse.

ggplot(data = food,aes(food$Category_2nd_Level)) +
  geom_bar()+coord_flip() + labs(x="Self-defined Category")

According to the second level category, we can see that North American, Deli and European are most common 3 category.

3.2. Missing Pattern

#library(tidyverse)
#library(ggplot2)
#library(dplyr)
#library(tidyr)
#library(DAAG)
food <- read.csv('data/food_processed.csv', header=T, na.strings=c("", "NA"))

row.names(food) <- food$ID

tidyfood <- food %>%
  rownames_to_column("id") %>%
  gather(key, value, -id) %>%
  mutate(missing = ifelse(is.na(value), "yes", "no"))

ggplot(tidyfood, aes(x = key, y = fct_rev(id), fill = missing)) +
  geom_tile(color = "white") + 
  ggtitle("food data with missing values") +
  #scale_fill_viridis_d() + # discrete scale
  theme_bw() + labs(x="Columns", y="ID")

It’s obvious that there are 3 missing pattern in our data. The most common ones are Price, Rating, Review count and Score (the Score is calculated from Price and Rating). And the second common feature is missing category data. The third pattern is missing them both. The first situation is because the data in yelp are missing.

4. Main Analysis

# plot rating distribution fill by district(postcode)

ggplot(data = food, aes(food$Rating, fill=factor(food$zip_code))) + 
  geom_histogram(aes(x= food$Rating,y = (..count..)/sum(..count..)), 
                 breaks=c(0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0),
                 position = 'dodge') + 
  labs(x="Rating", y="Density", fill="Zip Code")

We can see that the resturant’s rating in district 10036 has a peak around 3.5. And the resturant’s rating in distrct 10019 has a peak around 3. Both of the 10036 and 10019 distributions are left skewed.

ggplot(data = food, aes(food$Price, fill=factor(food$zip_code))) + 
  geom_histogram(aes(x= food$Price,y = ..density..),binwidth = 1,position = 'dodge') +
  labs(x="Price", fill="Zip Code")

The above graph shows that most of the restaurants in the area have Price level 2. District 10019 and 10036 seem to have similar price distribution pattern.

And we can see that 10011, 10012, 10033 only have 1 retuarant in that district, so the density will always be 1.

## plot the review based on postcode.
ggplot(data = food, aes(food$Review_Count, fill=factor(food$zip_code))) + 
  geom_histogram(position = 'dodge',binwidth = 50) + 
  xlim(0, 800) +
  labs(x="Review Count", fill="Zip Code")

## we draw some boxplot to explore the data.
## we find that restuarnts in 10036 are more likely to have more reviews. 

ggplot(food, aes(factor(food$zip_code), food$Review_Count)) +
  geom_boxplot() + 
  coord_flip() +
  labs(x="Zip Code", y="Review Count")

For review count, we find that restuarnts in 10036 are more likely to have more reviews. Restaurants located at 10019 comes the second. The restuarant in 10018 are more sparse.

## hex plot or scatter plot

ggplot(food,aes(x = Rating,y = Review_Count))+stat_bin_hex()+labs(y="Review Count")

ggplot(food,aes(x = Rating,y = Review_Count))+geom_point(position = 'jitter', alpha = .3)+labs(y="Review Count")

From the plot we can see that,

  1. there is a cluster around rating 3~4 and review less that 500.

  2. the more review, there is more likely to be high ratings.

  3. There is no resturant have more than 600 review that has low ratings.

  4. The resturant having over 1,000 reviews are outliers and they are all rating 3.5 or 4.0

# we explore the relationship between price and review

ggplot(food,aes(x = Price,y = Review_Count))+stat_bin_hex()+labs(y="Review Count")

ggplot(food,aes(x = Price,y = Review_Count))+geom_point(position = 'jitter', alpha = .3)+labs(y="Review Count")

  1. There is a cluster around Price 1-2 and review below 500.

  2. Outliers are the points over 1000 reviews and over 3 ratings.

data_new<-subset(food,food$zip_code %in% c(10036,10019,10018,10020))
data_new<- subset(data_new,data_new$Category_2nd_Level %in% c('North American','Deli','European','Asian'))


counts3 <- data_new %>% drop_na(`zip_code`,`Category_2nd_Level`)%>%group_by(data_new$`Category_2nd_Level`,  data_new$`zip_code`) %>% summarize(Freq =n())

colnames(counts3)<-c('Category_data','zip_code','Freq')
factor_cat <- factor(counts3$Category_data)

vcd::mosaic(factor_cat~zip_code, direction = c('v','h'), counts3, rot_labels=c(0,90,0,0),
            offset_varnames = c(left = 5), offset_labels = c(left = 0),
            just_labels = c("center","right"), margins = c(left = 5), set_varname=c(zip_code="Zip Code", factor_cat="Category"))

From the plot we can see that the majority data lies in 10019 and 10036. And district will influence the retuarants category. We can see that there are more North American restuarants in 10036, and relatively less other types retuarants.

Parallel Coordinate

food_pl <- as.data.frame(food[, c(2,4,5,6,7,8,9,10,11)])
food_pl$Score <- round(food_pl$Score)
food_pl$Rating <- round(food_pl$Rating)
food_pl[, 1:9] <- lapply(food_pl[, 1:9], factor)
food_plna <- na.omit(food_pl, cols=c("Score"))
#colnames(food_plna)
food_al <- food_plna %>% 
  #drop_na(Score) %>%
  group_by(Category_2nd_Level, Rating, Price, Street_Num, zip_code, Review_Count,Category_data) %>% #
  summarise(Freq = n())

library(alluvial)
pal <- RColorBrewer::brewer.pal(10, "Set3")
alluvial(food_al[, c("Category_2nd_Level", "Rating", "Price", "zip_code")], freq = food_al$Freq, 
         blocks = TRUE,
         alpha = 0.8,
         col = pal[match(food_al$Category_2nd_Level,
                       unique(food_al$Category_2nd_Level)) ])  

The majority of the restaurants belongs to North American food, and then comes European style restaurants and Deli. Most of the North American and European restaurants have ratings at 4 and above, and most of those restaurants are not too expensive having ratings at level 2 (around 20 to 30 dollars per person per meal). Asian restaurants comes to the fourth biggest group in our data. Their ratings range from 2 to 4. More than half of the Asian are ranked at level 4. Approximately, 95% of the Asian restaurants set the price levels within two dollar signs. Coffee shops are usually with low ratings and low cost. Steakhouse are all ranked high and the price level are also high. More than 50% of the restaurants are located in zip code 10036 area. Their categories are mainly North American, European, coffee shop, Asian food, and steakhouse.

library(tidyverse)
library(ggplot2)

hf100 <- read.csv("data/hotel_food_100.csv")
hf <- na.omit(hf100)
hf$hotel_price_title <- paste("Hotel rating is ", hf$hotel_price, "")
ggplot(hf, aes(food_price)) +
  geom_histogram(binwidth = 1, color ="blue", fill = "lightblue") +
  facet_wrap(~hotel_price_title) +
  ggtitle("Price distribution at hotel price level and neighborhood restaurants") +
  xlab("Price level of restaurants") + 
  ylab("Counts of restaurants within 100 square meters") +
  stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5, binwidth = 1) +
  theme_grey(16)

hf$hotel_rating2_title <- paste("Hotel rating level is ", hf$hotel_rating2, "")
ggplot(hf, aes(food_rating)) +
  geom_histogram(binwidth = 1, color ="blue", fill = "lightblue") +
  facet_wrap(~hotel_rating2_title, nrow = 2) +
  ggtitle("Rating distribution at hotel rating levle and neighborhood restaurants") +
  xlab("Rating level of restaurants") + 
  ylab("Counts of restaurants within 100 sqaure meters") +
  stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5, binwidth = 1) +
  theme_grey(16)

ggplot(hf, aes(x = food_price,y = food_review_count)) +
  geom_col(fill = "lightblue") +
  facet_wrap(~hotel_rating2_title, nrow = 2) +
  ggtitle("Review Counts of hotel Ratings and neighborhood restaurants") +
  xlab("Restaurant Price") + 
  ylab("Review Counts") 

We could find some interesting facts here.

  1. The higher the rating of the restaurant it is, it is not necessary that the ratings of the neighborhood restaurants are higher. Taking the rating level 5 as an example, there are not that many reviews nor high ratings. One of the reasons might be not that many restaurants rated as 5 or 1. A lot of the hotels are rated as level 3.

  2. There are a lot of restaurants rated as level 3, and the majority of neighborhood restaurants around level 3 hotels are also on level 3.

  3. The restaurants in Time Square are not too expensive. The most expensive ones are with three dollar signs.

  4. The distributions of each price level of hotels are very similar. 4 dollar sign restaurants are very rare, and the proportion of 2 dollar sign restaurants is nearly 50% of the data at each level.

  5. Hotels with 2 dollar signs have the most counts in Times Square.

5. Executive Summary

When we travel, how would we choose hotels? Location and convenience come to the first place. The hotels in Times Square are usually the first option when tourists visit New York City and would like to experience the New Yorker life at Midtown, such as watching Broadway show, shopping, drinking at a bar, enjoying a meal at a steakhouse, and etc. There is nowhere else in the world like New York City presents all kinds of best gourmets from everywhere of the world. If we search the vocabulary “restaurants” in Yelp, there are Italian, American, Mexican, Portuguese, Asian fusion… When we choose a hotel, if we take the foods surrounding the hotel into considerations, what kind of hotel we should choose? Also, gourmets play significant role during travelling. The food quality, restaurant environment, and price are the essential elements for us to judge if a restaurant is “good” or “bad”. If the rating of the hotel is high, are the ratings of the neighborhood restaurants also high? Does hotel affect the development of restaurants?

We would answer the above inquires by analyzing datasets provided by NYC OpenData and Yelp. There are 332 restaurants and 41 hotels chosen located in Times Square. The first graph (“Histogram of Rating on Different District”) gives us a general view of the restaurant data we collected. According to Yelp rating system, there are 5 levels—1 is the lowest and 5 is the highest. Most of the restaurants are rated from level 2 to 4; there are only a few are rated less than 2 or above 4.5. The shape is left skewed, so we can say the majority of the restaurants provide high quality foods. The 10036 zip code area owe nearly half of the restaurants, and 10018 zip code area have the second largest restaurant number. One reason is 10036 area itself occupies half of the Times Square area; the other one would be those area are usually popular tourists attractions, such as Madame Tussauds and theaters.

Here comes another question. What kind of food dominates Times Square food industry? If we take a look at the graph “Mosaic of Category vs. Zip Code”, it demonstrates how the food categories scatters in Times Square area. North American restaurants have the largest portion in the food industry in Times Square, and Delis have the second largest, while European and Asian become the third and the fourth largest category. Majority of the restaurants are located within 10019 and 10036 zip code area, while there are more North American style restaurants in 10036 than 10019 zip code area. Inside 10036 area, Asian restaurant and Deli have smaller portion than 10019 zip code area.

Alluvial diagram, or called Parallel coordinate plot, presents us the overall information we would be interested in and how the data we collected after we combine food categories, ratings, price levels, and locations together. If we are not sure what to eat tonight, we could look at the Alluvial diagram and make a decision. Most of the restaurants in Times Square are North American, and the majority of them are rated at level 4 but with low cost (2 or 3 dollar-sign according to Yelp), located in zip code area 10036. So, we might go to a North American restaurant with ratings at level 4 and price level at 2 and located at 10036 areas after we finish watching a show in the Theater District. If we would like to pick a high-quality steakhouse, we could just walk into any steakhouse we come across in Times Square, but before we do that, we might want to yelp it first; although 90% of them are rated very high, only few are served with low quality. Likewise, if we want something quick and not too expensive, we can stop by a Deli, because as the top 3 largest categories in Times Square, Delis offer low-cost meals, and spread over all the area, including all zip codes.

After we watch the Broadway show, fantastic North American dinner, it is time to take a rest. Are there any high rating hotels near the fancy restaurant we just have our dinner? Let’s take a look at the histogram of “Price distribution at each hotel price level and neighborhood restaurants”. As for each of hotel, we find out its surrounded restaurants within a circle area with 100 meters as radius; and then we make a comparison to see how the restaurants spread around each hotel through grouping by hotel price levels. Generally speaking, the shape of how expensive restaurants distributed are nearly identical. No matter how the price level of the hotels is, restaurants with price level 2 are nearly half of the neighborhood restaurants; restaurants with price level 1 are the second largest for each hotel cluster. At each price level, there are only few restaurants are 4 dollar-sign. Overall, the price of hotels is not too expensive in Times Square.

Let’s talk about the ratings of hotels and restaurants (“Rating distribution at each hotel rating level and neighborhood restaurants”). Likewise, the restaurant rating distribution has almost the same shape at each hotel rating level. Obviously, most of hotels and restaurants are rated as level 3, and more than half of the restaurants at level 3 hotels are also rated as level 3.

To conclude, North American and European restaurants prefer to locate in 10036 zip code area. Most restaurants and hotels provide high quality service with 3 to 4 stars, but of course, there are some low rating hotels and restaurants rated only 1 or 2 stars, and we might want to avoid those businesses when choosing a place to eat or accommodate. Speaking of the correlation between hotels and their neighborhood restaurants, there is no strong evidence showing the relationship exists. The ratings and price levels of restaurants and hotels are evenly spreading. What we have done so far is analyzing hotels as a group, such as in the group of price level at 1, 2, 3, which might average out some special and individual cases.

6. Interactive Component: D3, jQuery, and Bootstrap

Interactive Web System: https://pinhao1994.github.io/EDAV/

We built our website by D3, jQuery, and Google Map API. In order to project our data points onto the map, we built a projector transforming longitude and latitude into svg path object. Also, Bootstrap model (https://getbootstrap.com/) was utilized to implenment responsive web design which allowed our website to become mobile friendly. There are three features that have been designed: Mouseover, Clicking, and Category-Choosing.

When users mouse over each data point, our system will show the information of that restaurant or hotel; if the data point is a hotel location (in blue color), there will also be a histogram to show the top 5 restaurant categories surround the hotel. When users click the hotel data points (blue points), the system will draw out a circle area with 100 meters as radius, and the top 3 restaurants near the hotel will be shown in yellow color. Last but not the least, users can choose restaurant categories, the restaurant with chosen category will be marked in red. For more details, please feel free to visit our website https://pinhao1994.github.io/EDAV/.

7. Conclusion

In this project, the main limitations come from the nature of our combined dataset. Since we combined NYC open data on Times Square’s hotels and restaurants with information scrapped from Yelp API and they have different recording strategies, matching each item perfectly is challenging. For the same item, its name and placed category in two resources might deviate. As a result, Yelp API does not have research results for about 5% of rows in NYC open data, even address and phone number are given. Although we have a high precision rate, there is a bottleneck for the recall rate. In addition, the restaurants’ categories defined by two resources are not uniform. For example, some are categorized by political regions, like Italian food or Japanese food, others are labeled by food type, such as café and steakhouse. Preprocessing it manually might lead to information leakage or misinterpretation, which undermines our analysis accuracy. Besides, most of our data are categorical, which limits our graphing choice and exploratory approaches. In the future, we wish to try incorporate more numeric variables in order to explore at more angles. It would also be interesting if we can extend this project to the Manhattan area or even the entire New York City. In that case, we can look at broader distance circle for each hotel and examine clustering pattern for different attributes.