R Project
The following datasets come directly from Yelp, though they have been truncated in size to make them manageable for processing on your computer. The data sets are located on D2L.
- OA 11.7 - yelp_academic_dataset_business.json.csv
- OA 11.6 - yelp_academic_dataset_user.json.csv
Open both of these datasets in an RStudio file and make sure they import properly. Note that each of these datasets are quite large so it may take longer for some of the below tasks to run.
Complete the following tasks for each of the data sets. You may have to look up the syntax for some of the tasks as we have not explicitly covered it in class.
For the Business Data:
- Load the file and print the first five rows of the dataframe.
- Create a histogram or bar chart (decide which one makes the most sense) of the “state” column to get an idea of where the business are located.
- Generate a pie chart of the star ratings of all of the businesses.
- Explore the relationship between review count and stars by making a box plot of the reviews for each star rating.
- Perform a chi-squared test between stars = 1.0 and the starts = 5.0 data. What does the result of the test tell you?
- Hint: you will need to create two sub-data frames here to accomplish this task.
User Data:
- Open the file and print the column names.
- Use Pearson r correlation to evaluate the relationship between
cool_votes
,funny_votes
, anduseful_votes
. - Do a linear regression analysis of 2 of the 3 columns from the last step, including finding the equation of a fit line and plotting/labeling it.
- Does writing reviews (i.e. review_count) bring a user more fans? Support your answer using linear regression techniques.
- Find a different variable (besides review_count) that would bring a user more fans. Again, support your answer using quantitative techniques.
- Use the kmeans machine learning algorithm to see if you can organize the data into usable clusters. Do this first for review_count and fans, and then for the different variables that you chose in the previous part. Hoe many clusters did you choose? Why? Be sure to provide a justification/interpretation.