DSC 140: Fundamentals of Data Science

Sample Final 1

Data Analysis in R

Start by writing the line library(datasets). This will give you access to the data sets referenced below (no need to download or open any files from D2L or whatever).

[10 pts] Consider the “sleep” data set, which shows the result of a drug trial where students (identified in column “ID”) took one of two drugs (column “group”) and the change in how much sleep they got (in hours, column “extra”). Hint: the name of the dataframe is sleep.

Make a figure that shows two box plots (one for each drug) of the “extra” column. Which appears to be more effective as a sleep aid? Or is it unclear if there is any meaningful difference?
What is the average amount of extra sleep for patients who took drug 1? What about drug 2?
A chi-squared can be used to check if the means of two distributions are significantly different from one another. Run the test and report the p value. What do you conclude? Hint: We’ve used the R command subset to select only certain rows in some dataframe based on a specified criteria.

[10 pts] Now consider the “USJudgeRatings” data set (also from the datasets library you used in problem 1). This data set is based on observations of US Judges; each row is a different judge and the columns are numerical scores on different qualities:

CONT    Number of contacts of lawyer with judge.
INTG    Judicial integrity.
DMNR    Demeanor.
DILG    Diligence.
CFMG    Case flow managing.
DECI    Prompt decisions.
PREP    Preparation for trial.
FAMI    Familiarity with law.
ORAL    Sound oral rulings.
WRIT    Sound written rulings.
PHYS    Physical ability.
RTEN    Worthy of retention.

Determine the Pearson correlation coefficient between the “RTEN” column and every other column. Which column has the highest correlation? Use an appropriate loop rather than copy-pasting the same block of code for each pair of columns! Do not make a correlation matrix, use a loop to complete this section.
Make a scatter plot and linear regression with RTEN and the most highly correlated column you found in (a). What are the coefficients for the line of best fit?

[10 pts] Consider the “mtcars” data set, which shows data on 32 cars from model years 1973 and 1974. Use the k-nearest neighbors algorithm to try and predict the number of cylinders (column “cyl”) based on the miles per gallon (“mpg”) and horsepower (“hp”). Interpret your results.

Data Analysis with MySQL

Connect to the “orders” database. The following code may be of use:

connection = pymysql.connect(
host = "dsc140.cmgtlybtlnmh.us-east-2.rds.amazonaws.com",
user = "guest",
password = "UMUdsc140",
db="orders")

Do the following:

[2 pts] Identify and print the names of all the tables in the database. Look at the columns within some of the tables and briefly describe what this database appears to be all about.
[5 pts] Find the sale amount for every order. To do this you’ll need to find how many units were sold in each order and the price per unit for the corresponding item that was sold, then multiply the values together.
[3 pts] Make a histogram of the sale amounts you found in #2. (If you can’t get #2 to work, make up some fake data and make a histogram of that instead.)
[5 pts] Generate a pandas dataframe that lists the customerID for all customers served by the employee with EmployeeID “8”. Display the dataframe.
[5 pts] Generate a pandas dataframe that lists the Customer’s Name and their corresponding country of residence, again only for customers served by the employee with EmployeeID “8”. Display the dataframe.