DSC 140: Fundamentals of Data Science
Sample Final 1
Data Analysis in R
Start by writing the line library(datasets)
. This will give you access to the data sets referenced below (no need to download or open any files from D2L or whatever).
- [10 pts] Consider the “sleep” data set, which shows the result of a drug trial where students (identified in column “ID”) took one of two drugs (column “group”) and the change in how much sleep they got (in hours, column “extra”). Hint: the name of the dataframe is
sleep
.
- Make a figure that shows two box plots (one for each drug) of the “extra” column. Which appears to be more effective as a sleep aid? Or is it unclear if there is any meaningful difference?
- What is the average amount of extra sleep for patients who took drug 1? What about drug 2?
- A chi-squared can be used to check if the means of two distributions are significantly different from one another. Run the test and report the p value. What do you conclude? Hint: We’ve used the R command
subset
to select only certain rows in some dataframe based on a specified criteria.
- [10 pts] Now consider the “USJudgeRatings” data set (also from the datasets library you used in problem 1). This data set is based on observations of US Judges; each row is a different judge and the columns are numerical scores on different qualities:
CONT Number of contacts of lawyer with judge.
INTG Judicial integrity.
DMNR Demeanor.
DILG Diligence.
CFMG Case flow managing.
DECI Prompt decisions.
PREP Preparation for trial.
FAMI Familiarity with law.
ORAL Sound oral rulings.
WRIT Sound written rulings.
PHYS Physical ability.
RTEN Worthy of retention.
- Determine the Pearson correlation coefficient between the “RTEN” column and every other column. Which column has the highest correlation? Use an appropriate loop rather than copy-pasting the same block of code for each pair of columns! Do not make a correlation matrix, use a loop to complete this section.
- Make a scatter plot and linear regression with RTEN and the most highly correlated column you found in (a). What are the coefficients for the line of best fit?
- [10 pts] Consider the “mtcars” data set, which shows data on 32 cars from model years 1973 and 1974. Use the k-nearest neighbors algorithm to try and predict the number of cylinders (column “cyl”) based on the miles per gallon (“mpg”) and horsepower (“hp”). Interpret your results.
Data Analysis with MySQL
Connect to the “orders” database. The following code may be of use:
= pymysql.connect(
connection = "dsc140.cmgtlybtlnmh.us-east-2.rds.amazonaws.com",
host = "guest",
user = "UMUdsc140",
password ="orders") db
Do the following:
- [2 pts] Identify and print the names of all the tables in the database. Look at the columns within some of the tables and briefly describe what this database appears to be all about.
- [5 pts] Find the sale amount for every order. To do this you’ll need to find how many units were sold in each order and the price per unit for the corresponding item that was sold, then multiply the values together.
- [3 pts] Make a histogram of the sale amounts you found in #2. (If you can’t get #2 to work, make up some fake data and make a histogram of that instead.)
- [5 pts] Generate a pandas dataframe that lists the customerID for all customers served by the employee with EmployeeID “8”. Display the dataframe.
- [5 pts] Generate a pandas dataframe that lists the Customer’s Name and their corresponding country of residence, again only for customers served by the employee with EmployeeID “8”. Display the dataframe.