Sample Midterm Exam 2

Part 1: Data Analysis in Excel

Answer each question below in a separate tab in an Excel workbook. You may create multiple copies of the data if needed. Download the file and perform the following tasks:

  • Make a scatter plot of age on the x axis and avg_glucose_level on the y axis. Insert a linear line of best fit and display the equation and \(R^2\) value on the chart. Label the axes, include a title, and customize the style of the chart from the default.
  • In a cell near your figure, respond to the question “How does glucose level tend to change as patients age?” Justify your answer.
  • Repeat the first part only showing data for patients who have had a stroke (they have a “1” in the stroke column).
  • Calculate the average age, average BMI, and average glucose level for (a) stroke patients and (b) non-stroke patients
  • What is the correlation coefficient between the avg_glucose_level and bmi columns?
  • Make a histogram of the avg_glucose_level column. Customize the bin width to be 12.
  • Each patient can be characterized as Male or Female, with or without hypertension, and with or without heart disease.
    • How many stroke patients fall into each of these 8 categories (e.g., how many are male/with hypertension/with heart disease)? Use a pivot table to answer this question.
    • In a cell near the table, indicate the number of male patients with hypertension and without heart disease (this just requires you to interpret the values in your pivot table correctly).

Part 2: Python Basics

As this is a timed exam you do not have to provide detailed comments for your code, but you must use a comment to label the answers to each question.

  • Write a function that takes three arguments (x,y,z) that are floating point numbers. The function should return the “Euclidean norm” of the values, which is given by \((x^2+y^2+z^2)^{1/2}\). Call the function for x=5.0, y=7.0, z=9.0. Print the answer to the console.
  • The Fibonacci sequence starts with [1,1] and every following value is the sum of the previous two: [1,1,2,3,5,8,…]. Use an appropriate loop to determine the 50th number in the Fibonacci sequence.

Part 3: Data Analysis in Python

As this is a timed exam you do not have to provide detailed comments for your code, but you must use a comment to label the answers to each question.

  • Download the file. This contains information on some games in the video game sales platform Steam. Load the file as a dataframe and print the columns to the console.
  • Determine the minimum, maximum, and all three quartiles of the “price” column. Make a boxplot of the price distribution. Hint: use import matplotlib.pyplot as plt.
  • Make a scatter plot of positive_ratings (on the x axis) vs. negative_ratings (on the y axis).
  • Generate a line of best fit and include it on the plot generated in (a). Print the value of the slope to the console and interpret: do most games have more positive or more negative ratings?
  • Make a contingency table of developer and genres but only for games that have a price above 100.0. Display a list of the names of each of the games with a price above 100.0. Hint: df2 = df.query(“A > 0”) will select rows in df where the column named A has rows with values above 0.