Python Homework Part 2

The code you submit for this homework must following the coding guidelines.

Problem 1 (50 pts.)

Refer to the file titanic_train.csv.

This file was pulled from this website. There, you’ll see an explanation for the columns and the entries in some of the columns.

In a Jupyter Notebook file, perform some exploratory data analysis. This is an broad term that translates to “look at the data, summarize important bits, and point out anything that strikes you as noteworthy.” Some specific ideas include:

  • Look at summary statistics of numerical data (mean, median, maximum, minimum…)
  • Generate histograms or boxplots (this is straightforward in matplotlib – check out the documentation if you are interested in doing this) of columns of numerical data
  • Generate pie charts for columns of categorical data
  • Check for columns of numerical values that are highly correlated or anti-correlated
  • Perform a chi-square test for contingency tables formed from columns of categorical data, and check for significance (generally speaking, p<0.05).

For full credit: * Your Notebook should include at least one interesting example of each kind of analysis listed above. You are welcome to do more if you like. * Be clear in your output; use print statements that identify the values you’re displaying and generate figures with labeled axes. * Provide an interpretation for any statistics which you calculate, interpret any graphs you create.

This is intentionally a somewhat open-ended assignment. Data scientists sometimes follow “recipes” where they perform a specific analytical task – but the more interesting (and rewarding!) work involves creativity and exploration.