Project Reports

Project Report 1: Due Friday November 1st by midnight

Consider the Titanic data or, if you prefer, another data set we’ve analyzed in the course that includes both numerical and categorical data.

Prepare a 2-page technical report that includes/does the following (feel free to use/modify your work from Python Homework 2 and elsewhere in the course, as appropriate):

  • A histogram of a column of numerical data.
  • A pie chart of a categorical column of data.
  • A scatter plot of two numerical columns of data, including a linear regression line. Include the equation and R value on the plot.
  • At least one other figure of your choice that conveys something interesting about the data.
  • A contingency table comparing two categorical columns. Include a chi-square test and interpret your findings.
  • Identify at least two sets of numerical that it makes sense to compare against one another (e.g., the fares of passengers who embarked from each of the three ports in the Titanic data). Answer the question “Are these distributions statistically different from one another?” By performing a “2 sample Kolmogorov-Smirnov” test on each pair of data. Report and interpret each of the p values.

The syntax to perform the test is available here. To isolate data from a column, remember you can do something like:

fares_Q = df.query("Embarked=='Q'")['Fare']

One handy way to save a figure so you can insert it into a document is with

plt.savefig("figure.png")

which will save the figure in the same folder as the Jupyter Notebook you’re working on. You can also specify a complete file path in the usual way.

Your audience is someone familiar with the data set (so, there is no need to explain what is in the file) but not particularly proficient in data analysis (so, interpret each figure/result). Include a brief title that identifies the data being analyzed but don’t bother inserting your name, my name, the course, date, etc.; mimic a document for an employer and just get to the analysis. The total amount of text shouldn’t exceed 400 words. Be creative in laying everything out in an efficient and appealing way. For instance, you should consider wrapping text around the figures, and by no means should your text be double spaced.

In the Dropbox, submit both your Python code and the report.

The report and code will be assessed for: * The correctness and completeness of the analysis (40%) * The quality of the writing (spelling, grammar, flow, etc.) (30%) * The visual appeal/professionalism of the figures and the overall layout. (20%) * Github submission with all files and appropriate README. (10%) * See the bottom of this page. * The overall grade will be reduced if the report is more than two pages – keep it brief!

Project Report 2: Due Tuesday November 26th by midnight

Refer back to the R in class project. Prepare a short (5 minute max) oral report summarizing your findings from this in-class assignment. Focus specifically on the “User Data”.

Your report should: * Explain the data file: what information is included? What columns are in the Yelp user data file? How many entries are there? What type of data is stored for each column? * Summarize your analysis for parts 1-6. Describe the data (simple things like “review_count ranged from \(\<\)some number\(\>\) to \(\<\)some number\(\>\).”) but also describe any relationships (or lack thereof) that you found. If you perform a statistical test, describe it (assume your reader is a college student who isn’t in this class). * Be sure to conclude your paper by answering this specific question: What can a Yelp user do to enhance their popularity and impact on the platform? Be sure to be clear about how you are defining “popularity” and “impact” and support your conclusion by referencing the quantitative evidence you provided in your analysis. * Be sure to include graphics and tables in the report, where appropriate. * Include your your complete code file (.ipynb or .r) for both sets of analysis from the project with your submission. * Follow the same structure and format requirements as the first technical report.

The report will be assessed for the appropriateness of the analysis, quality of the presentation, and the extent to which you respond to the items above.

Adding to GitHub

You are required to upload any Excel workbooks, Python files, or R files you use for your analysis for both projects to a separate, public GitHub repository and share the link to the repository when you turn in your report. You are also requried to either upload your report to the repository or create a separate README file to explain what is contained in your repository and its importance. Follow the guidelines provided in the GitHub tutorial and use the format we have used for the other projects in this course.