Final Project

The final project is your opportunity to find an interesting data set, learn form it, and to tell a meaningful story about it. You will perform analysis, write a report, and give a presentation to your peers:

If you want me to review your slides or final report before submitted them, come by office hours or make an appointment!

You are allowed to do your analysis with any of the tools learned in this course (i.e. Excel, Python, or R), but you must use at least one of them to do your analysis.

Finding a Data Set

Your data set should be large enough and have enough entries to allow you to perform some robust statistical analysis. As a rough guide, look for data with more than 100 rows, more than 10 columns, and a mix of categorical and continuous data. You are encouraged to show the data set to Dr. Butler for approval before going too far in your analysis. The data set can be in any format that you have learned to access this semester.

In the context of your data set, determine at least two realistic and meaningful research questions that might come up in an applied context. Generally, these should involve making substantive predictions from the data.

Choose a topic and data set that interests you for this project! All topics and research questions are allowed, from business applications, to sports, to Pokemon. A good place to start looking for data sets is kaggle.com, but data sets are available across the internet.

By November 4th at 11:59pm, you should have your data set, research questions, and a short description of the analysis you plan on performing ready to submit. I will look over these and give you some feedback that you can use to guide the remainder of your project.

Required Analysis

First, perform some exploratory data analysis by applying the analytical tools you have learned in this course. At a minimum, you should:

  • Provide descriptive statistics, including quantities like the average, range, and standard deviation of columns of continuous data, the possible values in columns of categorical data.
  • Search for and describe interesting relationships between columns of continuous data. This should include scatter plots and, where appropriate, linear regression. Spearman and Pearson R values should also be reported. A 2-sample KS test may also be appropriate.
  • Search for and describe interesting relationships between columns of categorical data. This should be performed with a Chi-Squared test.
  • Apply at least one machine learning algorithm (be sure to choose an approach appropriate for your data); report and interpret your findings.

Then, answer your research questions with additional analysis as needed.

You should include at least two methods of analysis (chart types, statistical tests, …) that we did nto explicitly cover in this class. Demonstrate some level of self-sufficiency; employers care what you can do but they care more about your ability to figure out things you have never seen before.

Report

You should create a typed report that summarizes your analysis in the context of a clear narrative flow. There is no specific length requirement, but (with figures included) you should expect a report in the 8-10 page range. Use MLA formatting for any references (see the ethics report on details for format and citations). Follow the same structure and format requirements for the other reports assigned in the semester. Your audiecne is a fellow student at Mount Union who does not know much about data science or your data set. Explain what you did and what it means in an accessible style. Also, explain why you chose your data set and why your findings are interesting. The report will be assessed for appropriateness of the analysis, quality of the writing (spelling, grammar, flow, etc.), and the extend to which you respond to the items above.

Your report is due by 11:59pm on December 6th, which is the last day of classes for the semester.

Long Presentation

You will present your data and analysis in recorded video due by 11:59pm on December 6th:

  • Your presentation will be 5-8 minutes in length.
  • Include visuals (PowerPoint slides or an alternative presentation format),
  • It is suggested that you have no more than 7-8 slides.
  • Provide an overview of the data set, state your research questions, then describe the analysis and your interpretions/answers.

Presentations dates will be assigned randomly, but (to be fair) everyone must turn in their slides by 10am on December 2nd.

Lightening Presentation

You will present your project to the class during the last two class periods of the semester (December 4th and 6th for MWF and December 3rd and 5th for TR):

  • Your presentation should be no more than 3 minutes long and should show the main result(s) of your analysis
  • You may have one powerpoint slide to accompany your lightening presentation. This slide is due by 11:59pm on December 2nd.
  • Your classmates will have 2 minutes to ask you any questions about work work.

Adding to GitHub

You are required to upload any Excel workbooks, Python files, or R files you use for your analysis to a public GitHub repository and share the link to the repository when you turn in your final report. You are also requried to either upload your final report to the repository or create a separate README file to explain what is contained in your repository and its importance. Follow the guidelines provided in the GitHub tutorial and use the format we have used for the other projects in this course.

Grading Breakdown

Your final project will be graded out of 100 points.

  • You can gain up to 5 points for your Data Set, Research Questions, and Description assignment
  • You can gain up to 5 points by submited a completed rough draft of your final presentation by the deadline.
  • You can gain up to 20 points for your lightening presentation. Up to 5 points can be gained based on your slide and the other 15 points can be gained based on your presentation.
  • You can gain up to 35 points for your recorded presentation. Up to 15 points can be earned for a completed project and thorough analysis, up to 10 points can be earned based on the format and content of your slides, and up to 10 points can be earned based on how you present the information.
  • Finally, you can gain up to 35 points from your project report. You can gain up to 5 points for a completed and correctly formatted GitHub repository. 5 points are avaliable for a correctly formatted report, including a sufficient number of correctly formatted references. Up to 15 points can be earned for a completed project and thorough analysis and the final 10 points come from the quality of the writing you submit.