The final project for the course is a technical blog post related to a data analysis project you will work on piecemeal over the course of the semester.
The project is very open ended. The objective is to demonstrate your skill in asking meaningful questions of your data and answering them with results of the data analysis using R / Rmarkdown, and that your proficiency in interpreting and presenting the results. The goal is not to conduct an exhaustive data analysis. The data analysis part should meet the following criteria:
1. Perform exploratory data analysis summarizing your data using descriptive statistics / summary statistics and visualizations relevant to your questions or ones that highlight some interesting insight.
2. Demonstrate at least two of the following techniques we have learned in class and that helps answer your question: PCA, hypothesis testing / confidence interval, regression analysis (linear /logistic)
The first task is to identify the dataset, understand the data and write questions you are planning to answer using that dataset. You may pick a data set from one of the resources mentioned on this . The proposal should meet the following criteria:
1. Perform checks to determine quality of the data (missing values, outliers, etc.)
2. Proposal on what questions you are interested in answering from the data
3. Initial visualizations and if required transform to get the data ready
A good reference for ideas on questions and EDA in general:
More information on the format:
It should be about 2+ pages in length, not exceeding 10 with appendix. It should include roughly the following sections:
1. Background or the context of data selected – sources, description of how it was collected, time period it represents, context in it was collected if available, perhaps why you selected it
2. Description of the data – how big is it (number of observations, variables), how many numeric variables, how many categorical variables, description of the variables
3. Goal – What questions you plan to understand from the data.
3. Analysis – Descriptive statistics and visualization of key variables
4. Summary of findings from the analysis and further questions for future analysis
5. References – link to data or analysis sources you have referenced for the report
6. Appendix – all the visualization that does not support your questions directly can go here
The project should include
1. Introduction: What is your research question? Why do you care? Why should others care? If you know of any other related work done by others, please include a brief description.
2. Data: Include context about the data covering:
a. Data source: Include the citation for your data, and provide link to the source.
b. Data collection: Context on how the data was collected?
c. Cases: What are the cases (units of observation or experiment)? What do the rows represent in your dataset?
d. Variables: What are the variables you will be studying?
e. Type of study: was it an observational study or an experiment?
f. Data clean-up: (Optional) If you had to do any data clean up (missing values, outliers, transformation), include a very brief description of your steps.
3. Exploratory Data Analysis: summarize your data using descriptive statistics / summary statistics and visualizations relevant to your questions or ones that highlight some interesting insight. Additional plots not relevant to your research question can be included in the appendix.
4. Data Analysis: Pick and perform two of the following techniques we have learned in class and that helps answer your question about the dataset: PCA, hypothesis testing / confidence interval, regression analysis (linear /logistic)
5. Conclusion: Summarize your findings and include a discussion of what you have learned about your data through this project. You may also want to include limitations of your approach and include ideas for possible future work.
6. References: Include links that you have referenced for this project.