Modeling Data in the Tidyverse

Overview

Developing insights about your organization, business, or research project depends on effective modeling and analysis of the data you collect. Building effective models requires understanding the different types of questions you can ask and how to map those questions to your data. Different modeling approaches can be chosen to detect interesting patterns in the data and identify hidden relationships. This course covers the types of questions you can ask of data and the various modeling approaches that you can apply. Topics covered include hypothesis testing, linear regression, nonlinear modeling, and machine learning. With this collection of tools at your disposal, as well as the techniques learned in the other courses in this specialization, you will be able to make key discoveries from your data for improving decision-making throughout your organization. In this specialization we assume familiarity with the R programming language. If you are not yet familiar with R, we suggest you first complete R Programming before returning to complete this course.

Syllabus

Modeling Data Basics

Developing insights about your organization, business, or research project depends on effective modeling and analysis of the data you collect. Building effective models requires understanding the different types of questions you can ask and how to map those questions to your data. Different modeling approaches can be chosen to detect interesting patterns in the data and identify hidden relationships.

Inference

Inferential Analysis is what analysts carry out after they’ve described and explored their dataset. After understanding your dataset better, analysts often try to infer something from the data. This is done using statistical tests. We discussed a bit about how we can use models to perform inference and prediction analyses. What does this mean?

Linear Modeling

Linear models are the most commonly used models in data analysis because of their computational efficiency and their ease of interpretation. Having a solid understanding of linear models and how they work is critical for any work in data science. The tidyverse provides a set of tools for making linear modeling more efficient and streamlined.

Multiple Linear Regression

Multiple linear regression is needed when you want to include confounding factors or other predictors in your model for the response. R provides a straightforward way to do this via the formula interface to the lm() function.

Beyond Linear Regression

While we’ve focused on linear regression in this lesson on inference, linear regression isn’t the only analytical approach out there. However, it is arguably the most commonly used. And, beyond that, there are many statistical tests and approaches that are slight variations on linear regression, so having a solid foundation and understanding of linear regression makes understanding these other tests and approaches much simpler. For example, what if you didn’t want to measure the linear relationship between two variables, but instead wanted to know whether or not the average observed is different from expectation?

Hypothesis Testing

Hypothesis testing describes a family of statistical techniques for determining whether the data you collect provides evidence for the value of an unknown parameter of interest. The goal of hypothesis tests is to make inferences while accounting for variability in the data that can lead to spurious results.

Prediction Modeling

Prediction modeling is an essential activity in data science and involves building systems for making predictions based on previously observed data. These models are typically very flexible and can capture a range of different relationships.

The tidymodels Ecosystem

There are incredibly helpful packages available in R thanks to the work of RStudio. As mentioned above, there are hundreds of different machine learning algorithms. The tidymodels R packages have put many of them into a single framework, allowing you to use many different machine learning models easily.

Case Studies

This case study will demonstrate an approach to building a prediction model for predicting outdoor air pollution concentrations in the United States.

Summary of tidymodels

The tidymodels collection of packages can be overwhelming at first glance. Here, we provide a quick summary chart to help navigate all of the packages and when they should be used.

Project: Modeling Data in the Tidyverse

In this project, you will practice building models with the tidyverse for classifying consumer complaints data from the Consumer Financial Protection Bureau (CFPB). This project includes both a Peer Review step in which you'll upload R Markdown and knitted HTML files AND a Quiz step in which you'll answer questions about the predictions made by your classification algorithm.