Overview
This course gives you a comprehensive introduction to both the theory and practice of machine learning. You will learn to use Python along with industry-standard libraries and tools, including Pandas, Scikit-learn, and Tensorflow, to ingest, explore, and prepare data for modeling and then train and evaluate models using a wide variety of techniques. Those techniques include linear regression with ordinary least squares, logistic regression, support vector machines, decision trees and ensembles, clustering, principal component analysis, hidden Markov models, and deep learning.
A key feature of this course is that you not only learn how to apply these techniques, you also learn the conceptual basis underlying them so that you understand how they work, why you are doing what you are doing, and what your results mean. The course also features real-world datasets, drawn primarily from the realm of public policy. It is based on an introductory machine learning course offered to graduate students at the University of Chicago and will serve as a strong foundation for deeper and more specialized study.
Syllabus
- Machine Learning and the Machine Learning Pipeline
- In this module you will be introduced to the machine-learning pipeline and learn about the initial work on your data that you need to do prior to modeling. You will learn about how to ingest data using Pandas, a standard Python library for data exploration and preparation. Next, we turn to the first approach to modeling that we explore in this class, linear regression with ordinary least squares.
- Least Squares and Maximum Likelihood Estimation
- In this module, you continue the work that we began in the last with linear regressions. You will learn more about how to evaluate such models and how to select the important features and exclude the ones that are not statistically significant. You will also learn about maximum likelihood estimation, a probabilistic approach to estimating your models.
- Basis Functions and Regularization
- This module introduces you to basis functions and polynomial expansions in particular, which will allow you to use the same linear regression techniques that we have been studying so far to model non-linear relationships. Then, we learn about the bias-variance tradeoff, a key relationship in machine learning. Methods like polynomial expansion may help you train models that capture the relationship in your training data quite well, but those same models may perform badly on new data. You learn about different regularization methods that can help balance this tradeoff and create models that avoid overfitting.
- Model Selection and Logistic Regression
- In this module, you first learn more about evaluating and tuning your models. We look at cross validation techniques that will help you get more accurate measurements of your model's performance, and then you see how to use them along with pipelines and GridSearch to tune your models. Finally, we look a the theory and practice of our first technique for classification, logistic regression.
- More Classifiers: SVMs and Naive Bayes
- You will learn about two more classification techniques in this module: first, Support Vector Machines (SVMs) and then Naive Bayes, a quick and highly interpretable approach that uses Bayes' theorem.
- Tree-Based Models, Ensemble Methods, and Evaluation
- In this module, you will first learn about classification using decision trees. We will see how to create models that use individual decision trees, and then ensemble models, which use many trees, such as bagging, boosting, and random forests. Then, we learn more about how to evaluate the performance of classifiers.
- Clustering Methods
- To this point, we have been focusing on supervised learning and training models that estimate a target variable that you have specified. In this module, we take our first look at unsupervised learning, a domain of machine learning that uses techniques to find patterns and relationships in data without you ever defining a target. In particular, we look at a variety of clustering techniques, beginning with k-means and hierarchical clustering, and then distribution and density-based clustering.
- Dimensionality Reduction and Temporal Models
- You will look at two new techniques in this module. The first is Principal Component Analysis, a powerful dimensionality reduction technique that you can use to project high-dimensional features into lower-dimensional spaces. This can be used for a range of purposes, including feature selection, preventing overfitting, visualizing in two- or three-dimensional spaces higher dimensional data, and more. Then, you will study hidden Markov models, a technique that you can use to model sequences of states, where each state depends on the one that came before.
- Deep Learning
- This module introduces you to one of the most hyped topics in machine learning, deep learning with feed-forward neural networks and convolutional neural networks. You will learn about how these techniques work and where they might be very effective--or very ineffective. We explore how to design, implement, and evaluate such models using Python and Keras.
Taught by
Dr. Nick Feamster