STAT 508: Applied Data Mining & Statistical Learning

Overview

Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.

Syllabus

This course covers methodology, major software tools, and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand the conceptual underpinnings of methods in data mining. It focuses more on the usage of existing software packages (mainly in R) than developing the algorithms by the students. Students will be required to work on projects to practice applying the existing software. The topics include statistical learning; resampling methods; linear regression; variable selection; regression shrinkage; dimension reduction; non-linear methods; logistic regression, discriminant analysis; nearest-neighbors; decision trees; bagging; boosting; support vector machines; principal components analysis; clustering.

Prerequisites

STAT 501 (Regression Methods) or a similar course that covers analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
Basics of probability, expectation, and conditional distributions. Review the Basic Statistical Concepts notes on the STAT online site.
Matrix algebra and multivariate calculus will be beneficial but is not required. Review the Matrix Algebra Review notes on the STAT online site.
The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful. Introductions to R are available at Statistical R Tutorial and Cran R Project Intro Manual.

Textbooks

Required: An Introduction to Statistical Learning, with applications in R (2013), G. James, D. Witten, T. Hastie, R. Tibshirani (Springer).

Other Resources

Learning Online Orientation
Obtaining Statistical Software

Example datasets

Datasets were taken from the UCI machine learning database repository:
1. Iris: iris.data, source (including data set information)
Datasets were taken from An Introduction to Statistical Learning:
1. Auto.data
2. Advertising.data
3. Credit.data
Other datasets:
1. smsa.data
2. prostate.data