This course introduces the necessary concepts and common techniques for analyzing data. The primary emphasis is on the process of data analysis, including data preparation, descriptive analytics, model training, and result interpretation. The process starts with removing distractions and anomalies, followed by discovering insights, formulating propositions, validating evidence, and finally building professional-grade solutions. Following the process properly, regularly, and transparently brings credibility and increases the impact of the results.
This course will cover topics including Exploratory Data Analysis, Feature Screening, Segmentation, Association Rules, Nearest Neighbors, Clustering, Decision Tree, Linear Regression, Logistic Regression, and Performance Evaluation. Besides, this course will review statistical theory, matrix algebra, and computational techniques as necessary.
This course prepares students ready for and capable of the data preparation and analysis process. Besides developing Python codes for carrying out the process, students will learn to tune the software tools for the most efficient implementation and optimal performance. At the end of this course, students will have built their inventory of data analysis codes and their confidence in advocating their propositions to the business stakeholders.
Required Textbook: This course does not mandate any textbooks because the lecture notes are self-contained.
Optional Materials: A Practitioner's Guide to Machine Learning (abbreviated PGML for Reading)
Software Requirements: Python version 3.11 or above with the latest compatible versions of NumPy, SciPy, Pandas, Scikit-learn, and Statsmodels libraries.
To succeed in this course, learners should possess a basic knowledge of linear algebra and statistics, basic set theory and probability theory, and have basic Python and SQL skills. A few courses that can help equip you with the database knowledge needed for this course are: Introduction to Relational Databases, Relational Database Design, and Relational Database Implementation and Applications.
Overview
Syllabus
- Module 1: Process of Preparing and Analyzing Data
- Welcome to Data Preparation and Analysis! Module 1 guides students through the art of crafting informative and visually appealing histograms, a fundamental aspect of data visualization. Students will learn techniques for measuring the location and scale of data, understanding the origins and impacts of noise and missing values in datasets. This module also introduces the CRISP-DM Process, a structured approach to data mining, along with Gartner's Analytics Ascendancy Model for advanced data analysis. Additionally, students will explore the distinction between raw data and processed information, a key concept for effective data interpretation and decision-making.
- Module 2: Measure and Visualize Correlation
- Module 2 delves into the intricacies of statistical analysis, beginning with a thorough understanding of the p-value concept and its significance as a Type I Error indicator. Students will learn to apply statistical tests in Python to identify significantly correlated features, exploring various correlation metrics tailored for categorical, mixed-type, and continuous features. This module emphasizes practical application, equipping students with the skills to calculate and interpret these metrics using Python, thereby enhancing their ability to conduct sophisticated data analysis and draw meaningful conclusions from complex datasets.
- Module 3: Market Basket Analysis
- Module 3 offers a deep dive into the world of Association Rules, teaching students how to improvise these rules for identifying valuable feature combinations that generate specific label values. Learners will master setting appropriate thresholds for Support and Confidence and gain a comprehensive understanding of the Apriori Algorithm and the significance of Frequent Itemsets within it. This module covers the calculation of common metrics for Association Rules, familiarizing students with the relevant terminology. Additionally, learners will explore the practical application of Association Rules in Market Basket Analysis, including strategies for cross-selling, up-selling, and product bundling, equipping them with valuable skills for advanced data-driven decision making in business contexts.
- Module 4: Partitioning, Segmenting, and Clustering of Observations
- In Module 4, students will learn how to describe and interpret profiles of clusters, gaining proficiency in deploying the K-Means and K-Modes clustering algorithms. They will explore the application of Recency, Frequency, and Monetary (RFM) Analysis to identify the most valuable customers in retail business settings. The module also covers the technique of Simple Random Sampling with the option of incorporating stratification variables, enhancing the precision of data analysis. Furthermore, it emphasizes the importance of objectively validating models using a testing partition, ensuring the reliability and effectiveness of the analytical models in real-world scenarios.
- Module 5: Linear Regression
- This module delves into feature importance analysis in machine learning, covering Shapley Values, feature selection methods, statistical evaluation, feature interaction, aliasing, and the Least Squares Algorithm. Students will be able to master these concepts to build robust and interpretable models.
- Module 6: Binary Logistic Regression
- In Module 6, students will master the art of feature selection in machine learning by exploring the Forward and Backward Selection Method, the All-Possible Subsets Method, and the concept of complete and quasi-complete separation. Students will also discover association rules for identifying separations, interpret model parameters and predicted probabilities, and delve into the concepts of maximum likelihood estimation, odds, and odds ratios.
- Module 7: Decision Trees - The CART Algorithm
- Module 7 will equip students wth the ability to harness the power of tree-based models to uncover hidden patterns in your data. Students will be able to describe clusters effectively, intelligently set algorithm parameters, construct business rules from tree results, and utilize variance metrics, entropy values, and Gini indices for optimal tree construction.
- Module 8: Evaluating the Performance of Models
- Module 8 delves into the realm of evaluation metrics for machine learning models. Students will master the concepts of precision and recall curves, lift curves, and receiver operating characteristics (ROC) curves. Additionally, students will obtain the ability to discover methods for calculating probability thresholds using Kolmogorov-Smirnov statistics and F1 scores. They will be able to explore metrics like misclassification rate, area under the curve (AUC), and root mean squared error (RMSE), along with techniques for computing RMSE and detecting severely misfitted observations using model-specific residuals.
- Summative Course Assessment
- This module contains the summative course assessment that has been designed to evaluate your understanding of the course material and assess your ability to apply the knowledge you have acquired throughout the course. Be sure to review the course material thoroughly before taking the assessment.
Taught by
Ming-Long Lam