Overview
Explore the critical importance of data quality in machine learning with this conference talk from NDC Oslo 2022. Learn how to select the right data for robust machine learning models, avoiding common pitfalls in dataset creation. Discover techniques for ensuring field measurements align with expectations, selecting data conducive to good model performance, and designing experiments that yield desired results. Gain insights into generalizing models to new data, handling missing values, detecting outliers, and addressing class imbalance. Delve into topics such as the curse of dimensionality, feature engineering, and avoiding selection and survivorship biases. Understand the significance of appropriate metrics and model interpretability. Apply these language-agnostic techniques to enhance your machine learning projects, regardless of your preferred programming language or framework.
Syllabus
Intro
Learning machine learning
Basic model types
The danger of assumptions
What do our features mean?
How do we see how good a model is?
Reserve data for testing
Missing values
Detecting outliers
Adjusting for class imbalance
Curse of dimensionality
Feature engineering
Creating generalisable models
Selection bias
Survivorship bias
How does your model make predictions?
Are you using the right metrics?
Is newer and shiner better?
Taught by
NDC Conferences