Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore data preparation techniques for analysis using R in this comprehensive conference talk from ODSC WEST 2015. Learn the fundamentals of data quality and how to automate routine steps in a principled manner. Discover common pitfalls in data preparation and how to detect and fix them through interactive demonstrations in the open-source R analysis environment. Download materials from the provided GitHub repository to follow along or practice later. Gain insights on handling faulty sensor situations, missing variables, novel categorical levels, and compact coding. Understand the importance of treatment plans, user interfaces, and operational issues in data preparation. Led by John Mount and Nina Zumel, experienced data scientists and authors, this talk covers essential topics such as linear regression, calibration, interpretation, and avoiding overfitting. Equip yourself with practical skills to improve your data science projects and increase their chances of success.
Syllabus
Intro
Workshop Outline
Workshop Agenda
Workshop Goals
Data Preparation
Faulty Sensor Situation
systematically missing variables
building missing variables
missing values
pragmatic solution
novel categorical levels
new data
Wyoming
Chemical categorical variables
Dealing with new levels
VTreat solution
Categorical variables
Compact coding
Indicator vs numerical variables
Treatment Plan
User Interface
Treatment Example
Linear Regression
Calibration
Interpretation
Operational Issues
Overfitting
Data fussing
John Mount
Taught by
Open Data Science