Explore natural language processing (NLP) concepts, review advanced data cleaning and vectorization techniques, and learn how to build machine learning classifiers.
Overview
Syllabus
Introduction
- Welcome
- What you should know
- What tools do you need?
- Using the exercise files
- What are NLP and NLTK?
- NLTK setup and overview
- Reading in text data
- Exploring the dataset
- What are regular expressions?
- Learning how to use regular expressions
- Regular expression replacements
- Machine learning pipeline
- Implementation: Removing punctuation
- Implementation: Tokenization
- Implementation: Removing stop words
- Introducing stemming
- Using stemming
- Introducing lemmatizing
- Using lemmatizing
- Introducing vectorizing
- Count vectorization
- N-gram vectorizing
- Inverse document frequency weighting
- Introducing feature engineering
- Feature creation
- Feature evaluation
- Identifying features for transformation
- Box-Cox power transformation
- What is machine learning?
- Cross-validation and evaluation metrics
- Introducing random forest
- Building a random forest model
- Random forest with holdout test set
- Random forest model with grid search
- Evaluate random forest model performance
- Introducing gradient boosting
- Gradient-boosting grid search
- Evaluate gradient-boosting model performance
- Model selection: Data prep
- Model selection: Results
- Next steps
Taught by
Derek Jedamski