Overview
Syllabus
Intro
Blueprint for Supervised Machine Learning
Goal: Maximize Manual Annotation Efficiency 1. Deduplicate • Minimize manual effort . Find unique subjects in our data sets so that humans only annotate each subject once and to prevent leaking duplicate data across training & test
Part 1-Textual Deduplication: Measuring Similarity How can we find
Time Complexity of Pairwise Comparisons
Textual Deduplication: LSH Bitwise Rotations
Locality Sensitive Hashing: 32 Bit Simhash
Part 2 - Text Normalization Machine Representations
Text Normalization: Unicode Examples What's the difference?
Text Normalization: Halfwidth & Fullwidth Katakana
Text Normalization: Katakana Code Block
Text Normalization: Halfwidth & Fullwidth Forms
Text Normalization: Hebrew Presentation Forms
Text Normalization: Unicode Normalization Forms
Text Normalization: Composing Marks Normalization
Text Normalization: Katakana Normalization
Text Normalization: Hebrew Normalization
Additional Normalization Resources
Conclusion
Attributions: The Noun Project
Taught by
BasisTech