Overview
Explore the challenges and solutions for multilingual Natural Language Processing (NLP) models in this 45-minute PyCon US talk by Shreya Khurana. Dive into the complexities of language identification, transliterated and code-switched text, and the use of multilingual BERT models. Learn about existing Python frameworks for language identification tasks and their limitations. Discover approaches to handling the lack of annotated datasets for transliterated and code-switched text using web crawlers and self-generated datasets. Examine the performance of Google's multilingual BERT model trained in 104 languages through practical examples. Gain insights into evaluating NLP models for various tasks in a multilingual context. Access additional resources and code examples on GitHub to further enhance your understanding of multilingual NLP techniques.
Syllabus
Introduction
About me
Outline
Why multilingual data
Tasks associated with language systems
Syntax mixing
Transliterated text
Language identification
Language identification in practice
Other examples
Lambda ID
Blanked
Python
Limitations
Data augmentation
Simple example
The Transformer
Multiheaded attention
Stateoftheart soda
Why is it special
Word Piece Processing
Statistics of Languages
Bird Masked Language Model
Prediction Function
Code Switched Example
Lyrics Example
Task Evaluation
Generation Evaluation
Summary
Taught by
PyCon US