Efficient OCR for Building a Diverse Digital History

Overview

Learn about groundbreaking approaches to Optical Character Recognition (OCR) in this 46-minute conference talk from Harvard's Big Data Conference. Professor Melissa Dell presents innovative methods for making digital archives more representative of diverse historical documents. Explore how treating OCR as a character-level image retrieval problem with contrastive learning leads to more efficient and extensible solutions compared to traditional sequence-to-sequence architectures. Discover how this approach requires fewer labeled examples and computational resources while maintaining accuracy, particularly beneficial for low-resource document collections. Follow along as Dell demonstrates practical applications, from supply chain network analysis to knowledge graph construction, and discusses how this technology enables greater community participation in digital preservation efforts. Gain insights into technical aspects including CRN architecture, object detection, data augmentation, and zero-shot performance across multiple languages, with special attention to Japanese character recognition results.

Syllabus

Introduction
Digital Texts
Mass Digitization
Poor Performance
Sequence to Sequence Architecture
Efficient OCR
Digitization Tools
Modern OCR
FOCR vs Seek to Seek
CRN Architecture
OCR Architecture
Word Recognition
Models
Object Detection
Hard Negative Mining
Data Augmentation
OCR Benchmarks
Document Collections
Zero Shot Performance
Character Air Rate
Comparisons
Baseline Results
Japanese Results
Open Source OCR
ZeroShot Performance
Sample Efficiency
Oblations
Different Encoders
OCR at Scale
Custom Layout Model
Nonword Rate
Applications
Overall Data
Knowledge Graph
Supply Chain Network
Community Engagement
Training and Deploy
OCR encourages community engagement
Characters and words
Language extensibility
omitting the language model
decouple localization and recognition
limitations
extensions
fun example
conclusion

Taught by

Harvard CMSA

Reviews

Start your review of Efficient OCR for Building a Diverse Digital History

Taught by

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.