Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the open-source Frontera framework for large-scale web crawling in this EuroPython 2015 conference talk. Discover how to build real-time distributed web crawlers and website-focused ones using Frontera's customizable URL metadata storage, crawling strategies management, and transport layer abstraction. Learn about integrating Frontera with Scrapy, Kafka, and HBase to create a powerful distributed crawler. Gain insights into the framework's architecture, features, and use cases, including a demonstration of collecting statistics from the Spanish internet. Understand the motivation behind Frontera, its single-threaded and real-time capabilities, and future development plans. Perfect for developers interested in advanced web crawling techniques and large-scale data collection.
Syllabus
About me
What is Frontera
What is Terra
Motivation
Single threaded
Single integration
Real time
Unique content
Metadata storage
Architecture
Scrapping
Simple spider
Use cases
Architecture distributed
Features
Requirements
Quick start
Spanish crawl
Future plans
Questions
Taught by
EuroPython Conference