Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Koalas - Scaling Pandas API on Apache Spark

Databricks via YouTube

Overview

Explore Koalas, an open-source Python package implementing the pandas API on Apache Spark, in this 58-minute hands-on tutorial. Learn how to scale pandas to big data environments, enabling a seamless transition from single-machine to distributed computing without learning a new framework. Discover Koalas' latest functionalities, including Apache Spark 3.0 integration, and its potential as a standard API for large-scale data science. Get started with Koalas, compare Pandas and Koalas APIs for DataFrame transformation and feature engineering, and understand the differences between single-machine Pandas and distributed Koalas environments. Dive into topics such as indexing, data visualization, analysis techniques, and machine learning integration using MLflow. Follow along as the tutorial covers everything from basic operations to advanced concepts like time series analysis, outlier detection, and forecasting, providing a comprehensive overview of Koalas' capabilities in the realm of big data analytics.

Syllabus

Introduction
Agenda
Data Scientist Journey
Pandas Origins
Spark vs Pandas
Koalas
Pandas vs Spark
Koalas vs Spark
Internal Frame
Indexing
Default Index Types
Sequel query
Values
Data Visualization
Class
Plotting
Analysis
Date Range
Histogram
Outliers
Merge
Sort Values
Plot Koalas Numbers
Convert Year Month to DateTime
Plot Monthly Totals
Plot Monthly Average
Paralysis and Model Training
Filters
ML Flow
Import ML Flow
Pandas dataframe
Query runs
Forecast
Roadmap

Taught by

Databricks

Reviews

Start your review of Koalas - Scaling Pandas API on Apache Spark

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.