Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Versioning, Syncing & Streaming Large Datasets Using DAT + Node

JSConf via YouTube

Overview

Explore versioning, syncing, and streaming large datasets using DAT and Node.js in this JSConf.Asia 2014 conference talk. Dive into the open-source DAT tool, funded by the Sloan Foundation, designed to facilitate collaboration on datasets of any size. Learn how DAT aims to streamline work with large scientific datasets, enhancing automation and reproducibility. Discover the tool's streaming dataset versioning and replication system, built with a Unix philosophy for modularity and third-party application support. Gain insights into using Node.js and Docker for creating cross-platform data pipelines. Understand how to leverage Node and LevelDB for managing extensive datasets. The speaker, Max Ogden, an open-source software developer working on the DAT project at the United States Open Data Institute, shares his expertise and demonstrates practical examples throughout the 43-minute presentation.

Syllabus

Intro
dat is an open source tool for sharing and collaborating on data
analogy time: lets talk about source control
life before git
1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
maintainer creates new zip of cool-project that might contain my fix
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
max, import your genome into dat
data is stored locally in leveldb blobs are stored in blob-stores
choose the blob store that fits your use case s3, local-fs
auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate
calculate how big npm is using dat
dat cat transform
dat cat docker run-i transform
transform the npm data using bulk-markdown-to-png
use case: trillian astronomical
1. full sky scans 2. detect objects
problems: huge files, weird format
1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry

Taught by

JSConf

Reviews

Start your review of Versioning, Syncing & Streaming Large Datasets Using DAT + Node

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.