Completed
we want to do for data what git did for source code
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Versioning, Syncing & Streaming Large Datasets Using DAT + Node
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 dat is an open source tool for sharing and collaborating on data
- 3 analogy time: lets talk about source control
- 4 life before git
- 5 1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
- 6 maintainer creates new zip of cool-project that might contain my fix
- 7 claim: currently data sharing is a mess
- 8 email csv files
- 9 database dumps in git
- 10 we want to do for data what git did for source code
- 11 npm install -g dat
- 12 max, import your genome into dat
- 13 data is stored locally in leveldb blobs are stored in blob-stores
- 14 choose the blob store that fits your use case s3, local-fs
- 15 auto schema generation - free REST API - *all* APIs are streaming
- 16 a data set we can all relate
- 17 calculate how big npm is using dat
- 18 dat cat transform
- 19 dat cat docker run-i transform
- 20 transform the npm data using bulk-markdown-to-png
- 21 use case: trillian astronomical
- 22 1. full sky scans 2. detect objects
- 23 problems: huge files, weird format
- 24 1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
- 25 data pipelines dependency management data streaming
- 26 gasket is a cross platform pipeline manager
- 27 datscript is an experimental pipeline config language
- 28 the future
- 29 branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry