Overview
Syllabus
Intro
dat is an open source tool for sharing and collaborating on data
analogy time: lets talk about source control
life before git
1. somehow get a zip of cool-project 2. unpack and edit a file 3. email the file back 4. ????
maintainer creates new zip of cool-project that might contain my fix
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
max, import your genome into dat
data is stored locally in leveldb blobs are stored in blob-stores
choose the blob store that fits your use case s3, local-fs
auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate
calculate how big npm is using dat
dat cat transform
dat cat docker run-i transform
transform the npm data using bulk-markdown-to-png
use case: trillian astronomical
1. full sky scans 2. detect objects
problems: huge files, weird format
1TB gzipped CSVS 600 million objects, 300 columns 40TB imagery
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98V3, multi master replication, sync to databases, registry
Taught by
JSConf