Sharing data like source code

a distributed approach to data management

Michael Hanke	Yaroslav Halchenko
Psychoinformatics lab, Institute of Psychology II, University of Magdeburg Center for Behavioral Brain Sciences, Magdeburg	Dept. Psychological and Brain Sciences Center for Cognitive Neuroscience, Dartmouth College

Why do we share data?

Because it is the right thing to do
Because we promised we would
Because we hope to get something in return

With whom do we share data?

Fictional figure — reality is likely worse

What should data-sharing
technology do for us?
(at the very least)

Make it trivial to consume public data.

Streamline sharing with peers by facilitating bilateral/multilateral exchange.

… and minimize the extra effort to do the right thing^® and share

Introducing: DataLad

your humble data management and sharing helper

manage you own data locally — with industrial-grade version control
collaborate with peers effectively, even on continously evolving datasets — with proven workflows from open-source software development
facilitates public sharing, while minimizing the cost of integrating 3rd-party contributions
requires no central service, or service provider

How does it work?

Part I: Consume data

Install a dataset handle

$ datalad install-handle DEMOHANDLEURL ds1

Install a dataset handle

Obtain handle content

ds1/ $ datalad get file2

Obtain handle content

Part 2: Version control

For data?

conversion errors
preprocessing updates
algorithm changes
longitudinal acquisitions
proper attribution of contributions

Tracking data evolution

Fetch updates

ds1/ $ datalad update-handle

Upgrade dataset

ds1/ $ datalad upgrade-handle

Obtain new content

ds1/ $ datalad get file2

Obtain new content

ds1/ $ datalad upgrade-handle --upgrade-data

Multi-revision data

“I need the exact same data I used for this submission”

Multi-revision data

“I need the exact same data I used for this submission”

ds1/ $ datalad checkout v1

Multi-revision data

Part 3: Multi-dataset management

Search for handles based on meta data

$ datalad search-handle "made with love"

Search for handles based on meta data


$ datalad search-handle "made with love"
DB/ds1
DB/ds3
# SPARQL

Obtain a new handle

~ $ datalad install-handle DB/ds1 projectX/ds1

Obtain a new handle

~ $ datalad install-handle DB/ds1 projectX/ds1

Show where all the known handles are


$ datalad whereis ds1
/home/me/projectX/ds1

Dataset collections

Part 4: Collections

Curated (meta-)data for dataset handles

Multiple data sources

Register

$ datalad register-collection URL

Register


$ datalad register-collection URL
$ datalad register-collection URL

Query


$ datalad search-handle "made with love"
OWN/ds1
LAB/ds2
BIG/ds2
BIG/ds4

Query


$ datalad search-collections fMRI
OWN
LAB

Part 5: Collaboration

Create handle

~ $ datalad create-handle prjX/pilot

Specify basic meta data

~/prjX/pilot $ datalad describe \
    --author "Dr. Ex Ample" --author-orchid "..." \
    --author-email ex.ample@example.com \
    --description README.txt --license CC0 \
    ...

Populate with data

~/prjX/pilot $ git annex add file1 file2 # proxy?

Publish

~/prjX/pilot $ datalad publish serverA

Publish

~/prjX/pilot $ datalad publish serverA

Peer collaboration workflow


$ datalad register-collection URL

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
projectX/pilot/ $ datalad get file2

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
projectX/pilot/ $ datalad get file2
projectX/pilot/ $ # modified/add content

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB

Peer collaboration workflow

Part 6: datalad.org

Middle-man

Handle auto-generation for public data

Authorized restricted meta data export

Actual data provider does not change

Access control remains with provider

Accepts user-submitted handles

Data access remains under user control

datalad.org: like datalad, just bigger

Curate collections from 3rd-party data


~ $ datalad create-collection visualsearch/collection vs_collection
...
~/visualsearch/collection $ datalad describe ... # same as for handle
$ datalad add-handle ds1 vs_collection
$ datalad add-handle ds2 vs_collection
$ datalad publish-collection vs_collection

Curate collections from 3rd-party data

Inside datalad

written in Python
MIT licensed free and open source software
on GitHub: https://github.com/datalad
Python module and command-line tools

Foundation: git

built on and compatible with git
all version-control and (distributed) workflow features are supported
a datalad collection is a plain git repository filled with meta data
use GitHub or any other git server for collaboration

Work horse: git-annex

datalad handles are plain git-annex repositories
compatibility with all git-annex backends/methods: Amazon S3, Google drive, OwnCloud, rsync, webdav, torrent,...
optional dropbox-like auto-synchronization

Speaks: linked data

RDF (turtle) for meta data storage
no constraints on terms or ontologies
full support of SPARQL for meta data queries
meta data homogenization for supported standards

Coming spring 2016

support for all major platforms

up-to-date collections of handles for openfmri.org, crcns.org (and growing...)

extensible meta-data standard support (e.g. BIDS, NIDM)

Acknowledgements

Joey Hess (git-annex)

Benjamin Poldrack
Jason Gors

INCF task force “Standards for Data Sharing”

http://datalad.org
https://github.com/datalad/datalad

Sharing data like source code

a distributed approach to data management

Why do we share data?

With whom do we share data?

What should data-sharing technology do for us? (at the very least)

Introducing: DataLad

your humble data management and sharing helper

How does it work?

Part I: Consume data

Install a dataset handle

Install a dataset handle

Obtain handle content

Obtain handle content

Obtain handle content

Part 2: Version control

For data?

Tracking data evolution

Tracking data evolution

Fetch updates

Upgrade dataset

Obtain new content

Obtain new content

Multi-revision data

“I need the exact same data I used for this submission”

Multi-revision data

“I need the exact same data I used for this submission”

Multi-revision data

Part 3: Multi-dataset management

Search for handles based on meta data

Search for handles based on meta data

Search for handles based on meta data

Obtain a new handle

Obtain a new handle

Show where all the known handles are

Dataset collections

Part 4: Collections

Curated (meta-)data for dataset handles

Multiple data sources

Register

Register

Query

Query

Part 5: Collaboration

Create handle

Specify basic meta data

Populate with data

Publish

Publish

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Peer collaboration workflow

Part 6: datalad.org

Middle-man

Handle auto-generation for public data

Authorized restricted meta data export

Actual data provider does not change

Access control remains with provider

Accepts user-submitted handles

Data access remains under user control

datalad.org: like datalad, just bigger

Curate collections from 3rd-party data

Curate collections from 3rd-party data

Curate collections from 3rd-party data

Curate collections from 3rd-party data

Inside datalad

Foundation: git

Work horse: git-annex

Speaks: linked data

Coming spring 2016

Acknowledgements

What should data-sharing
technology do for us?
(at the very least)