Sharing data like source code

a distributed approach to data management

Michael HankeYaroslav Halchenko
Psychoinformatics lab, Institute of Psychology II, University of Magdeburg
Center for Behavioral Brain Sciences, Magdeburg
Dept. Psychological and Brain Sciences
Center for Cognitive Neuroscience, Dartmouth College

Why do we share data?

  • Because it is the right thing to do
  • Because we promised we would
  • Because we hope to get something in return

With whom do we share data?

Fictional figure — reality is likely worse

What should data-sharing
technology do for us?
(at the very least)

Make it trivial to consume public data.

Streamline sharing with peers by facilitating bilateral/multilateral exchange.

… and minimize the extra effort to do the right thing® and share

Introducing: DataLad

your humble data management and sharing helper

  • manage you own data locally — with industrial-grade version control
  • collaborate with peers effectively, even on continously evolving datasets — with proven workflows from open-source software development
  • facilitates public sharing, while minimizing the cost of integrating 3rd-party contributions
  • requires no central service, or service provider

How does it work?

Part I: Consume data

Install a dataset handle

$ datalad install-handle DEMOHANDLEURL ds1

Install a dataset handle

Obtain handle content

Obtain handle content

ds1/ $ datalad get file2

Obtain handle content

Part 2: Version control

For data?

  • conversion errors
  • preprocessing updates
  • algorithm changes
  • longitudinal acquisitions
  • proper attribution of contributions

Tracking data evolution

 

Tracking data evolution

 

Fetch updates

 
ds1/ $ datalad update-handle

Upgrade dataset

 
ds1/ $ datalad upgrade-handle

Obtain new content

 
ds1/ $ datalad get file2

Obtain new content

 
ds1/ $ datalad upgrade-handle --upgrade-data

Multi-revision data

“I need the exact same data I used for this submission”

Multi-revision data

“I need the exact same data I used for this submission”
ds1/ $ datalad checkout v1

Multi-revision data

 

Part 3: Multi-dataset management

Search for handles based on meta data

Search for handles based on meta data

$ datalad search-handle "made with love"

Search for handles based on meta data


$ datalad search-handle "made with love"
DB/ds1
DB/ds3
# SPARQL

Obtain a new handle

~ $ datalad install-handle DB/ds1 projectX/ds1

Obtain a new handle

~ $ datalad install-handle DB/ds1 projectX/ds1

Show where all the known handles are


$ datalad whereis ds1
/home/me/projectX/ds1

Dataset collections

Part 4: Collections

Curated (meta-)data for dataset handles

Multiple data sources

Register

$ datalad register-collection URL

Register


$ datalad register-collection URL
$ datalad register-collection URL
    

Query


$ datalad search-handle "made with love"
OWN/ds1
LAB/ds2
BIG/ds2
BIG/ds4

Query


$ datalad search-collections fMRI
OWN
LAB

Part 5: Collaboration

Create handle

~ $ datalad create-handle prjX/pilot

Specify basic meta data

~/prjX/pilot $ datalad describe \
    --author "Dr. Ex Ample" --author-orchid "..." \
    --author-email ex.ample@example.com \
    --description README.txt --license CC0 \
    ...

Populate with data

~/prjX/pilot $ git annex add file1 file2 # proxy?

Publish

~/prjX/pilot $ datalad publish serverA

Publish

~/prjX/pilot $ datalad publish serverA

Peer collaboration workflow


$ datalad register-collection URL
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
projectX/pilot/ $ datalad get file2
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
projectX/pilot/ $ datalad get file2
projectX/pilot/ $ # modified/add content
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB
    

Peer collaboration workflow


$ datalad register-collection URL
$ datalad install-handle prjX/pilot projectX/pilot
~/projectX/pilot $ datalad get file2
~/projectX/pilot $ # modified/add content
~/projectX/pilot $ datalad publish serverB
    

Peer collaboration workflow

Part 6: datalad.org

Middle-man

Handle auto-generation for public data

Authorized restricted meta data export

Actual data provider does not change

Access control remains with provider

Accepts user-submitted handles

Data access remains under user control

datalad.org: like datalad, just bigger

Curate collections from 3rd-party data

Curate collections from 3rd-party data

Curate collections from 3rd-party data


~ $ datalad create-collection visualsearch/collection vs_collection
...
~/visualsearch/collection $ datalad describe ... # same as for handle
$ datalad add-handle ds1 vs_collection
$ datalad add-handle ds2 vs_collection
$ datalad publish-collection vs_collection
    

Curate collections from 3rd-party data

Inside datalad

  • written in Python
  • MIT licensed free and open source software
  • on GitHub: https://github.com/datalad
  • Python module and command-line tools

Foundation: git

  • built on and compatible with git
  • all version-control and (distributed) workflow features are supported
  • a datalad collection is a plain git repository filled with meta data
  • use GitHub or any other git server for collaboration

Work horse: git-annex

  • datalad handles are plain git-annex repositories
  • compatibility with all git-annex backends/methods: Amazon S3, Google drive, OwnCloud, rsync, webdav, torrent,...
  • optional dropbox-like auto-synchronization

Speaks: linked data

  • RDF (turtle) for meta data storage
  • no constraints on terms or ontologies
  • full support of SPARQL for meta data queries
  • meta data homogenization for supported standards

Coming spring 2016

support for all major platforms

up-to-date collections of handles for openfmri.org, crcns.org (and growing...)

extensible meta-data standard support (e.g. BIDS, NIDM)

Acknowledgements

Joey Hess (git-annex)

Benjamin Poldrack
Jason Gors

INCF task force “Standards for Data Sharing”

http://datalad.org
https://github.com/datalad/datalad