Perpetual
decentralized management of digital objects
for
collaborative open-science

Michael Hanke

Institute of Neuroscience and Medicine, Brain & Behavior (INM-7), Research Center Jülich
Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf
http://psychoinformatics.de

“The task of neural science is to explain behavior in terms of the activities of the brain.”
— Eric Kandel, Principles of Neuroscience.

Source: 20th Century Fox

Inter-individual variability? Non-compliance? noise?

three individual brains in a brain structure defined reference space
(e.g., MNI)
"diagnostic" voxels for distinguishing perception of tools and dwellings

Is brain structure alone an optimal reference
for inter-individual analysis of brain function?

Mitchell et al., PLoS ONE, 2008

Common alignment algorithms

Compute a transformation of
2D/3D space
based on a single feature (e.g. image gray value),
or a low-dimensional feature vector (e.g. sulcal depth, curvature, myelin)

from Yeo, et al., Journal of Neurophysiology, 2011 and Robinson et al., NeuroImage, 2014

Are we comparing the right things?

from https://stnava.github.io/ANTsTalk

Functional hyperalignment

Compute a transformation of
a high-dimensional (representational) space
based on a high-dimensional feature vector, such as the functional response to watching a movie (>1000 time points)

Haxby, Guntupalli, Connolly, Halchenko, Conroy, Gobbini, Hanke & Ramadge (2011) A common high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72, 404-416.

Predictive modeling of brain organization

Guntupalli, Hanke, Halchenko, Connolly, Ramadge & Haxby (2016). A model of representational spaces in human cortex. Cerebral Cortex, 26, 2919-2934. (suppl.)

Toy example

Four individual brains performing the same task
(time-locked activity assumed)

Four voxels, 4D representational space

Voxels selected independently for each brain
— no structural constraints necessary, albeit often desired

Hyperalignment: Normalization

Independent Z-scoring of time series (centered data)
only deviation from mean is considered

Hyperalignment: Toy example

Pick a reference subject/space:
first acquired, random, least motion, most variance, largest brain, ...

Hyperalignment: Toy example

Compute orthonormal transformation that minimizes distance in 4D between subjects; average of transformed data defines response profiles in common space

Hyperalignment: Toy example

Individual transformation matrices align a subject's voxel space with the common space; can be applied to other data (see later slide for requirements)

Hyperalignment: Toy example

Orthonormal transformation is reversible: easily project data from common space into an individual voxel space

Common reference of brain function

Common pattern of involment of brain networks in particular brain functions in real-life cognition.

Reconceptualization of inter-individual differences.

Potential to facilitate reliable clinical diagnostics.

Psychological approach

Record data from lots of sensors/questionnaires
Determine key markers
Acquire normative samples
Describe individual sample relative to the norm

Guesstimate magnitude of complexity

Too big, too risky, too expensive — for an individual lab/center

from Swaroop Guntupalli (unpublished feasibility study)

Role model for community potential

Concept: Give interested parties something to work on using their own resources and re-intergrate their contributions for another cycle

Open, high-quality, well-described "naturalistic" data

Hanke, Baumgartner, Ibe, Kaule, Pollmann, Speck, Zinke, & Stadler (2014) A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie. Scientific Data, 1:140003. http://www.nature.com/articles/sdata20143

Resources and results towards building a functional brain atlas

studyforrest.org

open data resource
versatile structural imaging data
10+ hours of fMRI per subject, various paradigms
simultaneous physio data, eyetracking, auxiliary datasets
versatile movie stimulus descriptions(every spoken word (grammar, semantics); music played; emotions; body contact; eye movements, saccade tactually argets, fixations; visible facial features; semantic conflict, space/time discontinuities)
fantastic 3rd-party data hidden elsewhere
DuPre, Hanke & Poline (2019). Nature abhors a paywall: How open science can realize the potential of naturalistic stimuli. PsyArXiv.

Interim conclusion after five years

Was it worth being open? ABSOLUTELY!

18 additional, independent, published studies use these data (virtually all of them would not have been attempted by our lab)
not a single "scoop"
substantial boost in return-of-investment for the tax payer
inspired similar work by others

Did we make the most out of it? ABSOLUTELY NOT!

dozens of promises to contribute original data, none happened, yet
starting point for users today is practically identical to 4 years ago

Why is the open-science magic so weak?

Lessons from open-science

Isolated efforts are futile

Reporting standards: Nichols, Das, Eickhoff, Evans, Glatard, Hanke, Kriegeskorte, Milham, Poldrack, Poline, Proal, Thirion, Van Essen, White, Yeo . (2017). Best Practices in Data Analysis and Sharing in Neuroimaging using MRI. Nature Neuroscience. http://www.humanbrainmapping.org/cobidas
Standard data structures: Gorgolewski, Auer, Calhoun, Craddock, Duff, Flandin, Ghosh, Halchenko, Handwerker, Hanke, Keator, Li, Maumet, Michael, Nichols, Nichols, Poline, Rokem, Schaefer, Sochat, Turner, Varoquaux, Poldrack (2016). The Brain Imaging Data Structure: a protocol for standardizing and describing outputs of neuroimaging experiments. Scientific Data. http://bids.neuroimaging.io
Code review/release necessity: Eglen, Marwick, Halchenko, Hanke, Sufi, Gleeson, Silver. Davison, Lanyon, Abrams, Wachtler, Willshaw, Pouzat, Poline (2017). Towards standard practices for sharing computer code and programs in neuroscience. Nature Neuroscience.
Open-science strategy: Halchenko & Hanke (2015).
Four aspects to make science open “by design” and not as an after-thought. GigaScience.

Make your scientific output...

F
indable
A
ccessible
I
nteroperable
R
eusable

https://www.go-fair.org/fair-principles

FAIR principles

F1: (Meta)data are assigned a globally unique and persistent identifier
F2: Data are described with rich metadata
F3: Metadata clearly and explicitly include the identifier of the data they describe
F4: (Meta)data are registered or indexed in a searchable resource

A1: (Meta)data are retrievable by their identifier using a standardised ... protocol
A1.1: The protocol is open, free, and universally implementable
A1.2: The protocol allows for an authentication and authorisation procedure
A2: Metadata are accessible, even when the data are no longer available

I1: (Meta)data use a formal, accessible ... language for knowledge representation.
I2: (Meta)data use vocabularies that follow FAIR principles
I3: (Meta)data include qualified references to other (meta)data

R1: Meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1: (Meta)data are released with a clear and accessible data usage license
R1.2: (Meta)data are associated with detailed provenance
R1.3: (Meta)data meet domain-relevant community standards

https://www.go-fair.org/fair-principles

## An open-science project is never really finished > The utility of a resource declines in the absence of continued investment. > > FAIR today is not FAIR forever. * what worked yesterday will eventually need updating to remain useful (especially analysis code) * data can be "broken" too! * sticking to "old" standards will ultimately make things special, and too expensive to work with

Datalad principles

There are only two things in the world: datasets and files.
A dataset is a Git repository.
A dataset can have an optional annex for (large) file content tracking (transport to and from the annex managed with Git-annex, https://git-annex.branchable.com).

Minimization of custom procedures and data structures: Users must not loose data or data access, if DataLad would vanish.
Complete decentralization, no required central server or service.
Maximize use of existing 3rd-party infrastructure.

Install an existing dataset

request via standard URL,
(each dataset has a UUID, and each dataset location another UUID)

$ datalad install http://example.com/ds1

Obtain dataset content

request via user-friendly local file path, not internal ID,
regardless of remote actual storage solution properties

ds1/ $ datalad get file2

Tracking "remote" data evolution

ability to track any number of dataset "siblings",
in Git or non-Git data stores

ds1/ $ datalad update

Keep up-to-date

apply changes from default or selected sibling
while maintaining local data availability status

ds1/ $ datalad update --merge --reobtain-data

Dataset linkage

$ datalad install --dataset . --source http://example.com/ds inputs/rawdata

$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+       path = inputs/rawdata
+       url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572

Combine modular dataset units

"actionable" links to subdatasets/files, seamless handling of dataset trees, each dataset can be individually managed by a different curator

"Complete" provenance capture

- for any local command

$ datalad run -m "Perform eye movement event detection"\
  --input 'inputs/raw_eyegaze/sub-*/beh/sub-*...tsv.gz' \
  --output 'sub-*' \
  bash code/compute_all.sh

- for any containerized app (can be tracked in the dataset too)

$ datalad containers-run -n nilearn \
  --input 'inputs/mri_aligned/sub-*/in_bold3Tp2/sub-*_task-avmovie_run-*_bold*' \
  --output 'sub-*/LC_timeseries_run-*.csv' \
  "bash -c 'for sub in sub-*; do for run in run-1 ... run-8;
     do python3 code/extract_lc_timeseries.py \$sub \$run; done; done'"

Complete capture of any input data, computational environment, code, parameters, and outputs possible — without sacrificing modularity

Enables enigma-style computing — analyze data that you don't have!

## Scalable and Actionable (meta)data representations ![](pics/yoda_metadata.png) - (meta)data logistics solution with **built-in provenance capture** - automatic **metadata updates** to track evolution of standards - integrates well with storage/hosting technology - facilitates building metadata-driven applications - viable system to **bring computation to data** Note:

Metadata-based search for individual files

across datasets, without a DB (server)

$ datalad \
  -c datalad.search.index-egrep-documenttype=files \
  -f json_pp \
  search \
    bids.subject.sex:female \
    bids.type:t1 \
    bids.subject.age:24
{
  "dsid": "4842e188-7df5-11e6-8e6b-002590f97d84",
  "metadata": {
    "@context": {...},
    "bids": {...},
    "datalad_core": {
      "url": [
        "http://openneuro.s3.amazonaws.com/ds000008/ds000008_R1.1.0/...MZ92g",
        "http://openneuro.s3.amazonaws.com/ds000008/ds000008_R1.1.1/...UyanK",
        "http://openneuro.s3.amazonaws.com/ds000008/ds000008_R2.0.0/..._flBz"
      ]
    },
    "nifti1": {...},
  "parentds": "/tmp/mega/openfmri/ds000008",
  "path": "/tmp/mega/openfmri/ds000008/sub-15/anat/sub-15_T1w.nii.gz",
  "query_matched": {
    "bids.subject.age(years)": "24",
    "bids.subject.sex": "female",
    "bids.type": "T1"
  },
  "refcommit": "b18692ef1beefd88055bc0578b7567a8f4fdf8f9",
  "type": "file"
}
...

alternative output formats: JSON stream, custom, ...

Publish

Supports a variety of consumer storage solutions (SSH-servers, GIN, DropBox, Box.com, Google, WEBDAV, bittorrent, IPFS, ...) via Git-annex
Built-in support for strong data encryption
Per-target configuration of accepted content, with configurable permissions and authorization mechanisms
Export of dataset to FigShare and similar storage solutions
Multiple redundant synchronized publication targets are supported (seemingly "publish 2TB on GitHub")

Datasets

are lightweight (typically <<10MB, even when tracking TBs)
can be attached to a traditional paper to enable direct access to original data, analysis code, computational environments and results
have machine-readable metadata attached
support redundant storage
insure utility against failure of career, institutions, publishers

Extend DataLad

Separate Python packages, anyone can develop their own
https://github.com/datalad/datalad-extension-template
Means for tailored solutions with narrower scope or specific audiences
Extensions can provide additional commands, procedures, metadata extractors, webapps
Available extensions
- containers: support for containerized computational environments
- crawler: track web resources in automated data distributions
- neuroimaging: neuroimaging research data and workflow
- hirni: imaging raw data management/entry, automatic BIDS-conversion
- htcondor: cluster/cloud/grid-based remote code execution
- webapp: REST API for querying/manipulating datasets

http://docs.datalad.org/en/latest/customization.html#extension-packages

Too much?

All this is possible, but not necessary!

You just need to use two commands:

# 1. start something
myproject/ % datalad create
# 2. do something
# ...
# 3. save state
myproject/ % datalad save
# go to (2)

The result is a dataset that
- captures the full history of a project (all data, all code, all changes ever done to them)
- is compatible with everything that was shown previously in this talk