Register data in the data catalog

The CRC data catalog (https://data.sfb1451.de) showcases datasets that have been acquired or assembled by CRC members, and demonstrates the progress made towards building a versatile, curated, cross-species data resource for motor research.

The CRC data catalog is equivalent to a library’s register of books. It does not provide central data hosting. But it provides a uniform presentation of all CRC datasets, regardless of nature, origin, and location.

Catalog records

Catalog records are essentially a collection of tables (spreadsheets) with metadata on a dataset. The level of detail of a particular record is variable and can range from minimal (largely bibliographic) information to a comprehensive description that enables metadata-driven processing of the underlying data.

Catalog records employ the DataLad tabby format (see https://docs.datalad.org/projects/tabby for the specification). A key advantage of this format is that authors can provide straightforward metadata in spreadsheet form, edited via convenient (online) editors. At the same time, data curators can enrich those metadata to build a precise and detailed knowledge base on all CRC data.

Submitting a record

Catalog records can be submitted in two ways:

email

A manually composed record is emailed to m.szczepanik at fz-juelich.de. This is an adequate method for individual datasets, or a minimally described dataset.

DataLad-based deposition

A catalog record is generated in a semi-automatic fashion. The record is then deposited in the DataLad (super)dataset grouping the CRC datasets. This method is geared towards projects that generated large amounts of (homogeneous) datasets (e.g., the Z projects).

If you are interested in this method, please follow the instructions in the dedicated section or contact INF by email to m.szczepanik at fz-juelich.de.

Submitted records are processed by INF and included in the CRC superdataset, which is published to GitHub as sfb1451/all_datasets. Direct contributions are also welcome.

Creating a record

A catalog record is ultimately a collection of plain-text files. Therefore catalog records can be successfully created with many approaches. Here we describe the manual creation of a record with any of the universally available software solutions for editing spreadsheets.

Catalog records can be converted to and from the popular XLSX format. It can be edited collaboratively online (Sciebo's OnlyOffice, Google sheets, etc.), or downloaded and edited offline (Libre Office, or MS Office, etc.).

To create a record we provide:

a template in XLSX format: https://fz-juelich.sciebo.de/s/XOzaKNrGboVbJGm
a populated example record: https://fz-juelich.sciebo.de/s/qlrTRVyeyC4Sfdl

A record comprises multiple components that are described in the following sections. Each component takes the form of an individual table (or sheet).

Record components

This section provides a comprehensive overview of all pre-defined record components, and their individual items. Importantly, the present set of components is not fixed. A record can be extended with additional information as considered necessary or useful by individual authors.

Each individual table listed below describes one or more entities that have the same relationship with another entity (and are often also of the same type).

Table dataset (required)

This table contains direct properties of the dataset the catalog record is about.

Property names are given in column 1, and values in column 2 (and possibly in the following columns). Recognized properties are:

name (required)

Identifies the dataset uniquely within the scope of a CRC project, i.e. the respective project must not have two different datasets of the same name. The name should be suitable for a directory/folder name. Spaces and special characters should be avoided.

title (required)

Title to be displayed on the catalog landing page for the dataset. Language must be English.

description (required)

General description of the dataset. It may summarize its purpose, scope, content, and potential applications. If a long description needs to be split into paragraphs, each paragraph can be put into a dedicated column in this row. Language must be English.

crc-project (required)

One or more CRC project this dataset is attributed to (typically the project responsible for acquisition). The project is identified by its CRC project code (e.g., Z02). If multiple project have been involved, additional project codes can be given in subsequent columns, one per column.

version (required)

A label that identifies the version of the dataset the catalog record is describing. If a dataset is unversioned, it is acceptable to state latest. Otherwise any numerical label (e.g., 1.2), or text label (e.g., GITSHA 7db210fb5) can be provided here. The version should change when the content of the dataset changes.

sample[organism] (required)

Classification of organism(s) associated with, or studied for the dataset. One or more organisms can be given, one per column.

Organisms must be identified by their ID in the NCBI organismal taxonomy, which can be searched at https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon.

For example, the identifier for human or homo sapiens is NCBITaxon:9606. The column value should be NCBITaxon:9606 in this case.

sample[organism-part] (required)

Classification of organism part(s) associated with, or studied for the dataset. One or more organism parts can be given, one per column.

Organism parts must be identified by their ID in the Uber-anatomy ontology (UBERON), which can be searched at https://www.ebi.ac.uk/ols4/ontologies/uberon.

For example, the identifier for upper limb segment is UBERON:0008785. The column value should be UBERON:0008785 in this case.

The identifier for the brain is UBERON:0000955, but more precise definitions for individual brain structures are available.

keywords

Keywords describing the major topical themes of the dataset. Any number of keywords can be given, one keyword per column. Keyword aid the discoverability of a dataset.

license

A license document (URL) that applies to the dataset and defines the terms and conditions for use.

doi

A DOI the dataset got assigned (e.g., from a data portal it was published in). The DOI should preferably point to the dataset version described in the catalog record. URL format (starting with https://doi.org) is preferred.

homepage

A URL the catalog should advertise as the primary source of information/data on this dataset. This could be a dataset page in a data portal.

last-updated

Date of the last modification of the described dataset (version), for example a release date. Must be given in ISO 8601 format (i.e., YYYY-MM-DD).

Table data-controller (required)

This table lists one or more entities (natural persons or organizations) that are (legally) responsible for a dataset, and serve as an official contact point regarding collaboration inquiries. For datasets involving personal data (as defined in the European General Data Protection Regulation; GDPR) this table lists data controllers. For CRC datasets, these are typically the PIs of the involved CRC project(s).