Register data in the data catalog
The CRC data catalog (https://data.sfb1451.de) showcases datasets that have been acquired or assembled by CRC members, and demonstrates the progress made towards building a versatile, curated, cross-species data resource for motor research.
The CRC data catalog is equivalent to a library’s register of books. It does not provide central data hosting. But it provides a uniform presentation of all CRC datasets, regardless of nature, origin, and location.
Catalog records
Catalog records are essentially a collection of tables (spreadsheets) with metadata on a dataset. The level of detail of a particular record is variable and can range from minimal (largely bibliographic) information to a comprehensive description that enables metadata-driven processing of the underlying data.
Catalog records employ the DataLad tabby format (see https://docs.datalad.org/projects/tabby for the specification). A key advantage of this format is that authors can provide straightforward metadata in spreadsheet form, edited via convenient (online) editors. At the same time, data curators can enrich those metadata to build a precise and detailed knowledge base on all CRC data.
Submitting a record
Catalog records can be submitted in two ways:
- A manually composed record is emailed to m.szczepanik at fz-juelich.de. This is an adequate method for individual datasets, or a minimally described dataset.
- DataLad-based deposition
A catalog record is generated in a semi-automatic fashion. The record is then deposited in the DataLad (super)dataset grouping the CRC datasets. This method is geared towards projects that generated large amounts of (homogeneous) datasets (e.g., the Z projects).
If you are interested in this method, please follow the instructions in the dedicated section or contact INF by email to m.szczepanik at fz-juelich.de.
Submitted records are processed by INF and included in the CRC superdataset, which is published to GitHub as sfb1451/all_datasets. Direct contributions are also welcome.
Creating a record
A catalog record is ultimately a collection of plain-text files. Therefore catalog records can be successfully created with many approaches. Here we describe the manual creation of a record with any of the universally available software solutions for editing spreadsheets.
Catalog records can be converted to and from the popular XLSX format. It can be edited collaboratively online (Sciebo's OnlyOffice, Google sheets, etc.), or downloaded and edited offline (Libre Office, or MS Office, etc.).
To create a record we provide:
- a template in XLSX format: https://fz-juelich.sciebo.de/s/XOzaKNrGboVbJGm
- a populated example record: https://fz-juelich.sciebo.de/s/qlrTRVyeyC4Sfdl
A record comprises multiple components that are described in the following sections. Each component takes the form of an individual table (or sheet).
Record components
This section provides a comprehensive overview of all pre-defined record components, and their individual items. Importantly, the present set of components is not fixed. A record can be extended with additional information as considered necessary or useful by individual authors.
Each individual table listed below describes one or more entities that have the same relationship with another entity (and are often also of the same type).
Table dataset (required)
This table contains direct properties of the dataset the catalog record is about.
Property names are given in column 1, and values in column 2 (and possibly in the following columns). Recognized properties are:
- name (required)
- Identifies the dataset uniquely within the scope of a CRC project, i.e. the respective project must not have two different datasets of the same name. The name should be suitable for a directory/folder name. Spaces and special characters should be avoided.
- title (required)
- Title to be displayed on the catalog landing page for the dataset. Language must be English.
- description (required)
- General description of the dataset. It may summarize its purpose, scope, content, and potential applications. If a long description needs to be split into paragraphs, each paragraph can be put into a dedicated column in this row. Language must be English.
- crc-project (required)
- One or more CRC project this dataset is attributed to (typically the project responsible for acquisition). The project is identified by its CRC project code (e.g., Z02). If multiple project have been involved, additional project codes can be given in subsequent columns, one per column.
- version (required)
- A label that identifies the version of the dataset the catalog record is describing. If a dataset is unversioned, it is acceptable to state latest. Otherwise any numerical label (e.g., 1.2), or text label (e.g., GITSHA 7db210fb5) can be provided here. The version should change when the content of the dataset changes.
- sample[organism] (required)
Classification of organism(s) associated with, or studied for the dataset. One or more organisms can be given, one per column.
Organisms must be identified by their ID in the NCBI organismal taxonomy, which can be searched at https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon.
For example, the identifier for human or homo sapiens is NCBITaxon:9606. The column value should be NCBITaxon:9606 in this case.
- sample[organism-part] (required)
Classification of organism part(s) associated with, or studied for the dataset. One or more organism parts can be given, one per column.
Organism parts must be identified by their ID in the Uber-anatomy ontology (UBERON), which can be searched at https://www.ebi.ac.uk/ols4/ontologies/uberon.
For example, the identifier for upper limb segment is UBERON:0008785. The column value should be UBERON:0008785 in this case.
The identifier for the brain is UBERON:0000955, but more precise definitions for individual brain structures are available.
- keywords
- Keywords describing the major topical themes of the dataset. Any number of keywords can be given, one keyword per column. Keyword aid the discoverability of a dataset.
- license
- A license document (URL) that applies to the dataset and defines the terms and conditions for use.
- doi
- A DOI the dataset got assigned (e.g., from a data portal it was published in). The DOI should preferably point to the dataset version described in the catalog record. URL format (starting with https://doi.org) is preferred.
- homepage
- A URL the catalog should advertise as the primary source of information/data on this dataset. This could be a dataset page in a data portal.
- last-updated
- Date of the last modification of the described dataset (version), for example a release date. Must be given in ISO 8601 format (i.e., YYYY-MM-DD).
Table data-controller (required)
This table lists one or more entities (natural persons or organizations) that are (legally) responsible for a dataset, and serve as an official contact point regarding collaboration inquiries. For datasets involving personal data (as defined in the European General Data Protection Regulation; GDPR) this table lists data controllers. For CRC datasets, these are typically the PIs of the involved CRC project(s).
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- name (required)
- The full name of the responsible entity. For example, the name of a CRC project PI.
- email (required)
- An email address with which the entity can be contacted. For example, the institutional email address of a CRC project PI.
- type
- The type of entity described in a column. Either Person, or Organization.
- address
- A (postal) address for the responsible entity.
Table authors (required)
This table lists one or more entities (natural persons or organizations) that are considered authors of the dataset. These authors need not be identical to an author list of an associated publication. Any entity listed in this table will be credited on the catalog page of the dataset.
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- name (required)
- The full name of the author. For example, the name of a CRC member.
- An email address with which the author can be contacted.
- orcid
- ORCID of this author, to uniquely identify a researcher.
- affiliation
- One or more names of organizations or institutions an author is affiliated with. Affiliations are free-form. Multiple affiliations can be given by repeating the column as often as necessary.
Table funding (required)
This table lists one or more funding sources that are associated with the dataset and shall be credited on the dataset's catalog page.
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- funder (required)
- Code or URL that identifiers the entity providing the funding. For the for the required acknowledgment of the DFG funding of the CRC1451 use the code DFG.
- grant (required)
- Grant identifier. This is typically a funder-specific project code. for the required acknowledgment of the DFG funding of the CRC1451 use the code 431549029-<CRC-project-code>.
Some additional grant information can be retrieved online during processing of records. This is currently done for projects available in GEPRIS (DFG) and CORDIS (EU). For DFG grants other than CRC1451 and its subprojects, enter "DFG" in the funder field and only the numeric identifier in the identifier field. For European Commision grants, enter the URL to their CORDIS page.
Table publications
This table lists one or more publications which are associated with the dataset and shall be credited on the dataset's catalog page.
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- citation (required, but optional with doi specified)
- Free-form citation for the publication. Enables publication record display on catalog page. All citations in a metadata record should use a common, and homogeneous format.
- doi (optional)
- A Digital Object Identifier (URL, starting with https://doi.org/) for a publication. Enables publication DOI display on a catalog page, persistently identifies the publication, and enables metadata retrieval from bibliographic databases.
- url (optional)
- A URL pointing to the publication. A corresponding link is placed on the dataset page in the catalog. This need not be given when a publication DOI is specified.
- date (optional)
- Date of publication. Enables publication date display on a catalog page, and association of publication with the DFG reporting timeframe. ISO 8601 format; year alone is sufficient.
Table files
This table lists one or more files that form the dataset.
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- path[POSIX] (required)
- Relative path (within the dataset) in POSIX/UNIX notation (i.e. forward slashes). This enables display of a file tree on the dataset catalog page. Tip: do not include a top-level directory that matches the dataset name, because the files are already understood as being in the dataset.
- size[bytes]
- File size in bytes. Enables (total) size information in the catalog record and the file tree on the dataset catalog page. Required for auto-generation of a DataLad dataset from a catalog record.
- checksum[md5]
- MD5 checksum ("fingerprint") of the file. Enables content/download verification. Required for auto-generation of a DataLad dataset from a catalog record.
- url
- File content URL, i.e. URL allowing (possibly access-protected) download of that particular file (not leading to e.g. a landing page). Enables display of download button on the dataset catalog page. Required for auto-generation of a DataLad dataset from a catalog record.
Table used-for
This table lists one or more activities/projects that the dataset has been or is presently being used for.
Property names are given in row 1, and values for each entity are given in subsequent rows (columns corresponding to the header row specification). Recognized properties are:
- title (required)
- A title for the activity/project. This will be display on the dataset page in the catalog.
- url
- A URL pointing to a web page representing the activity/project, or providing information on it. A corresponding link is shown on the dataset page in the catalog.
- description
- A description of the activity/project, possibly focused on the role/association of the dataset in it. If a long description need to be split into paragraphs, each paragraph can be put into a dedicated column. Language must be English.
Q&A
- What is the preferred granularity of a dataset record in the catalog?
There is no technical limit on the maximum number of dataset records in the data catalog.
For datasets resulting from a (semi-)automated process, the preferred granularity is finer-is-better. For example, it is preferred to have dedicated records for the cross-sectional slices of a longitudinal dataset.
However, when datasets and their catalog metadata records have to be assembled manually, it is more important to have one high-level/umbrella record for an entire dataset (first), rather than only having fine-grained records for individual components later.
- Can dataset records be linked to yield larger datasets made from smaller components?
- Yes, this is possible. The DataLad software can generate this information programmatically. If this feature is needed for a manually assembled catalog record, please contact INF.