DataLad-based deposition to the catalog

For projects that have already adopted DataLad as data management tool for CRC1451 data, certain automated functionality can ease the load of generating and depositing metadata for use in the CRC1451 data catalog.

The previous section has covered in detail the generation of metadata records using the DataLad tabby format and spreadsheet editing tools. Here we focus on generating and depositing the same for two specific DataLad-based scenarios:

  1. A shareable DataLad dataset
  2. A NON-shareable DataLad dataset

Scenario A: shareable DataLad dataset

In this scenario the project has DataLad datasets that can be shared. The main question is then:

How should such a dataset be annotated with a tabby record, and how should this addition be deposited, such that that the resulting metadata would be sufficient for publication in the data catalog?

1. Create a tabby record in a dedicated location

First, a tabby record should be created by following the instructions provided in the previous Creating a record section, with two exceptions. First, the version property of the dataset table is no longer required (as the version can come from git). Second, the files table should not be included (as all information can be obtained from the git repository).

Next, the tabby record (including at least the required dataset, data-controller, authors, and funding tables) should be placed as separate tsv files (ideally using UTF-8 encoding) in the following directory relative to the root of the DataLad dataset that is being described:

.datalad/tabby/self/

For converting tabby files between excel and tsv formats, see the datalad_tabby.io.xlsx module. Finally, please include @tby-crc1451v0 convention name in the file names (this allows us to attach definitions to the terms used):

.datalad/tabby/self
├── authors@tby-crc1451v0.tsv
├── data-controller@tby-crc1451v0.tsv
├── dataset@tby-crc1451v0.tsv
└── funding@tby-crc1451v0.tsv

2. [Optional] Create extra tabby records

Any other tabby records that should be reported but do not relate directly to the DataLad dataset being described can be placed in the following directory relative to the root of the DataLad dataset:

.datalad/tabby/collection/

3. Save the dataset

Save the dataset using DataLad:

datalad save --to-git -m "<save message>"

This will ensure that the new and latest state of the dataset includes the added metadata records.

4. Publish the update

Note

This step assumes that the project has already been published (has a configured sibling).

Then, push the new state of the dataset to an existing sibling:

datalad push --to <sibling>

You can use any repository hosting solution (e.g. GitHub, GitLab, GIN) or follow the procedure described in the WebDAV walkthrough as long as the published dataset can be reachable by INF.

Once this has been done, please notify INF by email (m.szczepanik at fz-juelich.de), and we will clone the dataset and parse the relevant information, adding it to the data catalog.

Scenario B: NON-shareable DataLad dataset

In this scenario the project has DataLad datasets that can NOT be shared, and the main question becomes:

How can a tabby record (including file metadata) be generated in the most efficient way, and how should this addition be deposited, such that these records can populate the data catalog without having to share or publish the DataLad dataset?

1. Save tabby records to dedicated locations

For a maintainable structure and keeping the complete dataset including its metadata under version control, it is highly recommended to create and place tabby records in the same location as described in Scenario A steps 1 through 3.

2. [Optional] Create the files table

The files table is an optional addition to the tabby record and lists one or more files that form part of the DataLad dataset. Such a file listing can be easily generated based on the output of datalad status - see the status2tabby.py script, which uses the datalad-next extension, for an example implementation that you can also use directly:

usage: status2tabby.py [-h] dataset outfile

positional arguments:
  dataset     Dataset for which files will be listed
  outfile     Name of tsv file to write

options:
  -h, --help  show this help message and exit

3. Publish the metadata

For manual deposition, copy and send all tabby records to the INF project coordinator via email.

For deposition using Git / DataLad, open a GitHub pull request for the sfb1451/all-datasets superdataset (fork and clone the superdataset, add your tabby files collection under .datalad/tabby, push changes to your GitHub account and open a PR in GitHub's interface).