Datalad Based Deposition to the Catalog
For projects that have already adopted DataLad as data management tool for CRC1451 data, certain automated functionality can ease the load of generating and depositing metadata for use in the CRC1451 data catalog.
The previous section has covered in detail the generation of metadata records using the DataLad tabby format and spreadsheet editing tools. Here we focus on generating and depositing the same for two specific DataLad-based scenarios:
- A shareable DataLad dataset
- A NON-shareable DataLad dataset
Scenario A: shareable DataLad dataset
In this scenario the project has DataLad datasets that can be shared. The main question is then:
How should such a dataset be annotated with a tabby record, and how should this addition be deposited, such that that the resulting metadata would be sufficient for publication in the data catalog?
1. Create a tabby record in a dedicated location
First, a tabby record should be created by following the instructions
provided in the previous Creating a
record
section, with two exceptions. First, the version
property of the
dataset
table is no longer required (as the version can come from
git). Second, the files table should not be included (as all information
can be obtained from the git repository).
Next, the tabby record (including at least the required dataset
,
data-controller
, authors
, and funding
tables) should be placed as
separate tsv files (ideally using UTF-8 encoding) in the following
directory relative to the root of the DataLad dataset that is being
described:
.datalad/tabby/self/
For converting tabby files between excel and tsv formats, see the
datalad_tabby.io.xlsx
module. Finally, please include @tby-crc1451v0
convention name in the
file names (this allows us to attach definitions to the terms used):
.datalad/tabby/self
├── authors@tby-crc1451v0.tsv
├── data-controller@tby-crc1451v0.tsv
├── dataset@tby-crc1451v0.tsv
└── funding@tby-crc1451v0.tsv
2. [Optional] Create extra tabby records
Any other tabby records that should be reported but do not relate directly to the DataLad dataset being described can be placed in the following directory relative to the root of the DataLad dataset:
.datalad/tabby/collection/
3. Save the dataset
Save the dataset using DataLad:
datalad save --to-git -m "<save message>"
This will ensure that the new and latest state of the dataset includes the added metadata records.
4. Publish the update
Important
This step assumes that the project has already been published (has a configured sibling).
Then, push the new state of the dataset to an existing sibling:
datalad push --to <sibling>
You can use any repository hosting solution (e.g. GitHub, GitLab, GIN) or follow the procedure described in the WebDAV walkthrough as long as the published dataset can be reachable by INF.
Once this has been done, please notify INF by email
(m.szczepanik at fz-juelich.de
), and we will clone the dataset and
parse the relevant information, adding it to the data catalog.
Scenario B: NON-shareable DataLad dataset
In this scenario the project has DataLad datasets that can NOT be shared, and the main question becomes:
How can a tabby record (including file metadata) be generated in the most efficient way, and how should this addition be deposited, such that these records can populate the data catalog without having to share or publish the DataLad dataset?
1. Save tabby records to dedicated locations
For a maintainable structure and keeping the complete dataset including its metadata under version control, it is highly recommended to create and place tabby records in the same location as described in Scenario A steps 1 through 3.
2. [Optional] Create the files
table
The files
table is
an optional addition to the tabby record and lists one or more files
that form part of the DataLad dataset. Such a file listing can be easily
generated based on the output of datalad status
- see the
status2tabby.py
script, which uses the datalad-next
extension, for an example
implementation that you can also use directly:
usage: status2tabby.py [-h] dataset outfile
positional arguments:
dataset Dataset for which files will be listed
outfile Name of tsv file to write
options:
-h, --help show this help message and exit
Info
Another (more performant) alternative is to use the ls_file_collection function from DataLad-next, and transform its output.
3. Publish the metadata
For manual deposition, copy and send all tabby records to the INF project coordinator via email.
For deposition using Git / DataLad, open a GitHub pull request for the
sfb1451/all-datasets
superdataset (fork and clone the superdataset, add your tabby files
collection under .datalad/tabby
, push changes to your GitHub account
and open a PR in GitHub's interface).