Datalad Based Deposition to the Catalog

For projects that have already adopted DataLad as data management tool for CRC1451 data, certain automated functionality can ease the load of generating and depositing metadata for use in the CRC1451 data catalog.

The previous section has covered in detail the generation of metadata records using the DataLad tabby format and spreadsheet editing tools. Here we focus on generating and depositing the same for two specific DataLad-based scenarios:

A shareable DataLad dataset
A NON-shareable DataLad dataset

Scenario A: shareable DataLad dataset

In this scenario the project has DataLad datasets that can be shared. The main question is then:

How should such a dataset be annotated with a tabby record, and how should this addition be deposited, such that that the resulting metadata would be sufficient for publication in the data catalog?

1. Create a tabby record in a dedicated location

First, a tabby record should be created by following the instructions provided in the previous Creating a record section, with two exceptions. First, the version property of the dataset table is no longer required (as the version can come from git). Second, the files table should not be included (as all information can be obtained from the git repository).

Next, the tabby record (including at least the required dataset, data-controller, authors, and funding tables) should be placed as separate tsv files (ideally using UTF-8 encoding) in the following directory relative to the root of the DataLad dataset that is being described:

.datalad/tabby/self/

For converting tabby files between excel and tsv formats, see the datalad_tabby.io.xlsx module. Finally, please include @tby-crc1451v0 convention name in the file names (this allows us to attach definitions to the terms used):

.datalad/tabby/self
├── authors@tby-crc1451v0.tsv
├── data-controller@tby-crc1451v0.tsv
├── dataset@tby-crc1451v0.tsv
└── funding@tby-crc1451v0.tsv

2. [Optional] Create extra tabby records

Any other tabby records that should be reported but do not relate directly to the DataLad dataset being described can be placed in the following directory relative to the root of the DataLad dataset:

.datalad/tabby/collection/

3. Save the dataset

Save the dataset using DataLad:

datalad save --to-git -m "<save message>"

This will ensure that the new and latest state of the dataset includes the added metadata records.

4. Publish the update

Important

This step assumes that the project has already been published (has a configured sibling).

Then, push the new state of the dataset to an existing sibling:

datalad push --to <sibling>

You can use any repository hosting solution (e.g. GitHub, GitLab, GIN) or follow the procedure described in the WebDAV walkthrough as long as the published dataset can be reachable by INF.

Once this has been done, please notify INF by email (m.szczepanik at fz-juelich.de), and we will clone the dataset and parse the relevant information, adding it to the data catalog.

Scenario B: NON-shareable DataLad dataset

In this scenario the project has DataLad datasets that can NOT be shared, and the main question becomes:

How can a tabby record (including file metadata) be generated in the most efficient way, and how should this addition be deposited, such that these records can populate the data catalog without having to share or publish the DataLad dataset?

1. Save tabby records to dedicated locations

For a maintainable structure and keeping the complete dataset including its metadata under version control, it is highly recommended to create and place tabby records in the same location as described in Scenario A steps 1 through 3.

2. [Optional] Create the `files` table

The files table is an optional addition to the tabby record and lists one or more files that form part of the DataLad dataset. Such a file listing can be easily generated based on the output of ls_file_collection provided by Datalad-next. Below, we present an example implementation, which you can also use directly.

Note that files which are present will be checksummed; for annexed files which are not present locally the checksum will be taken from the annex key. This assumes that DataLad’s default backend (MD5E) is used; adjust the code for different checksums if necessary.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65


"""Generate a TSV file listing the files in a dataset

Usage:

    python this_file.py dataset outfile

"""

from argparse import ArgumentParser
from csv import DictWriter
from pathlib import Path
import re
from datalad.api import ls_file_collection


def match_md5(key):
    """Report MD5 checksum contained in annex key

    Assumes MD5E backend (DataLad's default).
    https://git-annex.branchable.com/internals/key_format/

    """
    m = re.match(r"MD5E-s\d+--([a-f0-9]{32})", key)
    return m.group(1) if m is not None else None


def transform_result(res):
    """Transform results to get required information

    Processes the outputs of ls_file_collection to get relative path,
    size in bytes, and md5 checksum.  For annex keys which are not
    present, the checksum will be taken from the annex key, if
    possible.  Keys match the sfb1451 tabby specification.

    """
    path = res["item"].relative_to(res["collection"]).as_posix()
    size = res["annexsize"] if res["type"] == "annexed file" else res["size"]
    checksum = res.get("hash-md5")

    if checksum is None and res["type"] == "annexed file":
        checksum = match_md5(res["annexkey"])

    return {"path[POSIX]": path, "size[bytes]": size, "checksum[md5]": checksum}


parser = ArgumentParser()
parser.add_argument("dataset", type=Path, help="Dataset for which files will be listed")
parser.add_argument("outfile", type=Path, help="Name of the tsv file to write")
args = parser.parse_args()

fc = ls_file_collection(
    "annexworktree",
    args.dataset,
    hash="md5",
    result_renderer="disabled",
    result_xfm=transform_result,
    return_type="generator",
)

with args.outfile.open("w", encoding="utf-8", newline="") as tsvfile:
    fieldnames = ["path[POSIX]", "size[bytes]", "checksum[md5]"]
    writer = DictWriter(tsvfile, delimiter="\t", fieldnames=fieldnames)
    writer.writeheader()
    for row in fc:
        writer.writerow(row)

3. Publish the metadata

For manual deposition, copy and send all tabby records to the INF project coordinator via email.

For deposition using Git / DataLad, open a GitHub pull request for the sfb1451/all-datasets superdataset (fork and clone the superdataset, add your tabby files collection under .datalad/tabby, push changes to your GitHub account and open a PR in GitHub's interface).