Forgejo with encryption
This walkthrough demonstrates how to encrypt annexed file content in Forgejo, a bare Git repository, or another Git hosting platform with annex support.
Introduction
When storing DataLad datasets on third party infrastructure, Git remotes and git-annex special remotes are often configured pointing to different services. However, some repository hosting services (e.g. G-node GIN, or some instances of Forgejo), as well as bare Git repositories1 on systems with git-annex installed, support storing annex objects alongside the Git repository, which makes them very convenient in terms of configuration and usage. Throughout the article, we will refer to remotes supporting both Git and annex content as “git+annex”. For background, see DataLad handbook’s chapter Beyond shared infrastructure.
Some of the SFB1451 datasets may require using data encryption at rest to guarantee data privacy or access control. Git-annex supports encryption and it can be very easy to set up. However, encryption is configured for special remotes and as such it would not apply to such “git+annex” remotes2.
This has been addressed in git-annex v.10.20250520 by the addition of the mask special remote.
With the mask special remote, setting up encrypted storage for git+annex comes down to the following steps:
- set up a Git remote,
- set up a mask special remote with encryption enabled,
- make sure that annexed contents are pushed only to the latter.
The model use case for this workflow is publishing datasets on Forgejo instances with git-annex support. Forgejo is a “self-hosted lightweight software forge”, which can be used to set up and maintain a platform similar to GitHub or GitLab. Forgejo-aneksajo is a lightly modified version of Forgejo which adds git-annex support. While no public instances are readily available, and the SFB1451 does not (currently) maintain its own instance, it is nevertheless very relevant for the SFB1451 community because Forgejo-aneksajo can be deployed by individual groups (see Collaborative infrastructure for a lab: Forgejo for more background information).
Prerequisites
- git-annex v.10.20250520 or newer
- gpg, and a generated private/public key
- DataLad-next (optional but highly recommended)
Walkthrough
Initial set-up and publishing
Publishing a dataset requires having a dataset in the first place. In this example we create a dataset with one file in Git (README) and one file annexed (an image).
datalad create foo
cd foo
echo "Hello world" > README
datalad download https://images.pexels.com/photos/21533358/pexels-photo-21533358.jpeg
datalad save --to-git -m "Add a readme" README
datalad save -m "Add a photo of a penguin" pexels-photo-21533358.jpeg
Add a Git remote pointing to a valid address – for example, an empty Forgejo-aneksajo repository, or (for easier testing) a local bare git repository.
git remote add origin https://hub.example.com/user/dataset.git
git init --bare /tmp/foo.git
git remote add origin /tmp/foo.git
Push (only the Git part – we do not want to send unencrypted data) to ensure that the remote repository is initialized with annex UUID.
datalad push --to origin --data nothing
At this point, the remote should be recognized by Git-annex, and have an annex UUID.
You can check this with git annex info origin
.
If the UUID is present, the mask remote can be set up3.
Replace the value of the keyid
parameter with the ID (either e-mail or the numeric key ID) of your GPG key.
git annex initremote encrypted-origin type=mask remote=origin encryption=hybrid keyid=me@example.com
With the remote configuration, the annexed content and the Git content can be pushed separately to these remotes.
datalad push --to encrypted-origin
datalad push --data nothing --to origin
That’s it! The repository is a regular repository, but the annexed content can not be accessed. The web interface will show the file tree, but report that the files are unavailable. Cloning the repository is possible, but getting the annexed content requires possession of a private GPG key for which the encryption has been enabled4.
Automating
Remembering to push to the mask remote first, and to use --data nothing
with the unencrypted remote
can be tedious and error prone.
However, both of these behaviors can be automated.
Git-annex wanted mechanism can be used to state that no data are wanted by the unencrypted remote.
DataLad’s publication dependency can be used to say that the encrypted remote should be pushed automatically
when a push to origin is requested.
git annex wanted origin "exclude=*"
datalad siblings configure -s origin --publish-depends encrypted-origin
With this, it is sufficient to datalad push --to origin
5.
This can be taken even further. For example, only some annexed files can be pushed encrypted. It is advisable to configure the encrypted and regular remotes with opposite settings to avoid duplicating the storage. For example, consider a situation where raw images are stored encrypted, but jpegs are not:
git annex wanted origin include=*jpeg
git annex wanted encrypted-origin include=*arw
Looking under the hood
Although we used Forgejo as an example (due to its potential relevance), the walkthrough is valid for any “git+annex” repository. Using a local bare Git repository is a good way to “look under the hood” and see that the annex keys in the bare repository are indeed stored encrypted, with HMAC hashing of the filenames.
The example below shows the annex objects directory with three files, two encrypted and one unencrypted.
❱ tree ../foo.git/annex/objects
../foo.git/annex/objects
├── 39e
│ └── bcf
│ └── GPGHMACSHA1--2c4b73cff92c9d186065f338551f8c5c8ec94b5b
│ └── GPGHMACSHA1--2c4b73cff92c9d186065f338551f8c5c8ec94b5b
├── 727
│ └── f52
│ └── GPGHMACSHA1--fbcb8f7931cb90c8fa9930368bfae671f36fec6b
│ └── GPGHMACSHA1--fbcb8f7931cb90c8fa9930368bfae671f36fec6b
└── a45
└── ab5
└── SHA256E-s2273030--fbeaece1f05264c39781adeac159ba33fbc1f4d2aea273106600bc4e3e3aa6c7.jpeg
└── SHA256E-s2273030--fbeaece1f05264c39781adeac159ba33fbc1f4d2aea273106600bc4e3e3aa6c7.jpeg
In Forgejo, the layout would be identical.
-
Bare Git repository is a repository which does not contain a working directory. It can be cloned from and pushed to. See Git on the server in the Pro Git book. ↩︎
-
Git annex does have a git special remote, but its purpose is to record a Git remote URL in the repository (normally, Git remote URL is a local configuration) and optionally autoenable it. It does not provide a way to set up encryption. ↩︎
-
In Forgejo-aneksajo versions prior to 11.0-0-git-annex2, the UUID was not always assigned immediately; see discussion in forgejo-aneksajo/issues/22. A workaround was to perform any action (including viewing the page) in the web interface. If the UUID is empty, visit the repository page in a browser (or refresh it) and try
git annex info origin
again. ↩︎ -
In the hybrid encryption scheme, data is encrypted using a symmetric cipher, and that cipher is asymmetrically encrypted (to selected GPG keys) and stored in the Git repository. This means that new recipient keys can be added without having to re-encrypt the data. See the documentation for git annex encryption. ↩︎
-
or
git annex push
which by default pushes to all available remotes ↩︎