Contributing#
Contributing to the repository beyond the instructions above is currently a dark art. A start is the description below. We hope to improve these docs over time.
Updating the version#
We use the bump
GitHub action to control the updating of our repository's version.
As a result, you shouldn't need to update the repository's version information by hand.
Under the hood, this action uses the command-line tool in
python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/version.py
.
This command-line tool ensures that the version is applied to all relevant places in the repository
and also provides an interface to bump the version.
Creating a Python virtual environment (venv
)#
Many of the steps described below require that a custom Python virtual environment is available. Rather than repeat these steps, we will document how to create this as an initial step, and thereafter, it can be activated and used repeatedly.
- Make a virtual environment (e.g.
python3 -m venv venv
) - Activate the virtual environment
(e.g.
source venv/bin/activate
- be careful not to activate multiple envs at once!) - Install the requirements into the environment
(e.g.
pip install -r dev-requirements.txt
)
There may be a need to update this environment ocassionally, however, in practice this should work fine once it is generated.
Follow the prompts to update pip
or other dependencies if prompted during install or execution.
The database#
In Database/input-data
there are three components:
- The file
esgf-input4MIPs.json
- The directory
Database/input-data/pmount
- The file
Database/input-data/supplementary-source-id-info.yaml
Database/input-data/esgf-input4MIPs.json
is a scrape of information from the ESGF index.
This captures the latest set of information we have queried from the ESGF index database.
It is generated via a GitHub action that automatically creates pull requests if new files have been published.
If you need to run it manually, you can run it with the steps below:
- If the
venv
virtual environment doesn't exist, or isn't activated - create/activate it (see "Creating a Python virtual environment (venv
)") - Run the script e.g.
python python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-esgf-scrape.py --out-file Database/input-data/esgf-input4MIPs.json --n-threads 4
Database/input-data/pmount
contains a number of JSON files.
Each file contains information about one file
from the raw netCDF files in the input4MIPs project.
The raw netCDF files are stored elsewhere.
The database entries are managed using the scripts in
scripts/pmount-database-generation
.
See scripts/pmount-database-generation/README.md
for further details.
Database/input-data/supplementary-source-id-info.yaml
contains supplementary information about our data.
This is data that cannot be scraped from ESGF or the files,
usually because it is only known after publication of the data
(e.g. reasons for later deprecation of the data).
At the moment, the fixes are applied at the source ID level.
If you need finer-grained control, add in a new source.
At present, we are tracking all of these inputs as part of this repository. This is ok for now, as the data is relatively small. This may not scale, so if we get to a certain size, we may have to pick a different approach.
The data from these three inputs,
plus information from the Controlled Vocabularies (CVs),
gets combined to create Database/input4MIPs_db_file_entries.json
.
This combination is done using python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-database.py
.
In order to run this script, you should:
- If the
venv
virtual environment doesn't exist, or isn't activated - create/activate it (see "Creating a Python virtual environment (venv
)") - Run the script e.g.
python python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-database.py --repo-root-dir .
Generating the HTML pages#
Having generated the database, we can then generate the HTML views of it.
Currently, the HTML pages are generated using
python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-html-pages.py
.
In order to run this script, you should:
- If the
venv
virtual environment doesn't exist, or isn't activated - create/activate it (see "Creating a Python virtual environment (venv
)") - Run the script e.g.
python python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-html-pages.py --repo-root-dir .
The version is automatically read out of the VERSION
file if it is not directly specified.
Summary of steps required to update this repository when a new data source is published on ESGF#
This is based on the sections above. The paths assume you are working on perlmutter. If you are working elsewhere, you may need to modify the paths slightly.
- Checkout a new branch from main
- Update the ESGF scrape:
python scripts/pollESGF.py Database/input-data/esgf-input4MIPs.json
- you may need to have an environment activated with needed requirements (e.g.
requests
before you can run this) - alternatively on perlmutter copy the existing file:
cp /PATH-TO-DATA-ROOT/input4MIPs/esgf-input4MIPs.json Database/input-data/esgf-input4MIPs.json
- you may need to have an environment activated with needed requirements (e.g.
- Activate an environment in which
input4mips-validation
is installed - Update the database by adding the tree you're interested in.
Do this by running the following command from the root of this repository:
bash scripts/pmount-database-generation/db-add-tree.sh <root-of-tree-to-add>
e.g.bash scripts/pmount-database-generation/db-add-tree.sh /PATH-TO-DATA-ROOT/input4MIPs/CMIP6Plus/CMIP/UofMD/
. On perlmutter, this is something likebash scripts/pmount-database-generation/db-add-tree.sh /global/cfs/projectdirs/m4931/gsharing/user_pub_work/input4MIPs/...
- (Not compulsory, but recommended because it makes it easier to see changes later) Commit the changes to the database
- If needed, add the source ID entry for the new files to
CVs/input4MIPs_source_id.json
- Activate an environment which has the local requirements installed
- Make a virtual environment (e.g.
python3 -m venv venv
) - Activate the virtual environment (e.g.
source venv/bin/activate
- be careful not to activate multiple envs at once!) - Install the requirements into the environment (e.g.
pip install -r dev-requirements.txt
)
- Make a virtual environment (e.g.
- Update the database:
python python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-database.py --repo-root-dir .
- If needed, add a reason for the retraction/deprecation of the previous data set in
Database/input-data/supplementary-source-id-info.yaml
- If needed, add a reason for the retraction/deprecation of the previous data set in
- Update the HTML pages:
python python-packages/input4MIPs-CVs/src/input4MIPs_CVs/cli/update-html-pages.py --repo-root-dir .
- If you get an error about a retracted publication status,
you'll need to edit the latest source ID being used for a given dataset
in
docs/dataset-info/delivery-summary.json
.
- If you get an error about a retracted publication status,
you'll need to edit the latest source ID being used for a given dataset
in
- Check that the HTML has updated as expected (e.g. the summary view has updated as expected, new datasets are in the datasets view, new files are in the files view)
- Commit everything
- Build the docs:
mkdocs build --strict
-
Check that the docs updated as expected. A few of the auto-generated components are worth checking here:
- are the source IDs for the dataset up to date?
E.g. do we need to update the source IDs to be used for the various CMIP7 phases in
docs/dataset-info/cmip7-phases-source-ids.json
or the delivery summary indocs/dataset-info/delivery-summary.json
? - did the relevant documentation page (e.g.
docs/dataset-overviews/population.md
) update correctly? If yes, there is an issue. Check the page carefully e.g. thesource_id_stub
at the top of the page (figuring out the logic here will likely require stepping through the python as it is still an evolving process). - did the revision history come through correctly? If not, there is an issue.
- are the source IDs for the dataset up to date?
E.g. do we need to update the source IDs to be used for the various CMIP7 phases in
-
Commit everything
- Push
- Make a pull request
- Request a review from @znichollscr and/or @durack1
- Update based on review
- Merge
- Tag @eleanororourke and @vnaik60 so they know a new update is live (e.g. "Hi @eleanororourke @vnaik60 just making sure you've seen this, thanks!")
- Celebrate
- Begin work on your update on v-next
Relationship to input4MIPs validation#
This repository contains the database and Controlled Vocabularies (CVs). The input4MIPs validation package implements the logic for validating data, based on the CVs. The two are deliberately decoupled, to allow the logic captured within input4IMPs validation to potentially be reused in other parts of the CMIP universe in future. We have a CI job which checks that the CVs in this repository can be loaded using input4MIPs validation. If this job fails, it is ok to still merge the merge requests. It is just a reminder to us that we have to update input4MIPs validation to support whatever changes to the CVs have been made.