Contributing

If you want to contribute to DBnomics by adding support for a new data provider, or enhance an existing one, please read the dedicated sections below.

DBnomics core team is open to external contributions and we make increasing efforts to ease the contribution path. Thank you for your interest in DBnomics!

Preliminary advices

Avoid redundant work

Suppose you want to add to DBnomics a new dataset of a specific provider. To avoid double work you should first check that it is not already available in DBnomics.

Of course when adding a completely new fetcher, there is no question. But sometimes it is not as straightforward as it seems: a particular dataset can be published by a provider at different places of its website, maybe under different names, etc.

In case of doubt, just ask DBnomics community on the forum.

Respect the licence and terms of use

Only open-data or data under a permissive licence can be contributed to DBnomics.

Please double check that the licence of source data is permissive enough when writing a fetcher.

See also: Can I have my private data on DBnomics?

Should I write a new fetcher or contribute to an existing one?

When adding support for a completely new provider, create a new fetcher.

When fixing errors in existing data, submit a patch to the corresponding fetcher via a GitLab merge-request.

When adding new datasets to an existing provider, it depends. If you feel confident with the source code of the existing fetcher and if your changes fit well go ahead and submit a merge-request. On the other hand, if your feel more confident working in your own project, go ahead and create a personal Git repository.

When several fetchers write data for the same provider, they should have their own source-data repository but share the same json-data repository. The case has never been encountered for now.

In case of doubt, just ask DBnomics community on the forum.

Contributing to an existing fetcher

When contributing to an existing fetcher, you should respect the code style and use code quality tools defined by its maintainers.

In case of error the maintainer of the fetcher will have to fix it, so the contributed changes should be cristal clear to her.

And of course the data produced by your changes must be valid (cf acceptation process section below).

Acceptation process

When a fetcher is ready to be submitted, its author can follow this process in order for the fetcher to be accepted, deployed to production, and its data visible on DBnomics website.

One of the main conditions for a fetcher to be accepted is to produce valid data. Follow the task: Validate data produced by a fetcher. Once data is valid, the fetcher can be submitted to the DBnomics core team.

Criteria to meet

Technically the fetcher must meet some criteria in order to be run in an automated job.

Mandatory:

  • the fetcher must be installable from a fresh virtual env: commit dependencies (requirements.in) and locked versions (requirements.txt) as explained in this section
  • download.py and convert.py must be executable with the common script arguments (like all fetchers do) as explained in the download and convert sections
  • convert.py must produce data that is valid towards DBnomics data model, as explained in this section
  • the fetcher must define a license as explained in this section

Optionally, to ease contribution:

  • comments, variables and functions names in source files should be written in English
  • the Python source files should be formatted automatically as explained in this section
  • the Python source files should be linted as explained in this section

Submit the fetcher

The contributor can create an account on the GitLab instance of DBnomics (click on register).

By default, to avoid spam, new accounts are created as external users, that can't create repositories. The contributor can send an email to contact@db.nomics.world to ask for removing the external status.

The contributor can create a personal repository and open a new issue on the project issue board to tell DBnomics core team about the new repository.

Developer review

At first, the source code is just executed by DBnomics core team, but not really audited. It's considered as a black box. Only the validity of data produced by convert.py is checked.

If data is valid, the fetcher is deployed manually to DBnomics pre-production instance by a developer of the core team.

Economist review

An economist of the core team will check data corresponding to the fetcher on the pre-production instance, and compare it to source data available on the provider website.

If there are problems they will be discussed on the issue opened earlier.

If everything is OK, then a core team developer will deploy the new fetcher to production.

Production and maintenance

Deploying a fetcher to production consists in configuring a pipeline of jobs that is scheduled in order to run the fetcher every day. This is done by a developer of the core team.

After running in production, fetchers often break, and this can hardly be avoided. In order to have up-to-date data on DBnomics, it is recommended to do fetcher maintainance as quickly as possible.

Each fetcher has a maintainer, which is its author by default. The dashboard shows them.

In case of problem with a fetcher, an issue is created and assigned to its maintainer who is responsible for solving it. DBnomics core members will look at the issue in a second priority.

Also, if questions are asked on the forum about a fetcher, its maintainer is expected to participate to the discussion. DBnomics core members will participate as well.

See also: Why do fetchers break after a while?

Writing a new fetcher

Install and configure environment

Here we use a classic Python workflow with pip and virtualenv. See also: Virtualenv section of The Hitchhiker’s Guide to Python!

The recommended Python version is the latest stable.

Initialize a new project from dbnomics-fetcher-cookiecutter.

pip install virtualenv
mkvirtualenv my-fetcher

pip install cookiecutter
git clone https://git.nomics.world/dbnomics/dbnomics-fetcher-cookiecutter.git
cookiecutter dbnomics-fetcher-cookiecutter

# Prepare directories to write data to.
mkdir source-data json-data

Download

The download script, named download.py, downloads data from the provider and writes it to a target directory named source-data. It expects the target directory to be empty.

Good practices:

  • Your download script is a bot, similarly to search engine bots that index webpages. Respect the directives exposed by the robots.txt file of the provider website.
  • Write your script in a resilient way, such that when data evolves, it may not break. Of course one cannot completely anticipate every possible change.

Run the download script:

python download.py source-data

Convert

The convert script, named convert.py, converts downloaded data from source-data to DBnomics data model and writes it to a target directory named json-data. It expects the target directory to be empty.

Run the convert script:

python convert.py source-data json-data

Validate data

Follow the task: Validate data produced by a fetcher.

Define a license

Every source code that is published publically should include a license file named LICENSE.

For example, for the AGPL-3.0:

wget https://www.gnu.org/licenses/agpl-3.0.txt -O LICENSE

The license file should be committed.

Code style and quality

In order to improve maintainability of a fetcher, is it highly recommended to follow code style and quality guidelines recommended by DBnomics project. Indeed, once a fetcher fails in production, it is very likely that a member of DBnomics maintainance team will handle the problem instead of its original author.

The recommended guidelines mainly follow Python good practices, with additional DBnomics related specific ones.

Declare dependencies with versions

Your fetcher may use external Python packages that are installed in your virtualenv with pip install.

It is highly recommended to pin the version number of those dependencies in requirements.txt. It is not sufficient to just mention the names of the packages in requirements.txt, otherwise one future day someone will install them with the versions available that future day, and the packages may behave differently than those you worked with. Also, it is important to pin versions recursively.

There are several solutions in the Python community to achieve version pinning. DBnomics fetchers usually use pip-tools like this:

# requirements.in
python-slugify
ujson
pip install pip-tools
pip-compile

The following file is generated:

# requirements.txt
#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile
#
python-slugify==4.0.0     # via -r requirements.in
text-unidecode==1.3       # via python-slugify
ujson==2.0.3              # via -r requirements.in

Both requirements.in and requirements.txt must be committed.

Format your code automatically

Format Python source code with an non opinionated formatter and ship the configuration you used along with the source code.

This will almost completely avoid committing changes only related to source code formatting, and ease finding bugs while reading through a clean source code history.

Configure your source code editor to use that formatter.

For example, use Black which is supported by VSCode Python extension, and enable format on save feature.

Use a linter

A linter is a tool that catches common errors in the source code.

In Python one of the most used is flake8. It supports plugins to catch almost every imaginable problems. For example, it can detect unused variables or unused imports.

Source code editors can take advantage of linters to highlight errors directly under the lines of the source code.

Use Python types

It is highly recommended to use Python types annotations in the source code of a fetcher. This will improve source-code editor instrumentation such as auto-completion and tooltips, and help catching errors.

This mainly consists of using mypy in your source code editor.

Follow common conventions

Some well-known Python libraries like Pandas propose a de-facto standard way to name variables, like df for data-frames.

DBnomics recommends following those conventions.

Examples:

  • use df instead of dataframe
  • use prices_df instead of prices

Do not abbreviate data model concepts

DBnomics data model defines concepts such as provider, dataset, time series, dimension or observation. Do not abbreviate those terms.

Examples:

  • use dataset_info instead of ds_info
  • use dimensions instead of dim_dict
  • use current_observation instead of current_obs

Plural of time series

In English, a time series is invariable.

In order to distinguish a single series from a list of series:

  • name a single series series
  • name a list of series series_list

Submit your fetcher

Follow the acceptation process.

Tasks

Report problems with data

If you notice wrong data on the website, you can help by contributing at different levels.

First of all, you can tell DBnomics core team about the problem by creating a new issue and filling the template named "Problem with data". This template contains placeholders that you can replace with real values. The idea is to give as much details as possible to help the DBnomics team to investigate!

Then you can try to solve the issue by yourself if you'd like to. Once you identified the source code repository of the fetcher, you can fork it and submit a merge-request. We recommend doing that a discussion with the DBnomics core team on the issue you created.

In any case: thank you for your contribution!

Validate data produced by a fetcher

Suppose you just finished writing or fixing a fetcher. Now you'd like to check the validity of data produced by convert.py. Run your fetcher if not already done:

mkdir source-data json-data
python download.py source-data
python convert.py source-data json-data

Now install the validation script and run it:

pip install dbnomics-data-model
dbnomics-validate --all-series --all-observations --developer-mode json-data

Example output:

- Series "RBA/A3-4/AFROMOTD" at location AFROMOTD.tsv (line 3)
  Error code: duplicated-observations-period
  Message: Duplicated period
  Context:
    period: '2013-11-11'

- Series "RBA/A3-4/AFROMOTD" at location AFROMOTD.tsv (line 5)
  Error code: duplicated-observations-period
  Message: Duplicated period
  Context:
    period: '2013-11-12'

[...]

Encountered errors codes:
    - duplicated-observations-period: 12448

At the end of the output you'll find a summary of the count of errors by type.

The --developer-mode option displays all errors, in particular the non fatal ones, in order to improve the quality of your fetcher. In production this option is not used to accelerate validation.

If your fetcher writes a huge quantity of data, you can remove the --all-series option to validate only a randomly chosen sample of series per dataset. You can also remove the --all-observations option to validate only a few observations per series.

View data produced by a fetcher in a local instance of DBnomics

See the dbnomics-docker project.