Design goals

Redistribute data from providers as-is

We want our users to be aware that the data found on DBnomics is similar to the provider data. On the other hand, we want our users to avoid dealing with data representation specificities.

As a consequence, DBnomics distinguishes data from its format, and simplifies format only.

If DBnomics simplified or harmonized provider data, that would require more manual work (i.e. data curation), and this would be incompatible with DBnomics automatic data fetching (see next section), and it would be impossible for the user to know what the provider data was. So data curation is left to the user.

The following items are kept as-is from the provider:

  • time series and their observations
  • dataset dimensions: DBnomics does not harmonize dimension names and values.
  • NA (non-available) values usage: DBnomics does not add or remove them. If a provider distributes a time series with an incomplete calendar (with some missing periods) DBnomics does not tries to complete it.

However some data formatting is harmonized:

  • periods: some providers use different codes to represent them (202001, 2020M01 for January, 2020). DBnomics always use 2020-01. See below for all period formats.
  • NA (non-available) values: some providers use NaN, some other -9999, etc. DBnomics always use NA.

Some providers distribute time series with no observation, or with only NA values, and DBnomics keeps them as-is as well. Here are some examples:

Update data regularly

We want up-to-date data on DBnomics, so data has to be updated automatically.

Data acquisition is done by small programs called DBnomics fetchers which are run automatically by the DBnomics platform.

Any manual data acquisition (e.g. copy-pasting values from a spreadsheet) would lead to outdated data.

We also want to keep track of the execution of fetchers, and that's way we have a dashboard.

Keep versions of provider data

Access data from programming languages

Access data from external software

Harmonized data model

Period format

Dimensions are provided as-is from provider data.

Period format is normalized:

  • YYYY for years
  • YYYY-MM for months (e.g. 2000-01, 2000-11)
  • YYYY-MM-DD for days (MUST be padded for MM and DD)
  • YYYY-Q[1-4] for year quarters
  • example: 2018-Q1 represents jan to mar 2018, and 2018-Q4 represents oct to dec 2018
  • YYYY-S[1-2] for year semesters (aka bi-annual, semi-annual)
  • example: 2018-S1 represents jan to jun 2018, and 2018-S2 represents jul to dec 2018
  • YYYY-B[1-6] for pairs of months (aka bi-monthly)
  • example: 2018-B1 represents jan + feb 2018, and 2018-B6 represents nov + dec 2018
  • YYYY-W[01-53] for year weeks (MUST be padded)

Normalization is done by each fetcher based on the knowledge of the provider data. For example, a period like 2000-qII would be normalized as 2000-Q2 by the conversion script of the fetcher.

Note: in the case the time series periods have a daily format with a lower frequency (e.g. monthly), then the period format is simplified to match the frequency. For example, periods like 2000-01-01, 2000-02-01, 2000-03-01 are simplified as 2000-01, 2000-02, 2000-03, but periods like 2000-01-15, 2000-02-15, 2000-03-15 can't be simplified because we would lose the first day information they convey.

Support different data models

DBnomics defines a data model inspired from SDMX, which has to be compatible with all supported providers, even if their own data model is not SDMX-compliant.

As a consequence, DBnomics data model defines hard constraints, but some other constraints have to be soft (cf data model).