Skip to content

Quality Assurance

Quality is not optional at OpenPlanetData. Every dataset goes through rigorous validation before release. This page documents our QA processes and how we measure data quality.

  1. Accuracy over speed - We don’t rush releases at the expense of quality
  2. Measurable quality - We define and track quality metrics
  3. Automated testing - Most checks run automatically in CI
  4. Human oversight - Critical changes require manual review
  5. Continuous improvement - We learn from errors and improve processes

Every dataset update goes through these validation stages:

Ensures data structure is correct:

# Example schema check
- All required fields are present
- Field types are correct (string, number, etc.)
- Field values match expected patterns (ISO codes, etc.)
- No unexpected/extra fields

Verifies data coverage:

  • All expected entries are present (e.g., all 249 countries)
  • Required related data is linked correctly

Ensures internal consistency:

  • Cross-references validate (country codes match between datasets)
  • Derived fields calculate correctly
  • No duplicate entries

Tests against known ground truth:

# Example accuracy test
known_countries = {
"FR": {"name": "France", "capital": "Paris"},
"JP": {"name": "Japan", "capital": "Tokyo"},
# ... more test cases
}
for code, expected in known_countries.items():
result = lookup(code)
assert result.name == expected["name"]
assert result.capital == expected["capital"]

Compares with previous versions:

  • Detects unexpected changes in stable data
  • Flags large changes for review
  • Tracks accuracy metrics over time

We track and publish quality metrics for each dataset:

MetricTargetCurrent
Completeness100%100%
Code Accuracy100%100%
Name Accuracy> 99%99.6%
Coordinate Accuracy> 95%97.8%

We maintain test datasets for accuracy validation:

All 249 countries/territories with verified:

  • Official codes from ISO
  • Names from UN
  • Capitals from official sources

When issues are detected:

  1. Automated alerts - CI fails and notifies maintainers
  2. Issue creation - Tracking issue created automatically
  3. Root cause analysis - Investigate source of error
  4. Fix and verify - Correct the issue and add test case
  5. Post-mortem - Document lessons learned

Found an error in our data? Help us improve:

Open an issue with:

  • Dataset and version affected
  • Specific incorrect data
  • Correct value with source/evidence

Open an issue with:

  • Description of the pattern
  • Examples of affected data
  • Suggested fix or investigation approach

We maintain a public changelog of quality improvements:

2024-01 - Improved city-level accuracy by 3% via new source
2024-02 - Fixed timezone data for 12 edge-case regions
2024-03 - Added validation for boundary dataset consistency

See the CHANGELOG.md in each repository for full history.

We’re always working to improve data quality:

  • Monthly reviews - Analyze error reports and metrics
  • Source evaluation - Assess new potential sources
  • Test expansion - Add new test cases from error reports
  • Process refinement - Improve pipeline based on lessons learned