How We Build Datasets
At OpenPlanetData, transparency is a core principle. This page explains exactly how we build, validate, and maintain our datasets so you can trust the data you’re using.
Pipeline Overview
Section titled “Pipeline Overview”Our data pipeline follows a consistent process across all datasets:
┌─────────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────────┐│ Collect │ → │ Process │ → │ Validate │ → │ Publish ││ Sources │ │ & Clean │ │ & Test │ │ Release │└─────────────┘ └──────────────┘ └────────────┘ └──────────────┘1. Source Collection
Section titled “1. Source Collection”We aggregate data from multiple authoritative sources. Each source is:
- Documented - We record what source provides what data.
- Versioned - We track which version of source data we’re using.
- Validated - We verify the source is still authoritative and up-to-date.
2. Processing & Cleaning
Section titled “2. Processing & Cleaning”Raw source data goes through several processing steps:
- Normalization - Convert data to consistent formats and encodings
- Deduplication - Remove duplicate entries
- Enrichment - Add derived fields where appropriate
- Conflict Resolution - Handle cases where sources disagree
3. Validation & Testing
Section titled “3. Validation & Testing”Every dataset update is validated before release:
- Schema Validation - Ensure data matches expected structure
- Completeness Checks - Verify all expected entries are present
- Accuracy Sampling - Test random samples against known ground truth
- Regression Testing - Compare with previous versions for unexpected changes
4. Publishing
Section titled “4. Publishing”Validated datasets are published as releases:
- Multiple Formats - JSON, CSV, Parquet as appropriate.
- Versioned Releases - Semantic versioning for tracking changes.
- Checksums - SHA256 hashes for integrity verification.
- Changelogs - Document what changed in each release.
Conflict Resolution
Section titled “Conflict Resolution”When different sources provide conflicting information, we follow a documented resolution process:
- Source Priority - Some sources are considered more authoritative for specific data types
- Recency - More recent data typically takes precedence
- Consensus - When multiple sources agree, that value is preferred
- Manual Review - Edge cases are flagged for human review
All conflict resolutions are logged and can be reviewed in our source repositories.
Automation
Section titled “Automation”Our pipelines are fully automated using GitHub Actions:
- Scheduled Runs - Pipelines run on fixed schedules (weekly/monthly).
- Source Monitoring - We detect when sources update.
- Automatic PRs - Updates create pull requests for review.
- CI/CD - All validation tests run automatically.
Open Source
Section titled “Open Source”All our pipeline code is open source:
- Pipeline Code - See exactly how data is processed.
- Validation Tests - Review our testing methodology.
- Source Configurations - Know which sources we use.
Visit our GitHub organization to explore the code.
Reproducibility
Section titled “Reproducibility”Anyone can reproduce our datasets by:
- Cloning the repository
- Running the pipeline scripts
- Verifying the output matches our releases
This ensures our data can be independently verified and audited.
Contributing
Section titled “Contributing”We welcome contributions to improve our pipelines:
- Bug Reports - Found an issue? Open a GitHub issue.
- Data Corrections - Know of an error? Submit a PR with evidence.
- Source Suggestions - Know of a better source? Let us know.
- Code Improvements - Help us improve our processing logic.
See our contribution guidelines in each repository for details.