Skip to content

FACTS Common Attributes: national ingest → public-facts (#299)#319

Open
kpdavi wants to merge 2 commits into
boettiger-lab:mainfrom
kpdavi:facts-common-attributes-2026-06
Open

FACTS Common Attributes: national ingest → public-facts (#299)#319
kpdavi wants to merge 2 commits into
boettiger-lab:mainfrom
kpdavi:facts-common-attributes-2026-06

Conversation

@kpdavi

@kpdavi kpdavi commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Closes #299.

Ingests the USFS FACTS Common Attributes layer as a single national dataset merged from all 9 USFS regions (01–06, 08, 09, 10), downloaded from the FS EDW on 2026-06-24. Data, STAC, and README are already published to NRP S3 (public-facts) and registered in the root catalog; this PR commits the pipeline manifests + sync wiring.

Result (live on public-facts)

  • GeoParquet common-attributes-2026-06.parquet — 7,324,720 rows, EPSG:4326 (reprojected from NAD83), ADMIN_REGION 01–10
  • PMTiles common-attributes-2026-06.pmtiles — source-layer common-attributes-2026-06, 21 curated tile fields (tile-accurate per STAC: PMTiles vector assets missing tile-level column schema + nodata hints (SVI, conservation-almanac) #283)
  • H3 hex res 10 (parents 9,8,0) — 96.9M rows, 6,057,664 features (== exact geometried feature count)
  • STAClicense: public-domain; passes lint-stac-categorical + lint-stac-pmtiles-fields on the live URL; root → public-facts → dataset traversal verified

Key decisions (recorded on #299)

  • Region field: native ADMIN_REGION (verified one value per source file, exactly 01–10) — no separate column stamped
  • Hex coverage: chunk-size 10000 × 733 completions for full 7.32M coverage — the generator default (1000) would have silently dropped ~5.3M features
  • Aspatial records: ~17% (1.27M) have NULL geometry (FACTS activities without a mapped unit) — in the GeoParquet, absent from hex/PMTiles; documented
  • Upstream data quality: ~35 features with out-of-US coords, ~1,271 with implausible fiscal years — preserved as-is, documented as caveats
  • Serialization: --row-group-size 2000 to avoid the DuckDB httpfs stoi crash

Pipeline (catalog/facts/k8s/common-attributes-2026-06/)

stage-rawschema-checkmergehexrepartition (+ pmtiles), plus gen_stac.py/_codes.json (STAC generator).

Sync

  • sync-public-facts.yaml — MinIO private mirror (new bucket)
  • source-sync-facts.yaml + scope/cron-config — source.coop (public-domain, license-clear). cboettig/facts repo must be created in the source.coop web UI before the weekly cron mirrors it (see new-repos.md).

🤖 Generated with Claude Code

Default User and others added 2 commits June 25, 2026 21:30
…b#299)

Ingest the USFS FACTS Common Attributes layer (Forest Service Activity
Tracking System) as a single national dataset merged from all 9 regions
(01-06, 08, 09, 10), downloaded from the FS EDW on 2026-06-24.

Pipeline (catalog/facts/k8s/common-attributes-2026-06/):
- stage-raw.yaml      setup public-facts bucket + mirror 9 GeoPackage zips to raw/
- schema-check.yaml   verify schema consistency across regions (all 9 identical,
                      109 fields) + ADMIN_REGION reliability (1 value/file, 01..10)
- merge.yaml          convert+merge → national GeoParquet (7,324,720 rows),
                      reproject NAD83→EPSG:4326, row-group-size 2000 (stoi-safe)
- *-hex.yaml          H3 res 10 (parents 9,8,0); chunk-size 10000 × 733 completions
                      = full coverage of 7.32M rows (NOT the generator default 1000,
                      which would have dropped ~5.3M features)
- *-repartition.yaml  merge chunks → hex/ by h0 (120Gi RAM + PVC scratch)
- *-pmtiles.yaml      PVC-backed; curated 21-field tile subset (tile-accurate boettiger-lab#283)
- gen_stac.py/_codes.json  STAC generator (passes lint-stac-categorical &
                      lint-stac-pmtiles-fields)

Region queryability uses the native ADMIN_REGION (verified 01..10, one value per
source file) — no separate region column stamped.

STAC/README published to NRP S3 (public-facts); registered public-facts as a
top-level child of the root catalog. license = public-domain.

Sync:
- sync-public-facts.yaml         MinIO private mirror (new bucket)
- source-sync-facts.yaml + scope source.coop (public-domain, license-clear);
  cboettig/facts repo creation pending (see new-repos.md)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…value sets

verify-stac CI flagged ingested values '0' and 'U' missing from the declared
values arrays (HARD values-incomplete). Declared sets now match ingested data.
Republished to s3://public-facts/common-attributes-2026-06/stac-collection.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest USFS FACTS Common Attributes (national, all regions) → public-facts

1 participant