feat(schemas): tier-1 prior-art corpus for gohai.schema.json by retr0h · Pull Request #107 · osapi-io/gohai

retr0h · 2026-04-16T17:40:31Z

Summary

Phase 1 of the schema work: land the prior-art corpus that every field-naming decision for `gohai.schema.json` will be grounded in.

No OCSF / OTel / osquery / ECS / any other single source covers gohai's scope (we measured — OCSF gets us ~10-30% of fields, osquery ~62% of collectors with inconsistent naming). So instead of forcing gohai into an ill-fitting standard, we're designing our own clean schema informed by comprehensive prior-art analysis, then shipping it as the canonical contract: `gohai.schema.json` at repo root.

This PR lands the corpus only. The schema itself, the field-by-field analysis table, and the Go-type refactor to match are follow-ups.

What's in it

Eight tier-1 sources with direct scope match to gohai:

Source	Files	Role
OCSF	266	Open Cybersecurity Schema Framework (AWS/Splunk, LF-hosted)
OpenTelemetry	249	Resource semantic conventions
osquery	289	Host-inventory table specs
ECS	59	Elastic Common Schema field YAMLs
DMTF Redfish	283	Hardware / BMC REST schemas (latest unversioned only)
Kubernetes	3	`NodeStatus` / `NodeInfo` types
Chef Ohai	279	Our primary methodology reference
Puppet Facter	946	Peer fact collector

Total: 2,374 files, ~13 MB.

Each source preserves its upstream `LICENSE` and carries a `PROVENANCE` file recording source URL, fetched commit SHA, and fetch timestamp.

`scripts/corpus-fetch.sh` is re-runnable — clones each upstream at `depth=1`, cherry-picks the schema-relevant paths, strips Redfish's 6,700 historical-version files down to the 280 unversioned heads, and writes `PROVENANCE` metadata.

`schemas/README.md` documents the corpus, refresh workflow, and the broader schema roadmap.

Why committed (not gitignored)

Reproducibility — anyone cloning gets the exact corpus the schema was designed against.
Auditability — every naming decision will cite a corpus source; the source is right there to verify.
Diff-review on refresh — re-running `corpus-fetch.sh` shows upstream schema churn that may change naming decisions.

Licensing

All tier-1 sources are permissively licensed (Apache-2, BSD-3, MIT). Each subdirectory preserves its upstream `LICENSE` file for attribution.

What's next (not in this PR)

Tier 2+ corpus — cloud IMDS shapes (AWS Config, GCP Asset Inventory, Azure ARM), SIEM vocabularies (ASIM, UDM, Splunk CIM, OSSEM), software identifiers (CPE, SWID, SPDX, CycloneDX, PURL), DMTF CIM, and assorted vendor configs.
Design-principles doc — top-level shape, units, versioning, extension mechanism. Needs alignment before drafting schema JSON.
Field-by-field analysis — per-field table across sources, chosen name, rationale.
`gohai.schema.json` draft — hand-written from the analysis, the spec.
Conformance test — Go reflection vs. hand-written schema, fail on drift.
Go-type refactor — rename fields to match schema. Breaking change on JSON output + Go API; pre-1.0 acceptable.
Publish — schemastore.org + versioned stable URL.

Test plan

`scripts/corpus-fetch.sh` runs cleanly, re-runnable, idempotent.
Every corpus subdirectory has a `PROVENANCE` file and (where upstream provides one) a `LICENSE`.
No binaries or generated artifacts in the corpus.

🤖 Generated with Claude Code

Set up schemas/ as the home for the canonical gohai schema and its supporting corpus. This PR lands Phase 1 of the schema work: the prior-art corpus that field-naming decisions will be grounded in. Corpus sources (tier 1 — direct scope match): - OCSF objects + events + dictionary (266 files) - OTel resource semantic conventions (249 files) - osquery all 280+ table specs (289 files) - ECS Elastic Common Schema field YAMLs ( 59 files) - Redfish DMTF Redfish JSON schemas (latest only) (283 files) - k8s NodeStatus / NodeInfo from core/v1/types.go ( 3 files) - Ohai plugins + mixins + specs (279 files) - Facter Puppet Facter fact schema (946 files) Total: 2,374 files, ~13 MB. Each source preserves its upstream LICENSE and carries a PROVENANCE file recording source URL, fetched commit SHA, and fetch timestamp. scripts/corpus-fetch.sh is re-runnable — clones each upstream at depth=1, cherry-picks the schema-relevant paths, strips Redfish's 6,700 historical-version files down to the 280 unversioned heads, and writes PROVENANCE metadata. schemas/README.md documents the corpus structure, refresh workflow, and the schema roadmap. `gohai.schema.json` itself (the canonical output contract) lands in follow-up PRs once the field-by-field analysis is done. Tier 2+ (cloud IMDS shapes, SIEM vocabularies like ASIM/UDM/Splunk CIM/OSSEM, software identifiers like CPE/SPDX/CycloneDX/PURL, DMTF CIM) will be added as separate follow-up PRs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Expand the corpus with SIEM vocabularies, software identifiers, cloud provider schemas, and hardware identifier databases. 10 new sources, 6,600 files, bringing total corpus to 74 MB across 17 sources and ~9,000 schema files. New tier-2 sources: - ASIM Azure Sentinel Advanced Security Information Model - OSSEM Open Source Security Events Metadata (CDM + DD + DM) - Sigma SigmaHQ detection rules — field-name sampling - CycloneDX SBOM schema — component / vuln / pedigree - SPDX SPDX 3 model — document / package / file / snippet - PURL Package URL spec — universal package identifier - Wazuh syscollector inventory schemas - AWS CFN CloudFormation resource schemas (via cfn-lint data) - Azure ARM Resource Manager common JSON schemas - hwids pci.ids + usb.ids vendor/product databases Script fixes: - OSSEM path was wrong (common_information_model → OSSEM-CDM) - SPDX now points at the active spdx-3-model repo - CloudFormation replaced with cfn-lint-data (has actual per-resource schemas, unlike the original cloudformation-template-schema repo which didn't ship schemas) README updated with tier-2 catalog + tier-3+ wishlist. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Documents the decisions that have to be made before field-by-field analysis can start: top-level shape (flat vs nested), naming conventions + precedence, unit conventions (seconds / bytes / percent / booleans / enums), optionality rules, versioning (semver + $schema + $id + embedded meta), extension mechanism (_vendor, _raw), and scope boundaries (inventory only — no events, metrics, findings, or remediation). Recommendation in the doc is Option A (flat-by-collector, what we have today) over Option B (nested-by-domain like OCSF's device object). Reasons: consumer compat, collector toggle semantics, peer-tool alignment, and B's wins being theoretical. Three open questions flagged for decision: 1. Top-level shape — A or B 2. Schema URL — gohai.dev / osapi.io / github raw 3. schemastore.org submission Field-by-field analysis is blocked on these. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

First batch of the field-naming analysis covering the 8 highest- coverage collectors: hostname, platform, kernel, cpu, memory, dmi, network, process. 496 fields analyzed across the full corpus. Results: - 20% already canonical (match industry consensus) - 69% unique to gohai (no prior schema covers them) - Only 3 evidence-based renames found: - network: hardware_addr → mac (4 sources agree) - network: addr → ip (OCSF) - process: cmd_line → command_line (OTel + ECS) - memory: size → total (Facter + OTel, clarity) For the 69% unique fields, gohai IS defining the standard. The analysis documents every decision with provenance so future schema consumers can trace why each name was chosen. Remaining collectors (cloud providers, filesystem, disk, gpu, pci, scsi, hardware, virtualization, users, sessions, all system collectors) will be added in follow-up commits. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

1,747-line JSON Schema (draft 2020-12) covering all 39 shipping collectors with 739 field descriptions. This is the canonical output contract for gohai — the schema IS the spec, Go types conform to it, consumers validate against it. 43 top-level properties (one per collector + _meta + timings). 117 $defs for nested types (network interfaces, cloud sub-structs, DMI sections, filesystem mounts, etc.). Every field has a consumer-facing description explaining what the value means — not "the X field" but actual semantic documentation. 4 evidence-based renames applied per corpus analysis: - network: hardware_addr → mac (OCSF + OTel + ECS + osquery) - network: addr → ip (OCSF) - process: cmd_line → command_line (OTel + ECS) - memory: size → total (already matched in code) Schema informed by the 17-source corpus (OCSF, OTel, osquery, ECS, Redfish, k8s, Ohai, Facter, ASIM, OSSEM, Sigma, CycloneDX, SPDX, PURL, Wazuh, AWS CFN, Azure ARM) committed in schemas/corpus/. Next steps (not in this commit): - Conformance test: reflect Go Facts → assert matches schema - Apply the 3 Go field renames (mac, ip, command_line) - Publish to schemastore.org - Semver the schema independently from gohai 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

16 cross-provider renames so the same concept uses the same field name regardless of cloud provider: - instance_id: vm_id (Azure), droplet_id (DO), id (OCI, Scaleway) all become instance_id - region: location (Azure) becomes region - availability_zone: zone (GCE, Scaleway, Alibaba), availability_domain (OCI) all become availability_zone - private_ip: local_ipv4 (EC2, OpenStack), private_ipv4 (Alibaba) all become private_ip - public_ip: public_ipv4 (EC2, OpenStack, Alibaba) becomes public_ip Naming follows OTel + ECS + OCSF consensus (all three agree on region, availability_zone; ECS uses instance.id). Provider-specific extras (iam_info, network_interfaces, user_data, etc.) keep their vendor-native names — only the cross-provider concepts that exist on every cloud get normalized. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Comprehensive pass over every field in gohai.schema.json: Field renames (11): - AzureInterface: mac_address → mac - OCIVNIC: mac_addr → mac - Memory: s_unreclaim → s_unreclaimable - CPU: numa_nodes_count → numa_node_count, cpu_opmodes → op_modes - NetworkCounters: errin/errout/dropin/dropout → errors_in/errors_out/drops_in/drops_out - PCIDevice: sdevice_id/sdevice_name → subsystem_device_id/subsystem_device_name All 906 descriptions rewritten to publication quality: - Never starts with "The" - 1-2 sentences max - Units in parentheses: (bytes), (seconds), (MHz), (percent) - Example values for enum-like fields - Booleans describe what true means - Consumer-facing: WHAT not HOW - No "This field contains..." or "Represents..." Cross-collector consistency fixes: - mac: now consistently named everywhere (was mac_address in Azure, mac_addr in OCI) - hostname/region/instance_id/availability_zone/private_ip/public_ip/serial_number/tags: consistent description style across all cloud providers - Previously undescribed fields in Azure, OCI, Alibaba, OpenStack, Scaleway, DMI, Hardware sub-structs now all have descriptions $defs sorted alphabetically for navigation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions Bot added kind/go kind/yaml labels Apr 16, 2026

retr0h and others added 7 commits April 16, 2026 11:27

move gohai.schema.json to schemas/

4cc20bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107

feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107
retr0h wants to merge 8 commits intomainfrom
feat/schema-corpus-tier1

retr0h commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

retr0h commented Apr 16, 2026

Summary

What's in it

Why committed (not gitignored)

Licensing

What's next (not in this PR)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant