Skip to content

feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107

Open
retr0h wants to merge 8 commits intomainfrom
feat/schema-corpus-tier1
Open

feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107
retr0h wants to merge 8 commits intomainfrom
feat/schema-corpus-tier1

Conversation

@retr0h
Copy link
Copy Markdown
Contributor

@retr0h retr0h commented Apr 16, 2026

Summary

Phase 1 of the schema work: land the prior-art corpus that every field-naming decision for `gohai.schema.json` will be grounded in.

No OCSF / OTel / osquery / ECS / any other single source covers gohai's scope (we measured — OCSF gets us ~10-30% of fields, osquery ~62% of collectors with inconsistent naming). So instead of forcing gohai into an ill-fitting standard, we're designing our own clean schema informed by comprehensive prior-art analysis, then shipping it as the canonical contract: `gohai.schema.json` at repo root.

This PR lands the corpus only. The schema itself, the field-by-field analysis table, and the Go-type refactor to match are follow-ups.

What's in it

Eight tier-1 sources with direct scope match to gohai:

Source Files Role
OCSF 266 Open Cybersecurity Schema Framework (AWS/Splunk, LF-hosted)
OpenTelemetry 249 Resource semantic conventions
osquery 289 Host-inventory table specs
ECS 59 Elastic Common Schema field YAMLs
DMTF Redfish 283 Hardware / BMC REST schemas (latest unversioned only)
Kubernetes 3 `NodeStatus` / `NodeInfo` types
Chef Ohai 279 Our primary methodology reference
Puppet Facter 946 Peer fact collector

Total: 2,374 files, ~13 MB.

Each source preserves its upstream `LICENSE` and carries a `PROVENANCE` file recording source URL, fetched commit SHA, and fetch timestamp.

`scripts/corpus-fetch.sh` is re-runnable — clones each upstream at `depth=1`, cherry-picks the schema-relevant paths, strips Redfish's 6,700 historical-version files down to the 280 unversioned heads, and writes `PROVENANCE` metadata.

`schemas/README.md` documents the corpus, refresh workflow, and the broader schema roadmap.

Why committed (not gitignored)

  • Reproducibility — anyone cloning gets the exact corpus the schema was designed against.
  • Auditability — every naming decision will cite a corpus source; the source is right there to verify.
  • Diff-review on refresh — re-running `corpus-fetch.sh` shows upstream schema churn that may change naming decisions.

Licensing

All tier-1 sources are permissively licensed (Apache-2, BSD-3, MIT). Each subdirectory preserves its upstream `LICENSE` file for attribution.

What's next (not in this PR)

  • Tier 2+ corpus — cloud IMDS shapes (AWS Config, GCP Asset Inventory, Azure ARM), SIEM vocabularies (ASIM, UDM, Splunk CIM, OSSEM), software identifiers (CPE, SWID, SPDX, CycloneDX, PURL), DMTF CIM, and assorted vendor configs.
  • Design-principles doc — top-level shape, units, versioning, extension mechanism. Needs alignment before drafting schema JSON.
  • Field-by-field analysis — per-field table across sources, chosen name, rationale.
  • `gohai.schema.json` draft — hand-written from the analysis, the spec.
  • Conformance test — Go reflection vs. hand-written schema, fail on drift.
  • Go-type refactor — rename fields to match schema. Breaking change on JSON output + Go API; pre-1.0 acceptable.
  • Publish — schemastore.org + versioned stable URL.

Test plan

  • `scripts/corpus-fetch.sh` runs cleanly, re-runnable, idempotent.
  • Every corpus subdirectory has a `PROVENANCE` file and (where upstream provides one) a `LICENSE`.
  • No binaries or generated artifacts in the corpus.

🤖 Generated with Claude Code

Set up schemas/ as the home for the canonical gohai schema and its
supporting corpus. This PR lands Phase 1 of the schema work: the
prior-art corpus that field-naming decisions will be grounded in.

Corpus sources (tier 1 — direct scope match):

- OCSF      objects + events + dictionary                 (266 files)
- OTel      resource semantic conventions                 (249 files)
- osquery   all 280+ table specs                          (289 files)
- ECS       Elastic Common Schema field YAMLs             ( 59 files)
- Redfish   DMTF Redfish JSON schemas (latest only)       (283 files)
- k8s       NodeStatus / NodeInfo from core/v1/types.go   (  3 files)
- Ohai      plugins + mixins + specs                      (279 files)
- Facter    Puppet Facter fact schema                     (946 files)

Total: 2,374 files, ~13 MB.

Each source preserves its upstream LICENSE and carries a PROVENANCE
file recording source URL, fetched commit SHA, and fetch timestamp.

scripts/corpus-fetch.sh is re-runnable — clones each upstream at
depth=1, cherry-picks the schema-relevant paths, strips Redfish's
6,700 historical-version files down to the 280 unversioned heads,
and writes PROVENANCE metadata.

schemas/README.md documents the corpus structure, refresh workflow,
and the schema roadmap. `gohai.schema.json` itself (the canonical
output contract) lands in follow-up PRs once the field-by-field
analysis is done.

Tier 2+ (cloud IMDS shapes, SIEM vocabularies like ASIM/UDM/Splunk
CIM/OSSEM, software identifiers like CPE/SPDX/CycloneDX/PURL, DMTF
CIM) will be added as separate follow-up PRs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
retr0h and others added 7 commits April 16, 2026 11:27
Expand the corpus with SIEM vocabularies, software identifiers, cloud
provider schemas, and hardware identifier databases. 10 new sources,
6,600 files, bringing total corpus to 74 MB across 17 sources and
~9,000 schema files.

New tier-2 sources:

- ASIM       Azure Sentinel Advanced Security Information Model
- OSSEM      Open Source Security Events Metadata (CDM + DD + DM)
- Sigma      SigmaHQ detection rules — field-name sampling
- CycloneDX  SBOM schema — component / vuln / pedigree
- SPDX       SPDX 3 model — document / package / file / snippet
- PURL       Package URL spec — universal package identifier
- Wazuh      syscollector inventory schemas
- AWS CFN    CloudFormation resource schemas (via cfn-lint data)
- Azure ARM  Resource Manager common JSON schemas
- hwids      pci.ids + usb.ids vendor/product databases

Script fixes:

- OSSEM path was wrong (common_information_model → OSSEM-CDM)
- SPDX now points at the active spdx-3-model repo
- CloudFormation replaced with cfn-lint-data (has actual per-resource
  schemas, unlike the original cloudformation-template-schema repo
  which didn't ship schemas)

README updated with tier-2 catalog + tier-3+ wishlist.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Documents the decisions that have to be made before field-by-field
analysis can start: top-level shape (flat vs nested), naming
conventions + precedence, unit conventions (seconds / bytes /
percent / booleans / enums), optionality rules, versioning (semver
+ $schema + $id + embedded meta), extension mechanism (_vendor,
_raw), and scope boundaries (inventory only — no events, metrics,
findings, or remediation).

Recommendation in the doc is Option A (flat-by-collector, what we
have today) over Option B (nested-by-domain like OCSF's device
object). Reasons: consumer compat, collector toggle semantics,
peer-tool alignment, and B's wins being theoretical.

Three open questions flagged for decision:

1. Top-level shape — A or B
2. Schema URL — gohai.dev / osapi.io / github raw
3. schemastore.org submission

Field-by-field analysis is blocked on these.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
First batch of the field-naming analysis covering the 8 highest-
coverage collectors: hostname, platform, kernel, cpu, memory, dmi,
network, process.

496 fields analyzed across the full corpus. Results:
- 20% already canonical (match industry consensus)
- 69% unique to gohai (no prior schema covers them)
- Only 3 evidence-based renames found:
  - network: hardware_addr → mac (4 sources agree)
  - network: addr → ip (OCSF)
  - process: cmd_line → command_line (OTel + ECS)
  - memory: size → total (Facter + OTel, clarity)

For the 69% unique fields, gohai IS defining the standard. The
analysis documents every decision with provenance so future schema
consumers can trace why each name was chosen.

Remaining collectors (cloud providers, filesystem, disk, gpu, pci,
scsi, hardware, virtualization, users, sessions, all system
collectors) will be added in follow-up commits.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
1,747-line JSON Schema (draft 2020-12) covering all 39 shipping
collectors with 739 field descriptions. This is the canonical
output contract for gohai — the schema IS the spec, Go types
conform to it, consumers validate against it.

43 top-level properties (one per collector + _meta + timings).
117 $defs for nested types (network interfaces, cloud sub-structs,
DMI sections, filesystem mounts, etc.).

Every field has a consumer-facing description explaining what the
value means — not "the X field" but actual semantic documentation.

4 evidence-based renames applied per corpus analysis:
- network: hardware_addr → mac (OCSF + OTel + ECS + osquery)
- network: addr → ip (OCSF)
- process: cmd_line → command_line (OTel + ECS)
- memory: size → total (already matched in code)

Schema informed by the 17-source corpus (OCSF, OTel, osquery, ECS,
Redfish, k8s, Ohai, Facter, ASIM, OSSEM, Sigma, CycloneDX, SPDX,
PURL, Wazuh, AWS CFN, Azure ARM) committed in schemas/corpus/.

Next steps (not in this commit):
- Conformance test: reflect Go Facts → assert matches schema
- Apply the 3 Go field renames (mac, ip, command_line)
- Publish to schemastore.org
- Semver the schema independently from gohai

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
16 cross-provider renames so the same concept uses the same field
name regardless of cloud provider:

- instance_id: vm_id (Azure), droplet_id (DO), id (OCI, Scaleway)
  all become instance_id
- region: location (Azure) becomes region
- availability_zone: zone (GCE, Scaleway, Alibaba),
  availability_domain (OCI) all become availability_zone
- private_ip: local_ipv4 (EC2, OpenStack), private_ipv4 (Alibaba)
  all become private_ip
- public_ip: public_ipv4 (EC2, OpenStack, Alibaba) becomes public_ip

Naming follows OTel + ECS + OCSF consensus (all three agree on
region, availability_zone; ECS uses instance.id).

Provider-specific extras (iam_info, network_interfaces, user_data,
etc.) keep their vendor-native names — only the cross-provider
concepts that exist on every cloud get normalized.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Comprehensive pass over every field in gohai.schema.json:

Field renames (11):
- AzureInterface: mac_address → mac
- OCIVNIC: mac_addr → mac
- Memory: s_unreclaim → s_unreclaimable
- CPU: numa_nodes_count → numa_node_count, cpu_opmodes → op_modes
- NetworkCounters: errin/errout/dropin/dropout → errors_in/errors_out/drops_in/drops_out
- PCIDevice: sdevice_id/sdevice_name → subsystem_device_id/subsystem_device_name

All 906 descriptions rewritten to publication quality:
- Never starts with "The"
- 1-2 sentences max
- Units in parentheses: (bytes), (seconds), (MHz), (percent)
- Example values for enum-like fields
- Booleans describe what true means
- Consumer-facing: WHAT not HOW
- No "This field contains..." or "Represents..."

Cross-collector consistency fixes:
- mac: now consistently named everywhere (was mac_address in Azure, mac_addr in OCI)
- hostname/region/instance_id/availability_zone/private_ip/public_ip/serial_number/tags: consistent description style across all cloud providers
- Previously undescribed fields in Azure, OCI, Alibaba, OpenStack, Scaleway, DMI, Hardware sub-structs now all have descriptions

$defs sorted alphabetically for navigation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant