feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107
Open
feat(schemas): tier-1 prior-art corpus for gohai.schema.json#107
Conversation
Set up schemas/ as the home for the canonical gohai schema and its supporting corpus. This PR lands Phase 1 of the schema work: the prior-art corpus that field-naming decisions will be grounded in. Corpus sources (tier 1 — direct scope match): - OCSF objects + events + dictionary (266 files) - OTel resource semantic conventions (249 files) - osquery all 280+ table specs (289 files) - ECS Elastic Common Schema field YAMLs ( 59 files) - Redfish DMTF Redfish JSON schemas (latest only) (283 files) - k8s NodeStatus / NodeInfo from core/v1/types.go ( 3 files) - Ohai plugins + mixins + specs (279 files) - Facter Puppet Facter fact schema (946 files) Total: 2,374 files, ~13 MB. Each source preserves its upstream LICENSE and carries a PROVENANCE file recording source URL, fetched commit SHA, and fetch timestamp. scripts/corpus-fetch.sh is re-runnable — clones each upstream at depth=1, cherry-picks the schema-relevant paths, strips Redfish's 6,700 historical-version files down to the 280 unversioned heads, and writes PROVENANCE metadata. schemas/README.md documents the corpus structure, refresh workflow, and the schema roadmap. `gohai.schema.json` itself (the canonical output contract) lands in follow-up PRs once the field-by-field analysis is done. Tier 2+ (cloud IMDS shapes, SIEM vocabularies like ASIM/UDM/Splunk CIM/OSSEM, software identifiers like CPE/SPDX/CycloneDX/PURL, DMTF CIM) will be added as separate follow-up PRs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Expand the corpus with SIEM vocabularies, software identifiers, cloud provider schemas, and hardware identifier databases. 10 new sources, 6,600 files, bringing total corpus to 74 MB across 17 sources and ~9,000 schema files. New tier-2 sources: - ASIM Azure Sentinel Advanced Security Information Model - OSSEM Open Source Security Events Metadata (CDM + DD + DM) - Sigma SigmaHQ detection rules — field-name sampling - CycloneDX SBOM schema — component / vuln / pedigree - SPDX SPDX 3 model — document / package / file / snippet - PURL Package URL spec — universal package identifier - Wazuh syscollector inventory schemas - AWS CFN CloudFormation resource schemas (via cfn-lint data) - Azure ARM Resource Manager common JSON schemas - hwids pci.ids + usb.ids vendor/product databases Script fixes: - OSSEM path was wrong (common_information_model → OSSEM-CDM) - SPDX now points at the active spdx-3-model repo - CloudFormation replaced with cfn-lint-data (has actual per-resource schemas, unlike the original cloudformation-template-schema repo which didn't ship schemas) README updated with tier-2 catalog + tier-3+ wishlist. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Documents the decisions that have to be made before field-by-field analysis can start: top-level shape (flat vs nested), naming conventions + precedence, unit conventions (seconds / bytes / percent / booleans / enums), optionality rules, versioning (semver + $schema + $id + embedded meta), extension mechanism (_vendor, _raw), and scope boundaries (inventory only — no events, metrics, findings, or remediation). Recommendation in the doc is Option A (flat-by-collector, what we have today) over Option B (nested-by-domain like OCSF's device object). Reasons: consumer compat, collector toggle semantics, peer-tool alignment, and B's wins being theoretical. Three open questions flagged for decision: 1. Top-level shape — A or B 2. Schema URL — gohai.dev / osapi.io / github raw 3. schemastore.org submission Field-by-field analysis is blocked on these. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
First batch of the field-naming analysis covering the 8 highest- coverage collectors: hostname, platform, kernel, cpu, memory, dmi, network, process. 496 fields analyzed across the full corpus. Results: - 20% already canonical (match industry consensus) - 69% unique to gohai (no prior schema covers them) - Only 3 evidence-based renames found: - network: hardware_addr → mac (4 sources agree) - network: addr → ip (OCSF) - process: cmd_line → command_line (OTel + ECS) - memory: size → total (Facter + OTel, clarity) For the 69% unique fields, gohai IS defining the standard. The analysis documents every decision with provenance so future schema consumers can trace why each name was chosen. Remaining collectors (cloud providers, filesystem, disk, gpu, pci, scsi, hardware, virtualization, users, sessions, all system collectors) will be added in follow-up commits. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1,747-line JSON Schema (draft 2020-12) covering all 39 shipping collectors with 739 field descriptions. This is the canonical output contract for gohai — the schema IS the spec, Go types conform to it, consumers validate against it. 43 top-level properties (one per collector + _meta + timings). 117 $defs for nested types (network interfaces, cloud sub-structs, DMI sections, filesystem mounts, etc.). Every field has a consumer-facing description explaining what the value means — not "the X field" but actual semantic documentation. 4 evidence-based renames applied per corpus analysis: - network: hardware_addr → mac (OCSF + OTel + ECS + osquery) - network: addr → ip (OCSF) - process: cmd_line → command_line (OTel + ECS) - memory: size → total (already matched in code) Schema informed by the 17-source corpus (OCSF, OTel, osquery, ECS, Redfish, k8s, Ohai, Facter, ASIM, OSSEM, Sigma, CycloneDX, SPDX, PURL, Wazuh, AWS CFN, Azure ARM) committed in schemas/corpus/. Next steps (not in this commit): - Conformance test: reflect Go Facts → assert matches schema - Apply the 3 Go field renames (mac, ip, command_line) - Publish to schemastore.org - Semver the schema independently from gohai 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
16 cross-provider renames so the same concept uses the same field name regardless of cloud provider: - instance_id: vm_id (Azure), droplet_id (DO), id (OCI, Scaleway) all become instance_id - region: location (Azure) becomes region - availability_zone: zone (GCE, Scaleway, Alibaba), availability_domain (OCI) all become availability_zone - private_ip: local_ipv4 (EC2, OpenStack), private_ipv4 (Alibaba) all become private_ip - public_ip: public_ipv4 (EC2, OpenStack, Alibaba) becomes public_ip Naming follows OTel + ECS + OCSF consensus (all three agree on region, availability_zone; ECS uses instance.id). Provider-specific extras (iam_info, network_interfaces, user_data, etc.) keep their vendor-native names — only the cross-provider concepts that exist on every cloud get normalized. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Comprehensive pass over every field in gohai.schema.json: Field renames (11): - AzureInterface: mac_address → mac - OCIVNIC: mac_addr → mac - Memory: s_unreclaim → s_unreclaimable - CPU: numa_nodes_count → numa_node_count, cpu_opmodes → op_modes - NetworkCounters: errin/errout/dropin/dropout → errors_in/errors_out/drops_in/drops_out - PCIDevice: sdevice_id/sdevice_name → subsystem_device_id/subsystem_device_name All 906 descriptions rewritten to publication quality: - Never starts with "The" - 1-2 sentences max - Units in parentheses: (bytes), (seconds), (MHz), (percent) - Example values for enum-like fields - Booleans describe what true means - Consumer-facing: WHAT not HOW - No "This field contains..." or "Represents..." Cross-collector consistency fixes: - mac: now consistently named everywhere (was mac_address in Azure, mac_addr in OCI) - hostname/region/instance_id/availability_zone/private_ip/public_ip/serial_number/tags: consistent description style across all cloud providers - Previously undescribed fields in Azure, OCI, Alibaba, OpenStack, Scaleway, DMI, Hardware sub-structs now all have descriptions $defs sorted alphabetically for navigation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of the schema work: land the prior-art corpus that every field-naming decision for `gohai.schema.json` will be grounded in.
No OCSF / OTel / osquery / ECS / any other single source covers gohai's scope (we measured — OCSF gets us ~10-30% of fields, osquery ~62% of collectors with inconsistent naming). So instead of forcing gohai into an ill-fitting standard, we're designing our own clean schema informed by comprehensive prior-art analysis, then shipping it as the canonical contract: `gohai.schema.json` at repo root.
This PR lands the corpus only. The schema itself, the field-by-field analysis table, and the Go-type refactor to match are follow-ups.
What's in it
Eight tier-1 sources with direct scope match to gohai:
Total: 2,374 files, ~13 MB.
Each source preserves its upstream `LICENSE` and carries a `PROVENANCE` file recording source URL, fetched commit SHA, and fetch timestamp.
`scripts/corpus-fetch.sh` is re-runnable — clones each upstream at `depth=1`, cherry-picks the schema-relevant paths, strips Redfish's 6,700 historical-version files down to the 280 unversioned heads, and writes `PROVENANCE` metadata.
`schemas/README.md` documents the corpus, refresh workflow, and the broader schema roadmap.
Why committed (not gitignored)
Licensing
All tier-1 sources are permissively licensed (Apache-2, BSD-3, MIT). Each subdirectory preserves its upstream `LICENSE` file for attribution.
What's next (not in this PR)
Test plan
🤖 Generated with Claude Code