hcg-kg

hcg-kg builds a local, queryable biomedical knowledge graph from parsed clinical guideline JSON files, with an initial focus on AHA guideline content for downstream use in HeartBioPortal.

For HBP 3.0, HCG-KG is the clinical guideline knowledge graph resource. HCG prepares and extracts structured guideline JSON, HCG-KG normalizes that content into graph nodes and edges, and HeartBioPortal uses the resulting guideline context in gene search dossiers through guideline summary/detail layers.

This repository is not about training an LLM on PDFs. The parsed guideline JSON files are treated as the source corpus for ingestion, normalization, structured extraction, graph construction, and source-grounded retrieval. The vendored PDFs are included only as source references for provenance attachment and downstream inspection. Optional local LLMs can assist extraction or summarization offline, but the runtime system is designed to answer from a graph plus provenance-bearing snippets.

Proposed repository architecture and rationale

src/hcg_kg: typed Python package for ingestion, normalization, extraction, graph persistence, and querying.
configs/: YAML profiles for local-dev, local-medium, and default hpc-large.
data/: vendored AHA parsed JSON inputs in raw/, vendored source PDFs in source_pdfs/, empty processed/, and a representative sample guideline JSON for tests and demo runs.
docs/: schema, architecture, query contract, and HPC execution notes.
examples/: short CLI examples.
slurm/: batch scripts for HPC execution.
docker/: container assets, including a local Neo4j compose file.
tests/: normalization, extraction, graph, and CLI coverage over sample data.

This layout keeps the repository open-source friendly, reproducible, and ready for both laptop iteration and large offline runs on a cluster.

Chosen stack

Python 3.11+
Typer CLI for a clean command surface
Pydantic models for typed schemas and configuration validation
YAML profiles for reproducible environment-specific configuration
NetworkX backend for local development and tests
Neo4j backend for larger graph persistence workloads
Optional TF-IDF snippet index for lightweight hybrid retrieval
Optional LlamaIndex / Hugging Face / Ollama extras for future local-model extraction

Why this stack

The first version prioritizes robust, source-grounded extraction from heterogeneous parsed JSON. That makes a defensive normalization layer and explicit schema control more important than coupling the core pipeline to any single orchestration library. The repository still exposes clear extension points for LlamaIndex or local-model extractors, while keeping the default path fully open-source and runnable without a finetuning workflow.

Repository tree

hcg-kg/
├── .github/workflows/ci.yml
├── configs/
│   ├── profiles/
│   └── schema/kg_schema.yaml
├── data/
│   ├── processed/
│   ├── raw/
│   └── sample/
├── docker/
├── docs/
├── examples/
├── scripts/
├── slurm/
├── src/hcg_kg/
└── tests/

What the project does

Given parsed guideline JSON files, the pipeline:

normalizes heterogeneous raw structures into a stable internal document model
preserves provenance for guideline title, section path, page, snippet text, and source paths
extracts gene-centric biomedical entities and relations
builds a local knowledge graph
optionally builds a snippet index for hybrid retrieval
exposes a query interface for gene-first lookup and grounded question answering

Example downstream questions:

What does this guideline say about gene LDLR?
What recommendations, evidence classes, conditions, biomarkers, drugs, or related entities are associated with APOE?
Which exact snippets and page references support those statements?

Why a graph is better than raw JSON search

Raw JSON search can recover text, but it does not resolve entity identity, relation structure, or cross-document traversal. A graph supports:

gene-first lookup over heterogeneous guideline structure
explicit relations between genes, recommendations, conditions, drugs, and biomarkers
easier downstream API integration for HeartBioPortal
provenance-preserving traversal from an answer back to the source snippet
future extension across additional guideline families such as ESC

Configuration profiles

The repository ships with three profiles:

local-dev: smallest settings, defaults to data/sample/*.json
local-medium: larger local run without assuming a graph server
hpc-large: default profile, tuned for cluster-scale offline extraction and Neo4j persistence
hpc-networkx: HPC-oriented extraction settings with a file-backed graph for clusters without Neo4j
hpc-llm: HPC-oriented extraction settings that use a local Hugging Face model through LlamaIndex

hpc-large is the default unless you pass --profile or set HCG_KG_PROFILE.

Setup

Laptop setup

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pre-commit install

Run the demo pipeline:

hcg-kg run-pipeline --profile local-dev --input-glob "data/sample/*.json"
hcg-kg query --profile local-dev --gene LDLR --pretty

Neo4j setup

For a local property graph service:

cp .env.example .env
docker compose -f docker/docker-compose.neo4j.yml up -d

Set NEO4J_PASSWORD and, if needed, override NEO4J_URI.

HPC setup

Clone the repository onto the cluster.
Create or activate a Python 3.11+ environment.
Export:

export HCG_KG_PROFILE=hpc-large
export NEO4J_PASSWORD="..."

Because the parsed AHA JSONs and source PDFs are vendored in data/raw/*.json and data/source_pdfs/, you can use the repo defaults and skip both HCG_KG_INPUT_GLOB and HCG_KG_SOURCE_PDF_DIR unless you want to override them.
Submit the stage-specific SLURM jobs from /Users/kvand/HeartBioPortal/HCG-KG/slurm, or run the CLI directly in batch jobs.

If Neo4j is not available on the cluster, use:

export HCG_KG_PROFILE=hpc-networkx

If you want LLM-based extraction with a local Hugging Face model through LlamaIndex:

pip install -e ".[llm]"
pip install llama-index-llms-huggingface
export HCG_KG_PROFILE=hpc-llm

The default hpc-llm profile uses Qwen/Qwen2.5-7B-Instruct. If you have already cached a different local model on the cluster, override it with models.model_name in a profile or by editing configs/profiles/hpc-llm.yaml. Do not run hpc-llm on the login node for the full corpus. Submit it through SLURM instead, for example:

sbatch -A <RT_PROJECT> slurm/run_pipeline_llm.slurm

The script targets the Big Red 200 gpu partition by default.

CLI overview

hcg-kg ingest
hcg-kg normalize
hcg-kg build-graph
hcg-kg build-embeddings
hcg-kg query --gene LDLR
hcg-kg inspect-document --path data/sample/aha_sample_guideline.json
hcg-kg inspect-gene --gene APOE
hcg-kg export-subgraph --gene LDLR --output /tmp/ldlr_subgraph.json
hcg-kg validate
hcg-kg run-pipeline
hcg-kg resume

On a fresh clone, the shortest end-to-end path is:

hcg-kg run-pipeline --profile hpc-large
hcg-kg query --profile hpc-large --gene LDLR --pretty

If you do not have a reachable Neo4j service, run:

hcg-kg run-pipeline --profile hpc-networkx
hcg-kg query --profile hpc-networkx --gene LDLR --pretty

For LLM-based extraction:

hcg-kg run-pipeline --profile hpc-llm
hcg-kg query --profile hpc-llm --gene LDLR --pretty

Source grounding and provenance

Every extracted statement should remain traceable to:

source guideline
section path
page number, if available
snippet text
source JSON path
source PDF path, when resolvable
JSON pointer into the parsed source structure

This repository is explicitly designed to avoid a black-box chatbot workflow.

Incremental and resumable processing

ingest discovers inputs and writes a manifest
normalize writes one normalized document file per input
build-graph reads normalized files and persists graph state
build-embeddings writes a reusable snippet index
resume reuses the manifest and skips finished work unless --force is passed

This supports long-running cluster jobs where retries should not rebuild the world.

Open-source license choice

The repository uses Apache 2.0. It is permissive, contributor-friendly, and includes a patent grant, which is useful for biomedical and translational informatics projects that may later integrate into larger research or production systems.

Plug-in points for exact AHA JSON schema details

The normalization layer is intentionally defensive because the current AHA parsed JSON files are heterogeneous. The main places to tighten once the exact schema is fully characterized are:

src/hcg_kg/ingest/normalizer.py: add explicit handlers for stable page, table, citation, and recommendation objects once known
src/hcg_kg/extract/heuristic.py: replace or augment heuristics with schema-aware or local-model extraction
configs/profiles/*.yaml: tune chunk sizes, worker counts, and retrieval settings for Big Red 200
docs/schema.md: extend relation types as additional downstream requirements emerge

The vendored PDF copies do not change the ingestion model. They are used only for provenance path resolution.

Limitations in v0.1

Extraction is heuristic-first and intentionally conservative.
Variant extraction is scaffolded but not deeply implemented yet.
Citation graphing is minimal.
Vector retrieval is TF-IDF based by default; neural embeddings are optional future work.
The Neo4j backend is implemented as an optional runtime dependency.

Future work

stronger entity normalization against HGNC and biomedical ontologies
richer recommendation and evidence parsing from known AHA section layouts
better citation extraction and reference linking
hybrid retrieval with local embedding models
ESC and other guideline-family adapters
HeartBioPortal-facing REST service layer

How this repository supports HBP 3.0

HCG-KG connects cardiovascular guideline source evidence to genes, variants, conditions, biomarkers, recommendations, evidence classes, evidence levels, and drugs/interventions in a provenance-bearing graph. HBP can query this graph or exported graph artifacts to show guideline context alongside gene, variant, protein, association, and drug-discovery evidence.

Related HBP 3.0 repositories:

HeartBioPortal organization: https://github.com/HeartBioPortal
Live site: https://heartbioportal.org/
HCG guideline extraction resource: https://github.com/HeartBioPortal/HCG
DataHub: https://github.com/HeartBioPortal/DataHub

Manuscript release

This repository supports the HeartBioPortal 3.0 NAR Database Issue manuscript release (v3.0.0-nar). Release-support files include graph schema documentation, graph/output manifests, examples, provenance documentation, citation metadata, Zenodo metadata, and checksum tooling.

Guideline graph outputs expose source-grounded context only. They are not medical advice, automated clinical recommendations, or direct clinical actionability.

Security and privacy

No controlled individual-level human data should be committed. Do not commit API keys, credentials, protected data, tokens, or restricted source data. Guideline PDFs, snippets, and parsed source JSON remain subject to source-specific licensing and publisher/society terms.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docker		docker
docs		docs
examples		examples
scripts		scripts
slurm		slurm
src/hcg_kg		src/hcg_kg
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.zenodo.json		.zenodo.json
CHECKSUMS.txt		CHECKSUMS.txt
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
EXAMPLES.md		EXAMPLES.md
GRAPH_MANIFEST.tsv		GRAPH_MANIFEST.tsv
KG_SCHEMA.md		KG_SCHEMA.md
LICENSE		LICENSE
MANIFEST.md		MANIFEST.md
Makefile		Makefile
PROVENANCE_SCHEMA.md		PROVENANCE_SCHEMA.md
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hcg-kg

Proposed repository architecture and rationale

Chosen stack

Why this stack

Repository tree

What the project does

Why a graph is better than raw JSON search

Configuration profiles

Setup

Laptop setup

Neo4j setup

HPC setup

CLI overview

Source grounding and provenance

Incremental and resumable processing

Open-source license choice

Plug-in points for exact AHA JSON schema details

Limitations in v0.1

Future work

How this repository supports HBP 3.0

Manuscript release

Security and privacy

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hcg-kg

Proposed repository architecture and rationale

Chosen stack

Why this stack

Repository tree

What the project does

Why a graph is better than raw JSON search

Configuration profiles

Setup

Laptop setup

Neo4j setup

HPC setup

CLI overview

Source grounding and provenance

Incremental and resumable processing

Open-source license choice

Plug-in points for exact AHA JSON schema details

Limitations in v0.1

Future work

How this repository supports HBP 3.0

Manuscript release

Security and privacy

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages