hcg-kg builds a local, queryable biomedical knowledge graph from parsed clinical guideline JSON files, with an initial focus on AHA guideline content for downstream use in HeartBioPortal.
For HBP 3.0, HCG-KG is the clinical guideline knowledge graph resource. HCG prepares and extracts structured guideline JSON, HCG-KG normalizes that content into graph nodes and edges, and HeartBioPortal uses the resulting guideline context in gene search dossiers through guideline summary/detail layers.
This repository is not about training an LLM on PDFs. The parsed guideline JSON files are treated as the source corpus for ingestion, normalization, structured extraction, graph construction, and source-grounded retrieval. The vendored PDFs are included only as source references for provenance attachment and downstream inspection. Optional local LLMs can assist extraction or summarization offline, but the runtime system is designed to answer from a graph plus provenance-bearing snippets.
src/hcg_kg: typed Python package for ingestion, normalization, extraction, graph persistence, and querying.configs/: YAML profiles forlocal-dev,local-medium, and defaulthpc-large.data/: vendored AHA parsed JSON inputs inraw/, vendored source PDFs insource_pdfs/, emptyprocessed/, and a representative sample guideline JSON for tests and demo runs.docs/: schema, architecture, query contract, and HPC execution notes.examples/: short CLI examples.slurm/: batch scripts for HPC execution.docker/: container assets, including a local Neo4j compose file.tests/: normalization, extraction, graph, and CLI coverage over sample data.
This layout keeps the repository open-source friendly, reproducible, and ready for both laptop iteration and large offline runs on a cluster.
- Python 3.11+
- Typer CLI for a clean command surface
- Pydantic models for typed schemas and configuration validation
- YAML profiles for reproducible environment-specific configuration
- NetworkX backend for local development and tests
- Neo4j backend for larger graph persistence workloads
- Optional TF-IDF snippet index for lightweight hybrid retrieval
- Optional LlamaIndex / Hugging Face / Ollama extras for future local-model extraction
The first version prioritizes robust, source-grounded extraction from heterogeneous parsed JSON. That makes a defensive normalization layer and explicit schema control more important than coupling the core pipeline to any single orchestration library. The repository still exposes clear extension points for LlamaIndex or local-model extractors, while keeping the default path fully open-source and runnable without a finetuning workflow.
hcg-kg/
├── .github/workflows/ci.yml
├── configs/
│ ├── profiles/
│ └── schema/kg_schema.yaml
├── data/
│ ├── processed/
│ ├── raw/
│ └── sample/
├── docker/
├── docs/
├── examples/
├── scripts/
├── slurm/
├── src/hcg_kg/
└── tests/
Given parsed guideline JSON files, the pipeline:
- normalizes heterogeneous raw structures into a stable internal document model
- preserves provenance for guideline title, section path, page, snippet text, and source paths
- extracts gene-centric biomedical entities and relations
- builds a local knowledge graph
- optionally builds a snippet index for hybrid retrieval
- exposes a query interface for gene-first lookup and grounded question answering
Example downstream questions:
- What does this guideline say about gene
LDLR? - What recommendations, evidence classes, conditions, biomarkers, drugs, or related entities are associated with
APOE? - Which exact snippets and page references support those statements?
Raw JSON search can recover text, but it does not resolve entity identity, relation structure, or cross-document traversal. A graph supports:
- gene-first lookup over heterogeneous guideline structure
- explicit relations between genes, recommendations, conditions, drugs, and biomarkers
- easier downstream API integration for HeartBioPortal
- provenance-preserving traversal from an answer back to the source snippet
- future extension across additional guideline families such as ESC
The repository ships with three profiles:
local-dev: smallest settings, defaults todata/sample/*.jsonlocal-medium: larger local run without assuming a graph serverhpc-large: default profile, tuned for cluster-scale offline extraction and Neo4j persistencehpc-networkx: HPC-oriented extraction settings with a file-backed graph for clusters without Neo4jhpc-llm: HPC-oriented extraction settings that use a local Hugging Face model through LlamaIndex
hpc-large is the default unless you pass --profile or set HCG_KG_PROFILE.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pre-commit installRun the demo pipeline:
hcg-kg run-pipeline --profile local-dev --input-glob "data/sample/*.json"
hcg-kg query --profile local-dev --gene LDLR --prettyFor a local property graph service:
cp .env.example .env
docker compose -f docker/docker-compose.neo4j.yml up -dSet NEO4J_PASSWORD and, if needed, override NEO4J_URI.
- Clone the repository onto the cluster.
- Create or activate a Python 3.11+ environment.
- Export:
export HCG_KG_PROFILE=hpc-large
export NEO4J_PASSWORD="..."- Because the parsed AHA JSONs and source PDFs are vendored in
data/raw/*.jsonanddata/source_pdfs/, you can use the repo defaults and skip bothHCG_KG_INPUT_GLOBandHCG_KG_SOURCE_PDF_DIRunless you want to override them. - Submit the stage-specific SLURM jobs from
/Users/kvand/HeartBioPortal/HCG-KG/slurm, or run the CLI directly in batch jobs.
If Neo4j is not available on the cluster, use:
export HCG_KG_PROFILE=hpc-networkxIf you want LLM-based extraction with a local Hugging Face model through LlamaIndex:
pip install -e ".[llm]"
pip install llama-index-llms-huggingface
export HCG_KG_PROFILE=hpc-llmThe default hpc-llm profile uses Qwen/Qwen2.5-7B-Instruct. If you have already cached a different local model on the cluster, override it with models.model_name in a profile or by editing configs/profiles/hpc-llm.yaml.
Do not run hpc-llm on the login node for the full corpus. Submit it through SLURM instead, for example:
sbatch -A <RT_PROJECT> slurm/run_pipeline_llm.slurmThe script targets the Big Red 200 gpu partition by default.
hcg-kg ingest
hcg-kg normalize
hcg-kg build-graph
hcg-kg build-embeddings
hcg-kg query --gene LDLR
hcg-kg inspect-document --path data/sample/aha_sample_guideline.json
hcg-kg inspect-gene --gene APOE
hcg-kg export-subgraph --gene LDLR --output /tmp/ldlr_subgraph.json
hcg-kg validate
hcg-kg run-pipeline
hcg-kg resumeOn a fresh clone, the shortest end-to-end path is:
hcg-kg run-pipeline --profile hpc-large
hcg-kg query --profile hpc-large --gene LDLR --prettyIf you do not have a reachable Neo4j service, run:
hcg-kg run-pipeline --profile hpc-networkx
hcg-kg query --profile hpc-networkx --gene LDLR --prettyFor LLM-based extraction:
hcg-kg run-pipeline --profile hpc-llm
hcg-kg query --profile hpc-llm --gene LDLR --prettyEvery extracted statement should remain traceable to:
- source guideline
- section path
- page number, if available
- snippet text
- source JSON path
- source PDF path, when resolvable
- JSON pointer into the parsed source structure
This repository is explicitly designed to avoid a black-box chatbot workflow.
ingestdiscovers inputs and writes a manifestnormalizewrites one normalized document file per inputbuild-graphreads normalized files and persists graph statebuild-embeddingswrites a reusable snippet indexresumereuses the manifest and skips finished work unless--forceis passed
This supports long-running cluster jobs where retries should not rebuild the world.
The repository uses Apache 2.0. It is permissive, contributor-friendly, and includes a patent grant, which is useful for biomedical and translational informatics projects that may later integrate into larger research or production systems.
The normalization layer is intentionally defensive because the current AHA parsed JSON files are heterogeneous. The main places to tighten once the exact schema is fully characterized are:
src/hcg_kg/ingest/normalizer.py: add explicit handlers for stable page, table, citation, and recommendation objects once knownsrc/hcg_kg/extract/heuristic.py: replace or augment heuristics with schema-aware or local-model extractionconfigs/profiles/*.yaml: tune chunk sizes, worker counts, and retrieval settings for Big Red 200docs/schema.md: extend relation types as additional downstream requirements emerge
The vendored PDF copies do not change the ingestion model. They are used only for provenance path resolution.
- Extraction is heuristic-first and intentionally conservative.
- Variant extraction is scaffolded but not deeply implemented yet.
- Citation graphing is minimal.
- Vector retrieval is TF-IDF based by default; neural embeddings are optional future work.
- The Neo4j backend is implemented as an optional runtime dependency.
- stronger entity normalization against HGNC and biomedical ontologies
- richer recommendation and evidence parsing from known AHA section layouts
- better citation extraction and reference linking
- hybrid retrieval with local embedding models
- ESC and other guideline-family adapters
- HeartBioPortal-facing REST service layer
HCG-KG connects cardiovascular guideline source evidence to genes, variants, conditions, biomarkers, recommendations, evidence classes, evidence levels, and drugs/interventions in a provenance-bearing graph. HBP can query this graph or exported graph artifacts to show guideline context alongside gene, variant, protein, association, and drug-discovery evidence.
Related HBP 3.0 repositories:
- HeartBioPortal organization: https://github.com/HeartBioPortal
- Live site: https://heartbioportal.org/
- HCG guideline extraction resource: https://github.com/HeartBioPortal/HCG
- DataHub: https://github.com/HeartBioPortal/DataHub
This repository supports the HeartBioPortal 3.0 NAR Database Issue manuscript release (v3.0.0-nar). Release-support files include graph schema documentation, graph/output manifests, examples, provenance documentation, citation metadata, Zenodo metadata, and checksum tooling.
Guideline graph outputs expose source-grounded context only. They are not medical advice, automated clinical recommendations, or direct clinical actionability.
No controlled individual-level human data should be committed. Do not commit API keys, credentials, protected data, tokens, or restricted source data. Guideline PDFs, snippets, and parsed source JSON remain subject to source-specific licensing and publisher/society terms.