input-association-files
│ │
│ [phenolist]
│ │
│ v
│ pheno-list.json
│ │ │
[parse] │
│ │ │
v v │
parsed/* │
│ └──────┐ │
[sites] │ │
rsids.tsv.gz--[add-rsids] │ │
genes.bed--[add-genes] │ │
│ │ │
v │ │
sites.tsv │ │
│ │ └──[augment-phenos]
[make-...] │ │
│ │ v
v │ pheno_gz/*
cpras-rsids-sqlite3 └─[matrix]─┘ │ │ └─[best-of-pheno]─> best_of_pheno/*
│ │ └─[qq]-> qq/*
v └─[manhattan]-> manhattan/*
matrix.tsv.gz │ │
│ [top-hits] [phenotypes]
[gather-pvalues-for-each-gene] │ │
│ v v
v top_hits.json phenotypes.json
best-phenos-by-gene.sqlite3
Square brackets show pheweb <step> subcommands.
Filenames are in generated-by-pheweb/ or its subdirectories (except pheno-list.json which is its sibling).
Reference this diagram against the filepaths listed in file_utils.py and the steps in pheweb process -h.
You can see all of the per-variant fields, per-association fields, and per-phenotype fields in parse_utils.py.
parsed/*files have the per-variant and per-association fields from the input files.sites.tsvhas every variant in the dataset, with the per-variant fields from theparsed/*plusrsidsandnearest_genesand (optionally)consequence.pheno_gz/*files are likeparsed/*plusrsidsandnearest_genesand (optionally)consequence.- Every line in these files must begin with a line from
sites.tsvin order forpheweb matrixto work. ie, they've got to have the same per-variant fields.
- Every line in these files must begin with a line from
matrix.tsv.gzcontains all the per-variant fields (ie, an exact copy ofsites.tsvin its left few columns), and all per-assoc fields (with header format<fieldname>@<phenocode>, egmaf@a1c).