CWAS (Category-Wide Association Study) is a data analytic tool to perform stringent association tests to find non-coding loci associated with autism spectrum disorder (ASD). CWAS runs category-based burden tests using de novo variants from whole genome sequencing data and diverse annotation data sets.
CWAS was used in the following papers.
- An analytical framework for whole genome sequence association studies and its implications for autism spectrum disorder (Werling et al., 2018)
- Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder (An et al., 2018)
- CWAS-Plus: Estimating category-wide association of rare noncoding variation from whole-genome sequencing data with cell-type-specific functional data (Kim et al., 2024)
Here is the original CWAS repository: sanderslab/cwas
Users must prepare following data for CWAS because it is very essential but cannot be generated automatically. Here are details.
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 3747728 . T C . . SAMPLE=11000.p1
chr1 38338861 . C A . . SAMPLE=11000.p1
chr1 117942118 . T G . . SAMPLE=11000.p1
- The input VCF data must follow the specification of VCF.
- The INFO field must contain a sample ID of each variant with this format
SAMPLE={sample_id}.
| SAMPLE | PHENOTYPE |
|---|---|
| 11000.p1 | case |
| 11000.s1 | ctrl |
| 11002.p1 | case |
| 11002.s1 | ctrl |
- CWAS requires the file like above listing sample IDs with its family IDs and phenotypes (Case=case, Control=ctrl).
- Here are details of the required format.
- Tab separated
- 2 essential columns: SAMPLE and PHENOTYPE
- A value in the PHENOTYPE must be case or ctrl.
- The values in the SAMPLE must be matched with the sample IDs of variants in the input VCF file.
| SAMPLE | AdjustFactor |
|---|---|
| 11000.p1 | 0.932 |
| 11000.s1 | 1.082 |
| 11002.p1 | 0.895 |
| 11002.s1 | 1.113 |
- The file like above is required if you want to adjust the number of variants for each sample in CWAS.
- Here are details of the required format.
- Tab separated
- 2 essential columns: SAMPLE and AdjustFactor
- A value in the AdjustFactor must be a float.
- The values in the SAMPLE must be matched with the sample IDs of variants in the input VCF file.
You can get the examples of the above data requirements from joonan-lab/cwas-input-example
You can install those file from this repository: joonan-lab/cwas-dataset
git clone https://github.com/joonan-lab/cwas-dataset.gitDue to the sizes of BigWig files for conservation scores, you must install them manually. Please follow this instruction.
We recomment using conda virtual environment to build environment for CWAS. Installing CWAS-Plus2 within a conda environment will prevent its installation in the global environment. When creating a conda environment, also install Python to enable local installations using pip. Run the following statements in your shell.
conda create -n cwas python=3.11
conda activate cwas
pip install cwas-plus2Or install from source:
conda create -n cwas python=3.11
conda activate cwas
git clone https://github.com/joonan-lab/cwas-plus2.git
cd cwas-plus2
pip install .- Python >=3.9, <3.13 is required.
- A Rust compiler is required for building the high-performance categorization engine. Install it with:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh- In addition, you must install Variant Effect Predictor (VEP) unless you plan to use the
--docker-modeoption in the annotation step.
Run this command.
cwas startAs default, this command creates CWAS workspace in your home directory. The path is $HOME/.cwas. $HOME/.cwas/configuration.txt has also generated.
Alternatively, CWAS workspace can be specified with -w option. The configuration.txt will also be generated in the specified CWAS workspace.
.cwas
└── configuration.txtWrite the following information in the $HOME/.cwas/configuration.txt.
ANNOTATION_DATA_DIR=/path/to/your/dir
GENE_MATRIX=/path/to/your/file
ANNOTATION_KEY_CONFIG=/path/to/your/file
VEP=/path/to/your/vep
VEP_CACHE_DIR=/path/to/your/vep/cache/dir
VEP_CONSERVATION_FILE=/path/to/your/vep/resource
VEP_LOFTEE=/path/to/your/vep/resource
VEP_HUMAN_ANCESTOR_FA=/path/to/your/vep/resource
VEP_GERP_BIGWIG=/path/to/your/vep/resource
VEP_MIS_DB=/path/to/your/missense/database
VEP_MIS_INFO_KEY=
VEP_MIS_THRES=The ANNOTATION_DATA is a directory that contains all the BED files from joonan-lab/cwas-dataset.
The VEP_MIS_DB is a database that is used to define damaging missense variants. The VEP_MIS_INFO_KEY is an user-defined name of the database used to annotate variants. The VEP_MIS_THRES is a threshold for missense variants (missense variants with value>=threshold are defined as damaging missense variants).
After writing the above file, run this command.
cwas configuration [-f]| Argument | Description |
|---|---|
-f, --force_overwrite |
Force to overwrite the result |
Following files will be generated in your home directory as default. If you specify CWAS workspace, the files will be located in the same directory as the configuration.txt.
.cwas
├── annotation-data
├── annotation_keys.yaml
├── category_domain.yaml
├── configuration.txt
├── gene_matrix.txt
└── redundant_category.txt
.cwas_envThis step merges the BED files to annotate variants. Run the following command.
cwas preparation [-p 4] [-f]| Argument | Description | Default |
|---|---|---|
-p, --num_proc |
Max number of worker processes | 1 |
-f, --force_overwrite |
Force to overwrite the result | - |
After running this, merged BED file and its index will be generated in your CWAS workspace.
.cwas
...
├── merged_annotation.bed.gz
├── merged_annotation.bed.gz.tbi
...This step annotates your VCF file using VEP. Run this command.
cwas annotation -v /path/to/your/vcf [-p 4] [-o_dir /path/to/output] [--docker-mode] [--vep-version 115.0]| Argument | Description | Default |
|---|---|---|
-v, --vcf_file |
(Required) Target VCF file | - |
-p, --num_proc |
Number of processes for the annotation | 1 |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
--docker-mode |
Run VEP using Docker (ensemblorg/ensembl-vep) instead of a local binary |
False |
--vep-version |
VEP version for Docker image tag (e.g. 115.0). Only used with --docker-mode |
latest |
Here is the result file.
.cwas
...
├── {Your VCF filename}.annotated.vcf.gz
...This step categorizes your variants using the annotation datasets. Run this command.
cwas categorization -i /path/to/your/annotated/vcf [-p 4] [-o_dir /path/to/output]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Annotated VCF file | - |
-p, --num_proc |
Number of worker processes | 1 |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
After running this, you will get...
.cwas
...
├── {Your VCF filename}.categorization_result.zarr
...Categorized results are generated in zarr format. Outputs are easily stored and loaded with zarr.
This step is for calculation of relative risks and p-values for each category. As a default, these tests are based on variant-level analysis. The --use_n_carrier option can be used for sample-level analysis.
This step runs category-based burden test using the categorization result. The type of the test is binomial test. Run this command.
cwas binomial_test -i /path/to/your/categorization/result -s /path/to/your/samples \
[-a /path/to/your/adj_factors] [-u] [-t TAG] [-num_ef N]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Categorized file (*.zarr) from categorization step | - |
-s, --sample_info |
(Required) File listing information of your samples | - |
-a, --adjustment_factor |
File listing adjustment factors of each sample | None |
-u, --use_n_carrier |
Use the number of samples with variants instead of variant counts | False |
-t, --tag |
Tags for highlighting points on the volcano plot (e.g. CRE,CHD8) |
None |
-num_ef, --num_effective_test |
Number of effective tests | None |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
-ms, --marker_size |
Marker size of the volcano plot (pt) | 15 |
-fs, --font_size |
Font size of the volcano plot (pt) | 15 |
-ps, --plot_size |
Plot size of the volcano plot (inch, square) | 7 |
-pt, --plot_title |
Title of the volcano plot | "Binomial test result" |
After running this, you will get...
.cwas
...
├── {Your VCF filename}.burden_test.txt
├── {Your VCF filename}.burden_test.volcano_plot.pdf
├── {Your VCF filename}.category_counts.txt
├── {Your VCF filename}.category_info.txt
...This step runs category-based permutations using the categorization result. Run this command.
cwas permutation_test -i /path/to/your/categorization/result -s /path/to/your/samples \
[-a /path/to/your/adj_factors] [-n 10000] [-p 4] [-b] [-u]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Categorized file (*.zarr) from categorization step | - |
-s, --sample_info |
(Required) File listing information of your samples | - |
-a, --adjustment_factor |
File listing adjustment factors of each sample | None |
-n, --num_perm |
Number of label-swapping permutations | 10000 |
-p, --num_proc |
Number of worker processes | 1 |
-b, --burden_shift |
Generate binomial p-values for each permutation (for burden shift and DAWN) | False |
-u, --use_n_carrier |
Use the number of samples with variants instead of variant counts | False |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
After running this, you will get...
.cwas
...
├── {Your VCF filename}.permutation_test.txt
├── {Your VCF filename}.binom_pvals.parquet
...This step generates a correlation matrix between categories.
cwas correlation -i /path/to/your/categorization/result -cm variant [-v /path/to/annotated/vcf] \
[-im] [-p 4] [-d all]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Categorized file (*.zarr) from categorization step | - |
-cm, --corr_matrix |
(Required) Correlation matrix type: variant or sample |
- |
-v, --annotated_vcf |
Annotated VCF file. Required for variant-level correlation (-cm variant) |
None |
-im, --intersection_matrix |
Also generate a matrix of intersected variant/sample counts between categories | False |
-p, --num_proc |
Number of worker processes (recommend using 1/3 of available cores) | 1 |
-d, --domain_list |
Domain list to filter categories (see Domain options) | all |
-c_info, --category_info |
Path to category information file (*.category_info.txt) | None |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
This step calculates the effective number of tests via eigen decomposition of the correlation or intersection matrix.
cwas effective_num_test -i /path/to/your/matrix -c_count /path/to/category_counts \
[-if corr] [-n 10000] [-thr N] [-s /path/to/samples] [-d all] [-ef]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Correlation matrix or intersection matrix from categorization | - |
-c_count, --cat_count |
(Required) Category counts file from burden test or sign test | - |
-if, --input_format |
Input format: corr (correlation matrix) or inter (intersection matrix) |
corr |
-n, --num_eig |
Number of eigenvalues to use | 10000 |
-thr, --threshold |
Minimum variant/sample count to filter categories | None |
-s, --sample_info |
Sample info file (required when -thr is not given) |
None |
-c_set, --category_set |
Text file containing categories for eigen decomposition | None |
-d, --domain_list |
Domain list to filter categories (see Domain options) | all |
-ef, --eff_num_test |
Only calculate effective number of tests (eigenvalues only, skip eigenvectors) | False |
-t, --tag |
Tag used for the name of output files | None |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
This step performs burden shift analysis using permutation results.
cwas burden_shift -i /path/to/burden_test.txt -b /path/to/binom_pvals.parquet \
-c_info /path/to/category_info.txt -c_count /path/to/category_counts.txt \
[-c_cutoff 7] [-pval 0.05] [-N 10]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Burden test result (*.burden_test.txt) | - |
-b, --burden_res |
(Required) Burden shift result from permutation (*.binom_pvals.parquet or *.binom_pvals.txt.gz) | - |
-c_info, --category_info |
(Required) Category information file (*.category_info.txt) | - |
-c_count, --cat_count |
(Required) Category counts file (*.category_counts.txt) | - |
-c_cutoff, --count_cutoff |
Minimum count cutoff for categories (must be positive) | 7 |
-pval, --pval |
P-value threshold | 0.05 |
-c_set, --category_set |
List of interest category sets for the main output plot | None |
-N, --n_cat_sets |
Number of top category sets in the main output plot | 10 |
-t, --tag |
Tag used for the name of output files | None |
-pt, --plot_title |
Title of the summarized plot | "Burdenshift: Overrepresented terms" |
-fs, --fontsize |
Font size of the main output plot | 10 |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
This step performs risk score analysis using categorization results.
cwas risk_score -i /path/to/categorization_result.zarr -s /path/to/samples \
[-a /path/to/adj_factors] [-d all] [-thr 3] [-n 1000] [-p 4]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Categorization result file (*.zarr) | - |
-s, --sample_info |
(Required) File listing sample IDs with families and phenotypes | - |
-a, --adjustment_factor |
File listing adjustment factors of each sample | None |
-c_info, --category_info |
Path to category information file (*.category_info.txt) | None |
-d, --domain_list |
Domain list to filter categories (see Domain options) | all |
-t, --tag |
Tag used for the name of output files | None |
--do_each_one |
Use each annotation individually to calculate risk score | False |
--leave_one_out |
Calculate risk score excluding one annotation at a time | False |
-fs_group, --feature_selection_group |
Groups for feature selection | "gene_set,functional_score,functional_annotation" |
-u, --use_n_carrier |
Use number of samples with variants instead of variant counts | False |
-thr, --threshold |
Minimum variant count in controls to select rare categories | 3 |
-tf, --train_set_fraction |
Fraction of the training set | 0.7 |
-n_reg, --num_regression |
Number of regression trials for mean R-squared | 10 |
-f, --fold |
Number of folds in (Stratified)KFold | 5 |
-n, --n_permute |
Number of permutations for p-value calculation | 1000 |
--predict_only |
Only predict the risk score, skip permutation test | False |
-p, --num_proc |
Number of worker processes for permutation | 1 |
-S, --seed |
Seed of random state | 42 |
-pt, --plotsize |
Plot size as "width,height" in inches | "7,7" |
-fs, --fontsize |
Font size of the main plot | 10 |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
This step performs DAWN (Detecting Association With Networks) analysis.
cwas dawn -e /path/to/eig_vecs -c /path/to/corr_matrix -P /path/to/permutation_test \
-c_count /path/to/category_counts [-k K] [-p 4]| Argument | Description | Default |
|---|---|---|
-e, --eig_vector |
(Required) Eigenvectors file from effective number test | - |
-c, --corr_mat |
(Required) Category correlation matrix file | - |
-P, --permut_test |
(Required) Permutation test result file | - |
-c_count, --cat_count |
(Required) Category counts file from burden test | - |
--leiden |
Perform Leiden clustering using: eigen_vector or corr_mat |
None |
-res, --resolution |
Resolution for Leiden clustering | 1 |
-r, --range |
Range (start,end) to find optimal K for k-means (start > 1) | "2,100" |
-k, --k_val |
Fixed K for k-means clustering (mutually exclusive with -r) |
None |
-s, --seed |
Seed value for t-SNE | 42 |
-T, --tsen_method |
t-SNE gradient algorithm: barnes_hut or exact |
exact |
-t, --tag |
Tag used for the name of output files | None |
-l, --lambda |
Lambda value for parameter tuning | 5.25 |
-C, --count_threshold |
Minimum variant/sample count per category | 20 |
-R, --corr_threshold |
Correlation threshold between clusters | 0.12 |
-S, --size_threshold |
Minimum number of categories per cluster | 2 |
--no-parsimonious |
Disable parsimonious K selection (use absolute silhouette max) | Enabled |
-p, --num_proc |
Number of worker processes | 1 |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
This step extracts variants of interest from the annotated VCF file.
cwas extract_variant -i /path/to/annotated/vcf [-c_set /path/to/category_set] [-t TAG] [-ai]| Argument | Description | Default |
|---|---|---|
-i, --input_file |
(Required) Annotated VCF file | - |
-c_set, --category_set |
Text file containing categories for extracting variants | None |
-t, --tag |
Tag used for output file name (output.<tag>.extracted_variants.txt.gz) | None |
-ai, --annotation_info |
Save with annotation information attached (gene list, functional annotations, etc) | False |
-o_dir, --output_directory |
Directory where output file will be saved | $CWAS_WORKSPACE |
Several steps (correlation, effective_num_test, risk_score, dawn) support the -d / --domain_list argument to filter categories by GENCODE domain. Available options:
| Option | Description |
|---|---|
run_all |
Test all available domain options |
all |
Include all domains (default) |
coding |
Coding regions |
noncoding |
Non-coding regions |
ptv |
Protein-truncating variants |
missense |
Missense variants |
damaging_missense |
Damaging missense variants |
promoter |
Promoter regions |
noncoding_wo_promoter |
Non-coding regions without promoter |
splice |
Splice site regions |
intron |
Intronic regions |
intergenic |
Intergenic regions |
3primeutr |
3' UTR |
5primeutr |
5' UTR |
utr |
UTR regions |
lincRNA |
Long intergenic non-coding RNA |