Skip to content

Commit fb9aa56

Browse files
committed
update README
looks OK? anything else to add @epbrenner @AbhirupaGhosh
1 parent dc4868a commit fb9aa56

File tree

2 files changed

+408
-68
lines changed

2 files changed

+408
-68
lines changed

README.Rmd

Lines changed: 185 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ knitr::opts_chunk$set(
1717

1818
<!-- badges: start -->
1919
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
20-
[![CRAN status](https://www.r-pkg.org/badges/version/amRml)](https://CRAN.R-project.org/package=amRml)
2120
<!-- badges: end -->
2221

2322
amRdata is the first package in the [amR suite](https://github.com/JRaviLab/amR)
@@ -29,6 +28,22 @@ molecular scales, genes, proteins, protein domains, and structure
2928
associated AMR phenotypes and isolate metadata for downstream ML modeling
3029
in package 2, amRml.
3130

31+
## Overview
32+
33+
amRdata provides functions to:
34+
35+
- Query and download bacterial genome data from BV-BRC
36+
- Access paired antimicrobial susceptibility testing (AST) results
37+
- Extract molecular features across four scales:
38+
- Gene clusters (pangenome analysis)
39+
- Protein clusters (sequence similarity)
40+
- Protein domains (functional annotations)
41+
- Structural variants (genome rearrangements)
42+
- Generate summary visualizations of AMR trends
43+
- Store data efficiently in Parquet and DuckDB formats
44+
45+
See the [package vignette](https://jravilab.github.io/amR_data/articles/intro.html) for detailed usage.
46+
3247
## Installation
3348

3449
```r
@@ -44,17 +59,180 @@ remotes::install_github("JRaviLab/amR_data")
4459
```r
4560
library(amRdata)
4661

62+
# Download genome data for a species
63+
cje_data <- download_bvbrc_data(
64+
taxon = "Campylobacter jejuni",
65+
drugs = c("ciprofloxacin", "tetracycline"),
66+
output_dir = "cje_genomes"
67+
)
68+
69+
# Extract multi-scale features
70+
features <- extract_features(
71+
genome_dir = "cje_genomes",
72+
scales = c("gene", "protein", "domain", "struct"),
73+
taxon = "Campylobacter jejuni",
74+
parallel = TRUE,
75+
n_cores = 4
76+
)
77+
78+
# Explore metadata summaries
79+
summarize_metadata(features$metadata, by = "country")
80+
81+
# Visualize geographic distribution
82+
plot_geographic_distribution(features$metadata)
4783
```
4884

4985
## Features
5086

51-
- **Data download from BV-BRC**
52-
- **Processing genomic sequences, metadata, and AMR phenotypes**
53-
- **Pangenome construction**
54-
- **Featurization into multiple molecular scales (genes --> struct, proteins, domains)**
55-
- **Processing data across molecular scales for ML modeling**
87+
### Data curation
5688

57-
See the [package vignette](https://jravilab.github.io/amR_data/articles/intro.html) for detailed usage.
89+
The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:
90+
91+
- Query isolate metadata with flexible filtering
92+
- Download genome files (.fna, .faa, .gff)
93+
- Retrieve AST results linking genotypes to phenotypes
94+
- Apply quality control filters (assembly quality, metadata completeness)
95+
96+
### Feature extraction
97+
98+
Features are extracted at four complementary molecular scales:
99+
100+
#### 1. Gene clusters (pangenome)
101+
102+
Uses Panaroo to construct graph-based pangenomes:
103+
- Handles large isolate sets (>5,000 genomes) through parallelized subset analysis
104+
- Merges multiple pangenome runs into a single unified pangenome
105+
- Generates gene presence/absence matrix per isolate
106+
- Identifies structural variants (gene triplets indicating genome rearrangements)
107+
108+
#### 2. Protein clusters
109+
110+
CD-HIT-based clustering of protein sequences:
111+
- Clusters proteins across all isolates
112+
- Creates protein presence/absence matrices
113+
- Complements gene-level features with protein-level resolution
114+
115+
#### 3. Protein domains
116+
117+
InterProScan analysis of representative protein sequences:
118+
- Identifies Pfam domains across the protein space
119+
- Maps domains to isolates through protein cluster membership
120+
- Provides functional annotation layer
121+
122+
123+
### Data storage
124+
125+
All feature matrices and metadata are stored in efficient formats:
126+
- **Parquet**: Columnar storage for large matrices
127+
- **DuckDB**: SQL-queryable database for rapid filtering
128+
- **Metadata**: Geographic, temporal, and host information per isolate
129+
130+
### Visualization
131+
132+
Built-in functions for exploring data:
133+
- Geographic distributions (maps, treemaps)
134+
- Temporal trends in AMR
135+
- Host organism analyses
136+
- Feature distribution summaries
137+
138+
## Workflow example
139+
140+
Complete workflow for processing *Staphylococcus epidermidis*:
141+
142+
```r
143+
library(amRdata)
144+
145+
# 1. Download data
146+
sepi_data <- download_bvbrc_data(
147+
taxon = "Staphylococcus epidermidis",
148+
drugs = c("oxacillin", "vancomycin", "ciprofloxacin"),
149+
min_assembly_quality = "Good",
150+
output_dir = "sepi_data"
151+
)
152+
153+
# 2. Extract features (all scales)
154+
features <- extract_features(
155+
genome_dir = "sepi_data",
156+
scales = c("gene", "protein", "domain", "struct"),
157+
taxon = "Staphylococcus epidermidis",
158+
parallel = TRUE,
159+
n_cores = 8
160+
)
161+
162+
# 3. Explore metadata
163+
summary(features$metadata)
164+
165+
plot_temporal_trends(
166+
features$metadata,
167+
drug = "oxacillin",
168+
by_year = TRUE
169+
)
170+
171+
# 4. Save processed data
172+
save_features(
173+
features,
174+
output_file = "sepi_features.parquet"
175+
)
176+
```
177+
178+
## Data requirements
179+
180+
### Input
181+
182+
The package requires:
183+
- Internet connection to access BV-BRC
184+
- Docker for BV-BRC CLI (automatically managed)
185+
- Sufficient storage for genome files (varies by species; typically 1-10 GB)
186+
187+
### Output
188+
189+
Feature matrices dimensions depend on species:
190+
- Rows: Number of isolates (typically 1,000-10,000)
191+
- Columns: Number of features
192+
- Genes: 2,000-20,000
193+
- Proteins: 2,000-15,000
194+
- Domains: 500-5,000
195+
- Structural variants: 100-5,000
196+
197+
## External dependencies
198+
199+
The package uses established bioinformatics tools:
200+
- **Panaroo** (≥1.3.0): Pangenome analysis
201+
- **CD-HIT** (≥4.8.1): Protein clustering
202+
- **InterProScan** (≥5.0): Domain annotation
203+
- **Docker**: For BV-BRC CLI container
204+
205+
These are automatically managed through the Docker container.
206+
207+
## Performance
208+
209+
Processing times vary by species and isolate count:
210+
- Data download: 10-30 minutes for 1,000 isolates
211+
- Pangenome construction: 1-6 hours for 5,000 isolates (parallelized)
212+
- Protein clustering: 30-90 minutes for 5,000 isolates
213+
- Domain annotation: 2-4 hours for 5,000 isolates
214+
- Total: 4-12 hours for a complete species analysis
215+
216+
Parallelization significantly reduces processing time when multiple cores are available.
217+
218+
## Integration with amR suite
219+
220+
amRdata is designed to work seamlessly with other amR packages:
221+
222+
```r
223+
library(amRdata)
224+
library(amRml)
225+
library(amRshiny)
226+
227+
# 1. Curate data
228+
features <- extract_features(...)
229+
230+
# 2. Train models
231+
models <- train_amr_models(features, drug = "ciprofloxacin")
232+
233+
# 3. Visualize
234+
launch_dashboard()
235+
```
58236

59237
## Related packages
60238

0 commit comments

Comments
 (0)