@@ -17,7 +17,6 @@ knitr::opts_chunk$set(
1717
1818<!-- badges: start -->
1919[ ![ Lifecycle: experimental] ( https://img.shields.io/badge/lifecycle-experimental-orange.svg )] ( https://lifecycle.r-lib.org/articles/stages.html#experimental )
20- [ ![ CRAN status] ( https://www.r-pkg.org/badges/version/amRml )] ( https://CRAN.R-project.org/package=amRml )
2120<!-- badges: end -->
2221
2322amRdata is the first package in the [ amR suite] ( https://github.com/JRaviLab/amR )
@@ -29,6 +28,22 @@ molecular scales, genes, proteins, protein domains, and structure
2928associated AMR phenotypes and isolate metadata for downstream ML modeling
3029in package 2, amRml.
3130
31+ ## Overview
32+
33+ amRdata provides functions to:
34+
35+ - Query and download bacterial genome data from BV-BRC
36+ - Access paired antimicrobial susceptibility testing (AST) results
37+ - Extract molecular features across four scales:
38+ - Gene clusters (pangenome analysis)
39+ - Protein clusters (sequence similarity)
40+ - Protein domains (functional annotations)
41+ - Structural variants (genome rearrangements)
42+ - Generate summary visualizations of AMR trends
43+ - Store data efficiently in Parquet and DuckDB formats
44+
45+ See the [ package vignette] ( https://jravilab.github.io/amR_data/articles/intro.html ) for detailed usage.
46+
3247## Installation
3348
3449``` r
@@ -44,17 +59,180 @@ remotes::install_github("JRaviLab/amR_data")
4459``` r
4560library(amRdata )
4661
62+ # Download genome data for a species
63+ cje_data <- download_bvbrc_data(
64+ taxon = " Campylobacter jejuni" ,
65+ drugs = c(" ciprofloxacin" , " tetracycline" ),
66+ output_dir = " cje_genomes"
67+ )
68+
69+ # Extract multi-scale features
70+ features <- extract_features(
71+ genome_dir = " cje_genomes" ,
72+ scales = c(" gene" , " protein" , " domain" , " struct" ),
73+ taxon = " Campylobacter jejuni" ,
74+ parallel = TRUE ,
75+ n_cores = 4
76+ )
77+
78+ # Explore metadata summaries
79+ summarize_metadata(features $ metadata , by = " country" )
80+
81+ # Visualize geographic distribution
82+ plot_geographic_distribution(features $ metadata )
4783```
4884
4985## Features
5086
51- - ** Data download from BV-BRC**
52- - ** Processing genomic sequences, metadata, and AMR phenotypes**
53- - ** Pangenome construction**
54- - ** Featurization into multiple molecular scales (genes --> struct, proteins, domains)**
55- - ** Processing data across molecular scales for ML modeling**
87+ ### Data curation
5688
57- See the [ package vignette] ( https://jravilab.github.io/amR_data/articles/intro.html ) for detailed usage.
89+ The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:
90+
91+ - Query isolate metadata with flexible filtering
92+ - Download genome files (.fna, .faa, .gff)
93+ - Retrieve AST results linking genotypes to phenotypes
94+ - Apply quality control filters (assembly quality, metadata completeness)
95+
96+ ### Feature extraction
97+
98+ Features are extracted at four complementary molecular scales:
99+
100+ #### 1. Gene clusters (pangenome)
101+
102+ Uses Panaroo to construct graph-based pangenomes:
103+ - Handles large isolate sets (>5,000 genomes) through parallelized subset analysis
104+ - Merges multiple pangenome runs into a single unified pangenome
105+ - Generates gene presence/absence matrix per isolate
106+ - Identifies structural variants (gene triplets indicating genome rearrangements)
107+
108+ #### 2. Protein clusters
109+
110+ CD-HIT-based clustering of protein sequences:
111+ - Clusters proteins across all isolates
112+ - Creates protein presence/absence matrices
113+ - Complements gene-level features with protein-level resolution
114+
115+ #### 3. Protein domains
116+
117+ InterProScan analysis of representative protein sequences:
118+ - Identifies Pfam domains across the protein space
119+ - Maps domains to isolates through protein cluster membership
120+ - Provides functional annotation layer
121+
122+
123+ ### Data storage
124+
125+ All feature matrices and metadata are stored in efficient formats:
126+ - ** Parquet** : Columnar storage for large matrices
127+ - ** DuckDB** : SQL-queryable database for rapid filtering
128+ - ** Metadata** : Geographic, temporal, and host information per isolate
129+
130+ ### Visualization
131+
132+ Built-in functions for exploring data:
133+ - Geographic distributions (maps, treemaps)
134+ - Temporal trends in AMR
135+ - Host organism analyses
136+ - Feature distribution summaries
137+
138+ ## Workflow example
139+
140+ Complete workflow for processing * Staphylococcus epidermidis* :
141+
142+ ``` r
143+ library(amRdata )
144+
145+ # 1. Download data
146+ sepi_data <- download_bvbrc_data(
147+ taxon = " Staphylococcus epidermidis" ,
148+ drugs = c(" oxacillin" , " vancomycin" , " ciprofloxacin" ),
149+ min_assembly_quality = " Good" ,
150+ output_dir = " sepi_data"
151+ )
152+
153+ # 2. Extract features (all scales)
154+ features <- extract_features(
155+ genome_dir = " sepi_data" ,
156+ scales = c(" gene" , " protein" , " domain" , " struct" ),
157+ taxon = " Staphylococcus epidermidis" ,
158+ parallel = TRUE ,
159+ n_cores = 8
160+ )
161+
162+ # 3. Explore metadata
163+ summary(features $ metadata )
164+
165+ plot_temporal_trends(
166+ features $ metadata ,
167+ drug = " oxacillin" ,
168+ by_year = TRUE
169+ )
170+
171+ # 4. Save processed data
172+ save_features(
173+ features ,
174+ output_file = " sepi_features.parquet"
175+ )
176+ ```
177+
178+ ## Data requirements
179+
180+ ### Input
181+
182+ The package requires:
183+ - Internet connection to access BV-BRC
184+ - Docker for BV-BRC CLI (automatically managed)
185+ - Sufficient storage for genome files (varies by species; typically 1-10 GB)
186+
187+ ### Output
188+
189+ Feature matrices dimensions depend on species:
190+ - Rows: Number of isolates (typically 1,000-10,000)
191+ - Columns: Number of features
192+ - Genes: 2,000-20,000
193+ - Proteins: 2,000-15,000
194+ - Domains: 500-5,000
195+ - Structural variants: 100-5,000
196+
197+ ## External dependencies
198+
199+ The package uses established bioinformatics tools:
200+ - ** Panaroo** (≥1.3.0): Pangenome analysis
201+ - ** CD-HIT** (≥4.8.1): Protein clustering
202+ - ** InterProScan** (≥5.0): Domain annotation
203+ - ** Docker** : For BV-BRC CLI container
204+
205+ These are automatically managed through the Docker container.
206+
207+ ## Performance
208+
209+ Processing times vary by species and isolate count:
210+ - Data download: 10-30 minutes for 1,000 isolates
211+ - Pangenome construction: 1-6 hours for 5,000 isolates (parallelized)
212+ - Protein clustering: 30-90 minutes for 5,000 isolates
213+ - Domain annotation: 2-4 hours for 5,000 isolates
214+ - Total: 4-12 hours for a complete species analysis
215+
216+ Parallelization significantly reduces processing time when multiple cores are available.
217+
218+ ## Integration with amR suite
219+
220+ amRdata is designed to work seamlessly with other amR packages:
221+
222+ ``` r
223+ library(amRdata )
224+ library(amRml )
225+ library(amRshiny )
226+
227+ # 1. Curate data
228+ features <- extract_features(... )
229+
230+ # 2. Train models
231+ models <- train_amr_models(features , drug = " ciprofloxacin" )
232+
233+ # 3. Visualize
234+ launch_dashboard()
235+ ```
58236
59237## Related packages
60238
0 commit comments