JRaviLab
diff --git a/‎README.Rmd‎
Lines changed: 185 additions & 7 deletions b/‎README.Rmd‎
Lines changed: 185 additions & 7 deletions
@@ -17,7 +17,6 @@ knitr::opts_chunk$set(
 
 <!-- badges: start -->
 [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
-[![CRAN status](https://www.r-pkg.org/badges/version/amRml)](https://CRAN.R-project.org/package=amRml)
 <!-- badges: end -->
 
 amRdata is the first package in the [amR suite](https://github.com/JRaviLab/amR) 
@@ -29,6 +28,22 @@ molecular scales, genes, proteins, protein domains, and structure
 associated AMR phenotypes and isolate metadata for downstream ML modeling 
 in package 2, amRml.
 
+## Overview
+
+amRdata provides functions to:
+
+- Query and download bacterial genome data from BV-BRC
+- Access paired antimicrobial susceptibility testing (AST) results
+- Extract molecular features across four scales:
+  - Gene clusters (pangenome analysis)
+  - Protein clusters (sequence similarity)
+  - Protein domains (functional annotations)
+  - Structural variants (genome rearrangements)
+- Generate summary visualizations of AMR trends
+- Store data efficiently in Parquet and DuckDB formats
+
+See the [package vignette](https://jravilab.github.io/amR_data/articles/intro.html) for detailed usage.
+
 ## Installation
 
 ```r
@@ -44,17 +59,180 @@ remotes::install_github("JRaviLab/amR_data")
 ```r
 library(amRdata)
 
+# Download genome data for a species
+cje_data <- download_bvbrc_data(
+  taxon = "Campylobacter jejuni",
+  drugs = c("ciprofloxacin", "tetracycline"),
+  output_dir = "cje_genomes"
+)
+
+# Extract multi-scale features
+features <- extract_features(
+  genome_dir = "cje_genomes",
+  scales = c("gene", "protein", "domain", "struct"),
+  taxon = "Campylobacter jejuni",
+  parallel = TRUE,
+  n_cores = 4
+)
+
+# Explore metadata summaries
+summarize_metadata(features$metadata, by = "country")
+
+# Visualize geographic distribution
+plot_geographic_distribution(features$metadata)
 ```
 
 ## Features
 
-- **Data download from BV-BRC**
-- **Processing genomic sequences, metadata, and AMR phenotypes**
-- **Pangenome construction**
-- **Featurization into multiple molecular scales (genes --> struct, proteins, domains)**
-- **Processing data across molecular scales for ML modeling**
+### Data curation
 
-See the [package vignette](https://jravilab.github.io/amR_data/articles/intro.html) for detailed usage.
+The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:
+
+- Query isolate metadata with flexible filtering
+- Download genome files (.fna, .faa, .gff)
+- Retrieve AST results linking genotypes to phenotypes
+- Apply quality control filters (assembly quality, metadata completeness)
+
+### Feature extraction
+
+Features are extracted at four complementary molecular scales:
+
+#### 1. Gene clusters (pangenome)
+
+Uses Panaroo to construct graph-based pangenomes:
+- Handles large isolate sets (>5,000 genomes) through parallelized subset analysis
+- Merges multiple pangenome runs into a single unified pangenome
+- Generates gene presence/absence matrix per isolate
+- Identifies structural variants (gene triplets indicating genome rearrangements)
+
+#### 2. Protein clusters
+
+CD-HIT-based clustering of protein sequences:
+- Clusters proteins across all isolates
+- Creates protein presence/absence matrices
+- Complements gene-level features with protein-level resolution
+
+#### 3. Protein domains
+
+InterProScan analysis of representative protein sequences:
+- Identifies Pfam domains across the protein space
+- Maps domains to isolates through protein cluster membership
+- Provides functional annotation layer
+
+
+### Data storage
+
+All feature matrices and metadata are stored in efficient formats:
+- **Parquet**: Columnar storage for large matrices
+- **DuckDB**: SQL-queryable database for rapid filtering
+- **Metadata**: Geographic, temporal, and host information per isolate
+
+### Visualization
+
+Built-in functions for exploring data:
+- Geographic distributions (maps, treemaps)
+- Temporal trends in AMR
+- Host organism analyses
+- Feature distribution summaries
+
+## Workflow example
+
+Complete workflow for processing *Staphylococcus epidermidis*:
+
+```r
+library(amRdata)
+
+# 1. Download data
+sepi_data <- download_bvbrc_data(
+  taxon = "Staphylococcus epidermidis",
+  drugs = c("oxacillin", "vancomycin", "ciprofloxacin"),
+  min_assembly_quality = "Good",
+  output_dir = "sepi_data"
+)
+
+# 2. Extract features (all scales)
+features <- extract_features(
+  genome_dir = "sepi_data",
+  scales = c("gene", "protein", "domain", "struct"),
+  taxon = "Staphylococcus epidermidis",
+  parallel = TRUE,
+  n_cores = 8
+)
+
+# 3. Explore metadata
+summary(features$metadata)
+
+plot_temporal_trends(
+  features$metadata,
+  drug = "oxacillin",
+  by_year = TRUE
+)
+
+# 4. Save processed data
+save_features(
+  features,
+  output_file = "sepi_features.parquet"
+)
+```
+
+## Data requirements
+
+### Input
+
+The package requires:
+- Internet connection to access BV-BRC
+- Docker for BV-BRC CLI (automatically managed)
+- Sufficient storage for genome files (varies by species; typically 1-10 GB)
+
+### Output
+
+Feature matrices dimensions depend on species:
+- Rows: Number of isolates (typically 1,000-10,000)
+- Columns: Number of features
+  - Genes: 2,000-20,000
+  - Proteins: 2,000-15,000
+  - Domains: 500-5,000
+  - Structural variants: 100-5,000
+
+## External dependencies
+
+The package uses established bioinformatics tools:
+- **Panaroo** (≥1.3.0): Pangenome analysis
+- **CD-HIT** (≥4.8.1): Protein clustering
+- **InterProScan** (≥5.0): Domain annotation
+- **Docker**: For BV-BRC CLI container
+
+These are automatically managed through the Docker container.
+
+## Performance
+
+Processing times vary by species and isolate count:
+- Data download: 10-30 minutes for 1,000 isolates
+- Pangenome construction: 1-6 hours for 5,000 isolates (parallelized)
+- Protein clustering: 30-90 minutes for 5,000 isolates
+- Domain annotation: 2-4 hours for 5,000 isolates
+- Total: 4-12 hours for a complete species analysis
+
+Parallelization significantly reduces processing time when multiple cores are available.
+
+## Integration with amR suite
+
+amRdata is designed to work seamlessly with other amR packages:
+
+```r
+library(amRdata)
+library(amRml)
+library(amRshiny)
+
+# 1. Curate data
+features <- extract_features(...)
+
+# 2. Train models
+models <- train_amr_models(features, drug = "ciprofloxacin")
+
+# 3. Visualize
+launch_dashboard()
+```
 
 ## Related packages