Skip to content

LangeLab/pepVet

Repository files navigation


pepVet logo

pepVet

Evaluate proteolytic digests for LC-MS/MS proteomics. Score peptides, compare enzymes, triage proteins, and plan your workflow before any sample reaches the instrument.

v0.1.6 R >= 4.6 Bioconductor 3.23 R CMD check MIT pkgdown docs


What pepVet does

Enzyme choice is the first and most consequential decision in a bottom-up proteomics experiment. Cut too aggressively and you drown in tiny fragments below the detection threshold. Cut too conservatively and overlong peptides fail to fly, fragment, or resolve on the column. Most tools stop after listing which peptides an enzyme could produce. pepVet goes further: it scores each peptide for LC-MS/MS suitability, ranks enzymes by digest quality, and triages proteins by expected difficulty, all before any sample touches the instrument.

Quick start

library(pepVet)

bsa <- system.file("extdata", "P02769.fasta", package = "pepVet")

# One-call evaluation with styled console report
result <- pepvet_check(bsa, enzyme = "trypsin", missed_cleavages = 1L)
result$scores

# Multi-enzyme comparison
comp <- compare_digests(bsa,
  enzymes = c("trypsin", "lysc", "glutamyl endopeptidase", "asp-n endopeptidase")
)
digest_report(comp)
recommend_enzyme(bsa, enzymes = c("trypsin", "lysc"))

Visualization

pepVet provides 12 ggplot2-based plot functions for digest diagnostics, enzyme comparison, physicochemical distributions, and proteome-scale overviews. Every function returns a ggplot or patchwork object that can be customized further.

Single-protein diagnostic: plot_digest_profile() called on BSA (P02769.fasta) digested with trypsin at one missed cleavage gives a four-panel figure showing length distribution, GRAVY hydrophobicity, sequence coverage, and component scores:

BSA trypsin digest profile

Proteome-scale enzyme comparison: plot_batch_comparison() called on the 50-protein fixture (small_proteome_50_proteins.fasta) evaluated against 10 enzymes (trypsin, Lys-C, chymotrypsin, Asp-N, Glu-C, Arg-C, thermolysin, pepsin, Staphylococcal peptidase I, proteinase K) gives verdict summaries, score distributions, component heatmaps, and per-protein win rates:

10-enzyme batch comparison on 50 proteins

See the Visualising Digest Quality article for a full walkthrough of all plot functions.

Features

Digest simulation

  • digest_protein() cleaves any protein sequence with any of 40 cleaver-compatible enzyme rules and returns a peptide tibble with coordinates and missed-cleavage counts.
  • annotate_cleavage_sites() labels each trypsin-family cleavage site as high, medium, or low efficiency using local P1-P1' sequence context.

Scoring

  • score_peptides() summarises a peptide set into five orthogonal component scores (S_length, S_coverage, S_count, S_hydro, S_charge) plus an optional sixth (S_unique) when a background proteome digest is supplied.
  • pepvet_preset() returns workflow-specific parameter sets for DDA, DIA, targeted, membrane, FFPE/degraded, and fractionated workflows.

Evaluation and comparison

  • evaluate_digest() wraps digest and scoring into one call and returns a named list with scores, peptides, and resolved parameters.
  • compare_digests() runs across a vector of enzymes for a single protein and returns a ranked tibble.
  • recommend_enzyme() returns the name of the best-scoring enzyme.

Batch workflows

  • batch_evaluate() evaluates every protein in a multi-FASTA independently and returns a flat tibble with one row per protein, including all score columns, verdicts, and four difficulty flags.
  • summarize_batch() computes proteome-level verdict distribution, composite score statistics, per-component means, and heuristic enzyme-switch candidates.
  • triage_proteins() appends an action column (proceed, consider_alternative, try_other_enzyme, skip) to the batch tibble.

Reporting and export

  • digest_report() renders a colour-coded console summary for single-protein or multi-enzyme results.
  • export_peptide_list() filters valid peptides and exports as Skyline-compatible CSV, generic annotated CSV, or FASTA.

Peptide properties

  • calculate_peptide_mass() computes monoisotopic neutral mass and m/z.
  • calculate_pI() computes isoelectric point using a Lehninger-style pKa set.

Scoring model

Six components, one weighted composite, one advisory verdict.

Score What it measures Why it matters
S_length Fraction of peptides in the active length window [7, 25] aa Short and long peptides lower identification rates
S_coverage Fraction of the protein covered by valid peptides Dark regions weaken protein-level inference
S_count Valid count relative to enzyme-aware expected density Too few weakens evidence; too many signals over-digestion
S_hydro Fraction of valid peptides in the active GRAVY window [-1.0, 0.6] Extreme hydrophobicity or hydrophilicity hurts LC retention
S_charge Valid peptides with non-terminal K/R/H Proxy for multi-charge potential and fragment ion richness
S_unique Fraction of valid peptides unique in a supplied proteome Shared peptides cannot distinguish isoforms or paralogs

Default weights (AHP-derived, consistency ratio 0.028): S_length 0.200, S_coverage 0.348, S_count 0.226, S_hydro 0.138, S_charge 0.088.

Verdict thresholds: Good >= 0.65, Moderate >= 0.40, Poor < 0.40. These are heuristic ranking labels, not calibrated probabilities.

Workflow presets

Each preset adjusts the valid-length window, GRAVY range, and component weights together.

preset <- pepvet_preset("targeted")
do.call(evaluate_digest, c(list(sequence = bsa, enzyme = "trypsin"), preset))
Preset Best fit Key shift Source
standard Routine DDA [7,25] aa, GRAVY [-1,0.6], AHP defaults Tabb 2008
dia DIA and SWATH [7,30] aa, GRAVY [-1,0.8], high coverage weight Ludwig 2018
targeted SRM, PRM, MRM [8,20] aa, GRAVY [-0.8,0.4], S_unique 30% Lange 2008, Picotti 2012
membrane Hydrophobic proteins GRAVY [-1.0,2.0], S_hydro 5% Vit & Petrak 2017
ffpe_degraded Degraded samples [6,30] aa, high S_count weight Coscia 2020, Buczak 2023
fractionated SCX / high-pH RP Same as standard, include_pI = TRUE -

Installation

pepVet depends on Bioconductor packages. Install them first:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(c("Biostrings", "IRanges", "cleaver"))

if (!requireNamespace("remotes", quietly = TRUE))
  install.packages("remotes")
remotes::install_github("LangeLab/pepVet", dependencies = TRUE)

Reference FASTA fixtures

The package ships pinned FASTA files for reproducible examples and regression tests.

File Protein Use
P02769.fasta BSA (607 aa) Canonical positive-control digest
P68431.fasta Histone H3.1 (136 aa) Exposes trypsin over-digestion on basic proteins
P56817.fasta BACE1 (501 aa) Membrane protein with mixed hydrophobicity
P00698.fasta Lysozyme C (147 aa) Small protein, well-characterised digest
Q8WZ42.fasta Titin (34350 aa) Very large protein for scale testing
P0CG48.fasta Ubiquitin (685 aa) Short protein edge case
P37840_isoforms.fasta Alpha-synuclein isoforms (3 seqs) Proteome-aware uniqueness example
small_proteome_50_proteins.fasta 50 human proteins Batch workflow fixture

Scope

pepVet is not a peptide detectability predictor. It is a rule-based, multi-criteria digest-ranking model for pre-acquisition planning. Scores are interpretable rankings within a given enzyme-workflow combination, not calibrated probabilities. The model does not account for PTMs, chromatographic gradients, or instrument-specific fragmentation parameters.

Documentation

Citation

citation("pepVet")

License

MIT. See LICENSE.md.

Contributing

Pull requests, bug reports, and documentation fixes are welcome. See CONTRIBUTING.md for the review workflow and CODE_OF_CONDUCT.md for community standards.

About

pepVet is an R package for proteolytic digest simulation, peptide-set scoring, enzyme comparison, and workflow preset selection for bottom-up proteomics.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages