Update regex patterns for feature ID extraction by AbhirupaGhosh · Pull Request #9 · JRaviLab/amRdata

AbhirupaGhosh · 2026-01-28T23:32:35Z

Description

fixing #8

What kind of change(s) are included?

Feature (adds or updates new capabilities)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read and followed the CONTRIBUTING.md guidelines.
I have searched for existing content to ensure this is not a duplicate.
I have performed a self-review of these additions (including spelling, grammar, and related).
I have added comments to my code to help provide understanding.
I have added a test which covers the code changes found within this PR.
I have deleted all non-relevant text in this pull request template.
Reviewer assignment: Tag a relevant team member to review and approve the changes.

epbrenner

Lots of reformatting, so this is going to take me a little longer than I expected, but the regex parse fix looks appropriate. Will test tomorrow to confirm and then approve or fix as needed. Thanks for getting to the bottom of this latest quirk.

epbrenner · 2026-01-28T23:46:11Z

R/data_curation.R

  DBI::dbExecute(
    con,
-    glue::glue('CREATE TABLE IF NOT EXISTS {meta_table} (
+    glue::glue("CREATE TABLE IF NOT EXISTS {meta_table} (


I can run this in the morning, but I think DuckDB requires single quotes for these strings to be interpreted as strings, and if we're gluing this into a command then double quotes might not work properly. Will check.

epbrenner · 2026-01-28T23:50:47Z

R/data_processing.R

+      proteinID = stringr::str_extract(value, "^fig\\|[0-9]+\\.[0-9]+\\.peg(?:sc)?\\.[0-9]+"),
+      locus_tag = stringr::str_match(value, "peg(?:sc)?\\.[0-9]+\\|([^\\s]+)")[, 2],


Makes sense as a fix. Again, will run ASAP and confirm before approving.

Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.

Update regex patterns for feature ID extraction

73f7b52

AbhirupaGhosh requested review from epbrenner and jananiravi January 28, 2026 23:32

AbhirupaGhosh self-assigned this Jan 28, 2026

Style code (GHA)

8d494a5

epbrenner reviewed Jan 28, 2026

View reviewed changes

AbhirupaGhosh and others added 7 commits February 19, 2026 08:54

Style code (GHA)

c52aaa7

Updating download logic

4edc964

Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)

Style code (GHA)

5287d1b

Add CD-HIT parsing function and update DB logic

f04d3dc

Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.

Style code (GHA)

6bcddb6

Merge branch 'cleanData' into minor_regex_change

394cfd7

AbhirupaGhosh mentioned this pull request Mar 12, 2026

I can run this in the morning, but I think DuckDB requires single quotes for these strings to be interpreted as strings, and if we're gluing this into a command then double quotes might not work properly. Will check. #16

Open

AbhirupaGhosh closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update regex patterns for feature ID extraction#9

Update regex patterns for feature ID extraction#9
AbhirupaGhosh wants to merge 9 commits intomainfrom
minor_regex_change

AbhirupaGhosh commented Jan 28, 2026 •

edited

Loading

Uh oh!

epbrenner left a comment

Uh oh!

epbrenner Jan 28, 2026

Uh oh!

epbrenner Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		proteinID = stringr::str_extract(value, "^fig\\\|[0-9]+\\.[0-9]+\\.peg(?:sc)?\\.[0-9]+"),
		locus_tag = stringr::str_match(value, "peg(?:sc)?\\.[0-9]+\\\|([^\\s]+)")[, 2],

Conversation

AbhirupaGhosh commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What kind of change(s) are included?

Checklist

Uh oh!

epbrenner left a comment

Choose a reason for hiding this comment

Uh oh!

epbrenner Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

epbrenner Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AbhirupaGhosh commented Jan 28, 2026 •

edited

Loading