Skip to content

refined logic how ends of an alignment are handeled#59

Merged
jonas-fuchs merged 13 commits intomasterfrom
5_3_prime_ends
Mar 23, 2026
Merged

refined logic how ends of an alignment are handeled#59
jonas-fuchs merged 13 commits intomasterfrom
5_3_prime_ends

Conversation

@jonas-fuchs
Copy link
Copy Markdown
Owner

This PR refines how ends of an alignment are handeled.

Previously gaps at the end of an alignment were masked like the internal gaps if they reached a frequency above 1-threshold. This logic does not take into account that often alignment ends (particular for viruses) are incomplete for a proportion of sequences. This leads to the exclusion of a lot of viable information as soon as there is a substantial amount of shorter sequences. This now introduces a seperate parameter TERMINAL_MASKING_THRESHOLD that defines how much information at an end is needed to be included.

Example:


# -------------------XXXXXXXXXXXXXXXXXXX---XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# ------------------XXXXXXXXXXXXXXXXXXXX---XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# ------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# would result in a consensus that looks like this with a terminal threshold of 0.5 and a threshold of 0.9:
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX N XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# Previously would look like:
# NN                XXXXXXXXXXXXXXXXXXXX N XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Importantly, for this to work, the consensus building had to be changed. Previously, the consensus therehold for the number of sequences was just n_sequences * threshold now this is calculated for each position by number_of_nucleotides * threshold, thereby excluding gaps in the the frequency calculation. Similar this was also applied to the entropy calculation for proper visualization.

Primers could potentially change slightly compared to prior calculations for regions which span gaps with a frequency lower than 1-threshold.

@jonas-fuchs jonas-fuchs added the enhancement New feature or request label Mar 5, 2026
@jonas-fuchs
Copy link
Copy Markdown
Owner Author

@wm75 If we merge this, the Galaxy Tool would need to be updated to include the new parameter.

@jonas-fuchs jonas-fuchs self-assigned this Mar 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates varVAMP’s alignment post-processing to treat terminal gaps (incomplete sequence ends) differently from internal gaps, so that informative ends aren’t overly masked when many sequences are shorter (common in viral datasets).

Changes:

  • Add a new config parameter TERMINAL_MASKING_THRESHOLD and use it in alignment gap-masking logic to decide whether terminal regions should be masked.
  • Update consensus calling to compute the consensus cutoff per-position based on non-gap information (excluding gaps from frequency calculations).
  • Exclude gaps from entropy calculations to better reflect variability where sequence information exists.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
varvamp/scripts/alignment.py Implements terminal-gap detection and separate masking logic via TERMINAL_MASKING_THRESHOLD.
varvamp/scripts/consensus.py Changes consensus cutoff logic to be based on non-gap counts per position.
varvamp/scripts/reporting.py Updates entropy calculation to exclude gaps.
varvamp/scripts/primers.py Excludes gap-only observations from per-base mismatch statistics.
varvamp/scripts/default_config.py Introduces TERMINAL_MASKING_THRESHOLD with defaults and explanatory comments.
varvamp/scripts/logging.py Adds config presence/validation for TERMINAL_MASKING_THRESHOLD; removes alignment-length check helper.
varvamp/command.py Removes the call to the removed alignment-length check.
pyproject.toml Bumps version from 1.3.1 to 1.3.2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

jonas-fuchs and others added 2 commits March 7, 2026 14:48
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@UdoGi UdoGi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the alignment.py as I'm missing too much context for the other changes

@jonas-fuchs jonas-fuchs merged commit 85d870b into master Mar 23, 2026
8 checks passed
@jonas-fuchs jonas-fuchs deleted the 5_3_prime_ends branch March 23, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants