refined logic how ends of an alignment are handeled#59
Conversation
|
@wm75 If we merge this, the Galaxy Tool would need to be updated to include the new parameter. |
…s with less information
There was a problem hiding this comment.
Pull request overview
This PR updates varVAMP’s alignment post-processing to treat terminal gaps (incomplete sequence ends) differently from internal gaps, so that informative ends aren’t overly masked when many sequences are shorter (common in viral datasets).
Changes:
- Add a new config parameter
TERMINAL_MASKING_THRESHOLDand use it in alignment gap-masking logic to decide whether terminal regions should be masked. - Update consensus calling to compute the consensus cutoff per-position based on non-gap information (excluding gaps from frequency calculations).
- Exclude gaps from entropy calculations to better reflect variability where sequence information exists.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
varvamp/scripts/alignment.py |
Implements terminal-gap detection and separate masking logic via TERMINAL_MASKING_THRESHOLD. |
varvamp/scripts/consensus.py |
Changes consensus cutoff logic to be based on non-gap counts per position. |
varvamp/scripts/reporting.py |
Updates entropy calculation to exclude gaps. |
varvamp/scripts/primers.py |
Excludes gap-only observations from per-base mismatch statistics. |
varvamp/scripts/default_config.py |
Introduces TERMINAL_MASKING_THRESHOLD with defaults and explanatory comments. |
varvamp/scripts/logging.py |
Adds config presence/validation for TERMINAL_MASKING_THRESHOLD; removes alignment-length check helper. |
varvamp/command.py |
Removes the call to the removed alignment-length check. |
pyproject.toml |
Bumps version from 1.3.1 to 1.3.2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
UdoGi
left a comment
There was a problem hiding this comment.
Only looked at the alignment.py as I'm missing too much context for the other changes
This PR refines how ends of an alignment are handeled.
Previously gaps at the end of an alignment were masked like the internal gaps if they reached a frequency above
1-threshold. This logic does not take into account that often alignment ends (particular for viruses) are incomplete for a proportion of sequences. This leads to the exclusion of a lot of viable information as soon as there is a substantial amount of shorter sequences. This now introduces a seperate parameterTERMINAL_MASKING_THRESHOLDthat defines how much information at an end is needed to be included.Example:
Importantly, for this to work, the consensus building had to be changed. Previously, the consensus therehold for the number of sequences was just
n_sequences * thresholdnow this is calculated for each position bynumber_of_nucleotides * threshold, thereby excluding gaps in the the frequency calculation. Similar this was also applied to the entropy calculation for proper visualization.Primers could potentially change slightly compared to prior calculations for regions which span gaps with a frequency lower than
1-threshold.