Implement DNA-to-Protein Pipeline (Work Order 03)#7
Conversation
- Added `predict_isoforms` to assemble mature mRNA from reference and mutant sequences. - Added `translate` and `compare_translation` for codon-to-AA translation and frameshift detection. - Added `ProteinFolder` API wrapper to predict protein structure mock pLDDT scores. - Implemented `compute_structure_impact` to assess structural collapse. - Added tests to verify frameshifts and structural collapse detection. Co-authored-by: AkeBoss-tech <69588353+AkeBoss-tech@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Implements a basic DNA→mRNA→protein pipeline under src/hg_dt/, plus tests covering a frameshift deletion and a mocked “structural collapse” signal using pLDDT deltas.
Changes:
- Add transcript isoform assembly from exon annotations and edited sequence windows.
- Add ORF detection + translation and a simple translation comparison (frameshift heuristic).
- Add a mock protein folding wrapper and a pLDDT-based structural impact metric, with tests for key edge cases.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/hg_dt/translate/transcript.py |
Builds ref/mut mRNA sequences per transcript by concatenating exon segments and handling strand/RNA conversion. |
src/hg_dt/translate/translator.py |
Finds longest ORF and translates to AA; compares ref vs mut translation for frameshift/truncation detection. |
src/hg_dt/models/protein.py |
Introduces a mocked protein folding API wrapper that returns synthetic pLDDT scores. |
src/hg_dt/analyze/deltas.py |
Adds compute_structure_impact to summarize pLDDT deltas and flag potential fold collapse. |
tests/hg_dt/test_pipeline.py |
Adds unit tests for frameshift deletion translation and structural collapse/no-collapse scenarios. |
pip_install.sh |
Adds a pip install step for pyranges. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def compare_translation(ref_mrna: str, mut_mrna: str) -> Dict[str, any]: | ||
| """ |
There was a problem hiding this comment.
compare_translation is annotated as Dict[str, any], but any is the built-in function, not a typing type. Use typing.Any (and import it) or a more precise TypedDict/return dataclass so type checkers don’t misinterpret this annotation.
| def compute_structure_impact(ref_plddt: List[float], mut_plddt: List[float]) -> Dict[str, float]: | ||
| """ | ||
| Computes structural impact metrics between reference and mutant pLDDT scores. |
There was a problem hiding this comment.
compute_structure_impact is annotated as returning Dict[str, float], but the returned dict includes a boolean (collapse). Adjust the return type (e.g., Dict[str, Any] or a TypedDict) to match the actual values.
| Dictionary of metrics, including average pLDDT drop and whether structural collapse occurred. | ||
| """ | ||
| if not ref_plddt and not mut_plddt: | ||
| return {'avg_drop': 0.0, 'collapse': False} |
There was a problem hiding this comment.
When both inputs are empty, this function returns only avg_drop and collapse, but in the non-empty case it returns additional keys (avg_ref_plddt, avg_mut_plddt). Returning inconsistent shapes makes the API harder to consume; consider always returning the full key set (with 0.0 defaults) even in the empty-input case.
| return {'avg_drop': 0.0, 'collapse': False} | |
| return { | |
| 'avg_ref_plddt': 0.0, | |
| 'avg_mut_plddt': 0.0, | |
| 'avg_drop': 0.0, | |
| 'collapse': False | |
| } |
| import os | ||
| import requests | ||
| from typing import Dict, Optional, Tuple, List | ||
|
|
||
| class ProteinFolder: | ||
| """Wrapper for a protein folding API (e.g. BioNeMo AlphaFold/ESMFold).""" | ||
|
|
||
| def __init__(self, api_key: Optional[str] = None): | ||
| # We simulate the API call. In a real environment, we'd use BioNeMo's REST API. | ||
| self.api_key = api_key or os.environ.get("BIONEMO_API_KEY", "mock_key") |
There was a problem hiding this comment.
requests, Tuple, and List are imported but unused in this module. If the folding API call is intentionally mocked for now, consider removing unused imports (and potentially the unused api_key field) to avoid implying external network behavior and to keep the module’s dependency surface minimal.
| import os | |
| import requests | |
| from typing import Dict, Optional, Tuple, List | |
| class ProteinFolder: | |
| """Wrapper for a protein folding API (e.g. BioNeMo AlphaFold/ESMFold).""" | |
| def __init__(self, api_key: Optional[str] = None): | |
| # We simulate the API call. In a real environment, we'd use BioNeMo's REST API. | |
| self.api_key = api_key or os.environ.get("BIONEMO_API_KEY", "mock_key") | |
| from typing import Dict | |
| class ProteinFolder: | |
| """Wrapper for a protein folding API (e.g. BioNeMo AlphaFold/ESMFold).""" | |
| def __init__(self): | |
| # We simulate the API call. In a real environment, we'd use BioNeMo's REST API. | |
| pass |
| def predict_structure(self, aa_seq: str) -> Dict[str, any]: | ||
| """ |
There was a problem hiding this comment.
predict_structure is annotated as returning Dict[str, any], but any is the built-in function, not a typing type. Use typing.Any (or a TypedDict) for correct type annotations.
| @@ -0,0 +1 @@ | |||
| pip install pyranges | |||
There was a problem hiding this comment.
This install script adds pyranges, but the new pipeline code imports Biopython (Bio.Seq) and the mock folding wrapper imports requests. Neither biopython nor requests appear in environment.yml, so a fresh environment following repo setup will likely fail at import time. Consider adding the missing deps either here (pip) or to environment.yml so CI/users get a reproducible install.
| pip install pyranges | |
| pip install pyranges biopython requests |
| import pytest | ||
| import pandas as pd | ||
| from src.hg_dt.translate.transcript import predict_isoforms | ||
| from src.hg_dt.translate.translator import compare_translation, translate |
There was a problem hiding this comment.
pytest and translate are imported but not used in this test module. Removing unused imports will keep the tests clean and avoid misleading readers about what’s being exercised.
| import pytest | |
| import pandas as pd | |
| from src.hg_dt.translate.transcript import predict_isoforms | |
| from src.hg_dt.translate.translator import compare_translation, translate | |
| import pandas as pd | |
| from src.hg_dt.translate.transcript import predict_isoforms | |
| from src.hg_dt.translate.translator import compare_translation |
| # 1 bp deletion at position 4 (G -> deleted) | ||
| # Ref: ATG GCC ATT GTA ATG GGC CGC TGA (start -> GCC -> ATT -> GTA -> ATG -> GGC -> CGC -> TGA stop) | ||
| # Ref AA: M A I V M G R * | ||
|
|
||
| # Mut: ATG CCA TTG TAA (start -> CCA -> TTG -> TAA stop) | ||
| # Mut AA: M P L * |
There was a problem hiding this comment.
The comment says “1 bp deletion at position 4” but the test parameters use edit_start=3, edit_end=4 (0-based half-open semantics). Consider clarifying the coordinate convention in the comment (0-based vs 1-based) to prevent future confusion when modifying this test.
This submission addresses Work Order 03: The DNA-to-Protein Pipeline. It implements transcription, alternative splicing simulation, rule-based translation, and protein structure impact analysis (mock API). Tested for frameshift deletions and pLDDT drop (structural collapse) edge cases. Tests passing correctly.
PR created automatically by Jules for task 13697527110041319258 started by @AkeBoss-tech