feat(pt): make se_a and dpa1 compatible with torch.export#5193
feat(pt): make se_a and dpa1 compatible with torch.export#5193wanghan-iapcm wants to merge 3 commits intodeepmodeling:masterfrom
Conversation
Summary of ChangesHello @wanghan-iapcm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully makes the se_a descriptor compatible with torch.export. The changes include refactoring imports and removing obsolete code in se_a.py. Additionally, new tests for torch.export are introduced, and existing JIT tests are updated to use torch.export. The overall implementation is solid and aligns with the PR's goal. I have one suggestion to improve code maintainability by addressing duplicated code in the new test file.
| class ForwardLowerWrapper(torch.nn.Module): | ||
| def __init__(self, model): | ||
| super().__init__() | ||
| self.model = model | ||
|
|
||
| def forward(self, extended_coord, extended_atype, nlist): | ||
| return self.model.forward_lower(extended_coord, extended_atype, nlist) |
There was a problem hiding this comment.
This ForwardLowerWrapper class is a duplicate of the one defined in test_export_energy_model_se_a starting at line 108. To avoid code duplication and improve maintainability, this class should be defined only once at the module level and then reused in both test methods. I suggest moving the other definition to the top of the file (e.g., after imports) and removing this one.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4729b0b6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| model = DescrptSeA( | ||
| rcut=self.rcut, | ||
| rcut_smth=self.rcut_smth, | ||
| sel=self.sel, | ||
| neuron=self.neuron, | ||
| axis_neuron=self.axis_neuron, | ||
| precision="float32", | ||
| trainable=False, | ||
| type_one_side=type_one_side, | ||
| ) | ||
| model.eval() |
There was a problem hiding this comment.
Move descriptor to DEVICE before export
When CUDA is available, env.DEVICE becomes a GPU device, but the DescrptSeA module stays on CPU. The test builds inputs on env.DEVICE and then calls torch.export.export(model, ...), which will raise a device-mismatch error (CPU parameters vs. CUDA inputs). This means the new export test will fail on GPU runners. Moving the model to env.DEVICE (or forcing CPU inputs) avoids the mismatch.
Useful? React with 👍 / 👎.
| precision=prec, | ||
| resnet_dt=idt, | ||
| seed=GLOBAL_SEED, | ||
| type_one_side=type_one_side, | ||
| ) |
There was a problem hiding this comment.
Export test mixes CPU model with DEVICE inputs
This test constructs coord_ext, atype_ext, and nlist on env.DEVICE, but the DescrptSeA instance remains on CPU. On GPU-capable environments, torch.export.export(dd0, ...) will error because module parameters and buffers are on CPU while inputs are on CUDA. This makes the test non-portable across CPU/GPU CI. Ensure the model is moved to env.DEVICE (or create CPU inputs) before exporting.
Useful? React with 👍 / 👎.
📝 WalkthroughWalkthroughThe changes refactor import organization in a descriptor module, introduce comprehensive TorchScript export testing, remove an outdated test class, and migrate existing tests from torch.jit.script to torch.export.export with added type_one_side parameter support. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
source/tests/pt/model/test_se_e2_a.py (1)
124-140:⚠️ Potential issue | 🟡 MinorModel not moved to device, potential device mismatch.
The model is created but not moved to
env.DEVICE, while the input tensors are created onenv.DEVICE. This could cause device mismatch errors when running on GPU.Compare with
test_consistencyin this same file (line 63) which calls.to(env.DEVICE)on the model, and withtest_export.pyline 92.Proposed fix
dd0 = DescrptSeA( self.rcut, self.rcut_smth, self.sel, precision=prec, resnet_dt=idt, seed=GLOBAL_SEED, type_one_side=type_one_side, - ) + ).to(env.DEVICE)
🤖 Fix all issues with AI agents
In `@source/tests/pt/model/test_export.py`:
- Around line 45-74: The test_export_descriptor_se_a creates a DescrptSeA
instance but never moves it to env.DEVICE while inputs are created on
env.DEVICE, causing device-mismatch errors on GPU; fix by sending the model to
the same device (call model.to(env.DEVICE) or model.to(device)) before calling
model.eval() and before export (so adjust around the DescrptSeA instantiation
and before torch.export.export) to ensure model, coord_ext, atype_ext, and nlist
are all on env.DEVICE.
- Around line 158-167: The tearDown method is attempting to remove the
"checkpoint" directory with os.remove, which fails for directories; update the
cleanup logic in tearDown (the method in test_export.py) to check whether
"checkpoint" is a directory and use shutil.rmtree to remove it (mirroring how
"stat_files" is handled), otherwise use os.remove for files — ensure the branch
targets the symbol "checkpoint" and uses shutil.rmtree when
os.path.isdir("checkpoint").
🧹 Nitpick comments (1)
source/tests/pt/model/test_export.py (1)
108-114: Consider extractingForwardLowerWrapperto reduce duplication.The
ForwardLowerWrapperclass is defined identically in two places. Consider moving it to module level to avoid duplication.Proposed refactor
+class ForwardLowerWrapper(torch.nn.Module): + """Wrapper to expose forward_lower for export compatibility.""" + + def __init__(self, model): + super().__init__() + self.model = model + + def forward(self, extended_coord, extended_atype, nlist): + return self.model.forward_lower(extended_coord, extended_atype, nlist) + + class TestExport(unittest.TestCase): # ... in test_export_energy_model_se_a: - class ForwardLowerWrapper(torch.nn.Module): - def __init__(self, model): - super().__init__() - self.model = model - - def forward(self, extended_coord, extended_atype, nlist): - return self.model.forward_lower(extended_coord, extended_atype, nlist) - wrapper = ForwardLowerWrapper(model)Apply similar change to
ExportIntegrationTest.test_export.Also applies to: 146-152
| def test_export_descriptor_se_a(self): | ||
| """Test DescrptSeA descriptor export.""" | ||
| for type_one_side in [True, False]: | ||
| model = DescrptSeA( | ||
| rcut=self.rcut, | ||
| rcut_smth=self.rcut_smth, | ||
| sel=self.sel, | ||
| neuron=self.neuron, | ||
| axis_neuron=self.axis_neuron, | ||
| precision="float32", | ||
| trainable=False, | ||
| type_one_side=type_one_side, | ||
| ) | ||
| model.eval() | ||
|
|
||
| nf = 2 | ||
| nloc = 5 | ||
| nnei = sum(self.sel) | ||
| nall = nloc + 10 | ||
|
|
||
| coord_ext = torch.randn(nf, nall * 3, device=env.DEVICE) | ||
| atype_ext = torch.randint( | ||
| 0, 2, (nf, nall), dtype=torch.int32, device=env.DEVICE | ||
| ) | ||
| nlist = torch.randint( | ||
| 0, nall, (nf, nloc, nnei), dtype=torch.int32, device=env.DEVICE | ||
| ) | ||
|
|
||
| exported = torch.export.export(model, (coord_ext, atype_ext, nlist)) | ||
| self.assertIsNotNone(exported) |
There was a problem hiding this comment.
Model not moved to device, potential device mismatch.
Same issue as in test_se_e2_a.py: the DescrptSeA model is created but not moved to env.DEVICE, while tensors are on env.DEVICE. This will fail on GPU.
Proposed fix
model = DescrptSeA(
rcut=self.rcut,
rcut_smth=self.rcut_smth,
sel=self.sel,
neuron=self.neuron,
axis_neuron=self.axis_neuron,
precision="float32",
trainable=False,
type_one_side=type_one_side,
- )
+ ).to(env.DEVICE)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_export_descriptor_se_a(self): | |
| """Test DescrptSeA descriptor export.""" | |
| for type_one_side in [True, False]: | |
| model = DescrptSeA( | |
| rcut=self.rcut, | |
| rcut_smth=self.rcut_smth, | |
| sel=self.sel, | |
| neuron=self.neuron, | |
| axis_neuron=self.axis_neuron, | |
| precision="float32", | |
| trainable=False, | |
| type_one_side=type_one_side, | |
| ) | |
| model.eval() | |
| nf = 2 | |
| nloc = 5 | |
| nnei = sum(self.sel) | |
| nall = nloc + 10 | |
| coord_ext = torch.randn(nf, nall * 3, device=env.DEVICE) | |
| atype_ext = torch.randint( | |
| 0, 2, (nf, nall), dtype=torch.int32, device=env.DEVICE | |
| ) | |
| nlist = torch.randint( | |
| 0, nall, (nf, nloc, nnei), dtype=torch.int32, device=env.DEVICE | |
| ) | |
| exported = torch.export.export(model, (coord_ext, atype_ext, nlist)) | |
| self.assertIsNotNone(exported) | |
| def test_export_descriptor_se_a(self): | |
| """Test DescrptSeA descriptor export.""" | |
| for type_one_side in [True, False]: | |
| model = DescrptSeA( | |
| rcut=self.rcut, | |
| rcut_smth=self.rcut_smth, | |
| sel=self.sel, | |
| neuron=self.neuron, | |
| axis_neuron=self.axis_neuron, | |
| precision="float32", | |
| trainable=False, | |
| type_one_side=type_one_side, | |
| ).to(env.DEVICE) | |
| model.eval() | |
| nf = 2 | |
| nloc = 5 | |
| nnei = sum(self.sel) | |
| nall = nloc + 10 | |
| coord_ext = torch.randn(nf, nall * 3, device=env.DEVICE) | |
| atype_ext = torch.randint( | |
| 0, 2, (nf, nall), dtype=torch.int32, device=env.DEVICE | |
| ) | |
| nlist = torch.randint( | |
| 0, nall, (nf, nloc, nnei), dtype=torch.int32, device=env.DEVICE | |
| ) | |
| exported = torch.export.export(model, (coord_ext, atype_ext, nlist)) | |
| self.assertIsNotNone(exported) |
🤖 Prompt for AI Agents
In `@source/tests/pt/model/test_export.py` around lines 45 - 74, The
test_export_descriptor_se_a creates a DescrptSeA instance but never moves it to
env.DEVICE while inputs are created on env.DEVICE, causing device-mismatch
errors on GPU; fix by sending the model to the same device (call
model.to(env.DEVICE) or model.to(device)) before calling model.eval() and before
export (so adjust around the DescrptSeA instantiation and before
torch.export.export) to ensure model, coord_ext, atype_ext, and nlist are all on
env.DEVICE.
| def tearDown(self) -> None: | ||
| for f in os.listdir("."): | ||
| if f.startswith("model") and f.endswith("pt"): | ||
| os.remove(f) | ||
| if f in ["lcurve.out", "frozen_model.pth"]: | ||
| os.remove(f) | ||
| if f in ["stat_files"]: | ||
| shutil.rmtree(f) | ||
| if f in ["checkpoint"]: | ||
| os.remove(f) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the test file to understand the context
cat -n source/tests/pt/model/test_export.py | sed -n '150,175p'Repository: deepmodeling/deepmd-kit
Length of output: 1278
🏁 Script executed:
# Search for checkpoint creation and usage in test files
rg -n "checkpoint" source/tests/pt/ --type py -C 2 | head -50Repository: deepmodeling/deepmd-kit
Length of output: 3797
🏁 Script executed:
# Check if there's any checkpoint directory creation or usage patterns
rg -n "checkpoint" source/pt/ --type py -C 2 | head -50Repository: deepmodeling/deepmd-kit
Length of output: 117
🏁 Script executed:
# Find _file_delete function definition
rg -n "def _file_delete" source/tests/pt/ --type pyRepository: deepmodeling/deepmd-kit
Length of output: 322
🏁 Script executed:
# Search for checkpoint creation in the codebase to understand its nature
rg -n "checkpoint.*mkdir\|mkdir.*checkpoint\|os\.makedirs.*checkpoint" source/ --type pyRepository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look at how training creates checkpoints
rg -n "checkpoint" source/deepmd/pt/ --type py -C 1 | head -80Repository: deepmodeling/deepmd-kit
Length of output: 124
🏁 Script executed:
# Read _file_delete function
cat -n source/tests/pt/test_model_compression_se_a.py | sed -n '25,40p'Repository: deepmodeling/deepmd-kit
Length of output: 725
🏁 Script executed:
# Check repository structure
ls -la source/ | head -20Repository: deepmodeling/deepmd-kit
Length of output: 1165
🏁 Script executed:
# Search for where "checkpoint" is created as a directory in tests
rg -n "checkpoint" source/tests/pt/model/ --type py -B 2 -A 2Repository: deepmodeling/deepmd-kit
Length of output: 729
🏁 Script executed:
# Search for where checkpoint directory is created
rg -n "checkpoint" source/tests/pt/model/ --type py | grep -E "mkdir|makedirs|mkdir_if_not_exist"Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Check the training code to see if checkpoint is a directory
find source -name "*.py" -type f | xargs rg -l "checkpoint" | head -5 | xargs grep -l "mkdir\|makedirs" | head -3Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look for what checkpoint actually contains in training context
rg -n "os\.listdir\|os\.makedirs" source/tests/pt/model/test_export.py -B 5 -A 5Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Read the entire test_export.py file to understand test setup
cat source/tests/pt/model/test_export.pyRepository: deepmodeling/deepmd-kit
Length of output: 5727
Use shutil.rmtree() for the checkpoint directory in cleanup.
Line 167 uses os.remove("checkpoint"), which only works for files. During training execution (line 136), the trainer creates a checkpoint directory to save model state. Attempting to remove a directory with os.remove() will raise IsADirectoryError. The fix should check the path type before removal, consistent with how stat_files is handled on line 165.
Proposed fix
if f in ["stat_files"]:
shutil.rmtree(f)
- if f in ["checkpoint"]:
- os.remove(f)
+ if f == "checkpoint":
+ if os.path.isdir(f):
+ shutil.rmtree(f)
+ else:
+ os.remove(f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def tearDown(self) -> None: | |
| for f in os.listdir("."): | |
| if f.startswith("model") and f.endswith("pt"): | |
| os.remove(f) | |
| if f in ["lcurve.out", "frozen_model.pth"]: | |
| os.remove(f) | |
| if f in ["stat_files"]: | |
| shutil.rmtree(f) | |
| if f in ["checkpoint"]: | |
| os.remove(f) | |
| def tearDown(self) -> None: | |
| for f in os.listdir("."): | |
| if f.startswith("model") and f.endswith("pt"): | |
| os.remove(f) | |
| if f in ["lcurve.out", "frozen_model.pth"]: | |
| os.remove(f) | |
| if f in ["stat_files"]: | |
| shutil.rmtree(f) | |
| if f == "checkpoint": | |
| if os.path.isdir(f): | |
| shutil.rmtree(f) | |
| else: | |
| os.remove(f) |
🤖 Prompt for AI Agents
In `@source/tests/pt/model/test_export.py` around lines 158 - 167, The tearDown
method is attempting to remove the "checkpoint" directory with os.remove, which
fails for directories; update the cleanup logic in tearDown (the method in
test_export.py) to check whether "checkpoint" is a directory and use
shutil.rmtree to remove it (mirroring how "stat_files" is handled), otherwise
use os.remove for files — ensure the branch targets the symbol "checkpoint" and
uses shutil.rmtree when os.path.isdir("checkpoint").
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5193 +/- ##
=======================================
Coverage 81.95% 81.95%
=======================================
Files 714 714
Lines 73441 73437 -4
Branches 3616 3616
=======================================
- Hits 60187 60185 -2
+ Misses 12091 12089 -2
Partials 1163 1163 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| self.fail(f"Full model compilation failed: {e}") | ||
|
|
||
| # 3. Execute | ||
| program = Runtime.get().load_program(executorch_program.buffer) |
Check failure
Code scanning / CodeQL
Potentially uninitialized local variable Error test
| # We need to wrap it to return specific tensor(s) or handle dict if export supports it (export usually flattens). | ||
|
|
||
| # Pristine output (dict) | ||
| pristine_dict = model.forward_lower(coord_ext, atype_ext, nlist) |
Check notice
Code scanning / CodeQL
Unused local variable Note test
Summary by CodeRabbit
New Features
Tests
Refactor