Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 4 New Failures, 3 Unrelated FailuresAs of commit 22922df with merge base 425c519 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label ciflow/trunk |
|
@pytorchbot label "partner: arm" |
|
@pytorchbot label "release notes: arm" |
There was a problem hiding this comment.
Pull request overview
This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.
Changes:
- Add
backends/arm/scripts/evaluate_model.pyto compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities. - Add pytest coverage for running
evaluate_model.pyagainst TOSA INT/FP targets and validating the emitted metrics JSON. - Update
examples/arm/aot_arm_compiler.pymessaging to point users to the new evaluation script.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| examples/arm/aot_arm_compiler.py | Updates the deprecation/error message to redirect evaluation usage to evaluate_model.py. |
| backends/arm/scripts/evaluate_model.py | Introduces the new evaluation CLI: argument parsing, compile/quantize/delegate pipeline, evaluator execution, and JSON metrics output. |
| backends/arm/test/misc/test_evaluate_model.py | Adds integration-style tests invoking the new script with representative CLI flags and checking output structure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if args.quant_mode is not None and args.dtype is not None: | ||
| raise ValueError("Cannot specify --dtype when --quant_mode is enabled.") | ||
|
|
||
| evaluators: list[Evaluator] = [ |
| "The model to test must be either quantized or delegated (--quant_mode or --delegate)." | ||
| ) | ||
|
|
|
|
||
| # Add evaluator for compression ratio of TOSA file | ||
| intermediates_path = Path(args.intermediates) | ||
| tosa_paths = list(intermediates_path.glob("*.tosa")) |
| # Add Executorch root to path so this script can be run from anywhere | ||
| _EXECUTORCH_DIR = Path(__file__).resolve().parents[3] | ||
| _EXECUTORCH_DIR_STR = str(_EXECUTORCH_DIR) | ||
| if _EXECUTORCH_DIR_STR not in sys.path: | ||
| sys.path.insert(0, _EXECUTORCH_DIR_STR) | ||
|
|
| "Evaluate a model quantized and/or delegated for the Arm backend." | ||
| " Evaluations include numerical comparison to the original model" | ||
| "and/or top-1/top-5 accuracy if applicable." |
| "Evaluate a model quantized and/or delegated for the Arm backend." | ||
| " Evaluations include numerical comparison to the original model" | ||
| "and/or top-1/top-5 accuracy if applicable." |
| "provided, up to 1000 samples are used for calibration. " | ||
| "Supported files: Common image formats (e.g., .png or .jpg) if " | ||
| "using imagenet evaluator, otherwise .pt/.pth files. If not provided," | ||
| "quantized models are calibrated on their example inputs." |
zingo
left a comment
There was a problem hiding this comment.
OK to merge, this adds a new file but as the sile is it's own test script buck2 files should not need updates.
87a2dfd to
d7219c7
Compare
This patch reimplements the evaluation feature that used to be in aot_arm_compiler.py while introducing a few improvements. The program is evaluate_model.py and it imports functions from aot_arm_compiler.py to compile a model in a similar manner, but runs its own code that is focused on evaluating a model using the evaluators classes in backends/arm/util/arm_model_evaluator.py. The following is supported in evaluate_model.py: - TOSA reference models (INT, FP). - Evaluating a model that is quantized and/or lowered. I.e., it is possible to evaluate a model that is quantized but not lowered, lowered but not quantized, or both at the same time. - The program can cast the model with the --dtype flag to evaluate a model in e.g., bf16 or fp16 format. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46
Add two tests that exercise evaluate_model.py with different command line arguments. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228
d7219c7 to
22922df
Compare
There was a problem hiding this comment.
Pull request overview
Adds a dedicated Arm backend evaluation entrypoint (evaluate_model.py) to replace the deprecated evaluation flow previously embedded in aot_arm_compiler.py, and wires in tests to exercise the new CLI.
Changes:
- Introduce
backends/arm/scripts/evaluate_model.pyto compile (optionally quantize and/or delegate) and evaluate models viaarm_model_evaluatorevaluators. - Add pytest coverage for TOSA INT/FP evaluation flows and register the new test in Arm test targets.
- Update
aot_arm_compiler.pydeprecation messaging to direct users toevaluate_model.py.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| backends/arm/scripts/evaluate_model.py | New evaluation CLI script that exports/quantizes/delegates and then runs evaluator(s), writing metrics JSON. |
| backends/arm/test/misc/test_evaluate_model.py | New tests covering basic TOSA INT/FP evaluation and metrics output. |
| backends/arm/test/targets.bzl | Registers the new test in the Arm test suite. |
| backends/arm/scripts/aot_arm_compiler.py | Improves deprecation log message to reference the new script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Evaluate a model quantized and/or delegated for the Arm backend." | ||
| " Evaluations include numerical comparison to the original model" | ||
| "and/or top-1/top-5 accuracy if applicable." |
There was a problem hiding this comment.
argparse.ArgumentParser is being constructed with the descriptive text as the first positional argument, which sets prog (the program name) rather than description. This results in confusing --help/usage output. Pass this text via the description= keyword (and optionally set prog= separately).
| "Evaluate a model quantized and/or delegated for the Arm backend." | |
| " Evaluations include numerical comparison to the original model" | |
| "and/or top-1/top-5 accuracy if applicable." | |
| description=( | |
| "Evaluate a model quantized and/or delegated for the Arm backend." | |
| " Evaluations include numerical comparison to the original model" | |
| "and/or top-1/top-5 accuracy if applicable." | |
| ) |
| def _get_args(): | ||
| parser = argparse.ArgumentParser( | ||
| "Evaluate a model quantized and/or delegated for the Arm backend." | ||
| " Evaluations include numerical comparison to the original model" |
There was a problem hiding this comment.
The parser description concatenates two string literals without a separating space, producing "modeland/or" in the help text. Add a space at the join boundary so the help output reads correctly.
| " Evaluations include numerical comparison to the original model" | |
| " Evaluations include numerical comparison to the original model " |
|
|
||
| if args.quant_mode is None and not args.delegate: | ||
| raise ValueError( | ||
| "The model to test must be either quantized or delegated (--quant_mode or --delegate)." |
There was a problem hiding this comment.
This error message tells users to pass --delegate, but the script doesn't provide a --delegate flag (delegation is enabled by default, and --no_delegate disables it). Update the message to reflect the actual CLI behavior (e.g., "enable delegation (default) or specify --quant_mode").
| "The model to test must be either quantized or delegated (--quant_mode or --delegate)." | |
| "The model to test must be either quantized (--quant_mode) or run with delegation enabled (default)." |
|
|
||
| # Get the model and its example inputs | ||
| original_model, example_inputs = get_model_and_inputs_from_name( | ||
| args.model_name, None |
There was a problem hiding this comment.
--model_name help claims .pt/.pth serialized models are supported, but get_model_and_inputs_from_name(args.model_name, None) hard-codes model_input=None. This breaks loading serialized models that require --model_input (see error in aot_arm_compiler.get_model_and_inputs_from_name). Add a --model_input argument and pass it through here.
| args.model_name, None | |
| args.model_name, getattr(args, "model_input", None) |
| # Export the model | ||
| exported_program = torch.export.export(eval_model, eval_inputs) | ||
|
|
||
| model_name = os.path.basename(os.path.splitext(args.model_name)[0]) |
There was a problem hiding this comment.
model_name = os.path.basename(os.path.splitext(args.model_name)[0]) treats dots as file extensions. If --model_name is a Python module path (supported by get_model_and_inputs_from_name), this will truncate the name incorrectly (e.g., "foo.bar" -> "foo"). Consider deriving the display name based on the resolved loader type (file path vs internal key vs module path) or using a safer split (e.g., last module segment after '.').
| model_name = os.path.basename(os.path.splitext(args.model_name)[0]) | |
| model_name_arg = args.model_name | |
| if ( | |
| os.path.sep in model_name_arg | |
| or (os.path.altsep is not None and os.path.altsep in model_name_arg) | |
| or model_name_arg.endswith(".py") | |
| ): | |
| model_name = os.path.basename(os.path.splitext(model_name_arg)[0]) | |
| elif "." in model_name_arg: | |
| model_name = model_name_arg.rsplit(".", 1)[-1] | |
| else: | |
| model_name = model_name_arg |
| if args.quant_mode is not None and args.dtype is not None: | ||
| raise ValueError("Cannot specify --dtype when --quant_mode is enabled.") | ||
|
|
||
| evaluators: list[Evaluator] = [ |
There was a problem hiding this comment.
Type annotation is incorrect here: args.evaluators is parsed as a list of strings, but this variable is annotated as list[Evaluator]. Fix the annotation to list[str] (or similar) to avoid misleading type information and static analysis errors.
| evaluators: list[Evaluator] = [ | |
| evaluators: list[str] = [ |
Arm backend: Add evaluate_model.py
This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.
The following is supported in evaluate_model.py:
I.e., it is possible to evaluate a model that is quantized but not
lowered, lowered but not quantized, or both at the same time.
model in e.g., bf16 or fp16 format.
Also add tests that exercise evaluate_model.py with different command
line arguments.
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell