Arm backend: Add evaluate_model.py by martinlsm · Pull Request #18199 · pytorch/executorch

martinlsm · 2026-03-16T15:18:46Z

Arm backend: Add evaluate_model.py

This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:

TOSA reference models (INT, FP).
Evaluating a model that is quantized and/or lowered.
I.e., it is possible to evaluate a model that is quantized but not
lowered, lowered but not quantized, or both at the same time.
The program can cast the model with the --dtype flag to evaluate a
model in e.g., bf16 or fp16 format.

Also add tests that exercise evaluate_model.py with different command
line arguments.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

pytorch-bot · 2026-03-16T15:18:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

❌ 4 New Failures, 3 Unrelated Failures

As of commit 22922df with merge base 425c519 ():

NEW FAILURES - The following jobs have failed:

Apple / build-frameworks-ios / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 6
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.arm64.2xlarge, executorch-ubuntu-22.04-g... / linux-job (gh)
RuntimeError: Command docker exec -t d8e20eb1e9cee0a7eb375f931dfc904ef24fc2246fc4c197d7c826818355abd1 /exec failed with exit code 1
trunk / test-torchao-huggingface-checkpoints (phi_4_mini, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
RuntimeError: Command docker exec -t ce3d2cdf57713c8d6f2c7ce4ffa2ea0699a6ddefff46ae3ea0308a05ae3f7348 /exec failed with exit code 1
trunk / test-torchao-huggingface-checkpoints (qwen3_4b, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc1... / linux-job (gh)
RuntimeError: Command docker exec -t d481a20200da4c5554d4c239b30cfeb94a7ad5c1857eb66ba3948e80c1fb4379 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

martinlsm · 2026-03-16T15:19:14Z

@pytorchbot label ciflow/trunk

martinlsm · 2026-03-16T15:19:23Z

@pytorchbot label "partner: arm"

martinlsm · 2026-03-16T15:19:32Z

@pytorchbot label "release notes: arm"

Copilot

Pull request overview

This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.

Changes:

Add backends/arm/scripts/evaluate_model.py to compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities.
Add pytest coverage for running evaluate_model.py against TOSA INT/FP targets and validating the emitted metrics JSON.
Update examples/arm/aot_arm_compiler.py messaging to point users to the new evaluation script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
examples/arm/aot_arm_compiler.py	Updates the deprecation/error message to redirect evaluation usage to `evaluate_model.py`.
backends/arm/scripts/evaluate_model.py	Introduces the new evaluation CLI: argument parsing, compile/quantize/delegate pipeline, evaluator execution, and JSON metrics output.
backends/arm/test/misc/test_evaluate_model.py	Adds integration-style tests invoking the new script with representative CLI flags and checking output structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backends/arm/scripts/evaluate_model.py

+    if args.quant_mode is not None and args.dtype is not None:
+        raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")
+
+    evaluators: list[Evaluator] = [


backends/arm/scripts/evaluate_model.py

+            "The model to test must be either quantized or delegated (--quant_mode or --delegate)."
+        )
+


backends/arm/scripts/evaluate_model.py

+
+    # Add evaluator for compression ratio of TOSA file
+    intermediates_path = Path(args.intermediates)
+    tosa_paths = list(intermediates_path.glob("*.tosa"))


backends/arm/scripts/evaluate_model.py

+# Add Executorch root to path so this script can be run from anywhere
+_EXECUTORCH_DIR = Path(__file__).resolve().parents[3]
+_EXECUTORCH_DIR_STR = str(_EXECUTORCH_DIR)
+if _EXECUTORCH_DIR_STR not in sys.path:
+    sys.path.insert(0, _EXECUTORCH_DIR_STR)
+


backends/arm/scripts/evaluate_model.py

+        "Evaluate a model quantized and/or delegated for the Arm backend."
+        " Evaluations include numerical comparison to the original model"
+        "and/or top-1/top-5 accuracy if applicable."


backends/arm/scripts/evaluate_model.py

+        "Evaluate a model quantized and/or delegated for the Arm backend."
+        " Evaluations include numerical comparison to the original model"
+        "and/or top-1/top-5 accuracy if applicable."


backends/arm/scripts/evaluate_model.py

+            "provided, up to 1000 samples are used for calibration. "
+            "Supported files: Common image formats (e.g., .png or .jpg) if "
+            "using imagenet evaluator, otherwise .pt/.pth files. If not provided,"
+            "quantized models are calibrated on their example inputs."


zingo

OK to merge, this adds a new file but as the sile is it's own test script buck2 files should not need updates.

This patch reimplements the evaluation feature that used to be in aot_arm_compiler.py while introducing a few improvements. The program is evaluate_model.py and it imports functions from aot_arm_compiler.py to compile a model in a similar manner, but runs its own code that is focused on evaluating a model using the evaluators classes in backends/arm/util/arm_model_evaluator.py. The following is supported in evaluate_model.py: - TOSA reference models (INT, FP). - Evaluating a model that is quantized and/or lowered. I.e., it is possible to evaluate a model that is quantized but not lowered, lowered but not quantized, or both at the same time. - The program can cast the model with the --dtype flag to evaluate a model in e.g., bf16 or fp16 format. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46

Add two tests that exercise evaluate_model.py with different command line arguments. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228

Copilot

Pull request overview

Adds a dedicated Arm backend evaluation entrypoint (evaluate_model.py) to replace the deprecated evaluation flow previously embedded in aot_arm_compiler.py, and wires in tests to exercise the new CLI.

Changes:

Introduce backends/arm/scripts/evaluate_model.py to compile (optionally quantize and/or delegate) and evaluate models via arm_model_evaluator evaluators.
Add pytest coverage for TOSA INT/FP evaluation flows and register the new test in Arm test targets.
Update aot_arm_compiler.py deprecation messaging to direct users to evaluate_model.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
backends/arm/scripts/evaluate_model.py	New evaluation CLI script that exports/quantizes/delegates and then runs evaluator(s), writing metrics JSON.
backends/arm/test/misc/test_evaluate_model.py	New tests covering basic TOSA INT/FP evaluation and metrics output.
backends/arm/test/targets.bzl	Registers the new test in the Arm test suite.
backends/arm/scripts/aot_arm_compiler.py	Improves deprecation log message to reference the new script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T14:30:43Z

backends/arm/scripts/evaluate_model.py

+        "Evaluate a model quantized and/or delegated for the Arm backend."
+        " Evaluations include numerical comparison to the original model"
+        "and/or top-1/top-5 accuracy if applicable."


argparse.ArgumentParser is being constructed with the descriptive text as the first positional argument, which sets prog (the program name) rather than description. This results in confusing --help/usage output. Pass this text via the description= keyword (and optionally set prog= separately).

Suggested change

"Evaluate a model quantized and/or delegated for the Arm backend."

" Evaluations include numerical comparison to the original model"

"and/or top-1/top-5 accuracy if applicable."

description=(

"Evaluate a model quantized and/or delegated for the Arm backend."

" Evaluations include numerical comparison to the original model"

"and/or top-1/top-5 accuracy if applicable."

)

Copilot · 2026-04-13T14:30:43Z

backends/arm/scripts/evaluate_model.py

+def _get_args():
+    parser = argparse.ArgumentParser(
+        "Evaluate a model quantized and/or delegated for the Arm backend."
+        " Evaluations include numerical comparison to the original model"


The parser description concatenates two string literals without a separating space, producing "modeland/or" in the help text. Add a space at the join boundary so the help output reads correctly.

Suggested change

" Evaluations include numerical comparison to the original model"

" Evaluations include numerical comparison to the original model "

Copilot · 2026-04-13T14:30:43Z

backends/arm/scripts/evaluate_model.py

+
+    if args.quant_mode is None and not args.delegate:
+        raise ValueError(
+            "The model to test must be either quantized or delegated (--quant_mode or --delegate)."


This error message tells users to pass --delegate, but the script doesn't provide a --delegate flag (delegation is enabled by default, and --no_delegate disables it). Update the message to reflect the actual CLI behavior (e.g., "enable delegation (default) or specify --quant_mode").

Suggested change

"The model to test must be either quantized or delegated (--quant_mode or --delegate)."

"The model to test must be either quantized (--quant_mode) or run with delegation enabled (default)."

Copilot · 2026-04-13T14:30:44Z

backends/arm/scripts/evaluate_model.py

+
+    # Get the model and its example inputs
+    original_model, example_inputs = get_model_and_inputs_from_name(
+        args.model_name, None


--model_name help claims .pt/.pth serialized models are supported, but get_model_and_inputs_from_name(args.model_name, None) hard-codes model_input=None. This breaks loading serialized models that require --model_input (see error in aot_arm_compiler.get_model_and_inputs_from_name). Add a --model_input argument and pass it through here.

Suggested change

args.model_name, None

args.model_name, getattr(args, "model_input", None)

Copilot · 2026-04-13T14:30:44Z

backends/arm/scripts/evaluate_model.py

+    # Export the model
+    exported_program = torch.export.export(eval_model, eval_inputs)
+
+    model_name = os.path.basename(os.path.splitext(args.model_name)[0])


model_name = os.path.basename(os.path.splitext(args.model_name)[0]) treats dots as file extensions. If --model_name is a Python module path (supported by get_model_and_inputs_from_name), this will truncate the name incorrectly (e.g., "foo.bar" -> "foo"). Consider deriving the display name based on the resolved loader type (file path vs internal key vs module path) or using a safer split (e.g., last module segment after '.').

Suggested change

model_name = os.path.basename(os.path.splitext(args.model_name)[0])

model_name_arg = args.model_name

if (

os.path.sep in model_name_arg

or (os.path.altsep is not None and os.path.altsep in model_name_arg)

or model_name_arg.endswith(".py")

):

model_name = os.path.basename(os.path.splitext(model_name_arg)[0])

elif "." in model_name_arg:

model_name = model_name_arg.rsplit(".", 1)[-1]

else:

model_name = model_name_arg

Copilot · 2026-04-13T14:30:44Z

backends/arm/scripts/evaluate_model.py

+    if args.quant_mode is not None and args.dtype is not None:
+        raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")
+
+    evaluators: list[Evaluator] = [


Type annotation is incorrect here: args.evaluators is parsed as a list of strings, but this variable is annotated as list[Evaluator]. Fix the annotation to list[str] (or similar) to avoid misleading type information and static analysis errors.

Suggested change

evaluators: list[Evaluator] = [

evaluators: list[str] = [

martinlsm requested a review from digantdesai as a code owner March 16, 2026 15:18

Copilot AI review requested due to automatic review settings March 16, 2026 15:18

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 16, 2026

martinlsm changed the title ~~Marlin evaluate model~~ Arm backend: Add evaluate_model.py Mar 16, 2026

pytorch-bot bot added the ciflow/trunk label Mar 16, 2026

pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 16, 2026

pytorch-bot bot added the release notes: arm Changes to the ARM backend delegate label Mar 16, 2026

Copilot started reviewing on behalf of martinlsm March 16, 2026 15:22 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

zingo approved these changes Mar 16, 2026

View reviewed changes

martinlsm force-pushed the marlin-evaluate-model branch from 87a2dfd to d7219c7 Compare March 17, 2026 09:55

Martin Lindström added 2 commits April 13, 2026 16:24

Arm backend: Add test for evaluate_model.py

22922df

Add two tests that exercise evaluate_model.py with different command line arguments. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228

Copilot AI review requested due to automatic review settings April 13, 2026 14:25

martinlsm force-pushed the marlin-evaluate-model branch from d7219c7 to 22922df Compare April 13, 2026 14:25

Copilot started reviewing on behalf of martinlsm April 13, 2026 14:26 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

		"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
		)

	" Evaluations include numerical comparison to the original model"
	" Evaluations include numerical comparison to the original model "

	args.model_name, None
	args.model_name, getattr(args, "model_input", None)

-    model_name = os.path.basename(os.path.splitext(args.model_name)[0])
+    model_name_arg = args.model_name
+    if (
+        os.path.sep in model_name_arg
+        or (os.path.altsep is not None and os.path.altsep in model_name_arg)
+        or model_name_arg.endswith(".py")
+    ):
+        model_name = os.path.basename(os.path.splitext(model_name_arg)[0])
+    elif "." in model_name_arg:
+        model_name = model_name_arg.rsplit(".", 1)[-1]
+    else:
+        model_name = model_name_arg

Conversation

martinlsm commented Mar 16, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

❗ 1 Active SEVs

❌ 4 New Failures, 3 Unrelated Failures

Uh oh!

martinlsm commented Mar 16, 2026

Uh oh!

martinlsm commented Mar 16, 2026

Uh oh!

martinlsm commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martinlsm commented Mar 16, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 16, 2026 •

edited

Loading