Skip to content

Arm backend: Add evaluate_model.py#18199

Open
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model
Open

Arm backend: Add evaluate_model.py#18199
martinlsm wants to merge 2 commits intopytorch:mainfrom
martinlsm:marlin-evaluate-model

Conversation

@martinlsm
Copy link
Copy Markdown
Collaborator

@martinlsm martinlsm commented Mar 16, 2026

Arm backend: Add evaluate_model.py

This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:

  • TOSA reference models (INT, FP).
  • Evaluating a model that is quantized and/or lowered.
    I.e., it is possible to evaluate a model that is quantized but not
    lowered, lowered but not quantized, or both at the same time.
  • The program can cast the model with the --dtype flag to evaluate a
    model in e.g., bf16 or fp16 format.

Also add tests that exercise evaluate_model.py with different command
line arguments.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

@martinlsm martinlsm requested a review from digantdesai as a code owner March 16, 2026 15:18
Copilot AI review requested due to automatic review settings March 16, 2026 15:18
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18199

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 4 New Failures, 3 Unrelated Failures

As of commit 22922df with merge base 425c519 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 16, 2026
@martinlsm martinlsm changed the title Marlin evaluate model Arm backend: Add evaluate_model.py Mar 16, 2026
@martinlsm
Copy link
Copy Markdown
Collaborator Author

@pytorchbot label ciflow/trunk

@martinlsm
Copy link
Copy Markdown
Collaborator Author

@pytorchbot label "partner: arm"

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 16, 2026
@martinlsm
Copy link
Copy Markdown
Collaborator Author

@pytorchbot label "release notes: arm"

@pytorch-bot pytorch-bot bot added the release notes: arm Changes to the ARM backend delegate label Mar 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces Arm backend model evaluation as a dedicated CLI (evaluate_model.py), replacing the previously embedded evaluation flow from aot_arm_compiler.py, and adds tests to exercise common invocation modes.

Changes:

  • Add backends/arm/scripts/evaluate_model.py to compile + (optionally) quantize and/or delegate a model, then evaluate it via Arm evaluator utilities.
  • Add pytest coverage for running evaluate_model.py against TOSA INT/FP targets and validating the emitted metrics JSON.
  • Update examples/arm/aot_arm_compiler.py messaging to point users to the new evaluation script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
examples/arm/aot_arm_compiler.py Updates the deprecation/error message to redirect evaluation usage to evaluate_model.py.
backends/arm/scripts/evaluate_model.py Introduces the new evaluation CLI: argument parsing, compile/quantize/delegate pipeline, evaluator execution, and JSON metrics output.
backends/arm/test/misc/test_evaluate_model.py Adds integration-style tests invoking the new script with representative CLI flags and checking output structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if args.quant_mode is not None and args.dtype is not None:
raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")

evaluators: list[Evaluator] = [
Comment on lines +180 to +182
"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
)


# Add evaluator for compression ratio of TOSA file
intermediates_path = Path(args.intermediates)
tosa_paths = list(intermediates_path.glob("*.tosa"))
Comment on lines +17 to +22
# Add Executorch root to path so this script can be run from anywhere
_EXECUTORCH_DIR = Path(__file__).resolve().parents[3]
_EXECUTORCH_DIR_STR = str(_EXECUTORCH_DIR)
if _EXECUTORCH_DIR_STR not in sys.path:
sys.path.insert(0, _EXECUTORCH_DIR_STR)

Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
Comment on lines +70 to +72
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
"provided, up to 1000 samples are used for calibration. "
"Supported files: Common image formats (e.g., .png or .jpg) if "
"using imagenet evaluator, otherwise .pt/.pth files. If not provided,"
"quantized models are calibrated on their example inputs."
Copy link
Copy Markdown
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge, this adds a new file but as the sile is it's own test script buck2 files should not need updates.

@martinlsm martinlsm force-pushed the marlin-evaluate-model branch from 87a2dfd to d7219c7 Compare March 17, 2026 09:55
Martin Lindström added 2 commits April 13, 2026 16:24
This patch reimplements the evaluation feature that used to be in
aot_arm_compiler.py while introducing a few improvements. The program is
evaluate_model.py and it imports functions from aot_arm_compiler.py to
compile a model in a similar manner, but runs its own code that is
focused on evaluating a model using the evaluators classes in
backends/arm/util/arm_model_evaluator.py.

The following is supported in evaluate_model.py:
- TOSA reference models (INT, FP).
- Evaluating a model that is quantized and/or lowered.
  I.e., it is possible to evaluate a model that is quantized but not
  lowered, lowered but not quantized, or both at the same time.
- The program can cast the model with the --dtype flag to evaluate a
  model in e.g., bf16 or fp16 format.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I85f731633364da1eb71abe602a0335f531ec7e46
Add two tests that exercise evaluate_model.py with different command
line arguments.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: I47304ea270518703dc4c826c4c6672c7aca95228
Copilot AI review requested due to automatic review settings April 13, 2026 14:25
@martinlsm martinlsm force-pushed the marlin-evaluate-model branch from d7219c7 to 22922df Compare April 13, 2026 14:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dedicated Arm backend evaluation entrypoint (evaluate_model.py) to replace the deprecated evaluation flow previously embedded in aot_arm_compiler.py, and wires in tests to exercise the new CLI.

Changes:

  • Introduce backends/arm/scripts/evaluate_model.py to compile (optionally quantize and/or delegate) and evaluate models via arm_model_evaluator evaluators.
  • Add pytest coverage for TOSA INT/FP evaluation flows and register the new test in Arm test targets.
  • Update aot_arm_compiler.py deprecation messaging to direct users to evaluate_model.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
backends/arm/scripts/evaluate_model.py New evaluation CLI script that exports/quantizes/delegates and then runs evaluator(s), writing metrics JSON.
backends/arm/test/misc/test_evaluate_model.py New tests covering basic TOSA INT/FP evaluation and metrics output.
backends/arm/test/targets.bzl Registers the new test in the Arm test suite.
backends/arm/scripts/aot_arm_compiler.py Improves deprecation log message to reference the new script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +72 to +74
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argparse.ArgumentParser is being constructed with the descriptive text as the first positional argument, which sets prog (the program name) rather than description. This results in confusing --help/usage output. Pass this text via the description= keyword (and optionally set prog= separately).

Suggested change
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
description=(
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
"and/or top-1/top-5 accuracy if applicable."
)

Copilot uses AI. Check for mistakes.
def _get_args():
parser = argparse.ArgumentParser(
"Evaluate a model quantized and/or delegated for the Arm backend."
" Evaluations include numerical comparison to the original model"
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser description concatenates two string literals without a separating space, producing "modeland/or" in the help text. Add a space at the join boundary so the help output reads correctly.

Suggested change
" Evaluations include numerical comparison to the original model"
" Evaluations include numerical comparison to the original model "

Copilot uses AI. Check for mistakes.

if args.quant_mode is None and not args.delegate:
raise ValueError(
"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message tells users to pass --delegate, but the script doesn't provide a --delegate flag (delegation is enabled by default, and --no_delegate disables it). Update the message to reflect the actual CLI behavior (e.g., "enable delegation (default) or specify --quant_mode").

Suggested change
"The model to test must be either quantized or delegated (--quant_mode or --delegate)."
"The model to test must be either quantized (--quant_mode) or run with delegation enabled (default)."

Copilot uses AI. Check for mistakes.

# Get the model and its example inputs
original_model, example_inputs = get_model_and_inputs_from_name(
args.model_name, None
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--model_name help claims .pt/.pth serialized models are supported, but get_model_and_inputs_from_name(args.model_name, None) hard-codes model_input=None. This breaks loading serialized models that require --model_input (see error in aot_arm_compiler.get_model_and_inputs_from_name). Add a --model_input argument and pass it through here.

Suggested change
args.model_name, None
args.model_name, getattr(args, "model_input", None)

Copilot uses AI. Check for mistakes.
# Export the model
exported_program = torch.export.export(eval_model, eval_inputs)

model_name = os.path.basename(os.path.splitext(args.model_name)[0])
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_name = os.path.basename(os.path.splitext(args.model_name)[0]) treats dots as file extensions. If --model_name is a Python module path (supported by get_model_and_inputs_from_name), this will truncate the name incorrectly (e.g., "foo.bar" -> "foo"). Consider deriving the display name based on the resolved loader type (file path vs internal key vs module path) or using a safer split (e.g., last module segment after '.').

Suggested change
model_name = os.path.basename(os.path.splitext(args.model_name)[0])
model_name_arg = args.model_name
if (
os.path.sep in model_name_arg
or (os.path.altsep is not None and os.path.altsep in model_name_arg)
or model_name_arg.endswith(".py")
):
model_name = os.path.basename(os.path.splitext(model_name_arg)[0])
elif "." in model_name_arg:
model_name = model_name_arg.rsplit(".", 1)[-1]
else:
model_name = model_name_arg

Copilot uses AI. Check for mistakes.
if args.quant_mode is not None and args.dtype is not None:
raise ValueError("Cannot specify --dtype when --quant_mode is enabled.")

evaluators: list[Evaluator] = [
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation is incorrect here: args.evaluators is parsed as a list of strings, but this variable is annotated as list[Evaluator]. Fix the annotation to list[str] (or similar) to avoid misleading type information and static analysis errors.

Suggested change
evaluators: list[Evaluator] = [
evaluators: list[str] = [

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants