Skip to content

Commit b4ba8e4

Browse files
pytorchbotrascaniclaude
authored
Cortex-M: Add backend documentation to docs site (#18420)
### Summary Adds the Cortex-M backend overview page to the ExecuTorch documentation website, making it discoverable alongside other embedded backends. The page covers target support, CMSIS-NN operator table, quantization, and a tutorial walking through export, quantization, edge lowering, and cross-compilation. Co-authored-by: RJ Ascani <rja@meta.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent a81977d commit b4ba8e4

File tree

4 files changed

+174
-0
lines changed

4 files changed

+174
-0
lines changed

docs/source/backends-overview.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
2828
| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs |
2929
| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs |
3030
| [Arm Ethos-U](/backends/arm-ethos-u/arm-ethos-u-overview.md) | Embedded | NPU | Arm MCUs |
31+
| [Arm Cortex-M](/backends/arm-cortex-m/arm-cortex-m-overview.md) | Embedded | CPU | Arm Cortex-M MCUs |
3132
| [Arm VGF](/backends/arm-vgf/arm-vgf-overview.md) | Android | GPU | Arm platforms |
3233
| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs |
3334
| [NXP](backends/nxp/nxp-overview.md) | Embedded | NPU | NXP SoCs |
@@ -59,6 +60,7 @@ backends/vulkan/vulkan-overview
5960
backends-qualcomm
6061
backends-mediatek
6162
backends/arm-ethos-u/arm-ethos-u-overview
63+
backends/arm-cortex-m/arm-cortex-m-overview
6264
backends/arm-vgf/arm-vgf-overview
6365
build-run-openvino
6466
backends/nxp/nxp-overview
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Arm Cortex-M Backend
2+
3+
:::{note}
4+
This backend is a work-in-progress proof of concept. It is not intended for production use, and APIs may change without notice.
5+
:::
6+
7+
The Arm&reg; Cortex&reg;-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels.
8+
9+
## Target Support
10+
11+
The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants:
12+
13+
| Variant | Description | Example CPUs | Supported |
14+
|--------------|-----------------------------|--------------------|-----------|
15+
| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85 ||
16+
| DSP | DSP extension instructions | Cortex-M4, M7, M33 ||
17+
| Pure C | Reference C implementation | Any Cortex-M ||
18+
19+
DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested.
20+
21+
## CMSIS-NN Supported Operators
22+
23+
The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels.
24+
25+
| ATen Op | CMSIS-NN Kernel | 8w8a | 8w16a | 4w8a |
26+
|--------------------------------|------------------------|------|-------|------|
27+
| `aten.convolution` | `arm_convolve` ||||
28+
| `aten.convolution` (depthwise) | `arm_depthwise_conv` ||||
29+
| `aten.convolution` (transposed)| `arm_transpose_conv` ||||
30+
| `aten.linear` | `arm_fully_connected` ||||
31+
| `aten.bmm` | `arm_batch_matmul` ||||
32+
| `aten.add` | `arm_elementwise_add` ||| N/A |
33+
| `aten.mul` | `arm_elementwise_mul` ||| N/A |
34+
| `aten.max_pool2d` | `arm_max_pool` ||| N/A |
35+
| `aten.avg_pool2d` | `arm_avgpool` ||| N/A |
36+
| `aten._softmax` | `arm_softmax` ||| N/A |
37+
| `aten.minimum` | `arm_minimum` ||| N/A |
38+
| `aten.maximum` | `arm_maximum` ||| N/A |
39+
| `aten.permute_copy` | `arm_transpose` ||| N/A |
40+
| `aten.constant_pad_nd` | `arm_pad` ||| N/A |
41+
|| LSTM ||||
42+
|| SVDF ||||
43+
44+
## Quantization Support
45+
46+
The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization:
47+
- **Per-channel** quantization for convolution operators.
48+
- **Per-tensor** quantization for all other supported operators.
49+
- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization.
50+
51+
CMSIS-NN also supports INT4 weights with INT8 activations (4w8a), INT8 weights with INT16 activations (8w16a), and per-channel quantization for fully connected layers, but the corresponding quantizer configurations and operator implementations are not yet integrated.
52+
53+
## Tutorial
54+
55+
### Prerequisites
56+
57+
Install the ExecuTorch pip package:
58+
```bash
59+
./install_executorch.sh
60+
```
61+
62+
For cross-compilation and running on simulated hardware:
63+
- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation.
64+
- [Arm&reg; Corstone&trade; SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation.
65+
66+
:::{tip}
67+
All cross-compilation tools can be downloaded and added to the path:
68+
```bash
69+
examples/arm/setup.sh --i-agree-to-the-contained-eula
70+
source examples/arm/arm-scratch/setup_path.sh
71+
```
72+
:::
73+
74+
### 1. Export and quantize
75+
76+
Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow:
77+
78+
```python
79+
import torch
80+
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
81+
from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer
82+
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
83+
84+
model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
85+
86+
example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last)
87+
exported_program = torch.export.export(model, (example_input,))
88+
graph_module = exported_program.module()
89+
90+
quantizer = CortexMQuantizer()
91+
prepared = prepare_pt2e(graph_module, quantizer)
92+
93+
# Calibrate with representative data
94+
for calibration_input in calibration_data:
95+
prepared(calibration_input)
96+
97+
quantized = convert_pt2e(prepared)
98+
quantized_exported_program = torch.export.export(quantized, (example_input,))
99+
```
100+
101+
### 2. Lower to edge and apply Cortex-M passes
102+
103+
Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations:
104+
105+
```python
106+
from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge
107+
from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager
108+
109+
config = EdgeCompileConfig(
110+
preserve_ops=[
111+
torch.ops.aten.linear.default,
112+
torch.ops.aten.hardsigmoid.default,
113+
torch.ops.aten.hardsigmoid_.default,
114+
torch.ops.aten.hardswish.default,
115+
torch.ops.aten.hardswish_.default,
116+
],
117+
_check_ir_validity=False,
118+
_core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
119+
)
120+
121+
edge_program_manager = to_edge(quantized_exported_program, compile_config=config)
122+
123+
pass_manager = CortexMPassManager(edge_program_manager.exported_program())
124+
edge_program_manager._edge_programs["forward"] = pass_manager.transform()
125+
```
126+
127+
### 3. Serialize to .pte
128+
129+
```python
130+
executorch_program = edge_program_manager.to_executorch(
131+
config=ExecutorchBackendConfig(extract_delegate_segments=False)
132+
)
133+
134+
with open("model.pte", "wb") as f:
135+
f.write(executorch_program.buffer)
136+
```
137+
138+
### 4. Cross-compile and run
139+
140+
Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in.
141+
142+
```bash
143+
# Build ExecuTorch libraries for Arm baremetal
144+
cmake --preset arm-baremetal \
145+
-DCMAKE_BUILD_TYPE=Release \
146+
-DEXECUTORCH_BUILD_DEVTOOLS=ON \
147+
-Bcmake-out-arm
148+
cmake --build cmake-out-arm --target install -j$(nproc)
149+
150+
# Build the executor runner, linking the .pte into the binary
151+
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \
152+
-DCMAKE_BUILD_TYPE=Release \
153+
-DET_PTE_FILE_PATH=$(pwd)/model.pte \
154+
-DTARGET_CPU=cortex-m55 \
155+
-Bbuild \
156+
examples/arm/executor_runner
157+
cmake --build build -j$(nproc) -- arm_executor_runner
158+
```
159+
160+
Run on a simulated Cortex-M target:
161+
162+
```bash
163+
backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128
164+
```
165+
166+
For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb).
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
```{include} backends/arm-cortex-m/arm-cortex-m-overview.md

docs/source/embedded-backends.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ Available hardware acceleration backends for embedded systems.
77

88
- {doc}`embedded-cadence` — Cadence Xtensa DSP processors
99

10+
## CPU Acceleration
11+
12+
- {doc}`embedded-arm-cortex-m` — Arm Cortex-M CMSIS-NN acceleration
13+
1014
## NPU Acceleration
1115

1216
- {doc}`embedded-arm-ethos-u` — ARM Ethos-U NPU acceleration
@@ -15,6 +19,7 @@ Available hardware acceleration backends for embedded systems.
1519

1620
```{toctree}
1721
:hidden:
22+
embedded-arm-cortex-m
1823
embedded-cadence
1924
embedded-arm-ethos-u
2025
embedded-nxp

0 commit comments

Comments
 (0)