|
| 1 | +# Arm Cortex-M Backend |
| 2 | + |
| 3 | +:::{note} |
| 4 | +This backend is a work-in-progress proof of concept. It is not intended for production use, and APIs may change without notice. |
| 5 | +::: |
| 6 | + |
| 7 | +The Arm® Cortex®-M backend accelerates quantized model execution on Arm Cortex-M CPUs using [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) optimized kernels. Unlike delegate-based backends, it operates as an operator library: quantized subgraphs are replaced with CMSIS-NN accelerated kernels during the pass-lowering stage, while unsupported operators fall back to portable fp32 kernels. |
| 8 | + |
| 9 | +## Target Support |
| 10 | + |
| 11 | +The backend targets Arm Cortex-M CPUs via CMSIS-NN, which provides optimized kernel implementations for three instruction set variants: |
| 12 | + |
| 13 | +| Variant | Description | Example CPUs | Supported | |
| 14 | +|--------------|-----------------------------|--------------------|-----------| |
| 15 | +| MVE (Helium) | M-profile Vector extensions | Cortex-M55, M85 | ✅ | |
| 16 | +| DSP | DSP extension instructions | Cortex-M4, M7, M33 | ⬜ | |
| 17 | +| Pure C | Reference C implementation | Any Cortex-M | ⬜ | |
| 18 | + |
| 19 | +DSP and pure C variants use the same CMSIS-NN API and may work, but have not been tested. |
| 20 | + |
| 21 | +## CMSIS-NN Supported Operators |
| 22 | + |
| 23 | +The backend pass pipeline replaces quantized ATen operators with [CMSIS-NN](https://arm-software.github.io/CMSIS-NN/latest/) kernel calls. See the [CMSIS-NN API documentation](https://arm-software.github.io/CMSIS-NN/latest/modules.html) for the full list of available kernels. |
| 24 | + |
| 25 | +| ATen Op | CMSIS-NN Kernel | 8w8a | 8w16a | 4w8a | |
| 26 | +|--------------------------------|------------------------|------|-------|------| |
| 27 | +| `aten.convolution` | `arm_convolve` | ✅ | ⬜ | ⬜ | |
| 28 | +| `aten.convolution` (depthwise) | `arm_depthwise_conv` | ✅ | ⬜ | ⬜ | |
| 29 | +| `aten.convolution` (transposed)| `arm_transpose_conv` | ✅ | ⬜ | ⬜ | |
| 30 | +| `aten.linear` | `arm_fully_connected` | ✅ | ⬜ | ⬜ | |
| 31 | +| `aten.bmm` | `arm_batch_matmul` | ✅ | ⬜ | ⬜ | |
| 32 | +| `aten.add` | `arm_elementwise_add` | ✅ | ⬜ | N/A | |
| 33 | +| `aten.mul` | `arm_elementwise_mul` | ✅ | ⬜ | N/A | |
| 34 | +| `aten.max_pool2d` | `arm_max_pool` | ✅ | ⬜ | N/A | |
| 35 | +| `aten.avg_pool2d` | `arm_avgpool` | ✅ | ⬜ | N/A | |
| 36 | +| `aten._softmax` | `arm_softmax` | ✅ | ⬜ | N/A | |
| 37 | +| `aten.minimum` | `arm_minimum` | ✅ | ⬜ | N/A | |
| 38 | +| `aten.maximum` | `arm_maximum` | ✅ | ⬜ | N/A | |
| 39 | +| `aten.permute_copy` | `arm_transpose` | ✅ | ⬜ | N/A | |
| 40 | +| `aten.constant_pad_nd` | `arm_pad` | ✅ | ⬜ | N/A | |
| 41 | +| — | LSTM | ⬜ | ⬜ | ⬜ | |
| 42 | +| — | SVDF | ⬜ | ⬜ | ⬜ | |
| 43 | + |
| 44 | +## Quantization Support |
| 45 | + |
| 46 | +The Cortex-M backend currently implements **symmetric INT8 (8w8a)** quantization: |
| 47 | +- **Per-channel** quantization for convolution operators. |
| 48 | +- **Per-tensor** quantization for all other supported operators. |
| 49 | +- **Shared quantization parameters** for data-movement operators (e.g. reshape, permute) to avoid unnecessary requantization. |
| 50 | + |
| 51 | +CMSIS-NN also supports INT4 weights with INT8 activations (4w8a), INT8 weights with INT16 activations (8w16a), and per-channel quantization for fully connected layers, but the corresponding quantizer configurations and operator implementations are not yet integrated. |
| 52 | + |
| 53 | +## Tutorial |
| 54 | + |
| 55 | +### Prerequisites |
| 56 | + |
| 57 | +Install the ExecuTorch pip package: |
| 58 | +```bash |
| 59 | +./install_executorch.sh |
| 60 | +``` |
| 61 | + |
| 62 | +For cross-compilation and running on simulated hardware: |
| 63 | +- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. |
| 64 | +- [Arm® Corstone™ SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) or [SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for simulation. |
| 65 | + |
| 66 | +:::{tip} |
| 67 | +All cross-compilation tools can be downloaded and added to the path: |
| 68 | +```bash |
| 69 | +examples/arm/setup.sh --i-agree-to-the-contained-eula |
| 70 | +source examples/arm/arm-scratch/setup_path.sh |
| 71 | +``` |
| 72 | +::: |
| 73 | + |
| 74 | +### 1. Export and quantize |
| 75 | + |
| 76 | +Export the model, then quantize using `CortexMQuantizer` with the PT2E quantization flow: |
| 77 | + |
| 78 | +```python |
| 79 | +import torch |
| 80 | +from torchvision.models import mobilenet_v2, MobileNet_V2_Weights |
| 81 | +from executorch.backends.cortex_m.quantizer.quantizer import CortexMQuantizer |
| 82 | +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e |
| 83 | + |
| 84 | +model = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() |
| 85 | + |
| 86 | +example_input = torch.randn(1, 3, 224, 224).to(memory_format=torch.channels_last) |
| 87 | +exported_program = torch.export.export(model, (example_input,)) |
| 88 | +graph_module = exported_program.module() |
| 89 | + |
| 90 | +quantizer = CortexMQuantizer() |
| 91 | +prepared = prepare_pt2e(graph_module, quantizer) |
| 92 | + |
| 93 | +# Calibrate with representative data |
| 94 | +for calibration_input in calibration_data: |
| 95 | + prepared(calibration_input) |
| 96 | + |
| 97 | +quantized = convert_pt2e(prepared) |
| 98 | +quantized_exported_program = torch.export.export(quantized, (example_input,)) |
| 99 | +``` |
| 100 | + |
| 101 | +### 2. Lower to edge and apply Cortex-M passes |
| 102 | + |
| 103 | +Lower to the edge dialect with a custom `EdgeCompileConfig`, then run the `CortexMPassManager` to replace quantized subgraphs with CMSIS-NN operator implementations: |
| 104 | + |
| 105 | +```python |
| 106 | +from executorch.exir import EdgeCompileConfig, ExecutorchBackendConfig, to_edge |
| 107 | +from executorch.backends.cortex_m.passes.cortex_m_pass_manager import CortexMPassManager |
| 108 | + |
| 109 | +config = EdgeCompileConfig( |
| 110 | + preserve_ops=[ |
| 111 | + torch.ops.aten.linear.default, |
| 112 | + torch.ops.aten.hardsigmoid.default, |
| 113 | + torch.ops.aten.hardsigmoid_.default, |
| 114 | + torch.ops.aten.hardswish.default, |
| 115 | + torch.ops.aten.hardswish_.default, |
| 116 | + ], |
| 117 | + _check_ir_validity=False, |
| 118 | + _core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default], |
| 119 | +) |
| 120 | + |
| 121 | +edge_program_manager = to_edge(quantized_exported_program, compile_config=config) |
| 122 | + |
| 123 | +pass_manager = CortexMPassManager(edge_program_manager.exported_program()) |
| 124 | +edge_program_manager._edge_programs["forward"] = pass_manager.transform() |
| 125 | +``` |
| 126 | + |
| 127 | +### 3. Serialize to .pte |
| 128 | + |
| 129 | +```python |
| 130 | +executorch_program = edge_program_manager.to_executorch( |
| 131 | + config=ExecutorchBackendConfig(extract_delegate_segments=False) |
| 132 | +) |
| 133 | + |
| 134 | +with open("model.pte", "wb") as f: |
| 135 | + f.write(executorch_program.buffer) |
| 136 | +``` |
| 137 | + |
| 138 | +### 4. Cross-compile and run |
| 139 | + |
| 140 | +Cross-compile the ExecuTorch runtime, Cortex-M kernels, and the example runner application. The first cmake invocation builds the ExecuTorch libraries for Arm baremetal. The second builds the [arm_executor_runner](https://github.com/pytorch/executorch/blob/main/examples/arm/executor_runner/) and links it against those libraries with the `.pte` model baked in. |
| 141 | + |
| 142 | +```bash |
| 143 | +# Build ExecuTorch libraries for Arm baremetal |
| 144 | +cmake --preset arm-baremetal \ |
| 145 | + -DCMAKE_BUILD_TYPE=Release \ |
| 146 | + -DEXECUTORCH_BUILD_DEVTOOLS=ON \ |
| 147 | + -Bcmake-out-arm |
| 148 | +cmake --build cmake-out-arm --target install -j$(nproc) |
| 149 | + |
| 150 | +# Build the executor runner, linking the .pte into the binary |
| 151 | +cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \ |
| 152 | + -DCMAKE_BUILD_TYPE=Release \ |
| 153 | + -DET_PTE_FILE_PATH=$(pwd)/model.pte \ |
| 154 | + -DTARGET_CPU=cortex-m55 \ |
| 155 | + -Bbuild \ |
| 156 | + examples/arm/executor_runner |
| 157 | +cmake --build build -j$(nproc) -- arm_executor_runner |
| 158 | +``` |
| 159 | + |
| 160 | +Run on a simulated Cortex-M target: |
| 161 | + |
| 162 | +```bash |
| 163 | +backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u55-128 |
| 164 | +``` |
| 165 | + |
| 166 | +For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb). |
0 commit comments