GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735
Passavee-Losripat
started this conversation in
Google Summer of Code
Replies: 2 comments 2 replies
-
|
@Passavee-Losripat thank you for reviewing this topic! |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
I added a new section to README, please take a look: https://github.com/alvoron/gsoc-2026-openvino/tree/main?tab=readme-ov-file#6-start-addressing-a-technical-gap-optional |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @alvoron and @v-Golubev!
My name is Passavee Losripat, a third-year Computer Science student at KAIST, and I'm interested in contributing to Project#5 Optimize Quantized Model Inference Performance on ARM Devices with OpenVINO for GSoC 2026. I've been following your initial repository as preparation for my GSoC proposal and wanted to share my investigation findings.
Contributions to OpenVINO
vectorize()and fixingpower()dtype promotion in the Keras OpenVINO backendSetup Details:
I have built OpenVINO 2026.1 from source with
ENABLE_DEBUG_CAPS=ONon Apple M4 Max. Full analysis script and result can be found in: scripts/analyze_graph.pyMy Investigation
After inspection of exec graph of both models, we can confirm that INT8 ACL kernels are not used:
Noticing that there is no
acl_i8in both of the models, showing that both models runs identically to FP32.First I confirmed the original quantized model genuinely contains INT8 weights (
I8: 156port precisions), so the problem is inside OpenVINO not the model itself.Result from tracing Conv port precision through all transformation stages:
After further comparing op counts between
ir_0andir_1, the single Subtract node disappears in the same step that Conv precision changes FP32 to FP16. This is consistent with the described technical gap of zero-point Subtract on the activation path is not being folded into ACL QuantizationInfo, causing the pre-LPT pass to fall back to FP16 instead.After checking
originalLayersNamesacross all 101 Conv nodes in the exec graph, I got result that FakeQuantize never appears in any of them. This meanshasQuantizationPostOp=falseinACLConvolutionExecutor::supports(), which causes it to return false and fall back to FP16. This is likely due to the described pattern gap whereConv→Multiply→Add→Swish→FakeQuantizeis not recognized because Swish is between Add and FakeQuantize.Questions
After I go through the code in
acl_conv.cpp, I found out thatactivationLayerInfoand the FakeQuantize scales are set in mutually exclusive branches in the constructor so currently you can fuse either an activation or a FakeQuantize post-op. So for theConv -> Swish -> FakeQuantizepattern, if Swish is handled asactivationLayerInfo, the FQ scales (fqInputScaleandfqOutputScale) will be never set, andgetDstQuantizationInfoinvalidateTensorsInfogets empty scales. So I wonder if we should handle both simultaneously by passing Swish toactivationLayerInfowhile also setting FQ scales or does this require ACL'sNEConvolutionLayerto natively support fused activation + requantization together in a single INT8 kernel?Beta Was this translation helpful? Give feedback.
All reactions