GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735

Passavee-Losripat · 2026-03-16T18:56:06Z

Passavee-Losripat
Mar 16, 2026

My name is Passavee Losripat, a third-year Computer Science student at KAIST, and I'm interested in contributing to Project#5 Optimize Quantized Model Inference Performance on ARM Devices with OpenVINO for GSoC 2026. I've been following your initial repository as preparation for my GSoC proposal and wanted to share my investigation findings.

Contributions to OpenVINO

Posted issue [Bug]: [ARM] INT8 inference produces incorrect results on Apple M4 Max #34673 : INT8 inference producing incorrect results on Apple M4 Max which is my motivation to improve OpenVINO for inference in ARM CPU
Merged keras#22404: implementing vectorize() and fixing power() dtype promotion in the Keras OpenVINO backend

Setup Details:

I have built OpenVINO 2026.1 from source with ENABLE_DEBUG_CAPS=ON on Apple M4 Max. Full analysis script and result can be found in: scripts/analyze_graph.py

My Investigation

Identical Runtime:

Model	Latency
FP32	26.0 ms
INT8 (Quantized)	26.0 ms

After inspection of exec graph of both models, we can confirm that INT8 ACL kernels are not used:

fp32: {'gemm_acl_f16': 88, 'winograd_acl_f16': 6, 'acl_f16': 8}
int8: {'gemm_acl_f16': 87, 'winograd_acl_f16': 6, 'acl_f16': 8}

Noticing that there is no acl_i8 in both of the models, showing that both models runs identically to FP32.

Confirmation of problem in Activation zero-point representation
First I confirmed the original quantized model genuinely contains INT8 weights (I8: 156 port precisions), so the problem is inside OpenVINO not the model itself.

Result from tracing Conv port precision through all transformation stages:

ir_0_preLpt_in:    {'FP32': 282}
ir_1_preLpt_out:   {'FP16': 282} -> INT8 lost before LPT
ir_2_lpt_out:      {'FP16': 282}
ir_3_postLpt_out:  {'FP16': 282}
ir_4_snippets_out: {'FP16': 282}
ir_5_cpuSpecific:  {'FP16': 282}

After further comparing op counts between ir_0 and ir_1, the single Subtract node disappears in the same step that Conv precision changes FP32 to FP16. This is consistent with the described technical gap of zero-point Subtract on the activation path is not being folded into ACL QuantizationInfo, causing the pre-LPT pass to fall back to FP16 instead.

Confirmation that FakeQuantize never fused into Conv
After checking originalLayersNames across all 101 Conv nodes in the exec graph, I got result that FakeQuantize never appears in any of them. This means hasQuantizationPostOp=false in ACLConvolutionExecutor::supports(), which causes it to return false and fall back to FP16. This is likely due to the described pattern gap where Conv→Multiply→Add→Swish→FakeQuantize is not recognized because Swish is between Add and FakeQuantize.

Questions

After I go through the code in acl_conv.cpp, I found out that activationLayerInfo and the FakeQuantize scales are set in mutually exclusive branches in the constructor so currently you can fuse either an activation or a FakeQuantize post-op. So for the Conv -> Swish -> FakeQuantize pattern, if Swish is handled as activationLayerInfo, the FQ scales (fqInputScale and fqOutputScale) will be never set, and getDstQuantizationInfo in validateTensorsInfo gets empty scales. So I wonder if we should handle both simultaneously by passing Swish to activationLayerInfo while also setting FQ scales or does this require ACL's NEConvolutionLayer to natively support fused activation + requantization together in a single INT8 kernel?

alvoron · 2026-03-18T10:47:30Z

alvoron
Mar 18, 2026
Collaborator

@Passavee-Losripat thank you for reviewing this topic!
Let's try to loose the constraint and accept both FQ and Activation post-ops.
It'd be great if you try it to implement and share your results in the proposal.

1 reply

Passavee-Losripat Mar 18, 2026
Author

Let's try to loose the constraint and accept both FQ and Activation post-ops.
It'd be great if you try it to implement and share your results in the proposal.

@alvoron Thank you for the direction! I'll implement this and include the results in the proposal. In the implementation, should the fix focus specifically on Conv -> Swish -> FQ or would you prefer a more general design that can handle any activation + FQ combination for future use cases?

alvoron · 2026-03-18T18:32:07Z

alvoron
Mar 18, 2026
Collaborator

I added a new section to README, please take a look: https://github.com/alvoron/gsoc-2026-openvino/tree/main?tab=readme-ov-file#6-start-addressing-a-technical-gap-optional

1 reply

Passavee-Losripat Mar 18, 2026
Author

@alvoron Thank you for the detailed guide! I'll start working through these steps and include the results in my proposal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GSoC 2026: Project#5 Investigation on Quantized Models on M4 Max #34735

Uh oh!

Uh oh!

Passavee-Losripat Mar 16, 2026

Contributions to OpenVINO

Setup Details:

My Investigation

Questions

Replies: 2 comments · 2 replies

Uh oh!

alvoron Mar 18, 2026 Collaborator

Uh oh!

Passavee-Losripat Mar 18, 2026 Author

Uh oh!

alvoron Mar 18, 2026 Collaborator

Uh oh!

Passavee-Losripat Mar 18, 2026 Author

Passavee-Losripat
Mar 16, 2026

Replies: 2 comments 2 replies

alvoron
Mar 18, 2026
Collaborator

Passavee-Losripat Mar 18, 2026
Author

alvoron
Mar 18, 2026
Collaborator

Passavee-Losripat Mar 18, 2026
Author