Benchmark Results Overview

Based on our comprehensive testing, here are the key performance insights:

Performance Hierarchy

Hybrid (OpenMP+SIMD+Blocked): Up to 66.69x speedup
- Best overall performance across all matrix sizes
- Combines parallel processing, vectorization, and cache optimization
- Peak performance: 52.88 GFLOPS (transposed hybrid on 4096×4096)
SIMD (AVX2): Up to 8.18x speedup
- Excellent for smaller to medium matrices
- Processes 8 floats simultaneously
- Consistent performance across different matrix sizes
OpenMP: Up to 4.72x speedup
- Effectiveness increases with matrix size
- Multi-core parallelization with 6 threads
- Better performance with transposed implementations
Blocked: Up to 2.00x speedup
- Mixed results when used alone
- Critical component for cache optimization in hybrid approaches
- Uses 64×64 tiles for cache efficiency
Naive: Baseline reference (1.00x)
- Disabled automatically for matrices >512 in any dimension

Matrix Size	Standard Best	Transposed Best	Winner	Performance Gap
128×128	Hybrid (∞x)	Hybrid (∞x)	Tie	Both sub-millisecond
256×256	Hybrid (33.55 GFLOPS)	Hybrid (16.78 GFLOPS)	Standard	2.0x advantage
512×256	Hybrid (16.78 GFLOPS)	Hybrid (16.78 GFLOPS)	Tie	Equal performance
512×512	Hybrid (20.65 GFLOPS)	Hybrid (33.56 GFLOPS)	Transposed	1.6x advantage
1024×512	Hybrid (22.37 GFLOPS)	Hybrid (20.65 GFLOPS)	Standard	1.1x advantage
1024×1024	Hybrid (21.05 GFLOPS)	Hybrid (41.30 GFLOPS)	Transposed	2.0x advantage
4096×1024	Hybrid (22.41 GFLOPS)	Hybrid (47.55 GFLOPS)	Transposed	2.1x advantage
4096×4096	Hybrid (17.66 GFLOPS)	Hybrid (51.73 GFLOPS)	Transposed	2.9x advantage

Transposed implementations win in 5 out of 8 test cases, with dominant performance on large matrices
Standard implementations excel on smaller matrices (256×256 and 1024×512) but not by a very big margin
Cache locality benefits of transposition are most apparent in medium-sized matrices
Correctness issues appear in some highly optimized implementations at large matrix sizes (small <e-5 floating point errors)
Performance scaling is non-linear, with sweet spots around 1024×1024 matrices

Accuracy Warnings: Some optimized implementations show correctness issues (marked as "NO") at very large matrix sizes, particularly:

Reliability: OpenMP and basic blocked implementations maintain correctness across all tested sizes.

The benchmark suite includes 7 comprehensive test cases:

Each test case evaluates: