Based on our comprehensive testing, here are the key performance insights:
-
Hybrid (OpenMP+SIMD+Blocked): Up to 66.69x speedup
- Best overall performance across all matrix sizes
- Combines parallel processing, vectorization, and cache optimization
- Peak performance: 52.88 GFLOPS (transposed hybrid on 4096×4096)
-
SIMD (AVX2): Up to 8.18x speedup
- Excellent for smaller to medium matrices
- Processes 8 floats simultaneously
- Consistent performance across different matrix sizes
-
OpenMP: Up to 4.72x speedup
- Effectiveness increases with matrix size
- Multi-core parallelization with 6 threads
- Better performance with transposed implementations
-
Blocked: Up to 2.00x speedup
- Mixed results when used alone
- Critical component for cache optimization in hybrid approaches
- Uses 64×64 tiles for cache efficiency
-
Naive: Baseline reference (1.00x)
- Disabled automatically for matrices >512 in any dimension
| Matrix Size | Standard Best | Transposed Best | Winner | Performance Gap |
|---|---|---|---|---|
| 128×128 | Hybrid (∞x) | Hybrid (∞x) | Tie | Both sub-millisecond |
| 256×256 | Hybrid (33.55 GFLOPS) | Hybrid (16.78 GFLOPS) | Standard | 2.0x advantage |
| 512×256 | Hybrid (16.78 GFLOPS) | Hybrid (16.78 GFLOPS) | Tie | Equal performance |
| 512×512 | Hybrid (20.65 GFLOPS) | Hybrid (33.56 GFLOPS) | Transposed | 1.6x advantage |
| 1024×512 | Hybrid (22.37 GFLOPS) | Hybrid (20.65 GFLOPS) | Standard | 1.1x advantage |
| 1024×1024 | Hybrid (21.05 GFLOPS) | Hybrid (41.30 GFLOPS) | Transposed | 2.0x advantage |
| 4096×1024 | Hybrid (22.41 GFLOPS) | Hybrid (47.55 GFLOPS) | Transposed | 2.1x advantage |
| 4096×4096 | Hybrid (17.66 GFLOPS) | Hybrid (51.73 GFLOPS) | Transposed | 2.9x advantage |
- Transposed implementations win in 5 out of 8 test cases, with dominant performance on large matrices
- Standard implementations excel on smaller matrices (256×256 and 1024×512) but not by a very big margin
- Cache locality benefits of transposition are most apparent in medium-sized matrices
- Correctness issues appear in some highly optimized implementations at large matrix sizes (small <e-5 floating point errors)
- Performance scaling is non-linear, with sweet spots around 1024×1024 matrices
Accuracy Warnings: Some optimized implementations show correctness issues (marked as "NO") at very large matrix sizes, particularly:
- SIMD implementations on matrices >= 1024×1024
- Hybrid transposed implementations on matrices >= 512×512
Reliability: OpenMP and basic blocked implementations maintain correctness across all tested sizes.
The benchmark suite includes 7 comprehensive test cases:
- 128×128 - Small square matrices
- 256×256 - Medium square matrices
- 512×256 - Rectangular matrices
- 512×512 - Large square matrices
- 1024×512 - Large rectangular matrices
- 1024×1024 - Very large square matrices
- 4096×1024 - Massive rectangular matrices
- 4096×4096 - Massive square matrices
Each test case evaluates:
- All implementation variants
- Execution time and speedup calculations
- GFLOPS performance metrics
- Correctness verification
- Standard vs transposed comparisons
- OpenMP specification contributors
- Intel AVX2 instruction set documentation
- Matrix multiplication optimization research community
- Performance benchmarking best practices