Hi all!
I'm in the process of running some measurements on my software with likwid and I was hoping to ask for some help with interpreting the results likwid is showing me, as the way I understand them, they can't be right.
Here’s what I’ve been up to so far.
My Software
- Parallelised using
pthreads (and “in-house” task scheduler based on ptheads), uses cudafor GPU acceleration.
- My code can pin the threads it runs on (using
pthreads_setaffinity_np), so in the measurement runs with likwid-perfctr, I’ll be using the -c instead of the -C flag
- https://github.com/abouzied-nasar/SWIFT/tree/likwid-markers in the unexpected case it’s needed
- The cluster I’m using has
likwid 5.4.1 installed
What I want to achieve
- I want to measure the memory bandwidth behaviour in specific regions of my code
- I’m interested in CPU behaviour of a code that is GPU accelerated. The GPU bit is not interesting to me at this point, only what’s going on on the CPU side.
- the
MEM event group and likwid' s Marker API seems ideal to do this.
My Hardware
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
CPU type: Intel Cascadelake SP processor
CPU stepping: 7
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets: 2
CPU dies: 2
Cores per socket: 16
Threads per core: 2
What I’ve done so far
To begin with, I did some likwid-bench benchmarks first to assess the peak memory bandwidth of my system:
# Triad benchmark: A[i] = B[i] * C[i] + D[i]
# 32 threads
likwid-bench -t triad -w N:4GB:32
MByte/s: 78734.97
# single thread
MByte/s: 12036.72
# copy_mem benchmark: Copy with non-temporal store
# 32 threads
likwid-bench -t copy_mem -w N:4GB:32
MByte/s: 83644.83
likwid-bench -t copy_mem -w N:4GB:1
MByte/s: 8881.45
# load benchmark
# 32 threads
likwid-bench -t load -w N:4GB:32
MByte/s: 99647.42
# single thread
likwid-bench -t load -w N:4GB:1
MByte/s: 10869.33
# store benchmark
# 32 threads
likwid-bench -t store -w N:4GB:32
MByte/s: 47517.67
# single thread
likwid-bench -t store -w N:4GB:1
MByte/s: 9423.61
I made sure I was the only one using the node while benchmarking. So far, so good, I guess.
Next, I ran a tiny (non-representative) example with my software to verify that everything works as it should.
First, a run on a single thread:
likwid-perfctr -f -c N:1 -g MEM ../swift_cuda --hydro --threads=1 --steps=2 --pin greshoGPU64.yml
gives:
--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+
| Event | Counter | HWThread 2 |
+-----------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 6864 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 30469 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 25944 |
| CAS_COUNT_RD | MBOX0C0 | 14515348 |
| CAS_COUNT_WR | MBOX0C1 | 7941874 |
| CAS_COUNT_RD | MBOX1C0 | 14314478 |
| CAS_COUNT_WR | MBOX1C1 | 7980721 |
| CAS_COUNT_RD | MBOX2C0 | 14274722 |
| CAS_COUNT_WR | MBOX2C1 | 7958660 |
| CAS_COUNT_RD | MBOX3C0 | 14112250 |
| CAS_COUNT_WR | MBOX3C1 | 7972464 |
| CAS_COUNT_RD | MBOX4C0 | 14103420 |
| CAS_COUNT_WR | MBOX4C1 | 7950516 |
| CAS_COUNT_RD | MBOX5C0 | 14039041 |
| CAS_COUNT_WR | MBOX5C1 | 7934033 |
+-----------------------+---------+------------+
+-----------------------------------+--------------+
| Metric | HWThread 2 |
+-----------------------------------+--------------+
| Runtime (RDTSC) [s] | 12.9292 |
| Runtime unhalted [s] | 1.327911e-05 |
| Clock [MHz] | 2694.7005 |
| CPI | 4.4390 |
| Memory read bandwidth [MBytes/s] | 422.5309 |
| Memory read data volume [GBytes] | 5.4630 |
| Memory write bandwidth [MBytes/s] | 236.3059 |
| Memory write data volume [GBytes] | 3.0552 |
| Memory bandwidth [MBytes/s] | 658.8368 |
| Memory data volume [GBytes] | 8.5182 |
+-----------------------------------+--------------+
which seems fine to me.
Next, using the full node (32 threads):
$ likwid-perfctr -f -c N:0-31 -g MEM ../swift_cuda --hydro --threads=32 --steps=2 --pin greshoGPU64.yml
I get the following result:
--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Event | Counter | HWThread 0 | HWThread 2 | HWThread 4 | HWThread 6 | HWThread 8 | HWThread 10 | HWThread 12 | HWThread 14 | HWThread 16 | HWThread 18 | HWThread 20 | HWThread 22 | HWThread 24 | HWThread 26 | HWThread 28 | HWThread 30 | HWThread 1 | HWThread 3 | HWThread 5 | HWThread 7 | HWThread 9 | HWThread 11 | HWThread 13 | HWThread 15 | HWThread 17 | HWThread 19 | HWThread 21 | HWThread 23 | HWThread 25 | HWThread 27 | HWThread 29 | HWThread 31 |
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| INSTR_RETIRED_ANY | FIXC0 | 3133822573 | 2813486090 | 3043728189 | 2930692486 | 2722090250 | 3126264246 | 2548880161 | 3018866070 | 3300903006 | 3360986726 | 2687726446 | 2669957110 | 2760906756 | 2552515195 | 3063501993 | 2917143245 | 3197656772 | 2801757639 | 2887380234 | 2864979720 | 2814551018 | 3008879024 | 2734287716 | 2509797720 | 3216438776 | 3118694950 | 3036079433 | 2657812274 | 3121342400 | 2937343518 | 3113016086 | 3179790190 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3563069606 | 3369129026 | 3415539366 | 3408741114 | 3249072324 | 3389731617 | 3231691812 | 3429669079 | 3557693499 | 3491897883 | 3291795213 | 3195020413 | 3237587084 | 3197102545 | 3458301541 | 3361217145 | 3368475391 | 3154197548 | 3214583426 | 3176180589 | 3182672830 | 3228687378 | 3127536670 | 3093572505 | 3307864643 | 3261535145 | 3235589514 | 3174752605 | 3243541576 | 3257180131 | 3255170846 | 3336653887 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2929354060 | 2768032612 | 2806316848 | 2800689944 | 2669804120 | 2784883240 | 2655251744 | 2818180616 | 2923486852 | 2868811252 | 2705133224 | 2625390292 | 2660995488 | 2627231856 | 2841621664 | 2762731940 | 2770029564 | 2591788304 | 2641191752 | 2609574388 | 2614745064 | 2652721284 | 2569715572 | 2541390060 | 2718158216 | 2679830280 | 2658276336 | 2608362012 | 2665197220 | 2676587556 | 2674313960 | 2742247864 |
| CAS_COUNT_RD | MBOX0C0 | 48962304 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 44916250 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX0C1 | 60148475 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58904072 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX1C0 | 47741198 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 44173297 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX1C1 | 59112460 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57916540 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX2C0 | 48105100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46236395 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX2C1 | 59278825 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 59745346 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX3C0 | 45992964 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42878464 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX3C1 | 58695026 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57316165 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX4C0 | 47163558 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42345235 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX4C1 | 59299798 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56955623 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX5C0 | 45281466 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42076025 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX5C1 | 57684251 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56757595 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
+----------------------------+---------+--------------+------------+------------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+----------------------------+---------+--------------+------------+------------+--------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 93851278012 | 2509797720 | 3360986726 | 2.932852e+09 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 105465453951 | 3093572505 | 3563069606 | 3.295795e+09 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 86662045184 | 2541390060 | 2929354060 | 2708188912 |
| CAS_COUNT_RD STAT | MBOX0C0 | 93878554 | 0 | 48962304 | 2.933705e+06 |
| CAS_COUNT_WR STAT | MBOX0C1 | 119052547 | 0 | 60148475 | 3.720392e+06 |
| CAS_COUNT_RD STAT | MBOX1C0 | 91914495 | 0 | 47741198 | 2.872328e+06 |
| CAS_COUNT_WR STAT | MBOX1C1 | 117029000 | 0 | 59112460 | 3.657156e+06 |
| CAS_COUNT_RD STAT | MBOX2C0 | 94341495 | 0 | 48105100 | 2.948172e+06 |
| CAS_COUNT_WR STAT | MBOX2C1 | 119024171 | 0 | 59745346 | 3.719505e+06 |
| CAS_COUNT_RD STAT | MBOX3C0 | 88871428 | 0 | 45992964 | 2.777232e+06 |
| CAS_COUNT_WR STAT | MBOX3C1 | 116011191 | 0 | 58695026 | 3.625350e+06 |
| CAS_COUNT_RD STAT | MBOX4C0 | 89508793 | 0 | 47163558 | 2.797150e+06 |
| CAS_COUNT_WR STAT | MBOX4C1 | 116255421 | 0 | 59299798 | 3.632982e+06 |
| CAS_COUNT_RD STAT | MBOX5C0 | 87357491 | 0 | 45281466 | 2.729922e+06 |
| CAS_COUNT_WR STAT | MBOX5C1 | 114441846 | 0 | 57684251 | 3.576308e+06 |
+----------------------------+---------+--------------+------------+------------+--------------+
+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Metric | HWThread 0 | HWThread 2 | HWThread 4 | HWThread 6 | HWThread 8 | HWThread 10 | HWThread 12 | HWThread 14 | HWThread 16 | HWThread 18 | HWThread 20 | HWThread 22 | HWThread 24 | HWThread 26 | HWThread 28 | HWThread 30 | HWThread 1 | HWThread 3 | HWThread 5 | HWThread 7 | HWThread 9 | HWThread 11 | HWThread 13 | HWThread 15 | HWThread 17 | HWThread 19 | HWThread 21 | HWThread 23 | HWThread 25 | HWThread 27 | HWThread 29 | HWThread 31 |
+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Runtime (RDTSC) [s] | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 | 10.2164 |
| Runtime unhalted [s] | 1.5529 | 1.4684 | 1.4886 | 1.4856 | 1.4160 | 1.4773 | 1.4085 | 1.4947 | 1.5505 | 1.5219 | 1.4347 | 1.3925 | 1.4110 | 1.3934 | 1.5072 | 1.4649 | 1.4681 | 1.3747 | 1.4010 | 1.3843 | 1.3871 | 1.4072 | 1.3631 | 1.3483 | 1.4417 | 1.4215 | 1.4102 | 1.3836 | 1.4136 | 1.4196 | 1.4187 | 1.4542 |
| Clock [MHz] | 2790.8562 | 2792.7461 | 2792.5928 | 2792.6339 | 2792.3188 | 2792.8225 | 2792.6033 | 2792.3407 | 2792.2378 | 2792.8305 | 2792.0884 | 2792.3174 | 2791.6588 | 2792.1785 | 2792.4257 | 2791.5335 | 2790.1914 | 2792.3791 | 2792.6068 | 2792.6758 | 2792.8503 | 2792.6685 | 2792.5591 | 2793.0196 | 2792.2739 | 2792.5425 | 2792.7902 | 2792.7177 | 2792.3840 | 2792.1924 | 2792.8423 | 2791.8332 |
| CPI | 1.1370 | 1.1975 | 1.1222 | 1.1631 | 1.1936 | 1.0843 | 1.2679 | 1.1361 | 1.0778 | 1.0390 | 1.2248 | 1.1967 | 1.1727 | 1.2525 | 1.1289 | 1.1522 | 1.0534 | 1.1258 | 1.1133 | 1.1086 | 1.1308 | 1.0731 | 1.1438 | 1.2326 | 1.0284 | 1.0458 | 1.0657 | 1.1945 | 1.0391 | 1.1089 | 1.0457 | 1.0493 |
| Memory read bandwidth [MBytes/s] | 1774.3858 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1645.2070 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory read data volume [GBytes] | 18.1278 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16.8080 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] | 2218.9884 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2177.4958 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory write data volume [GBytes] | 22.6700 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 22.2461 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] | 3993.3742 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3822.7027 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory data volume [GBytes] | 40.7978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39.0541 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
+----------------------------------------+------------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+----------------------------------------+------------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 326.9248 | 10.2164 | 10.2164 | 10.2164 |
| Runtime unhalted [s] STAT | 45.9650 | 1.3483 | 1.5529 | 1.4364 |
| Clock [MHz] STAT | 89354.7117 | 2790.1914 | 2793.0196 | 2792.3347 |
| CPI STAT | 36.1051 | 1.0284 | 1.2679 | 1.1283 |
| Memory read bandwidth [MBytes/s] STAT | 3419.5928 | 0 | 1774.3858 | 106.8623 |
| Memory read data volume [GBytes] STAT | 34.9358 | 0 | 18.1278 | 1.0917 |
| Memory write bandwidth [MBytes/s] STAT | 4396.4842 | 0 | 2218.9884 | 137.3901 |
| Memory write data volume [GBytes] STAT | 44.9161 | 0 | 22.6700 | 1.4036 |
| Memory bandwidth [MBytes/s] STAT | 7816.0769 | 0 | 3993.3742 | 244.2524 |
| Memory data volume [GBytes] STAT | 79.8519 | 0 | 40.7978 | 2.4954 |
+----------------------------------------+------------+-----------+-----------+-----------+
Here, I’m not entirely sure how to read and interpret the final output table correctly.
I take it SUM is the sum of the individual measurements for each thread? And the memory bandwidth numbers are averaged per thread as total memory moved divided by the runtime? Is that correct?
Next, I added the Likwid Marker API macros to my code. (I made sure to call LIKWID_MARKER_INIT and LIKWID_MARKER_CLOSE in serial regions of the code, the LIKWID_MARKER_REGISTER() ones in the parallel region followed by a barrier before callingLIKWID_MARKER_START() and LIKWID_MARKER_END().
First a run on a single thread:
likwid-perfctr -m -f -c N:0 -g MEM ../swift_cuda_likwid_individual --hydro --threads=1 --steps=2 --pin greshoGPU64.yml
I’m measuring 6 different parts of my code with different markers, so I’ll only post one of them here:
Region pack_force_pair, Group 1: MEM
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.336539 |
| call count | 26624 |
+-------------------+------------+
+-----------------------+---------+------------+
| Event | Counter | HWThread 0 |
+-----------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 1244383000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 1164683000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 956697300 |
| CAS_COUNT_RD | MBOX0C0 | 2357508 |
| CAS_COUNT_WR | MBOX0C1 | 914567 |
| CAS_COUNT_RD | MBOX1C0 | 2269415 |
| CAS_COUNT_WR | MBOX1C1 | 962587 |
| CAS_COUNT_RD | MBOX2C0 | 2369541 |
| CAS_COUNT_WR | MBOX2C1 | 935900 |
| CAS_COUNT_RD | MBOX3C0 | 2426628 |
| CAS_COUNT_WR | MBOX3C1 | 948586 |
| CAS_COUNT_RD | MBOX4C0 | 2362531 |
| CAS_COUNT_WR | MBOX4C1 | 955147 |
| CAS_COUNT_RD | MBOX5C0 | 2234895 |
| CAS_COUNT_WR | MBOX5C1 | 932785 |
+-----------------------+---------+------------+
+-----------------------------------+------------+
| Metric | HWThread 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 0.3365 |
| Runtime unhalted [s] | 0.5076 |
| Clock [MHz] | 2793.3552 |
| CPI | 0.9360 |
| Memory read bandwidth [MBytes/s] | 2666.2937 |
| Memory read data volume [GBytes] | 0.8973 |
| Memory write bandwidth [MBytes/s] | 1074.3839 |
| Memory write data volume [GBytes] | 0.3616 |
| Memory bandwidth [MBytes/s] | 3740.6776 |
| Memory data volume [GBytes] | 1.2589 |
+-----------------------------------+------------+
which still looks sensible so far.
Now the same on 32 threads:
$ likwid-perfctr -m -f -c N:0-31 -g MEM ../swift_cuda_likwid_individual --hydro --threads=32 --steps=2 --pin greshoGPU64.yml
Gives:
Region pack_force_pair, Group 1: MEM
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Region Info | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] | 0.010442 | 0.011560 | 0.016923 | 0.014710 | 0.015549 | 0.015303 | 0.015466 | 0.016514 | 0.016758 | 0.015745 | 0.011916 | 0.016441 | 0.017599 | 0.016243 | 0.015920 | 0.014819 | 0.017088 | 0.014930 | 0.015209 | 0.015394 | 0.016010 | 0.016345 | 0.016773 | 0.011244 | 0.015535 | 0.015468 | 0.014643 | 0.015012 | 0.014299 | 0.015546 | 0.013667 | 0.015316 |
| call count | 563 | 575 | 863 | 854 | 809 | 870 | 823 | 950 | 880 | 877 | 693 | 948 | 900 | 922 | 854 | 866 | 887 | 857 | 811 | 874 | 853 | 936 | 857 | 658 | 860 | 892 | 772 | 858 | 769 | 888 | 748 | 857 |
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Event | Counter | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| INSTR_RETIRED_ANY | FIXC0 | 26391780 | 27555630 | 40398940 | 39661550 | 37776000 | 41807650 | 38587430 | 44776580 | 41759230 | 42211240 | 31988210 | 45117470 | 43678630 | 44675960 | 40679420 | 40229980 | 42016820 | 40494310 | 37820350 | 40984130 | 40460380 | 44748260 | 41132770 | 31308840 | 40851330 | 42629310 | 37221040 | 40382140 | 37025220 | 42376680 | 35505310 | 41337410 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 35996690 | 38588760 | 59669280 | 51223730 | 53052250 | 52840090 | 53087420 | 57386460 | 56885390 | 54128200 | 41335000 | 57534850 | 59159950 | 55996910 | 54084730 | 51440720 | 58282140 | 51746050 | 51742580 | 53229480 | 54850200 | 55878340 | 56908520 | 39288500 | 53259380 | 53675150 | 50637850 | 52139160 | 49117240 | 53294570 | 47102640 | 53487450 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 29570550 | 31699240 | 49016130 | 42078780 | 43576440 | 43390140 | 43601560 | 47144660 | 46738210 | 44462040 | 33950760 | 47259760 | 48595600 | 45992550 | 44420910 | 42252660 | 47869160 | 42506120 | 42502800 | 43715550 | 45054240 | 45880680 | 46745750 | 32274980 | 43752160 | 44094130 | 41599270 | 42818270 | 40341450 | 43771940 | 38686090 | 43940670 |
| CAS_COUNT_RD | MBOX0C0 | 3730206 | 7418379 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX0C1 | 3021638 | 7785870 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX1C0 | 2975139 | 7080849 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX1C1 | 2258303 | 7193349 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX2C0 | 2930728 | 7184927 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX2C1 | 2227178 | 7489519 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX3C0 | 3253827 | 7024592 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX3C1 | 2603471 | 7411129 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX4C0 | 2834863 | 7318420 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX4C1 | 2161552 | 7713646 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_RD | MBOX5C0 | 2699138 | 7035477 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAS_COUNT_WR | MBOX5C1 | 2055074 | 7071022 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
+----------------------------+---------+------------+----------+----------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+----------------------------+---------+------------+----------+----------+--------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 1263590000 | 26391780 | 45117470 | 3.948719e+07 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 1667049680 | 35996690 | 59669280 | 5.209530e+07 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 1369303250 | 29570550 | 49016130 | 4.279073e+07 |
| CAS_COUNT_RD STAT | MBOX0C0 | 11148585 | 0 | 7418379 | 348393.2812 |
| CAS_COUNT_WR STAT | MBOX0C1 | 10807508 | 0 | 7785870 | 337734.6250 |
| CAS_COUNT_RD STAT | MBOX1C0 | 10055988 | 0 | 7080849 | 314249.6250 |
| CAS_COUNT_WR STAT | MBOX1C1 | 9451652 | 0 | 7193349 | 295364.1250 |
| CAS_COUNT_RD STAT | MBOX2C0 | 10115655 | 0 | 7184927 | 316114.2188 |
| CAS_COUNT_WR STAT | MBOX2C1 | 9716697 | 0 | 7489519 | 303646.7812 |
| CAS_COUNT_RD STAT | MBOX3C0 | 10278419 | 0 | 7024592 | 321200.5938 |
| CAS_COUNT_WR STAT | MBOX3C1 | 10014600 | 0 | 7411129 | 312956.2500 |
| CAS_COUNT_RD STAT | MBOX4C0 | 10153283 | 0 | 7318420 | 317290.0938 |
| CAS_COUNT_WR STAT | MBOX4C1 | 9875198 | 0 | 7713646 | 308599.9375 |
| CAS_COUNT_RD STAT | MBOX5C0 | 9734615 | 0 | 7035477 | 304206.7188 |
| CAS_COUNT_WR STAT | MBOX5C1 | 9126096 | 0 | 7071022 | 285190.5000 |
+----------------------------+---------+------------+----------+----------+--------------+
+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Metric | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Runtime (RDTSC) [s] | 0.0104 | 0.0116 | 0.0169 | 0.0147 | 0.0155 | 0.0153 | 0.0155 | 0.0165 | 0.0168 | 0.0157 | 0.0119 | 0.0164 | 0.0176 | 0.0162 | 0.0159 | 0.0148 | 0.0171 | 0.0149 | 0.0152 | 0.0154 | 0.0160 | 0.0163 | 0.0168 | 0.0112 | 0.0155 | 0.0155 | 0.0146 | 0.0150 | 0.0143 | 0.0155 | 0.0137 | 0.0153 |
| Runtime unhalted [s] | 0.0157 | 0.0168 | 0.0260 | 0.0223 | 0.0231 | 0.0230 | 0.0231 | 0.0250 | 0.0248 | 0.0236 | 0.0180 | 0.0251 | 0.0258 | 0.0244 | 0.0236 | 0.0224 | 0.0254 | 0.0226 | 0.0226 | 0.0232 | 0.0239 | 0.0244 | 0.0248 | 0.0171 | 0.0232 | 0.0234 | 0.0221 | 0.0227 | 0.0214 | 0.0232 | 0.0205 | 0.0233 |
| Clock [MHz] | 2793.0634 | 2793.1200 | 2793.1188 | 2793.0949 | 2793.3779 | 2794.1526 | 2793.6193 | 2792.8947 | 2792.5843 | 2793.2630 | 2793.4832 | 2793.2969 | 2793.2414 | 2793.5355 | 2793.6040 | 2793.3838 | 2793.5555 | 2793.2086 | 2793.2394 | 2793.7909 | 2793.3167 | 2794.4174 | 2793.2692 | 2793.0396 | 2793.0212 | 2792.9947 | 2792.9758 | 2793.9108 | 2793.5733 | 2793.6037 | 2793.6245 | 2792.9480 |
| CPI | 1.3639 | 1.4004 | 1.4770 | 1.2915 | 1.4044 | 1.2639 | 1.3758 | 1.2816 | 1.3622 | 1.2823 | 1.2922 | 1.2752 | 1.3544 | 1.2534 | 1.3295 | 1.2787 | 1.3871 | 1.2779 | 1.3681 | 1.2988 | 1.3557 | 1.2487 | 1.3835 | 1.2549 | 1.3037 | 1.2591 | 1.3605 | 1.2911 | 1.3266 | 1.2576 | 1.3266 | 1.2939 |
| Memory read bandwidth [MBytes/s] | 112925.0662 | 238401.0588 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory read data volume [GBytes] | 1.1791 | 2.7560 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] | 87815.3772 | 247269.3603 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory write data volume [GBytes] | 0.9169 | 2.8585 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] | 200740.4434 | 485670.4191 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Memory data volume [GBytes] | 2.0961 | 5.6145 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
+----------------------------------------+-------------+-----------+-------------+------------+
| Metric | Sum | Min | Max | Avg |
+----------------------------------------+-------------+-----------+-------------+------------+
| Runtime (RDTSC) [s] STAT | 0.4840 | 0.0104 | 0.0176 | 0.0151 |
| Runtime unhalted [s] STAT | 0.7265 | 0.0157 | 0.0260 | 0.0227 |
| Clock [MHz] STAT | 89387.3230 | 2792.5843 | 2794.4174 | 2793.3538 |
| CPI STAT | 42.2802 | 1.2487 | 1.4770 | 1.3213 |
| Memory read bandwidth [MBytes/s] STAT | 351326.1250 | 0 | 238401.0588 | 10978.9414 |
| Memory read data volume [GBytes] STAT | 3.9351 | 0 | 2.7560 | 0.1230 |
| Memory write bandwidth [MBytes/s] STAT | 335084.7375 | 0 | 247269.3603 | 10471.3980 |
| Memory write data volume [GBytes] STAT | 3.7754 | 0 | 2.8585 | 0.1180 |
| Memory bandwidth [MBytes/s] STAT | 686410.8625 | 0 | 485670.4191 | 21450.3395 |
| Memory data volume [GBytes] STAT | 7.7106 | 0 | 5.6145 | 0.2410 |
+----------------------------------------+-------------+-----------+-------------+------------+
And here is where I am getting lost. The Max (and Sum) memory bandwidths are much bigger than what my measurements with likwid-bench have shown (<~ 100GByte/s). What is going on here? How am I to understand these numbers?
Hi all!
I'm in the process of running some measurements on my software with
likwidand I was hoping to ask for some help with interpreting the resultslikwidis showing me, as the way I understand them, they can't be right.Here’s what I’ve been up to so far.
My Software
pthreads(and “in-house” task scheduler based onptheads), usescudafor GPU acceleration.pthreads_setaffinity_np), so in the measurement runs withlikwid-perfctr, I’ll be using the-cinstead of the-Cflaglikwid 5.4.1installedWhat I want to achieve
MEMevent group andlikwid' s Marker API seems ideal to do this.My Hardware
What I’ve done so far
To begin with, I did some
likwid-benchbenchmarks first to assess the peak memory bandwidth of my system:I made sure I was the only one using the node while benchmarking. So far, so good, I guess.
Next, I ran a tiny (non-representative) example with my software to verify that everything works as it should.
First, a run on a single thread:
gives:
which seems fine to me.
Next, using the full node (32 threads):
I get the following result:
Here, I’m not entirely sure how to read and interpret the final output table correctly.
I take it
SUMis the sum of the individual measurements for each thread? And the memory bandwidth numbers are averaged per thread as total memory moved divided by the runtime? Is that correct?Next, I added the Likwid Marker API macros to my code. (I made sure to call
LIKWID_MARKER_INITandLIKWID_MARKER_CLOSEin serial regions of the code, theLIKWID_MARKER_REGISTER()ones in the parallel region followed by a barrier before callingLIKWID_MARKER_START()andLIKWID_MARKER_END().First a run on a single thread:
I’m measuring 6 different parts of my code with different markers, so I’ll only post one of them here:
which still looks sensible so far.
Now the same on 32 threads:
Gives:
And here is where I am getting lost. The
Max(andSum)memory bandwidths are much bigger than what my measurements withlikwid-benchhave shown (<~ 100GByte/s). What is going on here? How am I to understand these numbers?