Skip to content

Help understanding output of MEM group with Marker API - reported memory bandwidth measurement is much higher than what likwid-bench suggests #696

@mladenivkovic

Description

@mladenivkovic

Hi all!

I'm in the process of running some measurements on my software with likwid and I was hoping to ask for some help with interpreting the results likwid is showing me, as the way I understand them, they can't be right.

Here’s what I’ve been up to so far.

My Software

  • Parallelised using pthreads (and “in-house” task scheduler based on ptheads), uses cudafor GPU acceleration.
  • My code can pin the threads it runs on (using pthreads_setaffinity_np), so in the measurement runs with likwid-perfctr, I’ll be using the -c instead of the -C flag
  • https://github.com/abouzied-nasar/SWIFT/tree/likwid-markers in the unexpected case it’s needed
  • The cluster I’m using has likwid 5.4.1 installed

What I want to achieve

  • I want to measure the memory bandwidth behaviour in specific regions of my code
    • I’m interested in CPU behaviour of a code that is GPU accelerated. The GPU bit is not interesting to me at this point, only what’s going on on the CPU side.
    • the MEM event group and likwid' s Marker API seems ideal to do this.

My Hardware

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
CPU type:	Intel Cascadelake SP processor
CPU stepping:	7
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:		2
CPU dies:		2
Cores per socket:	16
Threads per core:	2

What I’ve done so far

To begin with, I did some likwid-bench benchmarks first to assess the peak memory bandwidth of my system:

# Triad benchmark: A[i] = B[i] * C[i] + D[i]
# 32 threads
likwid-bench -t triad -w N:4GB:32
MByte/s:		78734.97

# single thread
MByte/s:		12036.72

# copy_mem benchmark: Copy with non-temporal store
# 32 threads
likwid-bench -t copy_mem -w N:4GB:32
MByte/s:		83644.83

likwid-bench -t copy_mem -w N:4GB:1
MByte/s:		8881.45

# load benchmark
# 32 threads
likwid-bench -t load -w N:4GB:32
MByte/s:		99647.42

# single thread
likwid-bench -t load -w N:4GB:1
MByte/s:		10869.33

# store benchmark
# 32 threads
likwid-bench -t store -w N:4GB:32
MByte/s:	47517.67

# single thread
likwid-bench -t store -w N:4GB:1 
MByte/s:	9423.61

I made sure I was the only one using the node while benchmarking. So far, so good, I guess.

Next, I ran a tiny (non-representative) example with my software to verify that everything works as it should.

First, a run on a single thread:

likwid-perfctr -f -c N:1 -g MEM ../swift_cuda --hydro --threads=1 --steps=2 --pin greshoGPU64.yml

gives:

--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+
|         Event         | Counter | HWThread 2 |
+-----------------------+---------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  |       6864 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |      30469 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  |      25944 |
|      CAS_COUNT_RD     | MBOX0C0 |   14515348 |
|      CAS_COUNT_WR     | MBOX0C1 |    7941874 |
|      CAS_COUNT_RD     | MBOX1C0 |   14314478 |
|      CAS_COUNT_WR     | MBOX1C1 |    7980721 |
|      CAS_COUNT_RD     | MBOX2C0 |   14274722 |
|      CAS_COUNT_WR     | MBOX2C1 |    7958660 |
|      CAS_COUNT_RD     | MBOX3C0 |   14112250 |
|      CAS_COUNT_WR     | MBOX3C1 |    7972464 |
|      CAS_COUNT_RD     | MBOX4C0 |   14103420 |
|      CAS_COUNT_WR     | MBOX4C1 |    7950516 |
|      CAS_COUNT_RD     | MBOX5C0 |   14039041 |
|      CAS_COUNT_WR     | MBOX5C1 |    7934033 |
+-----------------------+---------+------------+

+-----------------------------------+--------------+
|               Metric              |  HWThread 2  |
+-----------------------------------+--------------+
|        Runtime (RDTSC) [s]        |      12.9292 |
|        Runtime unhalted [s]       | 1.327911e-05 |
|            Clock [MHz]            |    2694.7005 |
|                CPI                |       4.4390 |
|  Memory read bandwidth [MBytes/s] |     422.5309 |
|  Memory read data volume [GBytes] |       5.4630 |
| Memory write bandwidth [MBytes/s] |     236.3059 |
| Memory write data volume [GBytes] |       3.0552 |
|    Memory bandwidth [MBytes/s]    |     658.8368 |
|    Memory data volume [GBytes]    |       8.5182 |
+-----------------------------------+--------------+

which seems fine to me.

Next, using the full node (32 threads):

$ likwid-perfctr -f -c N:0-31 -g MEM ../swift_cuda --hydro --threads=32 --steps=2 --pin greshoGPU64.yml 

I get the following result:

--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|         Event         | Counter | HWThread 0 | HWThread 2 | HWThread 4 | HWThread 6 | HWThread 8 | HWThread 10 | HWThread 12 | HWThread 14 | HWThread 16 | HWThread 18 | HWThread 20 | HWThread 22 | HWThread 24 | HWThread 26 | HWThread 28 | HWThread 30 | HWThread 1 | HWThread 3 | HWThread 5 | HWThread 7 | HWThread 9 | HWThread 11 | HWThread 13 | HWThread 15 | HWThread 17 | HWThread 19 | HWThread 21 | HWThread 23 | HWThread 25 | HWThread 27 | HWThread 29 | HWThread 31 |
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 3133822573 | 2813486090 | 3043728189 | 2930692486 | 2722090250 |  3126264246 |  2548880161 |  3018866070 |  3300903006 |  3360986726 |  2687726446 |  2669957110 |  2760906756 |  2552515195 |  3063501993 |  2917143245 | 3197656772 | 2801757639 | 2887380234 | 2864979720 | 2814551018 |  3008879024 |  2734287716 |  2509797720 |  3216438776 |  3118694950 |  3036079433 |  2657812274 |  3121342400 |  2937343518 |  3113016086 |  3179790190 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  | 3563069606 | 3369129026 | 3415539366 | 3408741114 | 3249072324 |  3389731617 |  3231691812 |  3429669079 |  3557693499 |  3491897883 |  3291795213 |  3195020413 |  3237587084 |  3197102545 |  3458301541 |  3361217145 | 3368475391 | 3154197548 | 3214583426 | 3176180589 | 3182672830 |  3228687378 |  3127536670 |  3093572505 |  3307864643 |  3261535145 |  3235589514 |  3174752605 |  3243541576 |  3257180131 |  3255170846 |  3336653887 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 2929354060 | 2768032612 | 2806316848 | 2800689944 | 2669804120 |  2784883240 |  2655251744 |  2818180616 |  2923486852 |  2868811252 |  2705133224 |  2625390292 |  2660995488 |  2627231856 |  2841621664 |  2762731940 | 2770029564 | 2591788304 | 2641191752 | 2609574388 | 2614745064 |  2652721284 |  2569715572 |  2541390060 |  2718158216 |  2679830280 |  2658276336 |  2608362012 |  2665197220 |  2676587556 |  2674313960 |  2742247864 |
|      CAS_COUNT_RD     | MBOX0C0 |   48962304 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   44916250 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX0C1 |   60148475 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   58904072 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX1C0 |   47741198 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   44173297 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX1C1 |   59112460 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   57916540 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX2C0 |   48105100 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   46236395 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX2C1 |   59278825 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   59745346 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX3C0 |   45992964 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   42878464 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX3C1 |   58695026 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   57316165 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX4C0 |   47163558 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   42345235 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX4C1 |   59299798 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   56955623 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX5C0 |   45281466 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   42076025 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX5C1 |   57684251 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |   56757595 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
+-----------------------+---------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

+----------------------------+---------+--------------+------------+------------+--------------+
|            Event           | Counter |      Sum     |     Min    |     Max    |      Avg     |
+----------------------------+---------+--------------+------------+------------+--------------+
|   INSTR_RETIRED_ANY STAT   |  FIXC0  |  93851278012 | 2509797720 | 3360986726 | 2.932852e+09 |
| CPU_CLK_UNHALTED_CORE STAT |  FIXC1  | 105465453951 | 3093572505 | 3563069606 | 3.295795e+09 |
|  CPU_CLK_UNHALTED_REF STAT |  FIXC2  |  86662045184 | 2541390060 | 2929354060 |   2708188912 |
|      CAS_COUNT_RD STAT     | MBOX0C0 |     93878554 |          0 |   48962304 | 2.933705e+06 |
|      CAS_COUNT_WR STAT     | MBOX0C1 |    119052547 |          0 |   60148475 | 3.720392e+06 |
|      CAS_COUNT_RD STAT     | MBOX1C0 |     91914495 |          0 |   47741198 | 2.872328e+06 |
|      CAS_COUNT_WR STAT     | MBOX1C1 |    117029000 |          0 |   59112460 | 3.657156e+06 |
|      CAS_COUNT_RD STAT     | MBOX2C0 |     94341495 |          0 |   48105100 | 2.948172e+06 |
|      CAS_COUNT_WR STAT     | MBOX2C1 |    119024171 |          0 |   59745346 | 3.719505e+06 |
|      CAS_COUNT_RD STAT     | MBOX3C0 |     88871428 |          0 |   45992964 | 2.777232e+06 |
|      CAS_COUNT_WR STAT     | MBOX3C1 |    116011191 |          0 |   58695026 | 3.625350e+06 |
|      CAS_COUNT_RD STAT     | MBOX4C0 |     89508793 |          0 |   47163558 | 2.797150e+06 |
|      CAS_COUNT_WR STAT     | MBOX4C1 |    116255421 |          0 |   59299798 | 3.632982e+06 |
|      CAS_COUNT_RD STAT     | MBOX5C0 |     87357491 |          0 |   45281466 | 2.729922e+06 |
|      CAS_COUNT_WR STAT     | MBOX5C1 |    114441846 |          0 |   57684251 | 3.576308e+06 |
+----------------------------+---------+--------------+------------+------------+--------------+

+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|               Metric              | HWThread 0 | HWThread 2 | HWThread 4 | HWThread 6 | HWThread 8 | HWThread 10 | HWThread 12 | HWThread 14 | HWThread 16 | HWThread 18 | HWThread 20 | HWThread 22 | HWThread 24 | HWThread 26 | HWThread 28 | HWThread 30 | HWThread 1 | HWThread 3 | HWThread 5 | HWThread 7 | HWThread 9 | HWThread 11 | HWThread 13 | HWThread 15 | HWThread 17 | HWThread 19 | HWThread 21 | HWThread 23 | HWThread 25 | HWThread 27 | HWThread 29 | HWThread 31 |
+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|        Runtime (RDTSC) [s]        |    10.2164 |    10.2164 |    10.2164 |    10.2164 |    10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |    10.2164 |    10.2164 |    10.2164 |    10.2164 |    10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |     10.2164 |
|        Runtime unhalted [s]       |     1.5529 |     1.4684 |     1.4886 |     1.4856 |     1.4160 |      1.4773 |      1.4085 |      1.4947 |      1.5505 |      1.5219 |      1.4347 |      1.3925 |      1.4110 |      1.3934 |      1.5072 |      1.4649 |     1.4681 |     1.3747 |     1.4010 |     1.3843 |     1.3871 |      1.4072 |      1.3631 |      1.3483 |      1.4417 |      1.4215 |      1.4102 |      1.3836 |      1.4136 |      1.4196 |      1.4187 |      1.4542 |
|            Clock [MHz]            |  2790.8562 |  2792.7461 |  2792.5928 |  2792.6339 |  2792.3188 |   2792.8225 |   2792.6033 |   2792.3407 |   2792.2378 |   2792.8305 |   2792.0884 |   2792.3174 |   2791.6588 |   2792.1785 |   2792.4257 |   2791.5335 |  2790.1914 |  2792.3791 |  2792.6068 |  2792.6758 |  2792.8503 |   2792.6685 |   2792.5591 |   2793.0196 |   2792.2739 |   2792.5425 |   2792.7902 |   2792.7177 |   2792.3840 |   2792.1924 |   2792.8423 |   2791.8332 |
|                CPI                |     1.1370 |     1.1975 |     1.1222 |     1.1631 |     1.1936 |      1.0843 |      1.2679 |      1.1361 |      1.0778 |      1.0390 |      1.2248 |      1.1967 |      1.1727 |      1.2525 |      1.1289 |      1.1522 |     1.0534 |     1.1258 |     1.1133 |     1.1086 |     1.1308 |      1.0731 |      1.1438 |      1.2326 |      1.0284 |      1.0458 |      1.0657 |      1.1945 |      1.0391 |      1.1089 |      1.0457 |      1.0493 |
|  Memory read bandwidth [MBytes/s] |  1774.3858 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |  1645.2070 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|  Memory read data volume [GBytes] |    18.1278 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |    16.8080 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
| Memory write bandwidth [MBytes/s] |  2218.9884 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |  2177.4958 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
| Memory write data volume [GBytes] |    22.6700 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |    22.2461 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|    Memory bandwidth [MBytes/s]    |  3993.3742 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |  3822.7027 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|    Memory data volume [GBytes]    |    40.7978 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |    39.0541 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
+-----------------------------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

+----------------------------------------+------------+-----------+-----------+-----------+
|                 Metric                 |     Sum    |    Min    |    Max    |    Avg    |
+----------------------------------------+------------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |   326.9248 |   10.2164 |   10.2164 |   10.2164 |
|        Runtime unhalted [s] STAT       |    45.9650 |    1.3483 |    1.5529 |    1.4364 |
|            Clock [MHz] STAT            | 89354.7117 | 2790.1914 | 2793.0196 | 2792.3347 |
|                CPI STAT                |    36.1051 |    1.0284 |    1.2679 |    1.1283 |
|  Memory read bandwidth [MBytes/s] STAT |  3419.5928 |         0 | 1774.3858 |  106.8623 |
|  Memory read data volume [GBytes] STAT |    34.9358 |         0 |   18.1278 |    1.0917 |
| Memory write bandwidth [MBytes/s] STAT |  4396.4842 |         0 | 2218.9884 |  137.3901 |
| Memory write data volume [GBytes] STAT |    44.9161 |         0 |   22.6700 |    1.4036 |
|    Memory bandwidth [MBytes/s] STAT    |  7816.0769 |         0 | 3993.3742 |  244.2524 |
|    Memory data volume [GBytes] STAT    |    79.8519 |         0 |   40.7978 |    2.4954 |
+----------------------------------------+------------+-----------+-----------+-----------+

Here, I’m not entirely sure how to read and interpret the final output table correctly.

I take it SUM is the sum of the individual measurements for each thread? And the memory bandwidth numbers are averaged per thread as total memory moved divided by the runtime? Is that correct?

Next, I added the Likwid Marker API macros to my code. (I made sure to call LIKWID_MARKER_INIT and LIKWID_MARKER_CLOSE in serial regions of the code, the LIKWID_MARKER_REGISTER() ones in the parallel region followed by a barrier before callingLIKWID_MARKER_START() and LIKWID_MARKER_END().

First a run on a single thread:

likwid-perfctr -m -f -c N:0 -g MEM ../swift_cuda_likwid_individual --hydro --threads=1 --steps=2 --pin greshoGPU64.yml

I’m measuring 6 different parts of my code with different markers, so I’ll only post one of them here:

Region pack_force_pair, Group 1: MEM
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.336539 |
|     call count    |      26624 |
+-------------------+------------+

+-----------------------+---------+------------+
|         Event         | Counter | HWThread 0 |
+-----------------------+---------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 1244383000 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  | 1164683000 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  |  956697300 |
|      CAS_COUNT_RD     | MBOX0C0 |    2357508 |
|      CAS_COUNT_WR     | MBOX0C1 |     914567 |
|      CAS_COUNT_RD     | MBOX1C0 |    2269415 |
|      CAS_COUNT_WR     | MBOX1C1 |     962587 |
|      CAS_COUNT_RD     | MBOX2C0 |    2369541 |
|      CAS_COUNT_WR     | MBOX2C1 |     935900 |
|      CAS_COUNT_RD     | MBOX3C0 |    2426628 |
|      CAS_COUNT_WR     | MBOX3C1 |     948586 |
|      CAS_COUNT_RD     | MBOX4C0 |    2362531 |
|      CAS_COUNT_WR     | MBOX4C1 |     955147 |
|      CAS_COUNT_RD     | MBOX5C0 |    2234895 |
|      CAS_COUNT_WR     | MBOX5C1 |     932785 |
+-----------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.3365 |
|        Runtime unhalted [s]       |     0.5076 |
|            Clock [MHz]            |  2793.3552 |
|                CPI                |     0.9360 |
|  Memory read bandwidth [MBytes/s] |  2666.2937 |
|  Memory read data volume [GBytes] |     0.8973 |
| Memory write bandwidth [MBytes/s] |  1074.3839 |
| Memory write data volume [GBytes] |     0.3616 |
|    Memory bandwidth [MBytes/s]    |  3740.6776 |
|    Memory data volume [GBytes]    |     1.2589 |
+-----------------------------------+------------+

which still looks sensible so far.

Now the same on 32 threads:

$ likwid-perfctr -m -f -c N:0-31 -g MEM ../swift_cuda_likwid_individual --hydro --threads=32 --steps=2 --pin greshoGPU64.yml 

Gives:

Region pack_force_pair, Group 1: MEM
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|    Region Info    | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] |   0.010442 |   0.011560 |   0.016923 |   0.014710 |   0.015549 |   0.015303 |   0.015466 |   0.016514 |   0.016758 |   0.015745 |    0.011916 |    0.016441 |    0.017599 |    0.016243 |    0.015920 |    0.014819 |    0.017088 |    0.014930 |    0.015209 |    0.015394 |    0.016010 |    0.016345 |    0.016773 |    0.011244 |    0.015535 |    0.015468 |    0.014643 |    0.015012 |    0.014299 |    0.015546 |    0.013667 |    0.015316 |
|     call count    |        563 |        575 |        863 |        854 |        809 |        870 |        823 |        950 |        880 |        877 |         693 |         948 |         900 |         922 |         854 |         866 |         887 |         857 |         811 |         874 |         853 |         936 |         857 |         658 |         860 |         892 |         772 |         858 |         769 |         888 |         748 |         857 |
+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|         Event         | Counter | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   |  FIXC0  |   26391780 |   27555630 |   40398940 |   39661550 |   37776000 |   41807650 |   38587430 |   44776580 |   41759230 |   42211240 |    31988210 |    45117470 |    43678630 |    44675960 |    40679420 |    40229980 |    42016820 |    40494310 |    37820350 |    40984130 |    40460380 |    44748260 |    41132770 |    31308840 |    40851330 |    42629310 |    37221040 |    40382140 |    37025220 |    42376680 |    35505310 |    41337410 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |   35996690 |   38588760 |   59669280 |   51223730 |   53052250 |   52840090 |   53087420 |   57386460 |   56885390 |   54128200 |    41335000 |    57534850 |    59159950 |    55996910 |    54084730 |    51440720 |    58282140 |    51746050 |    51742580 |    53229480 |    54850200 |    55878340 |    56908520 |    39288500 |    53259380 |    53675150 |    50637850 |    52139160 |    49117240 |    53294570 |    47102640 |    53487450 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  |   29570550 |   31699240 |   49016130 |   42078780 |   43576440 |   43390140 |   43601560 |   47144660 |   46738210 |   44462040 |    33950760 |    47259760 |    48595600 |    45992550 |    44420910 |    42252660 |    47869160 |    42506120 |    42502800 |    43715550 |    45054240 |    45880680 |    46745750 |    32274980 |    43752160 |    44094130 |    41599270 |    42818270 |    40341450 |    43771940 |    38686090 |    43940670 |
|      CAS_COUNT_RD     | MBOX0C0 |    3730206 |    7418379 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX0C1 |    3021638 |    7785870 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX1C0 |    2975139 |    7080849 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX1C1 |    2258303 |    7193349 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX2C0 |    2930728 |    7184927 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX2C1 |    2227178 |    7489519 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX3C0 |    3253827 |    7024592 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX3C1 |    2603471 |    7411129 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX4C0 |    2834863 |    7318420 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX4C1 |    2161552 |    7713646 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_RD     | MBOX5C0 |    2699138 |    7035477 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|      CAS_COUNT_WR     | MBOX5C1 |    2055074 |    7071022 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

+----------------------------+---------+------------+----------+----------+--------------+
|            Event           | Counter |     Sum    |    Min   |    Max   |      Avg     |
+----------------------------+---------+------------+----------+----------+--------------+
|   INSTR_RETIRED_ANY STAT   |  FIXC0  | 1263590000 | 26391780 | 45117470 | 3.948719e+07 |
| CPU_CLK_UNHALTED_CORE STAT |  FIXC1  | 1667049680 | 35996690 | 59669280 | 5.209530e+07 |
|  CPU_CLK_UNHALTED_REF STAT |  FIXC2  | 1369303250 | 29570550 | 49016130 | 4.279073e+07 |
|      CAS_COUNT_RD STAT     | MBOX0C0 |   11148585 |        0 |  7418379 |  348393.2812 |
|      CAS_COUNT_WR STAT     | MBOX0C1 |   10807508 |        0 |  7785870 |  337734.6250 |
|      CAS_COUNT_RD STAT     | MBOX1C0 |   10055988 |        0 |  7080849 |  314249.6250 |
|      CAS_COUNT_WR STAT     | MBOX1C1 |    9451652 |        0 |  7193349 |  295364.1250 |
|      CAS_COUNT_RD STAT     | MBOX2C0 |   10115655 |        0 |  7184927 |  316114.2188 |
|      CAS_COUNT_WR STAT     | MBOX2C1 |    9716697 |        0 |  7489519 |  303646.7812 |
|      CAS_COUNT_RD STAT     | MBOX3C0 |   10278419 |        0 |  7024592 |  321200.5938 |
|      CAS_COUNT_WR STAT     | MBOX3C1 |   10014600 |        0 |  7411129 |  312956.2500 |
|      CAS_COUNT_RD STAT     | MBOX4C0 |   10153283 |        0 |  7318420 |  317290.0938 |
|      CAS_COUNT_WR STAT     | MBOX4C1 |    9875198 |        0 |  7713646 |  308599.9375 |
|      CAS_COUNT_RD STAT     | MBOX5C0 |    9734615 |        0 |  7035477 |  304206.7188 |
|      CAS_COUNT_WR STAT     | MBOX5C1 |    9126096 |        0 |  7071022 |  285190.5000 |
+----------------------------+---------+------------+----------+----------+--------------+

+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|               Metric              |  HWThread 0 |  HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 | HWThread 6 | HWThread 7 | HWThread 8 | HWThread 9 | HWThread 10 | HWThread 11 | HWThread 12 | HWThread 13 | HWThread 14 | HWThread 15 | HWThread 16 | HWThread 17 | HWThread 18 | HWThread 19 | HWThread 20 | HWThread 21 | HWThread 22 | HWThread 23 | HWThread 24 | HWThread 25 | HWThread 26 | HWThread 27 | HWThread 28 | HWThread 29 | HWThread 30 | HWThread 31 |
+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|        Runtime (RDTSC) [s]        |      0.0104 |      0.0116 |     0.0169 |     0.0147 |     0.0155 |     0.0153 |     0.0155 |     0.0165 |     0.0168 |     0.0157 |      0.0119 |      0.0164 |      0.0176 |      0.0162 |      0.0159 |      0.0148 |      0.0171 |      0.0149 |      0.0152 |      0.0154 |      0.0160 |      0.0163 |      0.0168 |      0.0112 |      0.0155 |      0.0155 |      0.0146 |      0.0150 |      0.0143 |      0.0155 |      0.0137 |      0.0153 |
|        Runtime unhalted [s]       |      0.0157 |      0.0168 |     0.0260 |     0.0223 |     0.0231 |     0.0230 |     0.0231 |     0.0250 |     0.0248 |     0.0236 |      0.0180 |      0.0251 |      0.0258 |      0.0244 |      0.0236 |      0.0224 |      0.0254 |      0.0226 |      0.0226 |      0.0232 |      0.0239 |      0.0244 |      0.0248 |      0.0171 |      0.0232 |      0.0234 |      0.0221 |      0.0227 |      0.0214 |      0.0232 |      0.0205 |      0.0233 |
|            Clock [MHz]            |   2793.0634 |   2793.1200 |  2793.1188 |  2793.0949 |  2793.3779 |  2794.1526 |  2793.6193 |  2792.8947 |  2792.5843 |  2793.2630 |   2793.4832 |   2793.2969 |   2793.2414 |   2793.5355 |   2793.6040 |   2793.3838 |   2793.5555 |   2793.2086 |   2793.2394 |   2793.7909 |   2793.3167 |   2794.4174 |   2793.2692 |   2793.0396 |   2793.0212 |   2792.9947 |   2792.9758 |   2793.9108 |   2793.5733 |   2793.6037 |   2793.6245 |   2792.9480 |
|                CPI                |      1.3639 |      1.4004 |     1.4770 |     1.2915 |     1.4044 |     1.2639 |     1.3758 |     1.2816 |     1.3622 |     1.2823 |      1.2922 |      1.2752 |      1.3544 |      1.2534 |      1.3295 |      1.2787 |      1.3871 |      1.2779 |      1.3681 |      1.2988 |      1.3557 |      1.2487 |      1.3835 |      1.2549 |      1.3037 |      1.2591 |      1.3605 |      1.2911 |      1.3266 |      1.2576 |      1.3266 |      1.2939 |
|  Memory read bandwidth [MBytes/s] | 112925.0662 | 238401.0588 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|  Memory read data volume [GBytes] |      1.1791 |      2.7560 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
| Memory write bandwidth [MBytes/s] |  87815.3772 | 247269.3603 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
| Memory write data volume [GBytes] |      0.9169 |      2.8585 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|    Memory bandwidth [MBytes/s]    | 200740.4434 | 485670.4191 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
|    Memory data volume [GBytes]    |      2.0961 |      5.6145 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |          0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |           0 |
+-----------------------------------+-------------+-------------+------------+------------+------------+------------+------------+------------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+

+----------------------------------------+-------------+-----------+-------------+------------+
|                 Metric                 |     Sum     |    Min    |     Max     |     Avg    |
+----------------------------------------+-------------+-----------+-------------+------------+
|        Runtime (RDTSC) [s] STAT        |      0.4840 |    0.0104 |      0.0176 |     0.0151 |
|        Runtime unhalted [s] STAT       |      0.7265 |    0.0157 |      0.0260 |     0.0227 |
|            Clock [MHz] STAT            |  89387.3230 | 2792.5843 |   2794.4174 |  2793.3538 |
|                CPI STAT                |     42.2802 |    1.2487 |      1.4770 |     1.3213 |
|  Memory read bandwidth [MBytes/s] STAT | 351326.1250 |         0 | 238401.0588 | 10978.9414 |
|  Memory read data volume [GBytes] STAT |      3.9351 |         0 |      2.7560 |     0.1230 |
| Memory write bandwidth [MBytes/s] STAT | 335084.7375 |         0 | 247269.3603 | 10471.3980 |
| Memory write data volume [GBytes] STAT |      3.7754 |         0 |      2.8585 |     0.1180 |
|    Memory bandwidth [MBytes/s] STAT    | 686410.8625 |         0 | 485670.4191 | 21450.3395 |
|    Memory data volume [GBytes] STAT    |      7.7106 |         0 |      5.6145 |     0.2410 |
+----------------------------------------+-------------+-----------+-------------+------------+

And here is where I am getting lost. The Max (and Sum) memory bandwidths are much bigger than what my measurements with likwid-bench have shown (<~ 100GByte/s). What is going on here? How am I to understand these numbers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions