Profiling rust code: cpu bound? l1 data cache miss? l1 instr cache miss?

I have a piece of Rust code running at about 10x slower than what I expect it to run.

I am trying to figure out what is killing performance:

  • cpu bound ?

  • l1 data cache miss ?

  • l1 instr cache miss ?

  • ???

I'm 99.9% sure there are no memory allocations involved (no Vec / HashMap growing in size). This is blas/sgemm with pre-allocated input / outputs; (though there are some const sized objects allocated on stack and moved around).

I'm looking at Profiling - The Rust Performance Book

I have run

perf record cargo test -- blahblah
perf report

I can see the names of the Rust functions. I can see some are red / green / white.

I can see the generated x86_64 assembly code for the functions.

I even see some Rust code (which matches my source code).

=====

Is there a way to figure out if the problem is cpu-bound, l1 data cache miss, l1 instr cache miss, moving around too much data, or something else ?

2 Likes

Use --release with cargo test to get the bench profile instead of the test profile, similar to what you do with cargo build or cargo run.

More details in the Cargo book.

Good point, I tested under --release as well, same issues. (Not mentioned in original post, but I had opt-level = 3 in profiles.test). Also, --release appears to strip out debug info, so prof report no longer shows source code (only shows asm).

However, this doesn't get us closer to figuring out whether the bottleneck if CPU bound, L1 data miss, or L! instr miss.

Openblas is hitting ~ 20 gflops. I'm only hitting ~ 3.5 gflops. There's a factor of ~6 I'm somehow missing out on.

1 Like

What events are you measuring with perf? perf stat records a variety of interesting metrics (The perf tutorial has some more examples). I tend to like to make a small example benchmark program and mostly run it via perf stat rather than record/report but that’s just me.

When I’m trying to figure out what’s bounding my code, I look first at instructions per cycle (IPC) to check for a number higher than 2 or so; if not, I’m probably bound by IO or memory accesses (or maybe branch mispredictions?) and other perf events like cache misses may be helpful. If IPC are high (like 4), I may need to find a new algorithm (or possibly microoptimizations may help).

Perf report can help by showing callstacks and assembly related to each type of event, which can also help show missed chances for vectorization or other weird situations. Seeing which functions are responsible for cache misses can be illuminating.

This is all pretty general advice. Brendan Gregg has a lot of specific examples.

1 Like
  1. @JosephOsborn : The Brendan Gregg / Perf examples list was amazingly helpful. Thanks!

  2. This is my first time using perf seriously. In the dump below, if anyone has insights (no matter how "obvious"), please feel free to share them -- I'm sure there's something I can learn.

  3. Here is a comparison of "my sgemm (16.85 s)" vs openblas sgemm (2.267s)

  4. my rust sgemm 3000 3000 3000

 Performance counter stats for 'cargo test --release -- --nocapture test_001':

     2,676,118,088      branch-instructions:u     #  152.471 M/sec                    (12.93%)
        68,832,359      branch-misses:u           #    2.57% of all branches          (12.89%)
        26,915,211      cache-misses:u            #   16.029 % of all cache refs      (12.87%)
       167,913,344      cache-references:u        #    9.567 M/sec                    (12.85%)
    55,656,768,315      cpu-cycles:u              #    3.171 GHz                      (12.80%)
   128,684,270,433      instructions:u            #    2.31  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (15.98%)
    51,565,563,715      ref-cycles:u              # 2937.923 M/sec                    (19.15%)
     6,408,095,506      stalled-cycles-backend:u  #   11.51% backend cycles idle      (19.11%)
    19,357,650,829      stalled-cycles-frontend:u #   34.78% frontend cycles idle     (19.05%)
                 0      alignment-faults:u        #    0.000 K/sec                  
                 0      bpf-output:u              #    0.000 K/sec                  
                 0      context-switches:u        #    0.000 K/sec                  
         17,551.44 msec cpu-clock:u               #    1.038 CPUs utilized          
               125      major-faults:u            #    0.007 K/sec                  
           196,300      minor-faults:u            #    0.011 M/sec                  
           196,425      page-faults:u             #    0.011 M/sec                  
         17,551.97 msec task-clock:u              #    1.038 CPUs utilized          
    16,903,769,167 ns   duration_time:u           #  963.084 M/sec                  
       237,838,397      L1-dcache-load-misses:u   #    0.72% of all L1-dcache accesses  (18.97%)
    33,059,252,823      L1-dcache-loads:u         # 1883.535 M/sec                    (18.94%)
       314,264,081      L1-dcache-prefetch-misses:u #   17.905 M/sec                    (12.61%)
         6,494,260      L1-dcache-prefetches:u    #    0.370 M/sec                    (12.69%)
        26,928,494      L1-dcache-store-misses:u  #    1.534 M/sec                    (12.78%)
    11,260,260,394      L1-dcache-stores:u        #  641.548 M/sec                    (12.86%)
       442,675,603      L1-icache-load-misses:u   #    1.15% of all L1-icache accesses  (12.94%)
    38,344,274,837      L1-icache-loads:u         # 2184.646 M/sec                    (13.01%)
        23,233,631      LLC-load-misses:u         #   31.13% of all LL-cache accesses  (13.04%)
        74,625,823      LLC-loads:u               #    4.252 M/sec                    (13.08%)
        39,541,636      LLC-prefetch-misses:u     #    2.253 M/sec                    (6.53%)
       103,147,580      LLC-prefetches:u          #    5.877 M/sec                    (6.53%)
         2,126,088      LLC-store-misses:u        #    0.121 M/sec                    (6.53%)
        45,444,874      LLC-stores:u              #    2.589 M/sec                    (6.52%)
     2,275,102,048      branch-load-misses:u      #  129.623 M/sec                    (9.78%)
     2,764,060,437      branch-loads:u            #  157.481 M/sec                    (13.05%)
        11,918,687      dTLB-load-misses:u        #    0.04% of all dTLB cache accesses  (13.04%)
    32,166,823,211      dTLB-loads:u              # 1832.689 M/sec                    (13.04%)
           938,857      dTLB-store-misses:u       #    0.053 M/sec                    (13.02%)
    11,103,654,921      dTLB-stores:u             #  632.625 M/sec                    (12.99%)
         7,878,654      iTLB-load-misses:u        #    0.01% of all iTLB cache accesses  (12.97%)
   127,967,701,133      iTLB-loads:u              # 7290.897 M/sec                    (12.94%)

      16.903769167 seconds time elapsed

      16.856543000 seconds user
       0.495616000 seconds sys
  1. openblass sgemm 3000 3000 3000
      218,887,666      branch-instructions:u     #   93.421 M/sec                    (12.89%)
         1,787,440      branch-misses:u           #    0.82% of all branches          (12.93%)
         1,025,840      cache-misses:u            #    7.519 % of all cache refs      (12.98%)
        13,642,814      cache-references:u        #    5.823 M/sec                    (13.02%)
     7,042,103,906      cpu-cycles:u              #    3.006 GHz                      (12.93%)
    22,466,034,229      instructions:u            #    3.19  insn per cycle         
                                                  #    0.04  stalled cycles per insn  (16.14%)
     6,976,605,870      ref-cycles:u              # 2977.604 M/sec                    (19.38%)
       208,143,339      stalled-cycles-backend:u  #    2.96% backend cycles idle      (19.38%)
       912,195,010      stalled-cycles-frontend:u #   12.95% frontend cycles idle     (19.38%)
                 0      alignment-faults:u        #    0.000 K/sec                  
                 0      bpf-output:u              #    0.000 K/sec                  
                 0      context-switches:u        #    0.000 K/sec                  
          2,343.02 msec cpu-clock:u               #    0.995 CPUs utilized          
                16      major-faults:u            #    0.007 K/sec                  
            37,063      minor-faults:u            #    0.016 M/sec                  
            37,079      page-faults:u             #    0.016 M/sec                  
          2,343.03 msec task-clock:u              #    0.995 CPUs utilized          
     2,354,273,462 ns   duration_time:u           # 1004.800 M/sec                  
       231,607,629      L1-dcache-load-misses:u   #    8.27% of all L1-dcache accesses  (19.38%)
     2,801,235,210      L1-dcache-loads:u         # 1195.563 M/sec                    (19.41%)
       238,285,642      L1-dcache-prefetch-misses:u #  101.700 M/sec                    (12.97%)
           437,527      L1-dcache-prefetches:u    #    0.187 M/sec                    (12.97%)
         1,224,641      L1-dcache-store-misses:u  #    0.523 M/sec                    (12.97%)
        44,403,556      L1-dcache-stores:u        #   18.951 M/sec                    (12.98%)
           146,924      L1-icache-load-misses:u   #    0.00% of all L1-icache accesses  (12.96%)
     5,544,982,817      L1-icache-loads:u         # 2366.589 M/sec                    (12.96%)
           871,468      LLC-load-misses:u         #    7.23% of all LL-cache accesses  (12.96%)
        12,053,057      LLC-loads:u               #    5.144 M/sec                    (12.94%)
         5,734,009      LLC-prefetch-misses:u     #    2.447 M/sec                    (6.43%)
       212,056,120      LLC-prefetches:u          #   90.505 M/sec                    (6.40%)
            52,498      LLC-store-misses:u        #    0.022 M/sec                    (6.44%)
         6,529,303      LLC-stores:u              #    2.787 M/sec                    (6.44%)
       217,806,036      branch-load-misses:u      #   92.959 M/sec                    (9.64%)
       218,876,597      branch-loads:u            #   93.416 M/sec                    (12.84%)
           630,355      dTLB-load-misses:u        #    0.02% of all dTLB cache accesses  (12.80%)
     2,815,727,344      dTLB-loads:u              # 1201.748 M/sec                    (12.80%)
            25,132      dTLB-store-misses:u       #    0.011 M/sec                    (12.80%)
        38,527,967      dTLB-stores:u             #   16.444 M/sec                    (12.80%)
            21,032      iTLB-load-misses:u        #    0.00% of all iTLB cache accesses  (12.79%)
    22,632,395,475      iTLB-loads:u              # 9659.469 M/sec                    (12.80%)

       2.354273462 seconds time elapsed

       2.267348000 seconds user
       0.049242000 seconds sys

If I'm reading this correctly:

128,684,270,433      instructions:u            #    2.31  insn per cycle         
22,466,034,229      instructions:u            #    3.19  insn per cycle         

If I'm executing > 5x the instr count, this alone should explain for a large part of why my code is slower?

Assuming 4-wide SSE registers, a lower bound on the # of instructions should be:

3000 * 3000 * 3000 * 2 / 1000 / 1000 / 1000 / 4
13.5 billion ?

They do 22.4 billion, which is pretty close; I'm at 128.7 billion, so I'm issuing way too many instrs ?

Does your code have anything which may hinder auto-vectorisation? It could be that you are missing out on SIMD because there is a bounds check that can't be pulled out of a loop or something.

I got it down to 37.6M instructions:


     1,446,052,975      branch-instructions:u     #  235.778 M/sec                    (12.86%)
        11,787,247      branch-misses:u           #    0.82% of all branches          (12.88%)
        17,822,449      cache-misses:u            #   42.093 % of all cache refs      (12.88%)
        42,341,125      cache-references:u        #    6.904 M/sec                    (12.91%)
    18,469,630,318      cpu-cycles:u              #    3.011 GHz                      (12.99%)
    37,559,336,652      instructions:u            #    2.03  insn per cycle         
                                                  #    0.25  stalled cycles per insn  (16.27%)
    17,350,946,134      ref-cycles:u              # 2829.060 M/sec                    (19.56%)
     4,650,037,949      stalled-cycles-backend:u  #   25.18% backend cycles idle      (19.59%)
     9,450,246,769      stalled-cycles-frontend:u #   51.17% frontend cycles idle     (19.60%)
                 0      alignment-faults:u        #    0.000 K/sec                  
                 0      bpf-output:u              #    0.000 K/sec                  
                 0      context-switches:u        #    0.000 K/sec                  
          6,133.10 msec cpu-clock:u               #    0.999 CPUs utilized          
                 0      major-faults:u            #    0.000 K/sec                  
           122,844      minor-faults:u            #    0.020 M/sec                  
           122,844      page-faults:u             #    0.020 M/sec                  
          6,133.13 msec task-clock:u              #    0.999 CPUs utilized          
     6,137,839,308 ns   duration_time:u           # 1000.770 M/sec                  
       134,377,834      L1-dcache-load-misses:u   #    1.10% of all L1-dcache accesses  (19.59%)
    12,174,827,662      L1-dcache-loads:u         # 1985.097 M/sec                    (19.51%)
       316,753,907      L1-dcache-prefetch-misses:u #   51.647 M/sec                    (12.91%)
           915,558      L1-dcache-prefetches:u    #    0.149 M/sec                    (12.88%)
         9,018,038      L1-dcache-store-misses:u  #    1.470 M/sec                    (12.87%)
     3,206,034,971      L1-dcache-stores:u        #  522.742 M/sec                    (12.86%)
        60,996,287      L1-icache-load-misses:u   #    0.59% of all L1-icache accesses  (12.85%)
    10,371,690,334      L1-icache-loads:u         # 1691.097 M/sec                    (12.85%)
        15,493,301      LLC-load-misses:u         #   58.12% of all LL-cache accesses  (12.85%)
        26,655,844      LLC-loads:u               #    4.346 M/sec                    (12.85%)
        42,925,831      LLC-prefetch-misses:u     #    6.999 M/sec                    (6.42%)
        98,418,713      LLC-prefetches:u          #   16.047 M/sec                    (6.42%)
         1,580,115      LLC-store-misses:u        #    0.258 M/sec                    (6.42%)
        18,946,611      LLC-stores:u              #    3.089 M/sec                    (6.42%)
     1,250,061,895      branch-load-misses:u      #  203.822 M/sec                    (9.64%)
     1,514,701,822      branch-loads:u            #  246.971 M/sec                    (12.86%)
         2,767,282      dTLB-load-misses:u        #    0.02% of all dTLB cache accesses  (12.86%)
    12,334,413,397      dTLB-loads:u              # 2011.118 M/sec                    (12.85%)
           145,344      dTLB-store-misses:u       #    0.024 M/sec                    (12.86%)
     3,103,502,436      dTLB-stores:u             #  506.024 M/sec                    (12.85%)
           530,157      iTLB-load-misses:u        #    0.00% of all iTLB cache accesses  (12.87%)
    38,267,760,941      iTLB-loads:u              # 6239.532 M/sec                    (12.87%)

       6.137839308 seconds time elapsed

       5.753397000 seconds user

At this point, I think we are within 3x of openblas.

Here is the code it is currently doing for C = A * B, where A, B, C are all 4x4 f32 matrices:

       │     Disassembly of section .text:                                                                                                                                                                                                       ▒
       │                                                                                                                                                                                                                                         ▒
       │     00000000000135a0 <x_tensor::blas::mat4f::Mat4f::mat4_sgemm>:                                                                                                                                                                        ▒
       │     x_tensor::fixed::pt4::Pt4f32::mut_s_add:                                                                                                                                                                                            ▒
       │                                                                                                                                                                                                                                         ▒
       │     impl Pt4f32 {                                                                                                                                                                                                                       ▒
       │     #[inline(always)]                                                                                                                                                                                                                   ▒
       │     pub fn mut_s_add(&mut self, alpha: f32, rhs: &Pt4f32) {                                                                                                                                                                             ▒
       │     let d = &mut self.0;                                                                                                                                                                                                                ▒
       │     d[0] = d[0] + rhs.0[0] * alpha;                                                                                                                                                                                                     ▒
  0.74 │       movss  (%rsi),%xmm4                                                                                                                                                                                                               ▒
  0.03 │       movss  0x4(%rsi),%xmm5                                                                                                                                                                                                            ▒
  0.05 │       movss  0x8(%rsi),%xmm6                                                                                                                                                                                                            ▒
  2.96 │       movss  0xc(%rsi),%xmm7                                                                                                                                                                                                            ▒
  0.01 │       shufps $0x0,%xmm4,%xmm4                                                                                                                                                                                                           ▒
  9.84 │       movaps (%rdx),%xmm3                                                                                                                                                                                                               ▒
  0.25 │       movaps 0x10(%rdx),%xmm2                                                                                                                                                                                                           ▒
  3.08 │       movaps 0x20(%rdx),%xmm1                                                                                                                                                                                                           ▒
  1.94 │       movaps 0x30(%rdx),%xmm0                                                                                                                                                                                                           ▒
  6.12 │       mulps  %xmm3,%xmm4                                                                                                                                                                                                                ▒
  7.32 │       addps  (%rdi),%xmm4                                                                                                                                                                                                               ▒
  0.37 │       shufps $0x0,%xmm5,%xmm5                                                                                                                                                                                                           ▒
  0.02 │       mulps  %xmm2,%xmm5                                                                                                                                                                                                                ▒
  7.30 │       addps  %xmm4,%xmm5                                                                                                                                                                                                                ▒
       │       shufps $0x0,%xmm6,%xmm6                                                                                                                                                                                                           ▒
  0.45 │       mulps  %xmm1,%xmm6                                                                                                                                                                                                                ▒
  7.43 │       addps  %xmm5,%xmm6                                                                                                                                                                                                                ▒
       │       shufps $0x0,%xmm7,%xmm7                                                                                                                                                                                                           ▒
  0.02 │       mulps  %xmm0,%xmm7                                                                                                                                                                                                                ▒
  8.21 │       addps  %xmm6,%xmm7                                                                                                                                                                                                                ▒
  2.88 │       movaps %xmm7,(%rdi)                                                                                                                                                                                                               ▒
       │       movss  0x10(%rsi),%xmm4                                                                                                                                                                                                           ▒
       │       movss  0x14(%rsi),%xmm5                                                                                                                                                                                                           ▒
  0.04 │       movss  0x18(%rsi),%xmm6                                                                                                                                                                                                           ▒
  2.76 │       movss  0x1c(%rsi),%xmm7                                                                                                                                                                                                           ▒
       │       shufps $0x0,%xmm4,%xmm4                                                                                                                                                                                                           ▒
  0.04 │       mulps  %xmm3,%xmm4                                                                                                                                                                                                                ▒
  0.17 │       addps  0x10(%rdi),%xmm4                                                                                                                                                                                                           ▒
  2.63 │       shufps $0x0,%xmm5,%xmm5                                                                                                                                                                                                           ▒
  0.01 │       mulps  %xmm2,%xmm5                                                                                                                                                                                                                ▒
  0.16 │       addps  %xmm4,%xmm5                                                                                                                                                                                                                ▒
  0.04 │       shufps $0x0,%xmm6,%xmm6                                                                                                                                                                                                           ▒
  2.73 │       mulps  %xmm1,%xmm6                                                                                                                                                                                                                ▒
  0.22 │       addps  %xmm5,%xmm6                                                                                                                                                                                                                ▒
       │       shufps $0x0,%xmm7,%xmm7                                                                                                                                                                                                           ▒
  0.03 │       mulps  %xmm0,%xmm7                                                                                                                                                                                                                ▒
  3.63 │       addps  %xmm6,%xmm7                                                                                                                                                                                                                ▒
  0.41 │       movaps %xmm7,0x10(%rdi)                                                                                                                                                                                                           ▒
  0.01 │       movss  0x20(%rsi),%xmm4                                                                                                                                                                                                           ▒
  0.01 │       movss  0x24(%rsi),%xmm5                                                                                                                                                                                                           ▒
  2.30 │       movss  0x28(%rsi),%xmm6                                                                                                                                                                                                           ▒
  0.40 │       movss  0x2c(%rsi),%xmm7                                                                                                                                                                                                           ▒
       │       shufps $0x0,%xmm4,%xmm4                                                                                                                                                                                                           ▒
  0.02 │       mulps  %xmm3,%xmm4                                                                                                                                                                                                                ▒
  2.28 │       addps  0x20(%rdi),%xmm4                                                                                                                                                                                                           ▒
  0.43 │       shufps $0x0,%xmm5,%xmm5                                                                                                                                                                                                           ▒
  0.02 │       mulps  %xmm2,%xmm5                                                                                                                                                                                                                ▒
  0.06 │       addps  %xmm4,%xmm5                                                                                                                                                                                                                ▒
  2.32 │       shufps $0x0,%xmm6,%xmm6                                                                                                                                                                                                           ▒
  0.38 │       mulps  %xmm1,%xmm6                                                                                                                                                                                                                ▒
  0.20 │       addps  %xmm5,%xmm6                                                                                                                                                                                                                ▒
  0.02 │       shufps $0x0,%xmm7,%xmm7                                                                                                                                                                                                           ▒
  2.47 │       mulps  %xmm0,%xmm7                                                                                                                                                                                                                ▒
  0.80 │       addps  %xmm6,%xmm7                                                                                                                                                                                                                ▒
  0.34 │       movaps %xmm7,0x20(%rdi)                                                                                                                                                                                                           ▒
       │       movss  0x30(%rsi),%xmm4                                                                                                                                                                                                           ▒
  2.46 │       movss  0x34(%rsi),%xmm5                                                                                                                                                                                                           ▒
  0.24 │       movss  0x38(%rsi),%xmm6                                                                                                                                                                                                           ▒
  0.33 │       movss  0x3c(%rsi),%xmm7                                                                                                                                                                                                           ▒
  0.01 │       shufps $0x0,%xmm4,%xmm4                                                                                                                                                                                                           ▒
  2.37 │       mulps  %xmm3,%xmm4                                                                                                                                                                                                                ▒
  0.29 │       addps  0x30(%rdi),%xmm4                                                                                                                                                                                                           ▒
  0.39 │       shufps $0x0,%xmm5,%xmm5                                                                                                                                                                                                           ▒
  0.01 │       mulps  %xmm2,%xmm5                                                                                                                                                                                                                ▒
  2.32 │       addps  %xmm4,%xmm5                                                                                                                                                                                                                ▒
  0.22 │       shufps $0x0,%xmm6,%xmm6                                                                                                                                                                                                           ▒
  0.39 │       mulps  %xmm1,%xmm6                                                                                                                                                                                                                ▒
  0.16 │       addps  %xmm5,%xmm6                                                                                                                                                                                                                ▒
  2.26 │       shufps $0x0,%xmm7,%xmm7                                                                                                                                                                                                           ▒
  0.17 │       mulps  %xmm0,%xmm7                                                                                                                                                                                                                ▒
  0.97 │       addps  %xmm6,%xmm7                                                                                                                                                                                                                ▒
  0.42 │       movaps %xmm7,0x30(%rdi)                                                                                                                                                                                                           ▒
       │     x_tensor::blas::mat4f::Mat4f::mat4_sgemm:                                                                                                                                                                                           ▒
       │     out[3].mut_s_add(d3[2], &rr2);                                                                                                                                                                                                      ▒
       │     out[3].mut_s_add(d3[3], &rr3);                                                                                                                                                                                                      ▒
       │                                                                                                                                                                                                                                         ▒
       │     /*                                                                                                                                                                                                                                  ▒
       │                                                                                                                                                                                                                                         ▒
       │     */}                                                                                                                                                                                                                                 ▒
  3.08 │     ← retq                                                                                                                                                                                                                              ▒
                                                                                                                                                                                                                                                 ▒
                              

I believe the xmm registers imply sse. This CPU does not have avx instructions, so, afaik the best we can do is 4 f32's at once.

You seem to have many more cache misses than OpenBLAS, maybe investigate that next? That could be keeping your IPC too low. Does it use a different data layout for differently sized matrices or something?

openblas:

        1,025,840      cache-misses:u            #    7.519 % of all cache refs      (12.98%)
        13,642,814      cache-references:u        #    5.823 M/sec                    (13.02%)
       231,607,629      L1-dcache-load-misses:u   #    8.27% of all L1-dcache accesses  (19.38%)
     2,801,235,210      L1-dcache-loads:u         # 1195.563 M/sec                    (19.41%)
       238,285,642      L1-dcache-prefetch-misses:u #  101.700 M/sec                    (12.97%)
           437,527      L1-dcache-prefetches:u    #    0.187 M/sec                    (12.97%)
         1,224,641      L1-dcache-store-misses:u  #    0.523 M/sec                    (12.97%)
        44,403,556      L1-dcache-stores:u        #   18.951 M/sec                    (12.98%)
           146,924      L1-icache-load-misses:u   #    0.00% of all L1-icache accesses  (12.96%)
     5,544,982,817      L1-icache-loads:u         # 2366.589 M/sec                    (12.96%)
           871,468      LLC-load-misses:u         #    7.23% of all LL-cache accesses  (12.96%)
        12,053,057      LLC-loads:u               #    5.144 M/sec                    (12.94%)
         5,734,009      LLC-prefetch-misses:u     #    2.447 M/sec                    (6.43%)
       212,056,120      LLC-prefetches:u          #   90.505 M/sec                    (6.40%)
            52,498      LLC-store-misses:u        #    0.022 M/sec                    (6.44%)
         6,529,303      LLC-stores:u              #    2.787 M/sec                    (6.44%)

mine:

        17,822,449      cache-misses:u            #   42.093 % of all cache refs      (12.88%)
        42,341,125      cache-references:u        #    6.904 M/sec                    (12.91%)
    18,469,630,318      cpu-cycles:u              #    3.011 GHz                      (12.99%)
       134,377,834      L1-dcache-load-misses:u   #    1.10% of all L1-dcache accesses  (19.59%)
    12,174,827,662      L1-dcache-loads:u         # 1985.097 M/sec                    (19.51%)
       316,753,907      L1-dcache-prefetch-misses:u #   51.647 M/sec                    (12.91%)
           915,558      L1-dcache-prefetches:u    #    0.149 M/sec                    (12.88%)
         9,018,038      L1-dcache-store-misses:u  #    1.470 M/sec                    (12.87%)
     3,206,034,971      L1-dcache-stores:u        #  522.742 M/sec                    (12.86%)
        60,996,287      L1-icache-load-misses:u   #    0.59% of all L1-icache accesses  (12.85%)
    10,371,690,334      L1-icache-loads:u         # 1691.097 M/sec                    (12.85%)
        15,493,301      LLC-load-misses:u         #   58.12% of all LL-cache accesses  (12.85%)
        26,655,844      LLC-loads:u               #    4.346 M/sec                    (12.85%)
        42,925,831      LLC-prefetch-misses:u     #    6.999 M/sec                    (6.42%)
        98,418,713      LLC-prefetches:u          #   16.047 M/sec                    (6.42%)
         1,580,115      LLC-store-misses:u        #    0.258 M/sec                    (6.42%)
        18,946,611      LLC-stores:u              #    3.089 M/sec                    (6.42%)
 
  1. The data layout is almost certainly different. I skimmed some of the openblas papers. They have a "template" which they then optimize. I don't know what their actual layout is.

  2. I don't know how to read cache misses. "cache-misses" claims I'm missing 42% of cache refs? (wtf???), while L1-dcache-load-misses is only 1.10% (which sounds reasonable)

Consider C = A * B where all matrices are 32x32

Each matrix takes 32 * 32 * 4 bytes = 4kb. Storing all 3 matrices takes 12 kb.

The CPU has a cache size of 32kb.

When doing 32x32 matrix multiply, we should have near 0 cache miss of reading from L1 cache to register. I am willing to accept the 1.10% rate (when reading in a new matrix).

I have no idea what the "42% cache-miss" measures.

It looks like misses from your lower level cache, i.e. cases where you can’t avoid hitting ram for whatever reason. This could mean the predictor is not doing a good job (what’s your access pattern like?) or that you are pulling in too much data from too many places and blowing your cache lines?

Here is a place where perf record/report can help, since it will show roughly where the misses are happening.

The gist of it is as follows:

Suppose we have a special kernel for doing 32x32 matrix multiply.

Suppose our input matrices are all 3000x3000.


for y in 0 .. floor( (3000+31) / 32)) {
  for x in 0 .. floor( (3000+31) / 32) ) {
    let mut out: Mat32 = ...; // EDIT: added mut
    for k in 0 .. floor( (3000+31)/32) {
      let a_32: &Mat32 = ... ;
      let b_32: &Mat32 = ... ;
      out.sgemm(a, b); // KERNEL
    }
    merge 'out' into C, via alpha, beta
  }
}

Mat32 are 32x32 f32's they are aligned to 4096 bytes.

Each of A, B, C are 3000x3000 * 4 bytes = 36 MB.

L1 cache is 32kb.
L3 cache is 12MB.

My expectation is that for each KERNEL execution, we pay a small price to page all the 4kb 32x32 matrices in. (I am NOT using any pre-fetching hints), but once we pay that price, the KERNEL execution itself should be cache-miss free.

I still struggle to understand this 42% cache miss.

Where does perf report say your misses are happening? Is it happening in the kernel or in the loops? You can cycle through the different events to get callstacks/hot spots for each type of event.

I can’t guess where the misses are happening and I don’t know this application, but merging operations can often be unpleasantly random.

Aside: out is declared with let, but then a method is called on it for side effects. Does it use internal mutability?

Finally, if you have a 32x32 slice of a larger matrix, each 32*4 byte row will probably use its own cache line, even if your lines are larger than 128 bytes. So most of the loaded data to construct that small matrix will be discarded if your cache line is say 512 bytes. I hear that you’re working on 32x32 matrices with their own storage so that may be why your L1 misses are staying nice and low. The percentages here are as a proportion of L1 accesses vs as a proportion of LLC accesses, so that’s why the figures are on way different scales.

1 Like
  1. You're right, 'out' now has 'mut' added to it.

  2. Let me get back to you on 'perf / cache misses' ; this is clearly a very powerful tool; I clearly need to spend a bit more time understanding the GUI / computational model of perf.

You seem to be doing an order of magnitude more stores than openblas - this is likely the source of your issues.

Most of the other numeric differences don't make a notable difference in the rate of operations.

1 Like

I don't suppose OpenBLAS does anything tricky like using alternative algorithms for matrix multiplication? One of the earliest lessons I learnt about optimising code is that you'll get better gains out of finding a new better way to achieve something, you can only go so far by tightening the screws down on one single algorithm.

That said, I hear that OpenBLAS does that too. Architecture-specific assembly optimisations tuned by hand apparently. No small bar to get over, if true.

I got my_rust_sgemm time down to 4.11s, which is:

(1) within 2x of openblas
(2) 4x improvement of original

@squeakyneb : "algorithms" is an interesting word in the context of sgemm -- I think all of them are O(n^3), so it ends up being "constant size tweaks" involving block/band size / order / optimizing for L1 cache / registers / ... . Clearly openblas is doing something more clever than me here, but I think it's more on the 'tactic' level than the 'strategic' level.

@riking: Even in my latest 4.11s, within 2x run, I do:

     8,859,470,068      L1-dcache-loads:u // mine
       973,136,351      L1-dcache-stores:u // mine

     2,801,235,210      L1-dcache-loads:u // openblas
        44,403,556      L1-dcache-stores:u // openblas

So I'm somehow doing 3x the reads and 20x the writes.

Considering that these are 3000x3000 matrices, just writing out C should take 9,000,000 stores. Openblas is within a factor of 4 of theoretical opt, where as I am doing 100x.

I think the issue here is that I decided to do an optimized 32x32 = 32x32 by 32x32 sgemm kernel. As a result of this 'optimized 32x32 block', I end up writing to every 32x32 block a total of (3000 / 32 ~= 100) times. This might explain why I'm doing 100x the writes of the theoretical opt, or maybe the numbers just happen to match up.

@JosephOsborn : Thanks for all your help with learning to use perf. This was probably the single most valuable part of the experiment. Nearly all gains came from running perf report, looking at the generated assembly, and figuring out wtf was wrong.

For those who want to get within 23% (instead of merely 2x) of openblas, checkout gemm: a rabbit hole by @bluss which was tremendously helpful.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.