Unable to exploit all the cores, suggestions?

The locks are likely from mpsc. It is probably showing all threads, many inactive; hard to split.
"start_thread" suggests your not using a thread pool.
"std::sys_common::backtrace::__rust_begin_short_backtrace" maybe of concern.
"std::panicking::try::do_call::" nothing to worry about. This is how thread starts with closures inlined.

Best to view in gui so you can see tree. Unfortunately the hotspot one isn't that good at splitting threads.

1 Like

The futex usage can also be from Redis module blocking client management. From a cursory glance of https://github.com/antirez/redis/blob/unstable/src/module.c (towards the bottom of the file), there’s a single mutex that Redis used to unblock clients. That will be a point of contention if module threads respond quickly (ie handling a request is quick).

@siscia, see if you can get perf to record the call graphs - that would show who’s calling into these hot functions.

1 Like

Hi All,

some other updates.

I wasn't quite able to get some understandable call graphs as @vitalyd suggested, anyhow I used strace to get a pulse of where I was wasting time, at least in kernel mode, and turns out quite unsurprisingly that most of the time was spent in futex calls:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.93  170.146626         262    648425     76214 futex
  0.05    0.088029           0    305498           write
  0.02    0.027358           0    169171       764 read
  0.00    0.001294           1      1053           mprotect
  0.00    0.000293           0       764           epoll_wait
  0.00    0.000167           0       381           open
  0.00    0.000029           0       381           close
  0.00    0.000000           0         4           mmap
  0.00    0.000000           0         4           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00  170.263796               1125681     76978 total

If my understanding of trace is correct, it is telling us that all (99.93%) of the time spent in kernel mode is because of a futex related call.

Next step was to understand where those calls happens.

I tried to use a standard perf with should result in sampling CPU-cycles but I wasn't really able to understand what was going on.

Then I tried to use, instead of the classical perf cycles, the sampling of context-switch under the assumption that a context switch is more likely if the thread is waiting (correct me if I am wrong!)

The results are as follow:

Samples: 538K of event 'cs', Event count (approx.): 1817778                                                                                   [0/2240]
  Children      Self       Samples  Shared Object       Symbol                                                                                       ▒
-   99,98%    99,98%        128842  [kernel]            [k] schedule                                                                                 ▒
   - schedule                                                                                                                                        ▒
      - 414,41% futex_wait_queue_me                                                                                                                  ◆
           futex_wait                                                                                                                                ▒
           do_futex                                                                                                                                  ▒
           sys_futex                                                                                                                                 ▒
         - entry_SYSCALL_64_fastpath                                                                                                                 ▒
            - 233,79% __lll_lock_wait                                                                                                                ▒
               - pthread_mutex_lock                                                                                                                  ▒
                  + 93,61% sqlite3Malloc.part.6                                                                                                      ▒
                  + 79,27% sqlite3_free                                                                                                              ▒
                  + 58,12% RM_UnblockClient                                                                                                          ▒
                  + 2,65% moduleHandleBlockedClients                                                                                                 ▒
              173,92% pthread_cond_wait@@GLIBC_2.3.2                                                                                                 ▒
      + 2,70% exit_to_usermode_loop                                                                                                                  ▒
+   11,69%     0,00%             0  [kernel]            [k] entry_SYSCALL_64_fastpath                                                                ▒
+   11,69%     0,00%             0  [kernel]            [k] sys_futex                                                                                ▒
+   11,69%     0,00%             0  [kernel]            [k] do_futex                                                                                 ▒
+   11,69%     0,00%             0  [kernel]            [k] futex_wait                                                                               ▒
+   11,69%     0,00%             0  [kernel]            [k] futex_wait_queue_me                                                                      ▒
+    6,70%     0,00%             0  libpthread-2.23.so  [.] __lll_lock_wait                                                                          ▒
+    6,69%     0,00%             0  libpthread-2.23.so  [.] pthread_mutex_lock                                                                       ▒
+    4,97%     0,00%             0  libredis_sql.so     [.] _$LT$redis_sql..community_statement..Statement$LT$$u27$a$GT$$u20$as$u20$redis_sql..sqlite▒
+    4,97%     0,00%             0  libredis_sql.so     [.] sqlite3_step                                                                             ▒
+    4,97%     0,00%             0  libredis_sql.so     [.] sqlite3VdbeExec                                                                          ▒
+    4,81%     0,00%             0  libpthread-2.23.so  [.] pthread_cond_wait@@GLIBC_2.3.2                                                           ▒
+    2,67%     0,00%             0  libredis_sql.so     [.] sqlite3BtreeInsert                                                                       ▒
+    2,66%     0,00%             0  libredis_sql.so     [.] pager_write                                                                              ▒
+    2,66%     0,00%             0  libredis_sql.so     [.] sqlite3Malloc.part.6                                                                     ▒
+    2,63%     0,00%             0  libredis_sql.so     [.] insertCell                                                                               ▒
+    2,29%     0,00%             0  libredis_sql.so     [.] sqlite3VdbeHalt                                                                          ▒
+    2,29%     0,00%             0  libredis_sql.so     [.] sqlite3BtreeCommitPhaseTwo.part.456                                                      ▒
+    2,29%     0,00%             0  libredis_sql.so     [.] pager_end_transaction                                                                    ▒
+    2,29%     0,00%             0  libredis_sql.so     [.] sqlite3_free                                                                             ▒
+    2,02%     0,00%             0  libredis_sql.so     [.] memjrnlWrite                                                                             ▒
+    1,94%     0,00%             0  libredis_sql.so     [.] memjrnlClose                                                                             ▒
+    1,69%     0,00%             0  libredis_sql.so     [.] redis_sql::redis::listen_and_execute::h1ab708507bc22d52                                  

It seems to me, but again, please correct me if you see anything wrong. That most of the time is spending inside sqlite3Malloc, sqlite3_free and RM_UnblockClient (RM -> RedisModule).

If this is really the case, I really don't know what to do.

If malloc is the serialization point I don't believe that I could do much more than this.
(Maybe using a different process? But it seems quite extreme...)

Thoughts?

1 Like

I figure our how to remove the locks that wraps the SQLite malloc/free functions.

I was able to get ~ 4k ops/sec but nothing more.

There still is something that I am missing...

1 Like

So what if you don’t block the clients? Leave the handoff work (ie mpsc channel to the worker thread) but don’t block the client (return some dummy data or something) for testing purposes.

That global blocked client mutex in Redis seems like it won’t scale well if the worker threads complete quickly. If a thread goes to sleep cause it’s contending on the lock, performance will really suffer.

1 Like

I need to block the queue in order to free the main Redis thread and get other commands.

Without blocking I would be completely single-thread, which does yield better performance overall, but it is not what I am trying to do...

Could be worthed to try a PR and change the internals of Redis in order to get some datastructure that does not block so much.

However it seems like quite a lot of work, in raw C, that I am not so confident to write...

An idea could be to use only one thread to enque the data, call RM_UnblockClient, however I still limit myself to a single thread for enquing everything.

The data will flow from the main thread to the worker thread and finally to the "enqueuer thread"...

Not sure if it is a great idea, but all the work of the lock will be on the same thread and this should provide quite some interesting performance gain... I guess...

Thoughts?

1 Like

Yeah that’s what I’d try - you need to eliminate contention on that Redis mutex so it has to be used by only a single thread. I wouldn’t worry about enqueing from a single thread - that should be pretty quick.

It’d be nice to try an experiment, which doesn’t do what you actually want, but that would indicate if the above is even worth pursuing.

This is a classic concurrency issue - you need to ensure that concurrency costs (ie cross thread communication and synchronization overhead) are amortized out if you cannot eliminate them entirely. That usually means some form of batching. Can you use pipelining to shove more work to the module workers?

2 Likes