Unable to exploit all the cores, suggestions?

Hi All,

I am asking help for an issue that is not directly related to Rust but since this is one of the most technical community I know.

The software we are discussing is a Redis module written in Rust that embeds SQLite.

When a new database is created, it is created via SQLite and a single thread is started.
A single Redis instance is so able to manage multiple SQLite databases, each one with its own thread.

Now I am trying to stress the system and the optimistic assumption should be that up to ~$number_of_cores databases running I should see an almost linear increase in performance.
Ex. If with only one database I execute 10K ops/sec and I am using a single thread, using two databases I am expecting to see something close to 20K ops/sec since I am using two threads that do not share anything and that don't wait on each other.

Clearly, this does not happen and the performances are bounded to roughly 10K ops/sec, no matter the numbers of databases and thread used.

I don't believe that Redis is the bottleneck (I may be wrong here) since if I send incorrect commands it manages to answer to 50k req/sec.

I start to believe that I am loosing a lot of performance in context switch and that this may be the cause of not improving the througput even if I add non-contentious threads.
However, I have no idea how I can actually be sure that this is the issues, using perf maybe?

Also, I would like to ask any suggestion to the community; I believe that one thread per database is not the best architecture, but I would love some feedback.

Thanks,

Are you writing to the same Sqlite database file? If so, that is the bottleneck, since Sqlite allows only one writer at a time. Also look at its durability guarantees (fsync), since that may also impose a hard limit on number of writes.

If you make each thread write to a different sqlite db file, that may work. Also test with :memory: sqlite path to see whether your throughput is limited by CPU or disk.

No, I am sorry that I wasn't clear.

I am writing on different SQLite database, each one open in memory, via the :memory: path.

The CPUs never reach 100%, they sit at roughly 50%.

Isn't Redis itself single threaded? How are the incoming requests offloaded to your worker threads? It would help if you describe the internal architecture and threading model in more depth.

Hi Vitaly :slight_smile:

Yes Redis is single thread, however, it provides modules the opportunity to work in a multi-thread fashion.

The phase of dispatch of a command is managed by Redis, so it is a single thread loop that read the socket and decides what command call.

When working in the standard way, think about SET it operates in my machine at around 50k ops/sec.

However, modules have the capabilities of blocking a client, make works in a different thread and finally unblock the client and return the data.
While the client is blocked the main thread of redis is free to accept new connections or to do anything else it may please.

To recap, the part of interaction with the IO socket is a single thread, while the work on the several SQLite database is executed in a different thread, one and only one for each database.

Was this clearer?

Thanks, that's a bit clearer.

So your module is called on the main Redis thread, but it in turn dispatches the request to the appropriate worker thread? If so, how are handing off work to the worker thread and how are you receiving data back from the worker thread to send back to the client?

Have you tried taking sqlite out of the equation and just having your worker threads return some dummy data? Do you see an increase in requests/sec when adding more threads (up to # of cpus)?

I'm not too familiar with sqlite internals. But what kind of performance can you get if you write a simple test harness that exercises in-memory sqlite?

Yes, your understading is correct.

The work is send to the workers using a simple fifo queue, the one from the standard rust library: std::sync::mpsc - Rust

The result is sent back to redis using internal Redis mechanism that are quite confident that are quite efficients.

No, I didn't try to remove SQLite from the equation, but it may be a great suggestion.

No, as I said, I don't see any increase in the number of reqs/sec increasing the number of thread running, everything is topped at roughly 10k req/sec.

SQLite in memory, for the kind of workload I was testing reach easily 200k req/sec.

And just to check the obvious - your test harness creates multiple parallel connections to Redis? If that's the case, then it would appear that Redis is not actually multiplexing across the clients but is instead waiting for a client response to be generated before servicing the event loop (and other clients' requests). The fact that you're stuck at 10k/seq is implying this is working either with a single client only (and thus cannot multiplex its requests) or somehow multiplexing across clients isn't working properly.

So, how are you generating the load from the client(s)? Also, how does the module decide which worker thread to route the request to? Are the database instances equivalent? Is it round-robin or something else?

With all respect, but WHY?!
Why wrap a lightweight, filebased and file-optimised database, if you clearly need memory-based performance?

If you really need SQL features and good caching, why not use a full scale relational database, that has been optimised for literal decades to solve exactly this problem?

(E.g. PostgreSQL, which even supports Multicore across single queries in the latest versions, and has excellent diesel rust bindings)

By wrapping a lightweight DB in a caching layer, you are basically (poorly) reinventing the wheel: you need to keep reopening your sqlite file(s), repeating all the parsing, and sqlite, awesome as is, can not hold a candle to the incredible optimisation and (integrated, content aware) caching powers of PostgreSQL, MySQL or Microsoft SQL Server.

Don't solve your performance problems, solve your architecture problems!

(My apologies for going on a rant here :flushed:, but I have a very hard time imagining why you would build something in this way, I am also very curious about the constraints/reasoning that brought you to this solution.
Please do explain, and I am willing to learn if/why this could make sense!)

edit 1: links to postgres+diesel
edit 2: rant apology
edit 3: I found your original topic about this project, sorry for not noticing it back then.

2 Likes

The load is generated by the Redis tool redis-benchmark, by default it generates 50 independent clients, the throughput increase adding more clients up to ~200 clients when it tops to the 10K ops/sec I mentioned earlier.

The test is set up creating n equivalent databases using the :memory: path.
Then I create the test table, I create a statement and finally, I execute the benchmarks.

The benchmark has this shape:

./redis-benchmark -r 5 -c 200 REDISQL.EXEC_STATEMENT DB__rand_int__ insert_log mk

Which execute the statement insert_log against the databases DB00000000000x where 0 < x < 5.
5 is set using the parameter -r while -c set up the number of clients, in this case, 200.

It runs 100K times this command so the databases are chosen randomly in a uniform way by the redis-benchmark tool.

I will definitely follow your suggestion of trying to run an empty command in order to see what is slowing thing down. However, I was looking for a more "engineeristic" approach to the problem in order to have actuall measure and not just (educated) guesses.

Maybe I should look closer into perf

1 Like

I recently wrote an introduction to perf for my scientific computing colleagues, maybe it can be useful to you? It is a bit focused on compute performance, though, whereas for your use case I think you may want to focus more on system-wide analysis (IO, syscalls, mutexes...). For that, I can recommend Brendan Gregg's awesome website.

EDIT: Also, for instructions on using perf with Rust projects, see my posts in Profilers and how to interpret results on recursive functions - #2 by HadrienG .

2 Likes

My guess is the command the single thread is sending is executed so fast that mpsc overhead is killing any gain.

GUI for perf, but you might need a bit more than it offers.
https://github.com/KDAB/hotspot

Any reason you don’t use pipelining?

Also, are Redis and the clients all on the same machine? Even if not, I’m not sure how 200 clients on the same machine don’t saturate the machine with context switches? I think you’d want pipelining to at least overcome some of that.

It’s typically hard to benchmark high-perf servers using a single client machine - lots of times you can saturate the client itself before the server. But I’m just speculating here.

Running perf is the right thing to do to attack this scientifically. But it’s important to make sure the benchmark setup itself isn’t flawed - not saying it is but something to double check.

Let us know what you find.

Hi @juleskers

don't worry for the rant :wink:

Instead of answering just in this forum, I decided to write down the motivation on the project as a documentation page, you can read more details here: Motivations - RediSQL

Anyway, the main reason is basically simplicity of operation while retaining SQL capabilities in a micro-service architecture.

Happy, to continue the conversation even if I am not sure that this is the best place :slight_smile:

Let me tag also @HadrienG and @scottmcm, that I saw liked the comment of Jules, and so I guess may be interested in the argument, if not sorry about the noise :frowning:

2 Likes

If mspc overhead does appear to be an issue, you might want to check out the crossbeam-channel crate (rfc and benchmarks).

1 Like

Thanks for taking the time to explain! I've read through your post, and have a little better sense of what lead you to this decision.
I find it great to see how Rust empowers people to write "low-level" solutions like this. Am I correct in assuming that you wouldn't even have considered this, if C had been the only option? So, kudos to rust for making new solutions possible :smiley:

I can see how a second "source of truth" would be overhead and require code changes..
But didn't you have to do these code changes anyway? You still had to write code for SQL-handling (either in RediSQL, or the microservices themselves), and that code would have been basically identical for PostGres/MySQL/MS-SQL. That's one of the beauties of standardised ANSI SQL.

My concern is that now you have invested quite a lot of time into a custom solution that, as you've seen, doesn't scale. Even if you solve this current problem, SQLite will quickly plateau (It's awesome, but intentionally focussed at the lower, lightweight end). At that point, you will still have to switch to a "proper" SQL database.
If you had invested the same effort/hours you put into RediSQL into PostGres, you could have had a very nice PostGres server running already, with far more "headroom" than what you currently have.

Then again, working for a large organisation with some inertia myself, I can see how it would be a lot of effort to convince everyone that you really need a second server. Setting that up would be a large up-front effort with lots of communication with others, whereas your current module has been developed incrementally by you alone -> less hassle!
Also, now you could re-use your existing coding skills, instead of learning about SQL server administration.

Of course, I'm still ignoring the architectural/organisational issues of a second server; but I'd argue that you're already running a second infrastructure, you're just hiding it in your first one like a matryoshka doll.

But, having said that, I'm turning this into a "future costs vs todays costs" argument, which is always a personal/situational preference, and I don't know anything about your situation. I'll stop now, and accept that you probably know what you are doing :grinning:
Thanks for sharing with us!

1 Like

Hi,

that are just the motivation that get me started, now this is the only Open Source project that I have never made money off directly so I am keeping improve it and start to sell a "PRO" versione.

Anyhow your definition of "no-scaling" is quite high, the optimized version, running in memory, with only writes, makes way more than 30k inserts/seconds, which means 2 billions of inserted record in a single day.
(Clearly, remove two order of magnitude for getting real world workload and for ignoring the possibility to use master-slave for reading operation, but you will still get 20M inserts per day, doesn't seems so bad to me...)
(Also, I said insert but correct terms is transaction, where a single transaction could means multiple inserts).

Honestly, if you need more than 30k inserts/second you definitely have an huge engineering problem in your hands and this is not the solution you are looking for.

Most projects really need order of magnitude less of performance and, I believe, are more than willing to sacrifice a "real SQL database" for easier operations.

I definitely don't want to transform this topic into a discussion of the real word value of RediSQL.

Thank you for your feedback that helped me improve the documentation and for your really interesting objections.

If you have any more question, you will like to keep the conversation going or are interested in knowing how the project may help you in your field don't hesitate to write me an email.
(You should find it in my github profile).

Cheers :beers:

2 Likes

As I've said, I'll stop; you have been more than indulging of my concerns :heart: Cheers :beers:

To bring this back on topic: have the performance ideas others have given you been of use so far?

A little update.

Hey @juleskers :heart:

I finally run perf sampling cycles for a good 10 minutes and I finally go the results:

Samples: 1M of event 'cycles', Event count (approx.): 354283516692
  Children      Self  Command       Shared Object           Symbol                                                                                   β—†
+   37,12%     0,26%  redis-server  [kernel.kallsyms]       [k] entry_SYSCALL_64_fastpath                                                            β–’
+   21,22%     0,34%  redis-server  [kernel.kallsyms]       [k] sys_futex                                                                            β–’
+   20,84%     0,39%  redis-server  [kernel.kallsyms]       [k] do_futex                                                                             β–’
+   15,09%     0,00%  redis-server  libpthread-2.23.so      [.] 0xffff801d5934d4bd                                                                   β–’
+   14,69%     0,15%  redis-server  [kernel.kallsyms]       [k] sys_write                                                                            β–’
+   13,93%     0,24%  redis-server  [kernel.kallsyms]       [k] vfs_write                                                                            β–’
+   12,77%     0,07%  redis-server  [kernel.kallsyms]       [k] __vfs_write                                                                          β–’
+   12,65%     0,13%  redis-server  [kernel.kallsyms]       [k] new_sync_write                                                                       β–’
+   11,40%     0,00%  redis-server  [unknown]               [k] 0x0000000000000200                                                                   β–’
+   10,77%     0,09%  redis-server  [kernel.kallsyms]       [k] sock_write_iter                                                                      β–’
+   10,67%     0,05%  redis-server  [kernel.kallsyms]       [k] sock_sendmsg                                                                         β–’
+   10,55%     0,20%  redis-server  libpthread-2.23.so      [.] __lll_unlock_wake                                                                    β–’
+   10,43%     0,07%  redis-server  [kernel.kallsyms]       [k] inet_sendmsg                                                                         β–’
+   10,17%     0,31%  redis-server  [kernel.kallsyms]       [k] tcp_sendmsg                                                                          β–’
+    9,43%     1,48%  redis-server  [kernel.kallsyms]       [k] try_to_wake_up                                                                       β–’
+    9,38%     0,53%  redis-server  [kernel.kallsyms]       [k] futex_wake                                                                           β–’
+    9,35%     0,05%  redis-server  [kernel.kallsyms]       [k] wake_up_q                                                                            β–’
+    9,34%     0,30%  redis-server  [kernel.kallsyms]       [k] futex_wait                                                                           β–’
+    8,83%     0,06%  redis-server  [kernel.kallsyms]       [k] tcp_push                                                                             β–’
+    8,78%     0,05%  redis-server  [kernel.kallsyms]       [k] __tcp_push_pending_frames                                                            β–’
+    8,70%     0,20%  redis-server  [kernel.kallsyms]       [k] tcp_write_xmit                                                                       β–’
+    8,34%     0,00%  redis-server  libredis_sql.so         [.] std::sys_common::backtrace::__rust_begin_short_backtrace::h668c06fb2cddbaa9          β–’
+    8,34%     0,00%  redis-server  libredis_sql.so         [.] std::panicking::try::do_call::h720cfbcbaefe40ad                                      β–’
+    8,34%     0,00%  redis-server  libredis_sql.so         [.] __rust_maybe_catch_panic                                                             β–’
+    8,34%     0,00%  redis-server  libredis_sql.so         [.] _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h68dc0072d51b9610      β–’
+    8,34%     0,00%  redis-server  libredis_sql.so         [.] std::sys::imp::thread::Thread::new::thread_start::hbaf1b5aa1ca8e3ea                  β–’
+    8,34%     0,00%  redis-server  libpthread-2.23.so      [.] start_thread                                                                         β–’
+    8,10%     0,18%  redis-server  [kernel.kallsyms]       [k] futex_wait_queue_me                                                                  β–’
+    7,98%     0,25%  redis-server  [kernel.kallsyms]       [k] tcp_transmit_skb                                                                     β–’
+    7,95%     0,12%  redis-server  [kernel.kallsyms]       [k] schedule                                                                             β–’
+    7,93%     0,00%  redis-server  [unknown]               [k] 0000000000000000                                                                     β–’
+    7,70%     0,38%  redis-server  [kernel.kallsyms]       [k] __schedule                                                                           β–’
+    7,32%     0,12%  redis-server  [kernel.kallsyms]       [k] ip_queue_xmit                                                                        β–’
+    7,17%     0,32%  redis-server  libpthread-2.23.so      [.] __lll_lock_wait                                                                      β–’
+    7,15%     0,03%  redis-server  [kernel.kallsyms]       [k] ip_local_out                                                                         β–’
+    6,44%     0,00%  redis-server  [unknown]               [k] 0x0000000001010401                                                                   β–’
+    5,82%     1,25%  redis-server  libredis_sql.so         [.] redis_sql::redis::listen_and_execute::h1ab708507bc22d52                              β–’
+    5,32%     5,24%  redis-server  [vdso]                  [.] __vdso_gettimeofday                                                                  β–’

Main problem is that I have basically no idea of what this means.

It seems like it using most of the time doing sys_call and most of those system calls are futex related (futex should stand for "Fast Userland muTEX"

Also it seems like most of the time is spend outside the rust module, so I don't believe that the channel implementation is of any concerns, thanks anyway to @Ophirr33 for point out crossbeam.

I also sampled the context switch (that I believe is the count of "Where is the stack pointer when happens the Context Switch event?") with expected results:

Samples: 254K of event 'cs', Event count (approx.): 3428106
  Children      Self  Command       Shared Object       Symbol                                                                                       β—†
+   99,97%    99,97%  redis-server  [kernel.kallsyms]   [k] schedule                                                                                 β–’
+   99,09%     0,00%  redis-server  [kernel.kallsyms]   [k] entry_SYSCALL_64_fastpath                                                                β–’
+   99,05%     0,00%  redis-server  [kernel.kallsyms]   [k] futex_wait_queue_me                                                                      β–’
+   99,05%     0,00%  redis-server  [kernel.kallsyms]   [k] futex_wait                                                                               β–’
+   99,05%     0,00%  redis-server  [kernel.kallsyms]   [k] do_futex                                                                                 β–’
+   99,05%     0,00%  redis-server  [kernel.kallsyms]   [k] sys_futex                                                                                β–’
+   67,36%     0,00%  redis-server  libpthread-2.23.so  [.] __lll_lock_wait                                                                          β–’
+   31,71%     0,00%  redis-server  libpthread-2.23.so  [.] pthread_cond_wait@@GLIBC_2.3.2                                                           β–’
+   18,33%     0,00%  redis-server  [unknown]           [k] 0x0000000000000200                                                                       β–’
+   11,77%     0,00%  redis-server  libredis_sql.so     [.] std::collections::hash::map::RandomState::new::KEYS::__init::h2413584b0b846c97           β–’
+   11,77%     0,00%  redis-server  libredis_sql.so     [.] _$LT$alloc..vec..Vec$LT$T$GT$$GT$::extend_from_slice::hfa47afa150e7d64a                  β–’
+   11,77%     0,00%  redis-server  libredis_sql.so     [.] _$LT$core..fmt..Write..write_fmt..Adapter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..fmt..Wrβ–’
+   11,77%     0,00%  redis-server  libredis_sql.so     [.] _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h70a0fbbf4fe5a4fe          β–’
+   11,77%     0,00%  redis-server  libredis_sql.so     [.] std::panicking::default_hook::hf425c768c5ffbbad                                          β–’
+   11,77%     0,00%  redis-server  libpthread-2.23.so  [.] start_thread                                                                             β–’
+   10,06%     0,00%  redis-server  [unknown]           [k] 0x0000000001010401                                                                       β–’
+    9,40%     0,00%  redis-server  [unknown]           [k] 0000000000000000                                                                         β–’
+    5,56%     0,00%  redis-server  libredis_sql.so     [.] redis_sql::redis::listen_and_execute::h1ab708507bc22d52                                  β–’
+    4,32%     0,00%  redis-server  [unknown]           [k] 0xffffffff000003f8                                                                       β–’
+    2,07%     0,00%  redis-server  [unknown]           [k] 0x0000001300010080   

That again happens inside the futex.

So it seems like there are a lot of locks happenings around that may bound the performance of the application.

Any thoughts? How do I understand where are those locks?

1 Like

Also, just to make clear the problem:

In this picture is clear how I am unable to actually use totally all the cores of the machine.
Moreover, a single thread application results in a better throughput 50k ops/sec vs 30k ops/sec, while there are not evident form of locking.