Actix Web request failure rate all the way to 30%

I made a web service that loads an embedding model on startup, process user's text input and eventually return the vectorized outputs. However, when I stress test it, I found that when I put the concurrent user request to 100, the failure rate can be as high as 30%, which basically render this service unusable. Whereas, in python, I got 30 rps, but zero failure rate.
However, if I lower the concurrent user requests to 40, the rps remains at around 30 and the failure rate has gone. But if that is the case, the rust service would be somehow having the same performance as python.
The web framework i am using is Actix Web. I am using ort package for inferencing the embedding model, and huggingface's Tokenizers package for tokenizations.

Anyone has experiences on this?

The obligated question: How are you running this? Are you running it in release mode?

1 Like

Good point. I am going to give it a try

I built the project with this configuration in the Cargo.toml

[package]
name = "neural-network-services"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
actix-web = "4.8.0"
serde = { version = "1.0.203", features = ["derive"] }
ort = "2.0.0-rc.2"
tokenizers = "0.19.1"
ndarray = "0.15.6"
diesel = { version = "2.2.0", features = ["mysql", "serde_json"] }
dotenvy = "0.15"
serde_json = "1.0.117"

[profile.release]
opt-level = 3  # Maximize optimizations
lto = true     # Enable Link Time Optimizations
debug = false  # Disable debug assertions

The response is significantly faster, but the failure rate does not drop either. As you can see, many of the requests are "reset"

Actix-web by default sets the maximum backlog (number of TCP connections which the kernel allows to be established while userspace has yet to be able to accept them) to 1024. This means that if you block (eg by evaluating your (relatively) slow nn model) while processing the request and the backlog of requests to process exceeds 1024 actix-web will intentionally start to reject connections. The reason it does this is because the rest would eventually time out anyway on the client side due to not being able to process them fast enough. Immediately rejecting them allows the client to immediately return an error to the user rather than waiting like 30s before getting a timeout anyway. It allows a cloud load balancer to start a couple more instances of the web server. And in case you get DDOSed it allows for example Cloudflare to notice that your server is not able to process requests fast enough and start serving a Captcha instead. Quoting fasterthanlime's article "I won free load testing":

[...]

That's certainly more connections than there should EVER be between Cloudflare and my server. But Cloudflare expects origin servers to signal when they're struggling.

Origins should return 429, or 503, or start refusing connections; they should do something, anything other than just accept a ridiculous number of concurrent connections and let them just sit there — that's just a bad deal for everybody.

1 Like

in addition to that, I had set the number of workers to 17 in my configuration files. Although the threads are there, only one thread is working for requests. I think that I might have missed something, but I have no idea

Just to be sure, you're not running blocking code inside an async function, or locking a global Mutex while you perform your task, right?

Are you creating a separate Session for each thread? Based on some quick searching it seems like all accesses within a single Session may be synchronized by default.

I think you are correct. While I am not using a Mutex, the embedding async function has a couple of sync operations in it, for example, tokenization and inference. Does it mean that I should be offloading them into a separate threadpool for executing the tasks?

Great thanks for referencing this.
1024 connections seem to be fair. What I don't quiet get is that each request takes 4 seconds the most and 35 ms minimum, according to the statistics on the stress test interface, and yet the connection number could exceeds 1024 within a 4 seconds time range? 1024 connections is a huge number already.

Okay, I think that I did not consider Session for each thread. But according to the Actix Web's documentation, it starts an independent instance for each thread as it puts:

Once the workers are created, they each receive a separate application instance to handle requests. Application state is not shared between the threads, and handlers are free to manipulate their copy of the state with no concurrency concerns.

Okay, so I think that I had reached to a semi-conclusion: the number of workers seems to be better to just leave alone.

After I removed the method that specifies the number of workers when initializing HttpServer, the RPS had soared to an amazing 200+ without raising any errors. For my company's business, I think this is plenty enough for now.

Although still unclear about why even I had specified 17 workers, the threads are idle, for now, I will probably not adding a .worker method. Waiting for anyone who could explain on this.

Yes, if those operations can take a non-trascurable amount of time then consider running them on a blocking thread using spawn_blocking. Not doing so means you may prevent other tasks from running.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.