Lost data with actix-web

I've written a simulator for a distributed system that generates a gazillion trace records. An actix-web client in the simulator sends a subset of those records to an actix-web server. The server munges that data into a form that I can display in a browser.

I wanted to run larger models, so I re-compiled with --release. It took about 5x longer to compile, but it also ran about 5x faster. Great! Unfortunately, the display in the browser shows that I'm missing a bunch of trace records. I don't know enough about HTTP to know if that's the problem or if the issue is with actix-web. I also write the trace records to a file, and they are all there. I can replay from the file, but I'd like to see an animation as the simulation runs.

The server only needs a subset of the trace records, but I didn't want the client to have to know that subset. Instead, I send a record of a given type and don't send any more of that type again if I get a 404. I've verified that I'm getting the expected number of 404s and no other errors.

Sometimes I get a connect timeout error from the server, but I just ignore those runs. I asked about that here, but nobody could help.

Out of curiosity, is the code something that you are able to share, so that some willing good samaritan could debug it locally?

Unfortunately, it's the family (corporate) jewels. I'll be happy to take ideas on how to debug it, though. My best guess is that the HTTP protocol allows dropping POST requests when they come in too fast.

Can you configure a reverse proxy, say, Nginx between the client and actix-web server? If Nginx accepts the request but the actix-web drops them, you'll have some hypothesis validation and a direction to look into. You can also tweak the Nginx logs to log detailed debug info per request and compare it with actix logs.

AFAIK, the HTTP protocol itself does not defining request dropping mechanism but provides a code: HTTP 503 Service Unavailable. When and how each server triggers the same is implementation dependent.

Thanks. I'm not getting any 503 errors, so the problem must be something else. I'll see if I can narrow down the possibilities.

Can you debug what browser gets? As in running the same thing through curl for example. If it runs 5x faster, maybe browser isn't fast enough?

May be the problem in your "async" code, may be you run "blocking" code inside "async" handler, so
you got timeout, missed data because of thread that should handle many sockets at once stuck on I/O, mutex or something like that?

Good thought, @lwojtow. That's something I'll need to look out for when I animate. Right now, I wait for the run to complete and then do a GET of the munged data.

I can also create the munged data by having the server read the trace records from a file, which works. That's why I'm suspicious about the client-server connection rather than the server-browser one.

That's a good idea, @Dushistov, but my code uses threads, not async. (I started work on this simulator long before async was available.) That means any async problem must be in the actix-web code. Is it likely that this crate has that kind of problem?

Not sure that we understand each other. There is time when "async/await" keywords were introduced into Rust language. But actix was async long before that. So if use actix-web your code is async, even if there are no "async/await" keywords in your program and in version of "actix-web" that you use.

Something like this:

 actix_web::HttpServer::new(move || {
        actix_web::App::new()
            .data(State {
                unit: Arc::new(Mutex::new(Unit)),
                slow_stuff: SlowStuff::new(),
            })
            .service(actix_web::web::resource("/").to(index))
    })

if you do some long time work inside fn index( you can potentially block some other HTTP requests.

I did indeed misunderstand. I now have something to look at.

You gave me the clue that led to the solution, @Dushistov. According to the documentation

Since each worker thread processes its requests sequentially, handlers which block the current thread will cause the current worker to stop processing new requests:

Increasing the number of worker threads didn't help; it just slowed things down. It looks like I need to make the function doing the work async even though it's not doing any long running operations, such as I/O. It must be that requests are arriving faster than the workers can handle.

Now, do you have any idea what I can do about the "Error from server Connect(Timeout)" I'm getting?

If the server is blocked from making any progress, then that could just explain the intermittent timeouts as easily as those missing records.

Actix has a ThreadPool where you can make blocking requests, but using a ThreadPool will not eliminate all blocking behavior. For example, the ThreadPool is bounded to some maximum number of threads (by default num_cpus * 5). Spawning more tasks on a pool which is already at capacity will simply block those new tasks.

The thread-based I/O model is known to scale poorly1. Futures/async/await can improve on the scalability by using the idle time while waiting for I/O to do other tasks, like handling incoming requests. Actix itself does support Futures, so this is probably worth investing in. Version 0.2 even supports async/await! This async example shows the basics of using an async HTTP client with actix-web.

All told, I suspect that moving to a Futures-based I/O model will address most of the issues related to blocking.

You are all diving into Details and do guesswork. That won’t help too much. And unless the topic starter will hand out more information (code!) it will likely stay that way.

One thing to start to begin with: HTTP is a transactional protocol. One should know whether actually all calls have arrived at the server and have been processed by it. There are status codes for this, and if they are ok and the response body was properly closed then the data was not „lost“ on the transmission path.
From there on you can check whether it’s a client or server issue. And then try to figure out what the issue actually is

Am I misreading the statement that a blocked handler will stop processing new requests? I read that to mean that requests received while the handler is blocked will be dropped.

Everything is running on my Mac laptop, so there's no network to drop messages. The simulator uses way more threads than I have cores, but I don't see how that can result in dropped messages. Plus, I thought I ruled that out by checking the status codes in the client.

I did find the time to create an example. The example uses "expect" on the first two lines just in case something would swallow "?" errors.

fn process_msg(elements: web::Data<AppElements>, record: web::Json<Value>)
               -> Result<impl Responder, Error> {
    let trace_body = record.get("body").expect("Msg: bad trace record");
    let body: Body = serde_json::from_value(trace_body.to_owned()).expect("Bad Body format");
    if body.msg.payload.msg_type == "Known" {
        let that_name = body.msg.payload.id.name;
        let mut elements = elements
            .get_ref()
            .appelements.lock().unwrap();
        let other = update(&mut elements, &body.id.name, &that_name, body.rcvr, ElementType::Unknown)
                .get(&body.rcvr).expect("DiscoverDMsg: missing neighbor").clone();
        update(&mut elements, &other.name, &that_name, other.rcvr, ElementType::Known);
    }
    Ok(HttpResponse::Ok().body("process_msg".to_owned()))
}
fn update<'a>(elements: &'a mut MutexGuard<HashMap<String, AppElement>>, this_name: &String,
              that_name: &String, rcvr: usize, element_type: ElementType) -> &'a HashMap<usize, Other> {
    let element = elements
        .entry(this_name.to_string())
        .or_insert(Default::default());
    element
        .fields
        .entry(that_name.to_string())
        .or_insert(Default::default())
        .element
        .insert(rcvr, element_type);
    &element.theirs.others
}

As you can see, I only do a few hashmap operations while holding the lock. The other reason I don't think the locking is causing the problem is that only a few percent of the trace records are of the "Known" message type.

If nobody comes up with a better idea in the next couple of hours (I have a meeting.), I'll follow @Matthias247's advice and try to find out where the messages are disappearing.

Is it obvious? How it can process request on the same thread, if this thread blocked?
But by default should be N threads processing your requests, if you change nothing,
it should be number of logic cores of your CPU. So if you have 4 cores, and two requests (handled on differen threads) attempts to lock mutex and failed and their threads were suspended until corresponded locks will be free you got only 2 free threads to process requests.
And all sockets that scheduled to handle by threads that blocked by mutex should wait processing (reading) until mutex unlock and request handler return control to actix.

No it will be queued, also there is OS queues. The main problem that you will possible have that http client timeout will be expired because of server will handle requests in too slow way.

Maybe? The only way the server can respond with a status code indicating error is if it is not currently blocked. Otherwise the request will be held in a queue; For HTTP requests, this is the socket read buffer. For new incoming connections, this is the tcp listen backlog.

In the latter case, actix has a thread dedicated to accepting incoming connections, so new connections should only be dropped if (a) all workers are busy and (b) the total number of waiting connects exceeds the listen backlog (defaults to 1024). In the former case, blocking in a service handler will block everything on the HTTP worker thread. See: Server | Actix

I wouldn't presume that the lock is entirely to blame for timeouts. Even in cases with extreme contention, the worst case latency should remain reasonable (compared to generous network timeouts). I would direct your attention to any blocking operations in your actix-web service. This includes JoinHandle::join(), mpsc::Receiver::recv(), and TcpStream::read(), just to name a few.

Another thought I had; this conversation has so far used vague language, like comparing "client-server connection" to "the server-browser one". IMHO, a browser is a client, so there is some non-obvious network model in play. How is "server-browser" distinct from "client-server"? Perhaps a diagram can help? It doesn't have to give away intellectual property, but it my help to get everyone on the same page, architecturally speaking.

Curiouser and curiouser.

I took @Matthias247's advice and verified that all the messages are received and that they all get to the end of the processing function. Lesson learned. Don't assume; measure.

Actually, I used an AtomicU64 to count the messages received and calls to the update function as a proxy for matching up the exact messages with the log file. There still could be an error in processing some messages multiple times and skipping others. I guess I can try to figure out how to check it, but I ain't doing it by hand; it's thousands of records.

The problem is NOT lost messages even though it only happens only for large cases or running in release mode on modest sized ones. (I exclude cases where I get a server timeout error.) Small cases never fail. Replaying from the trace files is always correct. Both the POST requests and writes to the trace file are done in the same function on the same variable. The only difference is converting to a string for file output and to a JSON Value for the POST request. Hence the mystery.

To answer @parasyte's question, I am calling the simulator the client, because it's running the actix-web client to POST the trace records. The server is the server, obviously. The browser only does a GET after the run is complete. Eventually, I will use web sockets, so I can animate the picture.

Things are starting to come together. How is the actix-web client used? Is it running in blocking mode on the main thread (or in a dedicated thread pool)? Or is it creating a future that needs to be awaited on to initiate the request? And if it's a future being awaited, on which thread does this happen?

A modest size run of the simulator has about 1,000 threads, about 250 of them running an actix-web client in blocking mode.

Another issue is that something like 9 out of 10 runs large enough to show the problem triggers a server error. I've tried

        .keep_alive(100)
        .workers(200)
        .backlog(4096)
        .client_timeout(0)
        .client_shutdown(0)

to no avail. Most often it's a Connect(Timeout), but sometimes it's AddrNotAvailable.

Are there other setting I should try? Will making the server async help?