Before we do anything else, are you compiling this in release mode (i.e. cargo build --release)? The docs say to run the server with cargo run --example echo which is a debug build by default, and the code will be awfully slow because it won't have any optimisations (optimisations make debugging harder). I wouldn't be surprised if you get a 10-50x speed increase by just compiling in release mode.
This isn't necessarily related to performance, but your code is unsound and has a massive race condition. By using a static mutable variable for your counter every read and write to the variable will be done without any synchronisation, meaning multiple handlers that are running at the same time will "step on each other's toes", so to speak (see this StackOverflow answer for a more precise explanation). You'd only start to notice this when you do serious benchmarking and the tokio runtime starts handling requests on multiple threads.
You should be using an integer from the std::sync::atomic module because these manage thread safety correctly. That way you can create an AtomicUsize (or AtomicI32, or whatever) on the stack and pass references to it into your handler function.
Especially if you're new to the language, always try to write safe code and avoid the unsafe keyword. Usually when unsafe is involved there will be tricky invariants that need to be upheld (e.g. no data races or that you are using pointers correctly)... To be fair, if you wrote the same thing in C or Go it'd be equally as broken, you just wouldn't know it.
Very useful tips, Thank you very much both of you
I thought Tokio is a single threaded event loop, that's why I didn't use synchronization mechanism, I will change my implementation as you suggested and try again.
Thanks to your helps My code got a lot better.
I'd like to know where is my bottleneck.
I'm working on this code for a few weeks, read a lot of HTTP RFC stuff and did a lot of experiment, but still it is not good enough.
Do you know any good tool that helps me profile my code in runtime?
I'm puzzled as to what the counter and loop are there for. It looks like you're reading the first body, and then counting how many other reads succeed and multiplying by that. Surely that isn't what you want to do? The code would be way simpler without the extra looping and breaking that shouldn't ever happen if I'm understanding it right.
Thank you very much David @droundy, very good points, I'll apply them
Well, this is a micro-optimization I did based on how they run tests, the test-runner call the API many times concurrently but all the requests are the same, so I thought I can keep the first one and multiply it to the number of requests, and it worked.
Since the test-runner will use the same TCP connection to make all the HTTP requests, I added the loop, and it was a major improvement over my previous version where I close the connection after every HTTP response.
I am fairly new to rust, so take this with a grain of salt, but looking at the code it feels there is two things that could be happening (or I might be completely wrong):
You starve the CPU and hence there is just too many threads - though if those are scheduled asynchronously on Tokio this should not be happening. (But I am seeing no async in your functions ?, so why do you think you use Tokio?)
No new incoming requests can be served while the spawned threads execute
Basically you normally would schedule a number of workers asynchronously but that would mean to also have the listener be async, but as there is no await you have the listener synchronous.
Basically I think it might be worthwhile to see if async can improve the speed.
Good points, actually I removed Tokio in my latest version and I'm not using asyncIO API either.
But you're right, it would perform much better if I use AsyncIO API with a work-stealing runtime like Tokio.