What are some important reasons why `.text` is `async`?

I am learning the basics of async in the TRPL. If I understand correctly, the HTTP Response contains the HTTP Headers and the body (the HTML content). They may be retrieved as a stream arriving in chunks, or maybe in some other way, and it makes sense for get to be async.

Then text() may replace the BOM and patches malformed sequences (unsure what the patch is exactly, maybe closing tags, or non-UTF characters.)

  • Web pages can be quite large for those steps in the paragraph above to be carried out.
  • Or is there some more important reason why .text() needs to be async?
  • Or is it rather that it must parse/encode the binary data into the bytes corresponding to an encoding and this process is actually the one taking much time?

The snippet I tested is:

use reqwest;

#[tokio::main]
async fn main() -> Result<(), Box> {
  let content = reqwest::get("https://play.rust-lang.org/").await?;
  println!("\n\n\n content: {:?}", content);
  let text = content.text().await?;
  !("\n\n\n text: {:?}", text);
}

Normally, a get request comes back with a body, and even though it’s not printed there, I assume the body is there, just ignored.

I did ask ChatGPT, but the answer is strange, and suggests that text() is called why the body hasn’t arrived yet, which seems odd to me since it is awaited.

To quote:

:white_check_mark: 2. Why is .text() async?

Because .text() must read the entire response body, and the body:

  • Might still be arriving over the network.

  • May be large, so it’s read in chunks from the stream.

  • Is then decoded as UTF-8, possibly handling BOMs or replacing invalid sequences.

All of that may involve waiting — which is why .text() is async.

A HTTP requests normally comes back with a status code, some headers and then the body. The response comes back as a stream of bytes, so you get to read the status code and the headers sooner than the body. Moreover they are generally quite smaller than the body, so you can read them even faster.

All of this to say that .awaiting an HTTP request and getting a Response only waits to receive the status code and the headers, and gives them to you along with a stream of bytes that may not have been received yet representing the body. This stream is the one you’re .awaiting when you call .text(), .json() or .bytes(), etc etc.

11 Likes

reqwest could offer an interface which only allows waiting for the entire response without headers. This would make it less flexible. Here are some reasons why your application might care about being able to receive the data in a streaming fashion rather than waiting for the full response body:

  • Internet connections can sometimes be extremely slow. It is valuable to present to the user as much information as has been received so far, when that is a thing that makes sense.
  • When downloading large files to disk, writing the data as it comes in avoids wasting memory on storing the whole data, and allows you to resume a partial download later if it is interrupted. (Internet connections can be intermittent!)
  • In the event of an erroneous response of some kind, you can detect this earlier and cancel the request faster than waiting for the whole thing. For example, if a mis-configured server starts sending you a large web page instead of the small API response you expected, you can stop as soon as you see that the response Content-Type is text/html. (When making a HTTP request, you should always check the response status and Content-Type before making any assumptions about the format or meaning of the body.)
  • By parsing the response as it comes in, you can avoid needing to parse it in a hurry at the end of the request. This slightly reduces the overall time taken (because download and parsing are happening in parallel) and can improve energy efficiency on some machines (because the CPU can run in a low-power mode doing the work slowly as data arrives, rather than quickly at the end).
10 Likes

Though be aware you will often get a relatively decent chunk of data for an API synchronous with the headers most of the time: something around 14kB can be sent with even the initial server handshake response. You could even get multiple responses in one packet if the stars align!

There's no real performance issue making the body async regardless though, it's pretty much just checking a flag and returning the data which you would need to do anyway for some other design.

1 Like

This will wait for the headers to be received, but it does not wait for the body. After all, if the body is large, the end user may wish to use methods to read it incrementally without reading the entire body into memory.

Using the .text().await method tells reqwest to read the entire body into memory and then convert it to a String. Since it must wait for the entire body to be read into memory, it needs to be async.

To clarify, the async part is waiting for the body to be received. Converting it to a string is not async.

1 Like

So after the .get() request, the response's body keeps being received in parallel; finally, .text(), .json() and similar methods parsing the body await for the full extent to be received and parse it into the output type.

I guess this would be a detail to describe or mention it in the docs; I guess one needs to check the source.

I did not suspect this would be written to disk and rather would stay in memory (but indeed, that would block "restarting" a download from where it left off); is that a file one could spot, say in /tmp while it downloads?

I agree. I have seen other people confused by Response not being “just a data structure”.

I don’t mean that reqwest will write the data to disk. I mean that if you want to retrieve a large file, usually to disk, then being able to keep the part you’ve downloaded so far (rather than discarding it when a disconnection happens) is useful.

No, using methods like .text() and .json() require the response to be completely buffered in memory (well json in theory could decode into the result type as the data arrives, but I don't believe that's currently supported by the parser it uses). The result type gives you the value in memory, so there's not really anything clever to do here.

Those methods are for when you are dealing with a reasonably small amount of data, up to a few megabytes really, such as when you are using an API. If you want to read very large resources such as when downloading files for example, or when you are doing tricky things like streaming the display of a page as it downloads to reduce latency, you can read the bytes as they arrive with methods like .chunk() (in a loop) or byte_stream().

A trivial implementation of downloading an arbitrarily large file may look like:

let mut res = reqwest::get(url).await?.error_for_status()?;

let mut output = File::create(path)?;

while let Some(chunk) = res.chunk().await? {
  output.write_all(chunk)?;
}

Implementing resume would first check to see the the file was there, set it's length as a bytes range request header, then, if the response included the range response header, opening the file in append.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.