I have a Python application which lists a Google Cloud Storage Bucket, takes each file, downloads it, reading each line individually, attempting to read a JSON dictionary from each line, transforming the JSON dictionary, then ultimately sending it to BigQuery.
I'm using Python green threads for this. The main program thread iterates on all filenames in the bucket, submitting each file's name to a queue. On the other side of the queue, a thread pool picks up the filename and begins streaming it back, line by line. Each line is attempted to be deserialized into a JSON dictionary and then submitted to a secondary queue. On the other side of this queue is the final stage of the pipeline, transforming the JSON and uploading it into BigQuery.
So in this example, there are three types of threads:
- the main thread, submitting filenames within a bucket to queue A
- workers picking up filenames from queue A, reading these files line by line, attempting to deserialize each line as a JSON dictionary, and submitting the JSON dictionaries to queue B.
- workers picking up JSON dictionaries from queue B, transforming the payload, and then sending to BigQuery.
Obviously, this is unideal for a number of reasons:
- It's Python, so there are countless levels of optimization that can be easily had simply by switching to Rust.
- It's CPython, so threads aren't really threads, they're I/O green threads that get scheduled whenever I/O blocks the application. This means that the CPU load of JSON deserialization and transformation only happens on one CPU at any time, maximum.
- Queues are unbounded so memory is unbounded. This can be easily remedied.
This is going to be a long-running job, so performance is very important. Python's execution time is going to be O(1) dependent on the CPU frequency and idleness, whereas a multi-threaded solution will be O(n) dependent on the number of CPU cores in addition to frequency and idleness.
I've been thinking lately about how to port this into Rust efficiently. I could use crossbeam's multi-producer, multi-receiver channels between the thread pools. I could also throw in rayon perhaps, but it's probably not the most efficient thing for what I'm doing. Obviously parking_lot would be used for any Mutex
es or RwLock
s to utilize atomic integers instead of costly syscalls.
I then began thinking of Tokio. I have a lot of I/O bound operations occurring, such as paging through all of the filenames in the bucket, buffered reads against the files over TLS, and HTTP uploads into BigQuery. The CPU-bound tasks are largely around JSON serialization/deserialization and transformation. Using multiple thread-pools or even work-stealing threads such as in crossbeam-utils and rayon would be inefficient and would have a lot of OS overhead. This is why I'm thinking of Tokio: it would likely give me a work-stealing thread-pool and would be able to do a lot of concurrent I/O, yielding control back to the pool whenever I/O blocks.
I've used all of the other primitives listed above, but I haven't directly used Tokio yet. Since Tokio seems to be entirely built around running a server, I'm not sure how to set it up for an ETL-like job like this.
First, is this a good idea, is my thinking on this matter sound? Second, is there documentation on using Tokio in such a fashion?
As an aside, I'd like to keep thanking the amazing Rust community for all of these incredible tools. Rust's concurrency features can't be beat, and the fact that I can think of like 5 different ways to solve this problem is awesome. Thank you to the Rust team and to the community for the language and for all of these incredible libraries!