There's some old stuff around, and some glue code, but no widely used text to speech for Rust yet.
Piper is a quite good open-source text to speech system. Listen to some of the examples. This is a modern neural net system, and it's small enough to run on a Raspberry Pi, so it probably won't take over the whole machine on desktop. Anyone connected that to Rust?
It uses Onnx, the Microsoft open-source inference library. Like most modern AI programs, Piper itself isn't much code; Onnx is doing the work. There are some Rust crates to connect to Onnx code, but no idea which ones are any good.
This is not a production system, but Daniel McKenna (xd009642) gave a talk on text-to-speech in Rust at RustNation UK 2024. He showed a demo project using Rust + an ONNX Runtime wrapper. I converted it to use RTen for inference and achieved similar inference performance on an Intel i5 CPU. Note that RTen doesn't run the .onnx model directly (at present), it requires the model to be run through a conversion tool that produces something similar to ONNX Runtime's .ort format.
This demo project mentioned is using a non-ML method for generating the final output samples from the model outputs, as discussed in the talk. This produces an "underwater"-sounding output.
I haven't tried running Piper's models yet. I would be interested in giving them a go. (Update: I had a look, unfortunately the voice models need a number of operators that are not yet implemented.)
An update - I have got a pure-Rust demo that runs Piper models using RTen. It is part of the examples crate here. Technically the current demo is "phoneme to speech" rather than "text to speech" as it doesn't handle the initial preprocessing step of translating text to phonemes yet ("This is a text to speech system" => "ðɪs ɪz ɐ tˈɛkst tə spˈiːtʃ sˈɪstəm."), instead it takes phonemes as input. This required adding some new capabilities to the inference engine, so using it requires a git checkout of the rten crate.
On my Intel i5 laptop generation takes ~120ms for 1.9 seconds of output. On a Raspberry Pi 2 Zero generation takes about the same length of time as the output. On a Raspberry Pi 4 or 5 it will be somewhere in the middle.
For the text-to-phoneme translation, the original Piper inference tools use a C++ library called piper-phonemize, which is unfortunately a) C++ and b) GPL licensed. From discussions in the Piper repo I gather there is work planned to enable the models to do direct text-to-speech rather than going through a text-to-phoneme step, but there aren't good models released yet. I'm not sure how hard it will be to recreate the useful parts of piper-phonemize in Rust, but in the meantime you can use the C++ library through a wrapper.