Text to speech for Rust

John_Nagle · May 3, 2024, 4:21am

There's some old stuff around, and some glue code, but no widely used text to speech for Rust yet.

Piper is a quite good open-source text to speech system. Listen to some of the examples. This is a modern neural net system, and it's small enough to run on a Raspberry Pi, so it probably won't take over the whole machine on desktop. Anyone connected that to Rust?

It uses Onnx, the Microsoft open-source inference library. Like most modern AI programs, Piper itself isn't much code; Onnx is doing the work. There are some Rust crates to connect to Onnx code, but no idea which ones are any good.

Anybody into this stuff?

jofas · May 3, 2024, 10:06am

Burn supports ONNX models. Haven't tried burn yet but it looks promising and intuitive to me.

robertknight · May 3, 2024, 10:46am

This is not a production system, but Daniel McKenna (xd009642) gave a talk on text-to-speech in Rust at RustNation UK 2024. He showed a demo project using Rust + an ONNX Runtime wrapper. I converted it to use RTen for inference and achieved similar inference performance on an Intel i5 CPU. Note that RTen doesn't run the .onnx model directly (at present), it requires the model to be run through a conversion tool that produces something similar to ONNX Runtime's .ort format.

This demo project mentioned is using a non-ML method for generating the final output samples from the model outputs, as discussed in the talk. This produces an "underwater"-sounding output.

I haven't tried running Piper's models yet. I would be interested in giving them a go. (Update: I had a look, unfortunately the voice models need a number of operators that are not yet implemented.)

robertknight · May 9, 2024, 8:06am

An update - I have got a pure-Rust demo that runs Piper models using RTen. It is part of the examples crate here. Technically the current demo is "phoneme to speech" rather than "text to speech" as it doesn't handle the initial preprocessing step of translating text to phonemes yet ("This is a text to speech system" => "ðɪs ɪz ɐ tˈɛkst tə spˈiːtʃ sˈɪstəm."), instead it takes phonemes as input. This required adding some new capabilities to the inference engine, so using it requires a git checkout of the rten crate.

On my Intel i5 laptop generation takes ~120ms for 1.9 seconds of output. On a Raspberry Pi 2 Zero generation takes about the same length of time as the output. On a Raspberry Pi 4 or 5 it will be somewhere in the middle.

For the text-to-phoneme translation, the original Piper inference tools use a C++ library called piper-phonemize, which is unfortunately a) C++ and b) GPL licensed. From discussions in the Piper repo I gather there is work planned to enable the models to do direct text-to-speech rather than going through a text-to-phoneme step, but there aren't good models released yet. I'm not sure how hard it will be to recreate the useful parts of piper-phonemize in Rust, but in the meantime you can use the C++ library through a wrapper.

system · August 7, 2024, 8:06am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
How to convert text to voice help	4	5289	July 17, 2020
Idea for a Rust promotion challenge community	1	703	January 12, 2023
What's everyone working on this week? (Week 42, 2015)	22	3310	January 12, 2023
Using rust for audio output help	4	7742	January 12, 2023
Writing Java code wrapers help	3	1804	January 12, 2023

Text to speech for Rust

Related topics