How to load and run local LLMs (e.g., Llama) in Rust?

Hi everyone,

I'm exploring ways to run local large language models (like Meta's Llama, Llama 2, or Llama 3) directly from a Rust application. I understand that Rust doesn’t have a built-in LLM module, but I’ve seen crates like llm, llama-rs, and mistralrs in the ecosystem.

My goal is to:

  • Load a quantized GGUF model file (e.g., llama-3-8b.Q4_K_M.gguf) from disk,
  • Perform text completion or chat-style inference,
  • Ideally support CPU (and optionally GPU via Metal/CUDA).

However, I’m a bit overwhelmed by the options and their documentation. Could someone share:

  1. Which crate is currently the most actively maintained and beginner-friendly for this use case?
  2. A minimal working example of loading a GGUF model and generating text?
  3. Any gotchas or performance tips (e.g., model format requirements, threading, memory usage)?

I’ve tried snippets from mistralrs and llm, but ran into issues with model compatibility or unclear API usage. Any guidance or pointers to up-to-date tutorials would be greatly appreciated!

Thanks in advance :folded_hands:

I used Kalosm. Considering I don't know the first thing about LLMs and still managed to figure it out I think you'll have no problem with it.

Thanks so much for the suggestion — I hadn’t heard of Kalosm before, but it looks really promising!

To be honest, as someone completely new to LLMs in Rust, I’ve been a bit overwhelmed by all the options (llm, mistralrs, llama-rs, etc.) and wasn’t sure which one is the most beginner-friendly or well-maintained.

I’ll go ahead and give Kalosm a try for loading my local Llama model (I have a GGUF file ready). It seems like it handles a lot of the complexity under the hood, which is exactly what I need right now.

Appreciate the pointer — I’ll start setting up a test environment with it! :raising_hands:

I have used both candle and mistralrs for model inference. I can't tell what kinds of compatibility issues you ran into. On macOS, the biggest problem I had to address was getting the SDK configured correctly (see Function 'cast_bf16_f16' does not exist · Issue #2660 · huggingface/candle · GitHub).

You are right, though, documentation is generally inadequate (or outright bad) [1]. And tutorials are hard to come by. Due to the active development of AI frameworks, tutorials are quickly outdated anyway. Your best bet is to play around with things until you get it working. You may have to do some debugging and make patches (see below, for instance).

The goals you listed are mostly already implemented in several examples available for both candle and mistralrs. Loading a model from disk is API specific, for example: mistral.rs/mistralrs/examples/gguf_locally/main.rs at master · EricLBuehler/mistral.rs · GitHub and candle/candle-examples/examples/quantized at main · huggingface/candle · GitHub

Alternatively, you can use huggingface-cli [2] to populate your $HOME/.cache/huggingface/hub/ directory or set one of the environment variables to your preferred cache location:

I will warn you that CPU inference is slow with large models, even the 8B parameter model you want to use. I get around 12 output tokens per second [3] on the Mistral 7B GGUF model with CPU inference on an M3 Max (12 performance cores/4 efficiency cores, 64 GB RAM), and 60 output tokens per second with --features metal on the same chipset (40 GPU cores). It's up to you, but I wouldn't bother with CPU inference.

Note that I wasn't able to get the most recent mistralrs git revision working with Metal until I tracked down the bug: Fix hang and performance drop with Metal by parasyte · Pull Request #1662 · EricLBuehler/mistral.rs · GitHub

Some general tips:

  • Quantization really matters for inference performance. And the best quantization largely depends on hardware and model architectures.
    • Requantizing a normal 7B parameter model [4] with --isq AFQ8 gives me around 43 output tokens per second.
    • The same model with AFQ4 gives me around 70 output tokens per second.
    • Smaller quantizations usually perform better but have lower inference quality. AFQ2 completely falls apart on the model I'm running, causing it to get stuck endlessly repeating itself.
    • Is GGUF a hard requirement? You won't be able to requantize these. Meaning you really can't tune the quantization for your own hardware. You're limited to whatever quantizations the model provider gives you.
  • Use a server instead of linking the library to save yourself a great deal of recompiling/linking time.
  • I recommend using the HuggingFace tools and so on. It is much easier to go with the flow than against the grain. Especially when you are in unfamiliar territory.

  1. The ML domain is still largely research projects with little effort put toward robustness. ↩︎

  2. I use micromamba to manage my Python virtualenvs. That lets me install hf and all of its dependencies cleanly without breaking system Python. micromamba is installed with homebrew. You can also use it to install other ML tools that you might need along the way, like pytorch. ↩︎

  3. It used to be 3 output tokens per second, the last time I used mistralrs. CPU performance has improved by about 4x! ↩︎

  4. mistralai/Mistral-7B-Instruct-v0.3 ↩︎

1 Like

Thank you so much for such a precise and thoughtful analysis of the exact pain points I’ve been struggling with! Your insights not only hit the nail on the head but also offered genuinely practical guidance—this will save me a lot of trial and error.

I especially appreciated your closing remark:
“In unfamiliar territory, it’s often easier to go with the flow than to swim against the current.”

It reminded me of a Chinese saying:
“顺势而为,事半功倍。”
("Go with the trend, and you’ll achieve twice the result with half the effort.")

That really resonates with me—and I’ll keep it in mind as I navigate this space.

By the way, I’ve recently been exploring Kalosm as a local-first Rust framework for LLMs—have you had any experience with it or have thoughts on how it compares to Candle or Mistral.rs?

Thanks again for your generosity and wisdom! :folded_hands:

1 Like

kalosm crossed my path a while back, but I did not use it. It appears to be similar in scope to mistralrs, both are built on top of candle and provide "higher level" tools. I do not have direct experience with it.

1 Like