Truncating a string

I just want to truncate a string after N glyphs. I understand UTF-8 and why "truncate" won't work. But what does?

I've found

I must have missed something. There has to be a standard way to do this short of writing it myself.

Have a look at the unicode-segmentation crate. It contains an extension trait called UnicodeSegmentation which can be used to iterate over the graphemes in a string. With that in hand you can just .take(N).

4 Likes

the_string.char_indices().nth(n) should give you the appropriate index (if you want N chars – I'm not sure if that's what you are after, or you want grapheme clusters, or all of that plus Unicode width, etc.)

1 Like

Can you describe precisely what you mean here? When I look at https://www.unicode.org/glossary/#glyph, I see something font-/rendering-dependent, which it sounds like you don't want.

Are you ok if "noël" truncates to "noe" sometimes, for example?

(The two posts above mine are both good answers, but I don't know from the question which does what you want.)

2 Likes

Yes, I know you can take a string apart with UnicodeSegmentation and put it back together. But that's a lot of work to just keep log lines from overflowing.

I've written grapheme-oriented word wrap, so I know how, but that's overkill when all I want is truncation. Especially since most of the time, the string will be short enough it doesn't need to be truncated.

I'm just amazed that this isn't a standard library function.

There are actually good reasons that a bunch of otherwise desirable functionality wasn't included by default: it's so that the APIs can mature, evolve separately from stdlib, and if such a lib ever needs replacing, it can be done without gathering more and more deprecated stuff in stdlib over time.

1 Like

I would be amazed if it were.

But what do you even mean by that?

Suppose the string I want to cut is “Hello, World!”. And I want to cut 8 characters. Is it “Hello, ” or “Hell”? What if original would have only included one space?

I would strongly suspect that if you want log lines from overflowing then you want “Hell” but then you not only need to write console-aware code, you need the full-blown terminal library to know what characters your terminal treats as single-width ones and which ones it treats as double-width ones!

1 Like

For line wrapping wouldn't you want something like GitHub - ridiculousfish/widecharwidth: public domain wcwidth implementation, rather than grapheme clusters?

As far as I can tell neither Go nor Python ("batteries included") have support for Unicode segmentation by grapheme cluster in the standard library.

Truncation itself is a standard library function. It's called String::truncate.

It operates on byte indices. So if you give it a byte index that isn't on a valid char boundary, then it panics. If you know your log lines are all ASCII, then assume that every byte is a character and use String::truncate however you like.

If you want to treat each codepoint as a letter, then write one line of code to compute where you want to truncate:

let upto = s.char_indices().map(|(i, _)| i).nth(10).unwrap_or(s.len());
s.truncate(upto);

If you want to get it as correct as possible, then use graphemes via the unicode-segmentation crate. You can use the same code above for chars, but with grapheme_indices.

No need to take anything apart and putting it back together. Just find the index of the "glyphs" you want to show, find its byte offset and then do your truncation.

8 Likes

If you want to truncate a string to a maximum length approximately, you can use the unicode-width library to figure out how wide candidate pieces are.

If you want to truncate a string to a maximum length exactly, then you need feedback from your text renderer, because widths of arbitrary strings are affected by both the renderer per se and the font chosen. (It's actually possible to do this with terminals — you can write a string and then ask the terminal where the cursor ended up, and memoize the results so you get it right every time after the first. I've implemented this strategy, though not as a separable crate.)

6 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.