Format string into fixed space

I am trying to format a &str into a field that is exactly 7 "characters" wide. I could do the following:

fn fixed_size(inp: &str) -> String {
   let result = inp.to_string();
   result.truncate(7);
   format!("{result:<7}")
}

or

fn fixed_size(inp: &str) -> String {
   inp.chars().chain(iter::repeat(" ")).take(7).collect()
}

but (1) that's not exactly right with unicode and (2) seems way more convoluted than it needs to be

EDIT:

  • I want exactly 7 not at most 7
  • use-case is structured output for terminal, so 7 visual columns
  • I can tolerate if it breaks for "weird" characters but ideally it would be "not my fault" and fixable by updating some crate

Do you want 7 chars or 7 bytes? If the former, then it is almost trivial using iterators: s.chars().take(7).collect(). But that doesn't account for what Go calls "runes" (when a single representable symbol spans multiple unicode codepoints). If you want 7 bytes, it gets more involved. The example using fold is not ideal because it lacks an early return for when you pass really long strings, but you can change that to a for loop instead :slight_smile:

If you want no more than 7 "visual columns", like for use in the terminal, then you'll want to use unicode-width to calculate how many columns a given char takes up (most emoji will take 2, for example, &nbsp; and ZWJ will take zero, etc.).

"character" is ambiguous. what's your use case?

if you want to set the max limit of memory, then the byte length is the correct measurement, and truncate() already does that. if you are not sure 7 is at code point boundary, use floor_char_boundary()/ceil_char_boundary() to round it down or up first.

if you want the first 7 codepoints (the char type in rust), use the chars() iterator [1]:

let result = inp.chars().take(7).collect::<String>();

if you want 7 human perceived "characters" as in natural languages [2], this is a really hard problem (just like all unicode problems), and such features are not available in the standard library, you'll have to use third party crate, such as unicode-width, unicode-segmentation, etc. read their documentation for details.

I want to add: unicode grapheme clusters in theory can have unbounded number of codepoints, so even if you are processing unicode, you still need to set a upper limit of the byte length.


  1. although codepoint count isn't a very useful metric if you are dealing with unicode ↩︎

  2. the unicode termi is "grapheme cluster" ↩︎

2 Likes

Tangentially, shouldn't &nbsp; take one column rather than zero? It's still a space.

2 Likes

I guess I still need to either .chain() the iterator or format to get it to exactly 7 (as opposed to at most 7)

1 Like

Use-case is structured formatting of terminal output

so I guess the unicode-segmentation way would be along the lines of

fn fixed_size(inp: &str) -> String {
    inp.grapheme_indices().take(7).collect()
}

Something like this?

Yes, my bad. :slight_smile:

Yes: unicode-width is what rustc uses for this and it works "well enough". Some terminals have issues by emoji being presented as 1.5 columns width. Most terminals have no support for "compound emoji" (like yours does), grapheme clusters that are meant to be shown as a single emoji like the ZWJ family above, so rustc simply removes all ZWJ from the output so that underlines are more likely to properly align with their intended text (cue "rustc separates families" sub-thread).

Given the updates, you can write something like the following, but could optimize it further to avoid a few allocations:

fn visual(s: &str) -> String {
    let mut x = String::with_capacity(7); // This might be wider, but we're limiting the number of reallocations.
    let mut w = 0;
    for c in s.chars() {
        let c_w = unicode_width::UnicodeWidthChar::width(c).unwrap_or(1);
        if w + c_w > 7 {
            break;
        }
        w += c_w;
        x.push(c);
    }
    if w < 7 {
        for _ in w..7 {
            x.insert(0, ' ');
        }
    }
    x
}
2 Likes

Terminal width for non-ASCII text is even more cursed than all the complications of what is a "character" in Unicode, because terminal implementations have their own opinions on which code points are "wide" and aren't, and this even varies by fonts installed.

There are crates that contain some tables/heuristics for simple cases, and there are crates that perform ANSI hacking black magic to measure actual rendered width of text in the terminal.

3 Likes