Capturing Stdout on windows - allocation failure

I'm running the following function in a test:

fn spellcheck(dir: &Path) -> Option<String> {
  let pdf2text_status = Command::new("pdftotext")
    .current_dir(dir)
    .arg("Schüler_Name.pdf")
    .status()
    .unwrap();

  assert!(pdf2text_status.success());

  let mut txt = OpenOptions::new()
    .read(true)
    .open(&dir.join("Schüler_Name.txt"))
    .unwrap();

  let mut aspell = Command::new("aspell")
    .arg("--lang=de")
    .arg(&format!("--add-extra-dicts={}", ASPELL_PWS))
    .arg("list")
    .stdin(Stdio::piped())
    .stderr(Stdio::piped())
    .stdout(Stdio::piped())
    .spawn()
    .unwrap();

  let aspell_stdin = aspell.stdin.as_mut().unwrap();
  io::copy(&mut txt, aspell_stdin).unwrap();

  let output = aspell.wait_with_output().unwrap();

  assert!(output.status.success());
  assert_eq!(String::from_utf8(output.stderr).unwrap(), String::new());

  if !output.stdout.is_empty() {
    let s = String::from_utf8_lossy(&output.stdout); // problem here
    Some(s.to_string())
  } else {
    None
  }
}

Basically, I'm capturing the output of aspell, after arranging things properly (producing the text file, feeding it into aspell via stdin).

The problematic line has a comment. I tried using String::from_utf8 which panicked, so I settled on String::from_utf8_lossy, seeing I certainly don't need the non-utf8 chars.... but that failed with

test correct_pdf_names_when_error ... ok
test render_all_demo ... ok
test tilde_in_dir ... ok
memory allocation of 139225504 bytes failed
error: test failed, to rerun pass --test full_runs

Caused by:
process didn't exit successfully: D:\a\zrs\zrs\target\debug\deps\full_runs-57c67b03501e848c.exe --nocapture (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)
Error: Process completed with exit code 127.

Did I do something wrong? Seems to me this shouldn't really happen without unsafe, right?

Relatedly: What can I do about this, I do need the output from aspell? Note: On linux the test runs appropriately (using String::from_utf8).

Read this https://devblogs.microsoft.com/oldnewthing/20190108-00/?p=100655

1 Like

In any case, I would use a debugger to find out which part of the code actually triggered the OOM error, ideally with a stacktrace. Also check what from_utf8_lossy returns, is it Borrowed or Owned?

Also, can you give a reproducible example? Like the txt which produces this error?

Pretty sure it's owned, since from_utf8 returns an error. Could check...

This is all on CI for now, so I can't easily reproduce. There's quite some setup involved, too (dictionary, installing this stuff on windows...), and I can't share the file... but I might be able to find a reproducer I can share.

(e) I can however of course easily reproduce the bytewise stdout of aspell, but I wouldn't know what to do with it really...

(ee) Oh yeah, and did you see the line memory allocation of 139225504 bytes failed? Maybe the STATUS_STACK_BUFFER_OVERRUN is a fluke, but isn't that line printed by rust on alloc failure?

Is your Windows CI limited to 256M or something? It could literally be an allocation failure.

One step beyond that is asking why there's non-UTF8 in the file. Is it using a USC2 or some other non-UTF8 encoding, for example?

Note, it's not a file containing non-UTF8, it's stdout of aspell. But that link sounds linke I should try that LANG setting.

Not sure how much my CI is limited, but it's enough to compile a rather hefty set of rust dependencies... standard github action, guess that make the RAM 7 GB.

(e) Alas, the env var didn't help. I could however put the bytes onto the playground, run from_utf8_lossy on them, and print that... the offending bytes seem all to be 226, all before a newline or something. That should be googleable :slight_smile: Well, it was, but seem to me it's â in the western european code pages. Not sure why aspell would use that or put out that char, but... well, I could just filter it out.

So the overall solution here seemed to be to add --encoding=utf-8 to aspell's args, although LANG=de_DE.UTF-8 was supposed to do the same thing.

2 Likes