Compiling pywheel for tokenizers==0.7.0 (pip package)

I have been trying to

pip install tokenizers==0.7.0

but have been running into issues compiling the pyproject.toml wheel. Specifically, it complains that "const_fn" is deprecated:

  error[E0557]: feature has been removed
    --> /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lock_api-0.3.4/src/lib.rs:91:42
     |
  91 | #![cfg_attr(feature = "nightly", feature(const_fn))]
     |                                          ^^^^^^^^ feature has been removed
     |
     = note: split into finer-grained feature gates
  
       Running `/tmp/pip-install-g3gaovs1/tokenizers_9c083c18f7d44582b4bf2b519512d3b3/target/release/build/libc-ea0b5313de561e2c/build-script-build`
  For more information about this error, try `rustc --explain E0557`.
  error: could not compile `lock_api` (lib) due to previous error

The only instance I could find through Google of a similar issue was here, but this solution does not seem to work here.

I could not figure out how to build this wheel or how to revert the Rust compiler version so that it compiles with "const_fn" working.

Thanks in advance!

Is there a reason against installing the latest version of tokenizers? I'm not familiar with this program or library at all, but I could imagine if they were relying on nightly rust features, the later versions would have addressed these issues already. Seems like 0.7.0 is more than 3 years old.

I'm also not really understanding the installation procedure. Did it involve you installing a rust compiler yourself? If so, you could select an older version like 1.42.0 edit: actually, you'd need an old nightly from around April 2020, e. g. via the rustup default command. But with dependency resolution, it’s possible that runs into issues, too. It’s hard to help you further without a exact reproduction of your underlying issue, assuming there’s any good reason why you cannot use a newer version of tokenizers.

This issue seems relevant, though it doesn't seem to offer solutions

Never mind. I realized I was being silly and could just enable the feature that I wanted. :man_facepalming:

Thank you so much for your help and fast response! I hope you have a good day!

Summary

Thank you so much for the fast response!

Yes, there is a reason: it has been suggested that the newer versions of tokenizers prevents multithreading (for LLM training) since I get this warning:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Also, I did have to manually install a rust compiler since the binary for this version is not cached (or is configured in some way that forces compilation). Running rustup default gives 1.20.0-x86_64-unknown-linux-gnu (default). How do I set this to the nightly version that you suggested? I am having trouble finding any specific version number for nightly.

I know this is of limited relevance, as your issue seems to be solved, but FYI, to refer to specific nightly versions in rustup, they are called nightly-YYYY-MM-DD for any date. So you could e.g. rustup install nightly-2020-04-30 and similarly rustup default nightly-2020-04-30, just as you can do with stable, or nightly or specific versions numbers like 1.60.0.

On that note, having 1.20 reported as your default is a weird choice; in case you didn’t do that deliberately for some purpose, you might want to set it to rustup default stable, which is the most common choice.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.