Check if file is a text file

How to check if file is a text file, that is, is not binary, is not avi, mp4 etc ?
File's types like:
CSS, HTML etc as well as of course regular text
would qualify as text files.

Thank you

I think there might be Rust crates for the libmagic library, which allows you to find out the file type based on the file contents. Not sure if this is overkill though.

I guess a simpler approach would be to check if the file contains control characters that aren't supposed to be in text files (bytes from 0 to 31, and 127, with the exception of 9, 10 and 13 which are used for tabs and line breaks).

That 'kinda' wouldn't work in files that are over 1GB in size. I said kinda, because, sure, after a while it will read that file and perform the check but the times are simply too large.
As for libmagic, I've read the skimply docs, but didn't get how to use it. Could you provide minimal example?

I think it's not possible to determine if a file is a text file unless you process it completely.

However, you could use a heuristic (with false-positives) by scanning only the first and the last megabyte, for example.

I have not used it myself, and not sure if it allows you to do what you need because it might detect HTML as "HTML" and not "text", but I'm really not sure, I haven't used it.

I would likely go for the approach to scan the beginning and end of the file and check if it contains unexpected control characters.

If you know that the file uses UTF-8, you could also require that it is valid UTF-8, which reduces the risk of false-positives.

The first thing to do, before potentially more expensive checks, is to check the start of the file for the most common "magic numbers" which is presumably what libmagic will do for you. If the file starts with one, then it’s probably not text – except if it’s a BOM (byte order mark), then the file is likely UTF-16 text, which can be further verified by trying to decode a part of it.

If the file doesn’t have a known magic number and is exclusively made of bytes 0x20..0x7F then it’s almost certainly ASCII text –but of course it could be eg. Rust or HTML or whatever code rather than plain text in a natural language – though given that you said "gigabytes" then that may not be relevant. Anyway, what types of text files can be gigabytes in size and you don’t know anything else about their structure that could be useful in identifying them? Logs maybe – but they do have structure?

UTF-8 text is also pretty easy to identify and, as with ASCII, sampling eg. 100 kB of data from the file is likely to give you a pretty good confidence that the heuristic is correct.

3 Likes

How would you go about it?

UTF-8 is self-synchronizing, so wherever you are in the middle of a stream of UTF-8, you are guaranteed to find the next (or previous) code point boundary by checking at most four bytes, because the start byte of a code point is distinguishable from any continuation byte with a simple bitmask. The contrapositive is that if you don't find a start byte (or the following bytes aren't valid continuation bytes), then the file is not valid UTF-8. If you do, that doesn't guarantee that it is UTF-8; it could have been just a coincidence. So read forward however many bytes you want and check that it all decodes as valid UTF-8.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.