Dealing with large text files

What's the best way to deal that is read very large >2GB text files?

You, begin at the beginning and go on till you come to the end; then stop.


But in all seriousness, what's your question here? If you are reading a file sequentially, there's not much further you can do.

I deal with "large text files" doing text processing in Rust. But, I just can't imagine a >2GB book, not even a trilogy. :wink:

Does you use case have something to do with brute-forcing passwords? What are those >2GB files?

2 Likes

Sorry, in my original post I've typed sequentially. This is incorrect. The reads are random. Also, the files are bigger than available RAM.

That is your personal problem really. I am not sure why are you talking about your limitations here. :wink:

File objects implement the Seek trait, which will let you jump around arbitrarily without reading the entire file into memory.

If your access pattern is indeed random and your files do not fit into ram, I think the best thing you can do is rely on the disk caching of your operating system when jumping around inside the file using seek.

Things might change if you can somehow predict where you are going to access the file next, though.

Edit: one alternative to seek night be mapping the file into the address space of your program using sonthing like mmap. Memory mapping files is inheritantly unsafe however due to the possibility of changes to the file by other programs.

4 Likes

Thank you for that info. Indeed helpful.

Thank you for your reply. Yes, memory mapped file is something that I can try using.

In genomics, much of the data is stored in text files >2GB, usually generic TSV for generic data and certain formats for genomic data (FASTQ, SAM, VCF, etc).