Did any of the suggestions in the Performance part of their docs help?
In particular, you don't need to deserialize into a BimRecord struct every time. You should be able to copy the "ammortized allocations" suggestion and reuse the same StringRecord every time, using indexing to get at the snp, a1, and a2 fields directly. Deserializing a strongly typed value with serde is always going to be more expensive than string indexing.
I do enjoy tweaking code to get the most out of a processor. But, reading from disk is going to be one to three orders of magnitude slower than parsing. I suspect your time is better spent tuning the read buffer size (2 MiB in my experience is a good starting choice) and introducing async, overlapped, or threaded I/O.
It looks like the default for csv::Reader is rather small. Increasing that just to the disk cluster size (64 KiB is common) should make a significant impact.