Is this sane / worthwhile (Brainstorming: fast bulk file copy approach)

Hi, as I'm sitting here, a few TB of data (some large files, but also a butt-load of tiny files) are slowly (we hope it will be done in next few days) being pumped to another (MS-Win, sadly) fileserver (so that we can even begin to hope to migrate the rest of the server over the weekend).

Besides Win/NTFS being terrible with lots-of-small-files, I expect latency also plays some part in this (create/open file -- ok -- write this -- ok -- write that -- ok -- ... -- close file -- ok -- <rinse&repeat>). Suggestions to run multiple multi-threaded robocopy jobs (if you can subdivide the files somehow, or are lucky enough to already have even-enough split in sub-dirs) seem to confirm my suspicions.

So I started thinking:

  • what if we side-stepped the windows SMB protocol
    • let's use some fast messaging layer, so as not to bother with TCP or UDP low-level drudgery
    • scan the directories in advance / in another thread, to divide files into large / medium / tiny categories
      • large files will be tranferred in multiple chunks
      • medium (maximum size to be determined by benchmarking) files will be transferred in one chunk / message
      • tiny files will be combined into one multi-file message
  • file attributes / permissions are sent in the message header
  • listening process on the destination side disassembles messages, writing out tiny / medium files, and facilitates correct assembly / write order of large file fragments

==> ?Probably should? be faster than SMB ?

How does this sound ?

The fastest way may be to physically move a drive from one computer to another.

You're unlikely to see any benefit from trying to replace TCP. It's suboptimal in high-latency networks with high packet loss, but over a local network these problems are not relevant.

File access on Windows is surprisingly slow, so having a client there that has weird workarounds like a thread pool for closing file handles will help.

Apart from that, no cleverness is needed. You could probably just install rsync and use that. It supports chunking, compression, extended attributes.

And the metadata message parsing, pipelined send, and disassembly is quite easy to implement:

tar c | ssh | tar x

The fastest way may be to physically move a drive from one computer to another.

Single drive - perhaps. RAID array, that still holds a whole lotta of other volumes - nope.
For whole volume, a disk image / network cloning would probably be fast enough.

But sadly, we need to copy just one directory / share out of a humongous volume / drive.

so having a client there that has weird workarounds like a thread pool for closing file handles will help.

thanks for the link

tar c | ssh | tar x

Wut ??? tar supports windows ACLs ? I thought about using tar, but found it so unlikely it would support this that I haven't really check ... It would still be single-threaded & totally un-resilient (one failure => start over), but it would be a step up :wink:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.