I'm switching hosting providers and need to transfer millions of uploaded files to a new server. All of the files are in the same directory. Yes. You read that correctly. ;)
In the past I've done this:
scp the zip to the new serverThe last time I did this it took about 4-5 days to complete and that was about 60% of what I have now.
I'm hoping for a better way. What do you suggest?
File structure is hashed.  Something like this: AAAAAAAAAA.jpg - ZZZZZZZZZZ.txt
Here's one idea we're tossing around:
Split the zips into tons of mini-zips based on 3 letter prefixes. Something like:
AAAAAAAAAA.jpg - AAAZZZZZZZ.gif => AAA.zip
Theoretical Pros:
Theoretical Cons:
AAA*), perhaps offset by running many zip threads at once, using all CPUs instead of only one.We've also thought about rsync and scp but worry about the expense of transferring each file manually. And since the remote server is empty I don't need to worry about what's already there.
What do you think? How would you do it?
(Yes, I'll be moving these to Amazon S3 eventually, and I'll just ship them a disk, but in the meantime, I need them up yesterday!)
You actually have multiple options, my favorite would be using rsync.
rsync [dir1] [dir2]
This command will actually compare the directories, and sync only the differences between them.
With this, I would be most likeley to use the following
rsync -z -e ssh [email protected]:/var/www/ /var/www/
-z Zip
-e Shell Command
You could also use SFTP, FTP via SSH.
Or even wget.
wget -rc ssh://[email protected]:/var/www/
I'm from the Linux/Unix world. I'd use tar to make a number of tar files each of a set size. E.g.:
tar -cML $MAXIMUM_FILE_SIZE_IN_KILOBYTES --file=${FILENAME}}_{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}.tar  ${THE_FILES}
I'd skip recompression unless your .txt files are huge. You won't get much mileage of out recompressing .jpeg files, and it will eat up a lot of CPU (and real) time.
I'd look into how your traffic shaping works. How many concurrent connections can you have? How much bandwidth per connection? How much total?
I've seen some interesting things with scp. Testing out a home network, scp gave much lower throughput than copying over a mounted shared smbfs filesystem. I'm not entirely clear why. Though that may be desirable if scp is verifying the copy and requesting retransmission on errors. (There is a very small probability of an error making it through in a packet transmitted over the internet. Without some sort of subsequent verification stage that's a real problem with large data sets. You might want to run md5 hashes...)
If this is a webserver, you could always just use wget. Though that seems highly inefficient...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With