Which is the best tool for copying a large directory tree locally?
Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.
##
The Data
One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the /var/lib/cassandra
folder).
Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.
##
The Setup
Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.
We are hosted in Microsoft Azure, our VMs are Standard_D13_v2 with a Premium Storage attached as Cassandra’s data volume.
After attach a second volume, format and mount it, we have tried those tools:
#
rsync
find ${SOURCE} \
-type d \
-exec rsync --owner --group \
--archive --copy-links --whole-file \
--relative --no-compress --progress {} ${DESTINATION}/ \;
#
tar
For tar
I had a small script, mainly to avoid escaping on find -exec
:
#!/usr/bin/env bash
ORIGEN=$1
DESTINATION=/datadrive2
NEW_DIR=${DESTINATION}/${ORIGEN}
mkdir -p ${NEW_DIR}
(cd ${ORIGEN}; tar cf - .) | (cd ${NEW_DIR}; tar xpf -)
echo ${ORIGEN}
find ${SOURCE} -type d \
-exec /root/transfer-with-tar.sh {} \;
#
cpio
find ${SOURCE} -type f -print0 2>/dev/null | \
cpio -0admp ${DESTINATION} &>/dev/null)
#
rsync + parallel
find ${SOURCE} -type f > /tmp/backup.txt
time (cat /tmp/backup.txt | parallel -j 8 \
rsync --owner --group \
--archive --copy-links --whole-file \
--relative --no-compress --progress {} ${DESTINATION})
##
Results
Method | Time spent (min) | Time spent (hh:mm) |
---|---|---|
rsync | 232 | 03:52:00 |
tar | 206 | 03:26:00 |
cpio | 225 | 03:45:00 |
parallel rsync | 209 | 03:29:00 |
Using GNU parallel to improve rsync performance.
We had a very good experience using parallel
with rsync
over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!
THE WINNER IS: On our use case we decide to use parallel with rsync
, even it a little slower than tar
having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.
##
Acknowledge
This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?