Skip to main content

Mutable Ideas

Which is the best tool for copying a large directory tree locally?

Recently we had to move a full Cassandra backup to another cluster of machines (another Datacenter on Cassandra’s jargon). Although it can be achieved using DC replication we opted for a more conservative approach and not change production configurations neither increase its load due data streaming. This post is quick comparison to find out which tool would perform better for copying a large directory tree locally.

rsync, tar, cpio whats best

## The Data

One of our Cassandra’s clusters contains 12 nodes, each node has 532Gb of data distributed among 1,753,200 files (the /var/lib/cassandra folder). Spending some time on a benchmark produced a good payoff, once any copy operation is performed several times during our tests.

## The Setup

Why locally you may be thinking, once I just described copy files to other servers, turns out one of the best ways to copy bulk data from one server to another in cloud is to attach a storage on the VM copy everything as local, detach it and them attach again on the destination VM.

We are hosted in Microsoft Azure, our VMs are Standard_D13_v2 with a Premium Storage attached as Cassandra’s data volume.

After attach a second volume, format and mount it, we have tried those tools:

# rsync

find ${SOURCE} \
  -type d \
  -exec rsync --owner --group \
    --archive --copy-links --whole-file \
    --relative --no-compress --progress {} ${DESTINATION}/ \;

# tar

For tar I had a small script, mainly to avoid escaping on find -exec:

#!/usr/bin/env bash


mkdir -p ${NEW_DIR}
(cd ${ORIGEN}; tar cf - .) | (cd ${NEW_DIR}; tar xpf -)

echo ${ORIGEN}
find ${SOURCE} -type d \
  -exec /root/ {} \;

# cpio

find ${SOURCE} -type f -print0 2>/dev/null | \
  cpio -0admp ${DESTINATION} &>/dev/null)

# rsync + parallel

find ${SOURCE} -type f > /tmp/backup.txt
time (cat /tmp/backup.txt | parallel -j 8 \
rsync --owner --group \
  --archive --copy-links --whole-file \
  --relative --no-compress --progress {} ${DESTINATION})

## Results

Method Time spent (min) Time spent (hh:mm)
rsync 232 03:52:00
tar 206 03:26:00
cpio 225 03:45:00
parallel rsync 209 03:29:00

Using GNU parallel to improve rsync performance. We had a very good experience using parallel with rsync over network transfers and got curious to see if it would improve rsync locally as well. Turns out it improved a lot!

THE WINNER IS: On our use case we decide to use parallel with rsync, even it a little slower than tar having the ability to resume the copy and checksum the copied files are features we appreciated for safe transfers.

## Acknowledge

This question on Server Fault inspired us to try this comparison: Copying a large directory tree locally? cp or rsync?