Data Transfer Guide

This page gives an overview of the different mechanisms for transferring data to and from the UK-RDF and remote machines over JANET.

Overview

Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.

  • Disk speed - The RDF file-systems are highly parallel consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file-system you may find your transfer speed limited by disk performance.
  • Meta-data performance - Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system will interact strongly with those you perform so reducing the number of such operations you use, may reduce variability in your IO timings.
  • Network speed - Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination.
  • Fire-wall speed - Most modern networks are protected by some form of fire-wall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the fire-wall for that network will limit the transfer rate you can achieve.

Using the RDF

The RDF has 3 filesystems:

/general
/epsrc
/nerc

The file-system a user has access to depends on their funding body.

Archiving

If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger "archive" file for long term storage. A single large file makes more efficient use of the file-system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar, cpio and zip. When using these commands to prepare a file for the RDF, it is good practice to forgo compression as this will slow the archiving process.

tar command

The tar command packs files into a "tape archive" format intended for backup purposes. The command has general form:

tar [options] [file(s)]

Common options include -c "create a new archive", -v "verbosely list files processed", -W "verify the archive after writing", -l "confirm all file hard links are included in the archive", and -f "use an archive file" (for historical reasons, tar writes its output to stdout by default rather than a file). Putting these together:

tar -cvWlf mydata.tar mydata

will create and verify an archive ready for the RDF. Further information on the hard link check can be found in the tar manual.

To extract files from a tar file, the option -x is used. For example:

tar -xf mydata.tar

will recover the contents of "mydata.tar" to the current working directory.

To verify an existing tar file against a set of data, the -d "diff" option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:

$> tar -df mydata.tar mydata
mydata/damaged_file: Mod time differs
mydata/damaged_file: Size differs

Note that tar files do not store checksums with their data, requiring the original data to be present during verification.

cpio command

The cpio utility is a common file archiver and is provided by most Linux distributions. The command has form:

cpio [options] < in > out

Note cpio uses stdin and stdout for its input and output functionality. The utility does not provide a "recursive" flag like tar and zip and is hence often used with the find command when working with directories.

Common options include -o "create an archive (copy-out mode)", -v "verbose mode", and -H "use the given archive format". The recommended format is crc as this provides checksum support at the cost of compatibility with older versions of cpio. Together:

find mydata/ | cpio -ovH crc > mydata.cpio

will create an archive ready for the RDF.

Extraction is performed via the -i "copy-in" flag usually paired with -d to ensure directories are created as needed. For example:

cpio -id < mydata.cpio

recovers the contents of the archive to the working directory.

Archive verification can be performed in -i mode with the --only-verify-crc flag set. As the name implies, this skips the file extraction and only verifies the checksum for each file in the archive. An example of this on a damaged archive follows:

$> cpio -i --only-verify-crc < mydata.cpio
cpio: mydata/file: checksum error (0x1cd3cee8, should be 0x1cd3cf8f)
204801 blocks
zip command

The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:

zip [options] mydata.zip [file(s)] 

Common options are -r used to zip up a directory and -# where "#" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process. Together:

zip -0r mydata.zip mydata

will create an archive ready for the RDF. Note: Unlike tar and cpio, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system.

The corresponding unzip command is used to extract data from the archive. The simplest use case is:

unzip mydata.zip

which recovers the contents of the archive to the working directory.

Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip provides options for verifying this checksum against the stored files. The relevant flag is -t and is used as follows:

$> unzip -t mydata.zip
Archive:  mydata.zip
    testing: mydata/                 OK
    testing: mydata/file             OK
No errors detected in compressed data of mydata.zip.
Transfer Nodes

The RDF has its own data transfer nodes (dtn01.rdf.ac.uk, dtn02.rdf.ac.uk) that are specifically intended to support import and export of remote data. You should use these nodes when importing/exporting data to/from the RDF disks from remote machines. These are also the nodes where we support specialised data transfer software and additional network connections (such as the dedicated PRACE network). If you have specialised data transfer requirements, you may need to use these nodes.

Data Transfer via SSH

The easiest way of transferring data to or from ARCHER is to use one of the standard programs based on the SSH protocol such as scp, sftp or rsync. These all use the same underlying mechanism (ssh) as you normally use to log-in to ARCHER. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine. To avoid having to type in your password multiple times you can set up a ssh-key as documented in the user-guide.

The ssh command encrypts all traffic it sends. This means that file-transfer using ssh consumes a relatively large amount of cpu time at both ends of the transfer. The login nodes for ARCHER and RDF have fairly fast processors that can sustain about 100 MB/s transfer but you may have to consider alternative file transfer mechanisms if you want to support very high data rates. The encryption algorithm used is negotiated between the ssh-client and the ssh-server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by reqeusting a different algorithm than the default. The arcfour algorithm is usually quite fast if both hosts support it.

A single ssh based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce meta-data interactions it is a good idea to overlap transfers of files from different directories.

scp command

The scp command creates a copy of a file, or if given the -r flag a directory, on a remote machine. Below shows an example of the command to transfer files to ARCHER:

scp [options] source user@login.archer.ac.uk:[destination]

In the above example, the [destination] is optional, as when left out scp will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

If you want to request a different encryption algorithm add the -c algorithm-name flag to the scp options.

If you need to run scp from within a batch job see special instructions on how to use ssh-keys from batch jobs

rsync command

The rsync command can also transfer data between hosts using a ssh connection. It creates a copy of a file, or if given the -r flag a directory, at the given destination, similar to scp above. However, given the -a option rsync can also make exact copies (including permissions), this is referred to as 'mirroring'. In this case the rsync command is executed with ssh to create the copy of a remote machine. To transfer files to ARCHER the command should have the form:

rsync [options] -e ssh source user@login.archer.ac.uk:[destination]

In the above example, the [destination] is optional, as when left out rsync will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.

Additional flags can be specified for the underlying ssh command by using a quoted string as the argument of the -e flag. e.g.

rsync [options] -e "ssh -c arcfour" source user@login.archer.ac.uk:[destination]

Other Data Transfer protocols

For very large data transfers it may be necessary to use more specialised tools. For performance reasons these use (multiple) non-encrypted socket connections. As a result, it is usually necessary to have a range of TCP/IP ports open in the fire-walls before these tools can be used. On the RDF data-transfer nodes we support a port range of 50000,52000.

Grid-FTP

The RDF data-transfer nodes support Grid-FTP. If you have a personal grid-certificate you can register the certificate DN via the SAFE and then access the Grid-FTP servers using globus-url-copy.

-bash-4.1$ grid-proxy-init
Your identity: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=stephen booth
Enter GRID pass phrase for this identity:
Creating proxy ............................................ Done
Your proxy is valid until: Sat Feb  7 01:43:08 2015

-bash-4.1$ globus-url-copy -vb file:///general/z01/z01/spb/random_4G.dat gsiftp://dtn02.rdf.ac.uk/general/z01/z01/spb/copy.dat
Source: file:///general/z01/z01/spb/
Dest:   gsiftp://dtn02.rdf.ac.uk/general/z01/z01/spb/
  random_4G.dat  ->  copy.dat

   3129999360 bytes       687.05 MB/sec avg       789.00 MB/sec inst

In the above example the gsiftp protocol tells clobus-url-copy to connect to the grid-ftp daemon running on dtn02. You can also use the globus-online web-based portal at https://www.globus.org to manage grid-ftp transfers.

If you do not have a personal certificate the data-transfer nodes also support grid-ftp initiated via ssh.

[spbooth@jasmin-xfer1 ~]$ globus-url-copy -vb sshftp://spb@dtn01.rdf.ac.uk/general/z01/z01/spb/random_4G.dat file:///home/users/spbooth/random_4G.dat
Source: sshftp://spb@dtn01.rdf.ac.uk/general/z01/z01/spb/
Dest:   file:///home/users/spbooth/
  random_4G.dat

   3157262336 bytes        30.72 MB/sec avg        13.50 MB/sec inst

This uses your normal ssh credentials to authenticate the connection but the data is sent over seperate sockets so is not encrypted.

The -p flag to globus-url-copy controls how many parallel sockets to use to transfer data. The best value to use depends on network conditions but 4-8 streams is usually fairly good.

bbcp

The bbcp tool allows you to transfer large amounts data using parallel unencrypted streams with the authentication provided by ssh credentials.

Note: bbcp needs to be installed on both the source and destination hosts.

bbcp downloads and full documentation can be found at:

To use bbcp on the RDF you must first load the bbcp module:

module load bbcp

When copying data from the RDF DTN's you can use the following syntax:

module load bbcp
bbcp -z -s 2 -T 'ssh user@login.archer.ac.uk /usr/local/packages/cse/bbcp/13.05.03/bbcp' my_data.tar.gz user@login.archer.ac.uk:from_rdf.tar.gz

This copies data from the RDF to ARCHER (note that the RDF filesystems are mounted on ARCHER so this is just for illustration purposes, you would not use this mechanism to move data from the RDF to ARCHER).

  • The -s 2 option specifies that two parallel transfer streams should be used
  • The -T option specifies the command to launch bbcp on the remote site
  • The -z option tells bbcp to use reverse connection protocol (useful for avaoiding firewall problems)

If you wish to transfer data to the RDF from a remote host, the command on the remote host would look something like:

bbcp -s 4 -T "ssh user@dtn02.rdf.ac.uk module
load bbcp; bbcp" my_data.tar.gz user@dtn02.rdf.ac.uk:to_rdf.tar.gz

If you have any questions about copying your data to/from the RDF, please contact the ARCHER helpdesk via support@rdf.ac.uk.