Data Protection for Hadoop Environments

One of the questions I often get asked is do we need data protection for Hadoop environments?

It is an unusual question because most of my customers don’t ask do we need data protection for Oracle, DB2, SAP, Teradata or SQL environments? I think it is safe to say the majority of these environments are always protected.

So I started thinking. What motivates people to question whether they need data protection for Hadoop?

I boiled it down to three themes:

  1. Very few (if any) of the software backup vendors have solutions for Hadoop
  2. Hadoop size and scalability is daunting when you consider most of us come from a background of providing data protection for monolithic environments that are in the 10’s of TB rather than the 100’s or 1000’s of TB for a single system
  3. Hadoop has some inbuilt data protection properties

So if we play it back, we can’t turn to our traditional backup vendors as they don’t have integrated Hadoop solutions. We are inundated by the size and scale of the problem. And so we are left with the hope that Hadoop’s inbuilt data protection properties will be good enough.

Before we dive into answering is Hadoop’s inbuilt data protection properties good enough, I wanted to share some fundamental differences between Hadoop and traditional Enterprise systems. This will help us navigate how we should think about data protection in a Hadoop environment.

Distributed versus Monolithic

Hadoop is a distributed system that spans many servers. Not only is the computing distributed but for the standard Hadoop architecture, storage is distributed as well. Contrast this to a traditional Enterprise application that spans only a few servers and is usually reliant on centralised intelligent storage systems.

From a data protection perspective, which is going to be more challenging?

Protecting a centralised data repository that is addressable by a few servers is arguably simpler than supporting a large collection of servers each owning a portion of the data repository that is captive to the server.

When we are challenged to protect large volumes of data frequently we are forced to resort to infrastructure centric approaches to data protection. This is where we rely on the infrastructure to optimise the flow of data movement between the source of the data undergoing protection and the protection storage system. In many cases we have to resort to intelligent storage or in the case of virtualisation, the hypervisor to provide these optimisations. However, in a standard Hadoop architecture where physical servers with direct attached storage is used, neither of these approaches is available.

To comply with the standard Hadoop architecture, the protection methodology needs to be software-driven and ideally should exploit the distributed properties of the Hadoop architecture so that data protection processing can scale with the environment.

Mixed Workloads

Hadoop systems typically support different applications and business processes, many of which will be unrelated. Contrast this to a traditional Enterprise application consists of a collection of related systems working together to deliver an application (e.g. Customer Relationship Management) that supports one or more business processes. For these Enterprise systems we apply consistent data protection policies across the systems that support the application. We could go one step further and treat all application upstream and downstream interfaces and dependencies with the same levels of protection.

Hadoop environments can support many Enterprise applications and business processes. Each application will vary in importance much like production and non-production system classifications. The importance of Hadoop workloads is a characteristic we should exploit when devising a data protecting strategy. If we fall into the trap of treating everything equal, it gets us back to being overwhelmed by the size and scope of the problem ahead of us.

So, when thinking about data protection for Hadoop environments we should focus our efforts on the applications and corresponding data sets that matter. We may find if we do this the profile of the challenge ahead of us changes from being a 500TB protection problem to a 100TB protection problem using method A and 200TB protection problem using method B and 200TB that does not require additional levels of protection.

Hadoop Inbuilt Data Protection

There are two styles of data protection provided by Hadoop.

Onboard protection is built into the standard Hadoop File System (HDFS). The level of protection provided is confined to an individual Hadoop cluster. If the Hadoop cluster or file system suffers a major malfunction then these methods will likely be compromised.

Offboard protection is primarily concerned with creating copies of files from one Hadoop cluster to another Hadoop cluster or independent storage system.

Onboard Data Protection Features

In a typical Enterprise application we usually rely on the storage system to provide data redundancy both from an access layer (storage controllers) and persistence layer (storage media).

In Hadoop, redundancy at the persistence layer is provided by maintaining N-way local copies (3 by default) of each data block across data nodes. This is called the system-wide replication factor and is universally applied when files are created. The replication factor can also be defined explicitly when files are create and can be modified retrospectively. This flexibility enables redundancy to be aligned to the importance and value of data.

Redundancy at the access layer is provided by the NameNode. The NameNode tracks the location and availability of data blocks and instructs Hadoop clients which data nodes should be used when reading and writing data blocks.

Hadoop provides proactive protection. Many filesystems assume data stays correct after it is written. This is a naive assumption that has been researched extensively. HDFS protects against data corruption by verifying the integrity of data blocks and repairing them from replica copies. This process occurs as a background task periodically and when Hadoop clients read data from HDFS.

HDFS supports snapshots of the entire file system or directories. Nested snapshot directories are not allowed. HDFS Snapshots are read only logical copies of directories and files that preserve the state of persistent data. One unexpected behaviour of HDFS snapshots is they are not atomic. In our testing, HDFS snapshots only preserve consistency on file close. What this means if a snapshot is taken of an open file the resulting snapshot will represent the file data once the open file is closed. This behaviour is at odds with traditional storage and filesystem based snapshot implementations that preserve data at the time of the snapshot regardless of state.

To illustrate this point we created a snapshot (s1) of a closed file then we started appending to the same file and at the same time created another snapshot (s2). Then we compared the size of the original file with the two snapshot versions. The first snapshot did not grow and represented the file data before the append operation. The second snapshot grew after it was created and represented the file data after the append operation completed (i.e. on file close).

Here is the experiment if you want to try it yourself.

Create a directory and enable snapshots.

[root@hadoop ~]# hdfs dfs -mkdir /mydir
[root@hadoop ~]# hdfs dfsadmin -allowSnapshot /mydir
Allowing snaphot on /mydir succeeded

Upload a file to the directory.

[root@hadoop ~]# hdfs dfs -put /boot/vmlinuz–3.10.0–123.el7.x86_64 /mydir
[root@hadoop ~]# hdfs dfs -ls /mydir
Found 1 items
-rw-r–r– 3 root supergroup 4902656 2015–01–07 04:23 /mydir/vmlinuz–3.10.0–123.el7.x86_64

Create a snapshot of the directory. In this case the file in the directory is idle.

root@hadoop ~]# hdfs dfs -ls /mydir
Found 1 items
-rw-r–r– 3 root supergroup 2630536960 2015–01–07 04:27 /mydir/vmlinuz–3.10.0–123.el7.x86_64
[root@hadoop ~]# hdfs dfs -createSnapshot /mydir s1
Created snapshot /mydir/.snapshot/s1
[root@hadoop ~]# hdfs dfs -ls /mydir/.snapshot/s1
Found 1 items
-rw-r–r– 3 root supergroup 2630536960 2015–01–07 04:27 /mydir/.snapshot/s1/vmlinuz–3.10.0–123.el7.x86_64

The size of the original file and snapshot version are the same. This is what we expected.

Append to the original file and in the background create a new snapshot (s2).

root@hadoop ~]# hdfs dfs -appendToFile /vmlinuz–3.10.0–123.el7.x86_64 /mydir/vmlinuz–3.10.0–123.el7.x86_64 &
[1] 20263
[root@hadoop ~]# hdfs dfs -ls /mydir/vmlinuz–3.10.0–123.el7.x86_64
-rw-r–r– 3 root supergroup 2684354560 2015–01–07 04:27 /mydir/vmlinuz–3.10.0–123.el7.x86_64
[root@hadoop ~]# hdfs dfs -createSnapshot /mydir s2
Created snapshot /mydir/.snapshot/s2
[root@hadoop ~]# hdfs dfs -ls /mydir/.snapshot/s2/vmlinuz–3.10.0–123.el7.x86_64
-rw-r–r– 3 root supergroup 3489660928 2015–01–07 04:27 /mydir/.snapshot/s2/vmlinuz–3.10.0–123.el7.x86_64
[root@hadoop ~]# hdfs dfs -ls /mydir/.snapshot/s2/vmlinuz–3.10.0–123.el7.x86_64
-rw-r–r– 3 root supergroup 3892314112 2015–01–07 04:27 /mydir/.snapshot/s2/vmlinuz–3.10.0–123.el7.x86_64
root@hadoop ~]# hdfs dfs -ls /mydir/.snapshot/s1
Found 1 items
-rw-r–r– 3 root supergroup 2630536960 2015–01–07 04:27 /mydir/.snapshot/s1/vmlinuz–3.10.0–123.el7.x86_64

Notice the second snapshot (s2) continues to grow as the file append runs in the background. The first snapshot (s1) does not change.

This demonstrates HDFS snapshots provide consistency only on file close. This may well be how the designers intended them to work. Coming from a storage background it is important to understand this behaviour particularly if HDFS snapshots form part of your data protection strategy.

Hadoop supports the concept of a recycle bin that is implemented by the HDFS Trash feature. When files are deleted using the HDFS client they are moved to a user-level trash folder where they persist for a predefined amount of time. Once the time elapses the file is deleted from the HDFS namespace and the referenced blocks are reclaimed as part of the space reclamation process. Files that have not expired can be moved back into a directory and in this regard can be used to undo a delete operation.

Below is an example of a file that was deleted and subsequently moved to trash.

[hdfs@hdp–1 ~]$ hdfs dfs -rm /tmp/myfile
14/12/01 16:00:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 21600000 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://hdp–1.mlab.local:8020/tmp/myfile’ to trash at: hdfs://hdp–1.mlab.local:8020/user/hdfs/.Trash/Current

If we want to access the file in trash we can use regular HDFS commands referencing the files trash path.

[hdfs@hdp–1 ~]$ hdfs dfs -ls /user/hdfs/.Trash/Current/tmp
Found 1 items
-rw-r–r– 2 hdfs hdfs 13403947008 2014–11–25 15:39 /user/hdfs/.Trash/Current/tmp/myfile
[hdfs@hdp–1 ~]$ hdfs dfs -mv /user/hdfs/.Trash/Current/tmp/myfile /tmp

There are a few caveats to be aware of:

  • Trash is implemented in the HDFS client only. If we are interfacing with HDFS using a programmatic API the function is not implemented.
  • Use of trash cannot be enforced. A user can bypass trash by specifying the -skipTrash argument to HDFS client.

Offboard Data Protection Features

Hadoop supports copying files in and out of Hadoop clusters using Hadoop Distributed Copy (distcp). This technology is designed to copy large volumes of data either within the same Hadoop cluster, to another Hadoop cluster, Amazon S3 (or S3 compliant object store), Openstack Swift, FTP or NAS storage visible to all Hadoop data nodes. Support for Microsoft Azure has been included in the 2.7.0 release. Refer to HADOOP–9629.

A key benefit of distcp is it uses Hadoop’s parallel processing model (MapReduce) to carry out the work. This is important for large Hadoop clusters as a protection method that does not leverage the distributed nature of Hadoop is bound to hit scalability limits.

One downside of the current distcp implementation is that single file replication performance is bound to one map task and data node respectively.

Why is this a problem?

Each file requiring replication can only consume the networking resources available to the data node running the corresponding map task. For many small files this does not represent a problem, as many map tasks can be used to distribute the workload. However, for very large files which are common to Hadoop environments, performance of individual file copies will be limited by the bandwidth available to individual data nodes. For example, if data nodes are connected via 1 Gbe and the average file size in the cluster was 10TB, then the minimum amount of time required to copy one average file is 22 hours.

To address this situation distcp will need to evolve to support distributed block-level copy which to my knowledge is not currently available. Another option is to compress the files transparently prior to transmission however that is not available either (see HADOOP–8065).

Hadoop Distributed Copy to EMC ViPR (S3 interface)

We previously mentioned it is possible to copy files from Hadoop to S3 compliant object stores. This section will describe how to use distributed copy with EMC ViPR object store which is S3 compliant to the extent required.

S3 support is provided by the JetS3t API (s3 and s3n drivers) or the AWS S3 SDK (s3a driver). The JetS3t API allows the S3 endpoint to be configured to a target other than Amazon Web Services (AWS). To do this we need to set some properties that will get passed to the JetS3t API.

Create a file called jets3t.properties in your hadoop configuration directory. In my case it is /usr/local/hadoop/etc/hadoop.

Set some properties as follows:

s3service.s3-endpoint=vipr-ds1.mlab.local
s3service.s3-endpoint-http-port=9020
s3service.s3-endpoint-https-port=9021
s3service.disable-dns-buckets=true
s3service.https-only=false

Here we set the s3 endpoint to our ViPR data store (if you have multiple data stores point this at your load balancer) and set the ports for HTTP and HTTPS access that ViPR listens on. We disabled the use of DNS to resolve bucket names since we did not setup wildcard DNS for S3 bucket naming. We also disabled HTTPS as it requires real certificates rather than self signed to function.

Also in the same directory is the Hadoop core-site.xml. Set the following properties in this file:

<property>
 <name>fs.s3.awsAccessKeyId</name>
 <value>root</value>
</property>
<property>
 <name>fs.s3n.awsAccessKeyId</name>
 <value>root</value>
</property>
<property>
 <name>fs.s3n.awsSecretAccessKey</name>
 <value>6U08BuxfB6DTWPMzUU7tgfXDGBzXMEu/DLBG++5m</value>
</property>
<property>
 <name>fs.s3.awsSecretAccessKey</name>
 <value>6U08BuxfB6DTWPMzUU7tgfXDGBzXMEu/DLBG++5m</value>
</property>

The access key id and secret access key are obtained from ViPR and defined when the ViPR tenant and bucket are setup.

Here are a few screenshots of the ViPR bucket setup and object access key.
title
title

We also define the java classes the Hadoop client should call when S3 URI’s are used. Add the following properties to core-site.xml:

<property>
 <name>fs.s3.impl</name>
 <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
 <name>fs.s3n.impl</name>
 <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
 <name>fs.AbstractFileSystem.s3n.impl</name>
 <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
 <name>fs.AbstractFileSystem.s3.impl</name>
 <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

Ensure both the core-site.xml and jets3t.properties files are copied to all Hadoop data nodes.

We now need to add the S3 jar files to the Hadoop classpath. One way to do this is to link the jar files into a directory included in the Hadoop classpath.

To view the classpath use:

[root@hadoop ~]# hadoop classpath
/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/share/hadoop/common/:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/share/hadoop/yarn/:/usr/local/hadoop/share/hadoop/mapreduce/lib/:/usr/local/hadoop/share/hadoop/mapreduce/:/usr/local/hadoop/contrib/capacity-scheduler/*.jar

Symbolic link the jar files into one of these directories.

[root@hadoop ~]# cd /usr/local/hadoop/share/hadoop/hdfs/
[root@hadoop hdfs]# ln -s ../tools/lib/hadoop-aws–2.6.0.jar
[root@hadoop hdfs]# ln -s ../tools/lib/aws-java-sdk–1.7.4.jar

Now we can start using the ViPR object store as both a source for Hadoop processing and a target for distcp.

Here is how we copy a file from HDFS to the ViPR object store bucket named hadoopbackups.

[root@hadoop ~]# hdfs dfs -put /boot/vmlinuz–3.10.0–123.el7.x86_64 s3n://hadoopbackups/
14/12/30 00:55:05 INFO s3native.NativeS3FileSystem: OutputStream for key ‘vmlinuz–3.10.0–123.el7.x86_64.COPYING’ writing to tempfile ‘/tmp/hadoop-root/s3/output–4123538761754760460.tmp’
14/12/30 00:55:06 INFO s3native.NativeS3FileSystem: OutputStream for key ‘vmlinuz–3.10.0–123.el7.x86_64.COPYING’ closed. Now beginning upload
14/12/30 00:55:06 INFO s3native.NativeS3FileSystem: OutputStream for key ‘vmlinuz–3.10.0–123.el7.x86_64.COPYING’ upload complete

Here is how to list the contents of the ViPR object store bucket.

[root@hadoop ~]# hdfs dfs -ls s3n://hadoopbackups/
Found 1 items

Here is how to copy a directory using distcp from HDFS to the ViPR object store bucket.

[root@hadoop ~]# hadoop -pt -update -delete -append distcp hdfs://hadoop:9000/files/ s3n://hadoopbackups/files
14/12/30 23:17:56 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=true, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null‘, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[hdfs://hadoop:9000/files], targetPath=s3n://hadoopbackups/files, targetPathExists=false, preserveRawXattrs=false}
14/12/30 23:17:56 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:17:57 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
14/12/30 23:17:57 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
14/12/30 23:17:57 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:17:58 INFO mapreduce.JobSubmitter: number of splits:10
14/12/30 23:17:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1419937768824_0002
14/12/30 23:17:58 INFO impl.YarnClientImpl: Submitted application application_1419937768824_0002
14/12/30 23:17:58 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1419937768824_0002/
14/12/30 23:17:58 INFO tools.DistCp: DistCp job-id: job_1419937768824_0002
14/12/30 23:17:58 INFO mapreduce.Job: Running job: job_1419937768824_0002
14/12/30 23:18:05 INFO mapreduce.Job: Job job_1419937768824_0002 running in uber mode : false
14/12/30 23:18:05 INFO mapreduce.Job: map 0% reduce 0%
14/12/30 23:18:20 INFO mapreduce.Job: map 50% reduce 0%
14/12/30 23:18:21 INFO mapreduce.Job: map 60% reduce 0%
14/12/30 23:18:22 INFO mapreduce.Job: map 70% reduce 0%
14/12/30 23:18:24 INFO mapreduce.Job: map 100% reduce 0%
14/12/30 23:18:25 INFO mapreduce.Job: Job job_1419937768824_0002 completed successfully
14/12/30 23:18:25 INFO mapreduce.Job: Counters: 38
 File System Counters
 FILE: Number of bytes read=0
 FILE: Number of bytes written=1110540
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=44129667
 HDFS: Number of bytes written=0
 HDFS: Number of read operations=97
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=20
 S3N: Number of bytes read=0
 S3N: Number of bytes written=44123904
 S3N: Number of read operations=0
 S3N: Number of large read operations=0
 S3N: Number of write operations=0
 Job Counters
 Launched map tasks=10
 Other local map tasks=10
 Total time spent by all maps in occupied slots (ms)=131490
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=131490
 Total vcore-seconds taken by all map tasks=131490
 Total megabyte-seconds taken by all map tasks=134645760
 Map-Reduce Framework
 Map input records=10
 Map output records=0
 Input split bytes=1340
 Spilled Records=0
 Failed Shuffles=0
 Merged Map outputs=0
 GC time elapsed (ms)=6263
 CPU time spent (ms)=23840
 Physical memory (bytes) snapshot=1677352960
 Virtual memory (bytes) snapshot=21159137280
 Total committed heap usage (bytes)=1164443648
 File Input Format Counters
 Bytes Read=4423
 File Output Format Counters
 Bytes Written=0
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
 BYTESCOPIED=44123904
 BYTESEXPECTED=44123904
 COPY=10

We can analyse the distcp job in more detail by browsing the job history. Here we can see the job was split into 10 map tasks (one file per map task).

title

We can see the duration of each map task.

title

We can review various counters. The most interesting is the number of bytes copied and bytes expected.

title

These counters can also be viewed with the command line.

[root@hadoop ~]# mapred job -counter job_1419937768824_0003 ‘org.apache.hadoop.tools.mapred.CopyMapper$Counter’ BYTESCOPIED
14/12/30 23:21:07 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:21:08 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
44123904
[root@hadoop ~]# mapred job -counter job_1419937768824_0003 ‘org.apache.hadoop.tools.mapred.CopyMapper$Counter’ BYTESEXPECTED
14/12/30 23:21:16 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:21:17 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
44123904

We can also copy files from the ViPR object store bucket to HDFS.

[root@hadoop ~]# hadoop -pt -update -delete -append distcp s3n://hadoopbackups/files/ hdfs://hadoop:9000/files2
14/12/30 23:18:54 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=true, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null‘, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[s3n://hadoopbackups/files], targetPath=hdfs://hadoop:9000/files2, targetPathExists=false, preserveRawXattrs=false}
14/12/30 23:18:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:18:56 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
14/12/30 23:18:56 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
14/12/30 23:18:56 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/30 23:18:57 INFO mapreduce.JobSubmitter: number of splits:10
14/12/30 23:18:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1419937768824_0003
14/12/30 23:18:57 INFO impl.YarnClientImpl: Submitted application application_1419937768824_0003
14/12/30 23:18:57 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1419937768824_0003/
14/12/30 23:18:57 INFO tools.DistCp: DistCp job-id: job_1419937768824_0003
14/12/30 23:18:57 INFO mapreduce.Job: Running job: job_1419937768824_0003
14/12/30 23:19:03 INFO mapreduce.Job: Job job_1419937768824_0003 running in uber mode : false
14/12/30 23:19:04 INFO mapreduce.Job: map 0% reduce 0%
14/12/30 23:19:11 INFO mapreduce.Job: map 10% reduce 0%
14/12/30 23:19:14 INFO mapreduce.Job: map 50% reduce 0%
14/12/30 23:19:15 INFO mapreduce.Job: map 60% reduce 0%
14/12/30 23:19:16 INFO mapreduce.Job: map 70% reduce 0%
14/12/30 23:19:17 INFO mapreduce.Job: map 80% reduce 0%
14/12/30 23:19:19 INFO mapreduce.Job: map 90% reduce 0%
14/12/30 23:19:20 INFO mapreduce.Job: map 100% reduce 0%
14/12/30 23:19:20 INFO mapreduce.Job: Job job_1419937768824_0003 completed successfully
14/12/30 23:19:20 INFO mapreduce.Job: Counters: 38
 File System Counters
 FILE: Number of bytes read=0
 FILE: Number of bytes written=1110540
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=5516
 HDFS: Number of bytes written=44123904
 HDFS: Number of read operations=143
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=39
 S3N: Number of bytes read=44123904
 S3N: Number of bytes written=0
 S3N: Number of read operations=0
 S3N: Number of large read operations=0
 S3N: Number of write operations=0
 Job Counters
 Launched map tasks=10
 Other local map tasks=10
 Total time spent by all maps in occupied slots (ms)=88546
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=88546
 Total vcore-seconds taken by all map tasks=88546
 Total megabyte-seconds taken by all map tasks=90671104
 Map-Reduce Framework
 Map input records=10
 Map output records=0
 Input split bytes=1340
 Spilled Records=0
 Failed Shuffles=0
 Merged Map outputs=0
 GC time elapsed (ms)=3352
 CPU time spent (ms)=21580
 Physical memory (bytes) snapshot=1673371648
 Virtual memory (bytes) snapshot=21154611200
 Total committed heap usage (bytes)=1137180672
 File Input Format Counters
 Bytes Read=4176
 File Output Format Counters
 Bytes Written=0
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
 BYTESCOPIED=44123904
 BYTESEXPECTED=44123904
 COPY=10

This job is also distributed with multiple map tasks copying files from the ViPR object store to HDFS.

We can verify this by looking at the successful map attempts and in particular the status and node columns. These show the copy tasks were distributed amongst Hadoop data nodes.

title

We can also list the files in the ViPR object store bucket. Keep in mind the command does not represent S3 folders the way you might expect (look at the files directory below).

[root@hadoop ~]# hdfs dfs -ls s3n://hadoopbackups/
Found 10 items
drwxrwxrwx - 0 1970–01–01 10:00 s3n://hadoopbackups/files
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_9

If we copy the files directory again to files2 things start to look confusing.

[root@hadoop ~]# hdfs dfs -ls s3n://hadoopbackups/
Found 20 items
drwxrwxrwx - 0 1970–01–01 10:00 s3n://hadoopbackups/files
drwxrwxrwx - 0 1970–01–01 10:00 s3n://hadoopbackups/files2
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-rw-rw- 1 4902656 2014–12–30 23:19 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_9
-rw-rw-rw- 1 4902656 2014–12–30 23:26 s3n://hadoopbackups/vmlinuz–3.10.0–123.el7.x86_64_9

However if we browse the bucket using s3fs (which I installed previously) all is well.

[root@hadoop ~]# s3fs -o url=http://vipr-ds1.mlab.local:9020 -o passwd_file=/usr/local/etc/passwd-s3fs hadoopbackups /mnt
[root@hadoop ~]# ls /mnt
files files2
[root@hadoop ~]# ls /mnt/files
vmlinuz–3.10.0–123.el7.x86_64_1 vmlinuz–3.10.0–123.el7.x86_64_3 vmlinuz–3.10.0–123.el7.x86_64_5 vmlinuz–3.10.0–123.el7.x86_64_7 vmlinuz–3.10.0–123.el7.x86_64_9
vmlinuz–3.10.0–123.el7.x86_64_2 vmlinuz–3.10.0–123.el7.x86_64_4 vmlinuz–3.10.0–123.el7.x86_64_6 vmlinuz–3.10.0–123.el7.x86_64_8
[root@hadoop ~]# ls /mnt/files2
vmlinuz–3.10.0–123.el7.x86_64_1 vmlinuz–3.10.0–123.el7.x86_64_3 vmlinuz–3.10.0–123.el7.x86_64_5 vmlinuz–3.10.0–123.el7.x86_64_7 vmlinuz–3.10.0–123.el7.x86_64_9
vmlinuz–3.10.0–123.el7.x86_64_2 vmlinuz–3.10.0–123.el7.x86_64_4 vmlinuz–3.10.0–123.el7.x86_64_6 vmlinuz–3.10.0–123.el7.x86_64_8

Note1: keep this in mind when using hdfs dfs -ls against an S3 bucket
Note2: when copying directories from hdfs to S3 ensure there are no nested directories as we observed distcp will fail to copy the structure back into HDFS from the parent S3 folder

We can also delete files from the ViPR object store bucket.

[root@hadoop ~]# hdfs dfs -rm -r -skipTrash s3n://hadoopbackups/files
Deleted s3n://hadoopbackups/files

Note: we need to specify the skipTrash flag otherwise the HDFS client will create a trash folder on the object store and move the files instead of deleting them

Hadoop Distributed Copy to EMC Data Domain (as NFS target)

In addition to S3 we can just as easily create copies of HDFS files onto Data Domain Protection Storage for Backup and Archive .

To accomplish this we must first export an NFS share from the Data Domain system. For this experiment we created a Data Domain mtree called hadoop and exported it to the three Hadoop data nodes.

title

We then mounted the share on the three data nodes under /dd.

[root@hadoop ~]# ssh hadoop mount ddve–03.mlab.local:/data/col1/hadoop /dd
[root@hadoop ~]# ssh hadoop–1 mount ddve–03.mlab.local:/data/col1/hadoop /dd
[root@hadoop ~]# ssh hadoop–2 mount ddve–03.mlab.local:/data/col1/hadoop /dd

Now we can copy files from HDFS to Data Domain as follows:

[root@hadoop ~]# hadoop distcp /files file:///dd/
14/12/31 03:45:47 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null‘, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[/files], targetPath=file:/dd, targetPathExists=true, preserveRawXattrs=false}
14/12/31 03:45:47 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/31 03:45:52 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
14/12/31 03:45:52 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
14/12/31 03:45:52 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
14/12/31 03:45:53 INFO mapreduce.JobSubmitter: number of splits:10
14/12/31 03:45:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1419937768824_0007
14/12/31 03:45:54 INFO impl.YarnClientImpl: Submitted application application_1419937768824_0007
14/12/31 03:45:55 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1419937768824_0007/
14/12/31 03:45:55 INFO tools.DistCp: DistCp job-id: job_1419937768824_0007
14/12/31 03:45:55 INFO mapreduce.Job: Running job: job_1419937768824_0007
14/12/31 03:46:03 INFO mapreduce.Job: Job job_1419937768824_0007 running in uber mode : false
14/12/31 03:46:03 INFO mapreduce.Job: map 0% reduce 0%
14/12/31 03:46:16 INFO mapreduce.Job: map 20% reduce 0%
14/12/31 03:46:17 INFO mapreduce.Job: map 30% reduce 0%
14/12/31 03:46:20 INFO mapreduce.Job: map 50% reduce 0%
14/12/31 03:46:21 INFO mapreduce.Job: map 60% reduce 0%
14/12/31 03:46:22 INFO mapreduce.Job: map 70% reduce 0%
14/12/31 03:46:36 INFO mapreduce.Job: map 100% reduce 0%
14/12/31 03:46:37 INFO mapreduce.Job: Job job_1419937768824_0007 completed successfully
14/12/31 03:46:37 INFO mapreduce.Job: Counters: 33
 File System Counters
 FILE: Number of bytes read=0
 FILE: Number of bytes written=45215734
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=44129771
 HDFS: Number of bytes written=0
 HDFS: Number of read operations=97
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=20
 Job Counters
 Launched map tasks=10
 Other local map tasks=10
 Total time spent by all maps in occupied slots (ms)=182909
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=182909
 Total vcore-seconds taken by all map tasks=182909
 Total megabyte-seconds taken by all map tasks=187298816
 Map-Reduce Framework
 Map input records=10
 Map output records=0
 Input split bytes=1330
 Spilled Records=0
 Failed Shuffles=0
 Merged Map outputs=0
 GC time elapsed (ms)=5525
 CPU time spent (ms)=16300
 Physical memory (bytes) snapshot=1595691008
 Virtual memory (bytes) snapshot=21133045760
 Total committed heap usage (bytes)=1116733440
 File Input Format Counters
 Bytes Read=4537
 File Output Format Counters
 Bytes Written=0
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
 BYTESCOPIED=44123904
 BYTESEXPECTED=44123904
 COPY=10

We can list the Data Domain contents.

[root@hadoop ~]# hdfs dfs -ls file:///dd/files
Found 9 items
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 1 root root 4902656 2014–12–31 03:46 file:///dd/files/vmlinuz–3.10.0–123.el7.x86_64_9

We can also protect the file copies on Data Domain using the retention lock feature. Retention lock is used to prevent files from being deleted until a certain date and time. This technique is commonly used to comply with data retention regulations but can also be used to provide enhanced levels of protection from human error and malware which for this use case is important because the protection copies are directly exposed to the system being protected.

Here is an example of using retention lock. Set the access time of the file to a date in the future.

[root@hadoop ~]# date
Wed Dec 31 12:02:34 EST 2014
[root@hadoop ~]# touch -a -t 201501011000 /dd/files/vmlinuz–3.10.0–123.el7.x86_64_1
[root@hadoop ~]# ls -lu /dd/files/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-r–r– 1 root root 4902656 Jan 1 2015 /dd/files/vmlinuz–3.10.0–123.el7.x86_64_1

Then try and delete it.

[root@hadoop ~]# rm -f /dd/files/vmlinuz–3.10.0–123.el7.x86_64_1
rm: cannot remove ‘/dd/files/vmlinuz–3.10.0–123.el7.x86_64_1’: Permission denied

Retention lock prevents the file from being deleted.

Note: retention lock must be licensed and enabled on the Data Domain mtree that we are using to store protection copies.

Data Domain also supports a feature called fast copy. This enables the Data Domain file system namespace to be duplicated to different areas on the same Data Domain system. The duplication process is almost instantaneous and creates a copy of the pointers representing the directories and files we are interested in copying. The copy consumes no additional space as the data remains deduplicated against the original.

The fast copy technology can be used to support an incremental forever protection strategy for Hadoop data sets with versioning enabled. Let’s demonstrate how this works in practice.

Let’s say we want to create weekly versioned copies of the HDFS /files directory on a Data Domain system.

First we create a copy of the initial version using distcp to the Data Domain mtree that we have mounted under /dd.

The first version will be copied to a directory based on the current timestamp syntax YYYYMMDD. In this example we will be using the date command with backticks to evaluate the current system date in the command line.

[root@hadoop ~]# hadoop distcp -pt -update -delete -append /files file:///dd/`date +%Y%m%d`

We can see the files were copied as expected.

[root@hadoop ~]# hdfs dfs -ls file:///dd/`date +%Y%m%d`
Found 9 items
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/Monday/vmlinuz–3.10.0–123.el7.x86_64_9

Immediately after creating the copy on Data Domain we want to protect it further by taking a fast copy to an mtree that is not visible to Hadoop. To do this we need to have a second mtree setup on the Data Domain system (we created /data/col1/hadoop_fc ahead of time) that we fast copy into.

To run the fast copy command we need to have ssh keys setup between the controlling host and Data Domain. This was setup ahead of time for a user called hadoop. Keep in mind this entire process does not need to run from a Hadoop node. It should be run from an independent trusted secure host that has access to Hadoop and Data Domain.

Lets fast copy the current version from the hadoop mtree to the hadoop_fc mtree for added protection.

[root@hadoop ~]# ssh hadoop@ddve–03 filesys fastcopy source /data/col1/hadoop/`date +%Y%m%d` destination /data/col1/hadoop_fc/
Data Domain OS
(00:00) Waiting for fastcopy to complete…
Fastcopy status: fastcopy /data/col1/hadoop/20141231 to /data/col1/hadoop_fc/20141231: copied 9 files, 1 directory in 0.10 seconds

Now a week passes and we want to preserve todays version of the HDFS /files directory. To demonstrate differences between the two versions lets first delete a file from HDFS to show how this approach enables versioning.

[root@hadoop ~]# hdfs dfs -rm /files/vmlinuz–3.10.0–123.el7.x86_64_1
14/12/31 13:09:40 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 3600 minutes, Emptier interval = 3600 minutes.
Moved: ‘hdfs://hadoop:9000/files/vmlinuz–3.10.0–123.el7.x86_64_1’ to trash at: hdfs://hadoop:9000/user/root/.Trash/Current

Now we have 8 files in HDFS /files.

Before we create our weekly protection copy we want to fast copy the previous weeks copy into todays. This way when we distcp todays version of HDFS /files it should recognise the 8 files already exist and then all it needs to do is delete the one file that no longer exists (the vmlinuz–3.10.0–123.el7.x86_64_1 file we deleted from HDFS).

Note: for this to work I set the system clock forward 7 days

First fast copy last weeks copy into todays copy.

[root@hadoop ~]# ssh hadoop@ddve–03 filesys fastcopy source /data/col1/hadoop/`date +%Y%m%d -d '7 days ago'` destination /data/col1/hadoop/`date +%Y%m%d`
Data Domain OS
Fastcopy status: fastcopy /data/col1/hadoop/20141231 to /data/col1/hadoop/20150107: copied 9 files, 1 directory in 0.03 seconds

Now this weeks copy looks identical to last weeks. That’s what we want as our starting position for each subsequent version.

[root@hadoop ~]# hdfs dfs -ls file:///dd/`date +%Y%m%d -d '7 days ago'` file:///dd/`date +%Y%m%d`
Found 9 items
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20141231/vmlinuz–3.10.0–123.el7.x86_64_9
Found 9 items
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_1
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_9

Now we need to synchronise the current HDFS version of /files with this weeks copy on the Data Domain. First lets look at our HDFS version.

[root@hadoop ~]# hdfs dfs -ls /files
Found 8 items
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_9

Now lets run distcp to synchronise this weeks copy.

[root@hadoop ~]# hadoop distcp -pt -update -delete -append /files file:///dd/`date +%Y%m%d`
15/01/07 00:04:31 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=true, ignoreFailures=false, maxMaps=20, sslConfigurationFile=’null‘, copyStrategy=’uniformsize’, sourceFileListing=null, sourcePaths=[/files], targetPath=file:/dd/20150107, targetPathExists=true, preserveRawXattrs=false}
15/01/07 00:04:31 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
15/01/07 00:04:32 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
15/01/07 00:04:32 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
15/01/07 00:04:32 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.0.94:8032
15/01/07 00:04:33 INFO mapreduce.JobSubmitter: number of splits:8
15/01/07 00:04:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1419937768824_0027
15/01/07 00:04:33 INFO impl.YarnClientImpl: Submitted application application_1419937768824_0027
15/01/07 00:04:33 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1419937768824_0027/
15/01/07 00:04:33 INFO tools.DistCp: DistCp job-id: job_1419937768824_0027
15/01/07 00:04:33 INFO mapreduce.Job: Running job: job_1419937768824_0027
15/01/07 00:04:39 INFO mapreduce.Job: Job job_1419937768824_0027 running in uber mode : false
15/01/07 00:04:39 INFO mapreduce.Job: map 0% reduce 0%
15/01/07 00:04:46 INFO mapreduce.Job: map 50% reduce 0%
15/01/07 00:04:47 INFO mapreduce.Job: map 75% reduce 0%
15/01/07 00:04:48 INFO mapreduce.Job: map 100% reduce 0%
15/01/07 00:04:49 INFO mapreduce.Job: Job job_1419937768824_0027 completed successfully
15/01/07 00:04:50 INFO mapreduce.Job: Counters: 32
 File System Counters
 FILE: Number of bytes read=0
 FILE: Number of bytes written=874648
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=4631
 HDFS: Number of bytes written=504
 HDFS: Number of read operations=64
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=16
 Job Counters
 Launched map tasks=8
 Other local map tasks=8
 Total time spent by all maps in occupied slots (ms)=39324
 Total time spent by all reduces in occupied slots (ms)=0
 Total time spent by all map tasks (ms)=39324
 Total vcore-seconds taken by all map tasks=39324
 Total megabyte-seconds taken by all map tasks=40267776
 Map-Reduce Framework
 Map input records=8
 Map output records=8
 Input split bytes=1080
 Spilled Records=0
 Failed Shuffles=0
 Merged Map outputs=0
 GC time elapsed (ms)=935
 CPU time spent (ms)=8660
 Physical memory (bytes) snapshot=1314115584
 Virtual memory (bytes) snapshot=16911384576
 Total committed heap usage (bytes)=931659776
 File Input Format Counters
 Bytes Read=3551
 File Output Format Counters
 Bytes Written=504
 org.apache.hadoop.tools.mapred.CopyMapper$Counter
 BYTESSKIPPED=39221248
 SKIP=8

As we expected distcp synchronised the current HDFS version of /files with this weeks protection copy. To achieve that it did very little work since the distcp target was pre-populated from the previous version using Data Domain fast copy.

We can verify by comparing HDFS /files with the current version on the Data Domain.

[root@hadoop ~]# hdfs dfs -ls /files file:///dd/`date +%Y%m%d`
Found 8 items
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 3 root supergroup 4902656 2014–12–30 23:16 /files/vmlinuz–3.10.0–123.el7.x86_64_9
Found 8 items
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_2
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_3
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_4
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_5
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_6
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_7
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_8
-rw-r–r– 1 root root 4902656 2014–12–30 23:16 file:///dd/20150107/vmlinuz–3.10.0–123.el7.x86_64_9

We can continue this process indefinitely and as a background process cleanup versions that we no longer require, remembering to cleanup the versions from both the hadoop and hadoop_fc mtrees.

This approach to Hadoop data protection has some interesting benefits.

  • Produces an incremental forever approach (at the file level) that should scale quite well.
  • Leverages distributed processing (distcp) to drive the creation of protection copies.
  • The protection process can be independent of Hadoop so that human error and malware are less likely to propagate to the protection process and copies i.e. delineation of controls.
  • All the protection copies are deduplicated against each other and stored efficiently for versioning of large data sets.
  • The protection copies are stored in their native format and readily accessible directly from Hadoop for processing without requiring special access, tools or having to restore them back into HDFS first.
  • The protection copies can be replicated between Data Domain systems for efficient offsite protection.

One other possibility that we have yet to exploit is distributing the deduplication process. In the current strategy Data Domain is required to perform the deduplication processing on behalf of the Hadoop data nodes.

title

What would make a great deal of sense is if we could distribute the deduplication processing to the Hadoop data nodes. That would allow the protection strategy to scale uniformly. Data Domain Boost is designed to do exactly that. It is a software development kit (SDK) that allows deduplication processing to be embedded in the data sources (applications).

title

As of this time the SDK is not in the public domain and there is no integration with distcp. If you think this would be a worthwhile development please leave a comment.

Vendor specific Data Protection

Thus far we have touched on the data protection concepts and methods implemented with the standard Apache Hadoop distribution. Now lets look at how each vendors Hadoop distribution provides protection capabilities unique to that distribution.

The Cloudera CDH distribution provides a Backup and Disaster Recovery solution based on a hardened version of Hadoop distributed copy (distcp). This solution requires a second Hadoop cluster to be established as a replication target.

The HortonWorks distribution comes with Apache Falcon. This provides a data management and processing framework that incorporates governance and regulatory compliance features.

The Pivotal HD distribution provides specific data protection tools for HAWQ the SQL query engine running on top of Hadoop.

The MapR distribution has taken a different approach to addressing protection. They developed a proprietary filesystem which is API-compatible with HDFS. This is a major shift from the standard Apache distribution and provides a rich set of data services that can be used for protection including support for atomic snapshots.

There are also a number of tool vendors that can be used for Hadoop protection.

WANdisco is an extension built on top of Apache Hadoop or any of the supported distribution partners and is designed to stretch a Hadoop cluster across sites providing a non-stop architecture. The WANdisco solution does not appear to address versioning which I consider a core requirement of a data protection solution.

A number of other tools exist from Syncsort and Attunity to move data in and out of Hadoop.

If I have omitted any others feel free to leave a comment.

Apart from software-derived solutions there are converged solutions such as those offered by EMC where we have Pivotal HD, Cloudera CDH or Hortonworks running on Isilon OneFS which supports HDFS natively. In these architectures protection can be provided by the Isilon intelligent storage infrastructure. In this case Isilon OneFS supports snapshots which can be replicated and/or rolled over (at a file or direction level) to alternative storage for data protection and long term preservation.

Application Protection rather than File Protection

There is a significant caveat with all data protection methods we have discussed to date. We have discussed how to protect data stored in HDFS. What we have not considered is the applications layered on top of Hadoop.

HDFS file recovery does not constitute application recovery. In order to provide application recovery we need to consider which files require protection and whether the application can be put into a consistent state at the time the protection copy is created. If we cannot guarantee consistency then we need to consider whether the application can be brought into a consistent state after recovery from an inconsistent copy.

Application consistency is usually a property of the application. For example, transaction processing applications supported by relational databases provide consistency through ACID (Atomicity, Consistency, Isolation, Durability) properties. As part of the protection method we would call on the ACID properties of the application to create a consistent view in preparation for the copy process. If the application does not provide ACID properties then we rely on a volume, file system or storage-based function that allows IO to be momentarily frozen so that an atomic point-in-time view of an object (or group of objects) can be taken.

In the case of the standard Apache Hadoop architecture we would expect HDFS snapshots to provide consistent views of data sets, however as we demonstrated in practice this is not possible. And since we don’t have the luxury of intelligent storage we are at the mercy of the application to provide hooks that allow us to take consistent point-in-time views of application data sets. Without this functionality we would need to resort to archaic practices analogous to cold backups which is simply not feasible at scale (i.e. take application offline during HDFS snapshot operation).

HBase is one example of an application layered on top of Hadoop. This technology introduces the concept of tables to Hadoop. HBase includes its own snapshot implementation called HBase snapshots. This is necessary as only HBase understands the underlying structure, relationships and states of the data sets it layers on top of HDFS. However, even HBase snapshots are not perfect. The Apache HBase Operational Management guide includes the following warning:

There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled. A snapshot is only a representation of a table during a window of time. The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors. There is also no way to know whether a given insert or update is in memory or has been flushed.

While HBase understands how to snapshot a table it cannot guarantee consistency. Whether this presents a problem for you will depend on your workflow. If atomicity and consistency is important then some synchronisation with application workflows will be necessary to provide consistent recovery positions and versioning of application workload.

What are we Protecting against anyway?

I think we have exhausted the data protection options available to Hadoop environments. What I would like to end with is to help you answer the question are Hadoop’s inbuilt protection properties good enough?

When we think about data protection we usually think in terms of recovery point objective and recovery time objective. What we rarely consider are the events we are protecting against and whether the protection methods and controls we employ are effective against those events.

Whilst it is not my intention to raise fear and doubt it is important to acknowledge there is no such thing as software that doesn’t fail unexpectedly. HDFS has had a very good track record of avoiding data loss. However, despite this record, software bugs that can lead to data loss still exist (see HDFS–5042). The research community has also weighed in on the topic with a paper titled HARDFS: Hardening HDFS with Selective and Lightweight Versioning. It is worth reading if you want to understand what can go wrong.

To help you answer the question is Hadoop’s inbuilt protection properties good enough? we need to explore the events that can lead to data loss and then assess how Hadoop’s inbuilt protection properties would fair in each case.

Below is a list of the events I come across in the Enterprise that can result in data loss.

  • Data corruption– usually a by product of hardware and/or software failure. An example may be physical decay of the underlying storage media which is sometimes referred to as bit rot.
  • Site failure– analogous to a disaster event that renders the site inoperable. An example may be a building fire, systematic software failure or malware event.
  • System failure– individual server, storage or network component failure. An example may be a motherboard or disk failure.
  • Operational failure– human error resulting from the day to day operations of the environment. An example may be an operator re-initialising the wrong system and erasing data.
  • Application software failure– failures of the application software layered on top of the infrastructure. An example may be a coding bug.
  • Infrastructure software failure– software failures relating to the infrastructure supporting the applications. An example may be disk firmware or controller software bug.
  • Accidental user event– accidental user-driven acts that require data recovery. An example may be a user deleting the wrong file.
  • Malicious user event– intentional user-driven acts designed to harm the system. An example may be a user re-initialising a file system.
  • Malware event– automated systems designed to penetrate and harm the system. An example was the recent Syncology worm that encrypted the data stored on Syncology NAS devices and demanded payment in exchange for the encryption key.

Associated with each event is the probability of occurrence. Probability is implicated by many factors, environment, technology, people, processes, policies and experience. Each of these may help or hinder our ability to counter the risk of these events occurring. It would be unfair to try and generalise the probability of these events occurring in your environment (that is left up to the reader) so for the purpose of comparison I have assumed all events are equally probable and rated Hadoop’s tolerance to these events using a simple low, medium and high scale. It is important to acknowledge these are my ratings based on my judgement and my experience. By all means if you have a different view leave a comment.

Event Rating Rationale
Data corruption High Support for N-way replication, data scanning and verify on read provides significant scope to respond to data corruption events.
Site failure Medium A Hadoop cluster by design is isolated to one site. Hadoop distributed copy can be used to create offsite copies however there are challenges relating to application consistency and security that need careful consideration in order to maintain recovery positions.
System failure Medium The Hadoop architecture is fault tolerant and designed to tolerate server and network failures. However, system reliability provided by N-way replication is relative to data node count. As node count increases reliability decreased. To maintain constant probability of data loss the replication factor must be adjusted with data node increases.
Operational failure Medium Hadoop lacks role based access controls to Hadoop functions. This makes it difficult to enforce system-wide data protection policies and standards. For example, a data owner can change the replication factor on files they have write access too. This can lead to compromises in data protection policies which cannot be prevented but can be audited by enabling HDFS audit logging.
Application software failure Low Data protection strategies that share the same software and controls with the application being protected are more vulnerable to software (and operational) failures that can propagate to the protection strategy. To address this situation it is common practice to apply defence in depth principles. This introduces a layered approach to data protection that mitigates the risk of a failure (e.g. software) compromising both the environment under protection and the protection strategies.
Infrastructure software failure High One can argue the Hadoop architecture has less reliance on infrastructure software compared to Enterprise architectures that rely on centralised storage systems. The nodes can be made up of a diverse set of building blocks that mitigates against a common wide-spread infrastructure software failures.
Accidental user event Medium HDFS trash allows a user to recover from user-driven actions such as accidental file deletion. However, it can be bypassed and is only implemented in the HDFS client. Furthermore, if the event goes unnoticed and trash expires, data loss occurs. There are documented cases of this happening.
Malicious user event Low If an authorised malicious user (i.e. the insider threat) wanted to delete their data sets they could do so using standard Hadoop privileges. HDFS trash can be bypassed or expunged and HDFS snapshots of a users directories can be removed by the user even if the snapshots were created by the super user. Furthermore, in a secure relationship between two HDFS clusters the distcp end points are implicitly trusted and vulnerable by association.
Malware event Low A Hadoop client with super user privileges has access to the entire Hadoop cluster including all inbuilt protection methods.

 

Hadoop is very reliable when it comes to protecting against data corruption and physical component failures. Where Hadoop’s inbuilt protection strategies fall short is application software and human failures (intentional or otherwise).

To address these events we need to consider taking a copy of the data and placing it under the control of a different (and diverse) system. Separation of duties and system diversification is an important property of a data protection strategy that minimises events from propagating to secondary data sources (i.e. protection copies).

The idea is analogous to diversifying investments across fund managers and asset classes. When funds are split between different fund managers and asset classes the likely hood of one fund manager or asset class affecting the other is reduced. In other words, we hedge our risk.

The same principle applies to data protection. When there is separation of duties and diversification between primary and protection storage the likely hood of cross-contamination from either human events (accidental or otherwise) or software failures is minimised.

Taking copies of Hadoop data sets may sound daunting to begin with however we must remember to apply the hot, warm and cold data concepts to reduce the frequency (how often) and volume (how much) of data that needs to be processed to support our data protection requirements.

To achieve that we need to understand our data. We need to understand the value of the data and what the business impact would be if the data was not available or worse lost. Secondly, based on the business impact analysis (BIA) we need to determine how often and how much data needs to be protected, and/or whether there are other ways to reconstitute the data from alternative sources. Once we understand these classifications we can devise policies aligned with methods to protect the data.

If the data has low business value then it may be appropriate to utilise the method with the lowest cost. For example, HDFS trash with two replicas. If the data has high availability and business value then it would be appropriate to utilise multiple methods that together provide the fastest time to recovery and broadest event coverage (i.e. lowest risk).

Methods providing fast time to recovery and broad event coverage rely on opposing technical strategies.

Fast time to recovery methods rely on technology that is tightly coupled with the primary data source and structure e.g. a snapshot that is not diversified from the primary data source or structure. Methods that yield the broadest event coverage and lowest risk profile rely on technologies that are loosely coupled (or decoupled) from the primary data source and structure e.g. a full independent copy on a diverse storage device and file system that is independently verified.

To achieve both fast time to recovery and broadest event coverage requires the adoption of multiple complimentary protection methods with opposing technical and risk profiles.

Summary

It is certainly clear in my mind that there is no universal approach to providing data protection for large scale Hadoop environments. The options available to you will ultimately depend on:

  • The Hadoop architecture you choose to adopt
  • The Hadoop distribution and the options it offers relative to your requirements
  • The value of data stored in Hadoop and your tolerance for risk
  • The applications layered on top of Hadoop
  • The application workflows and your ability to integrate with them for the purpose of protection and versioning

When I first thought about writing this article I never expected it would reach over 11,000 words. Looking back I think if you have reached this point you would agree data protection for Hadoop is a broad topic that warrants special attention.

I hope you found this useful and I look forward to the continued adoption of Hadoop in the Enterprise. This will drive data protection vendors to embrace the technology and mature our approach to protecting Hadoop environments.

Data Domain Boost over WAN is here!

Last week we made Avamar 7.1 Generally Available to the public. For more information see here. One of the features we introduced was support for Data Domain Boost over WAN. Naturally, I wanted to try it out by backing up my new MacBook (that I am still learning) from my EMC work office to my home lab over an SSL VPN tunnel.

The first thing I did was upgrade the lab to Avamar 7.1 and seed the first backup to the Data Domain over the LAN. The total size of the seed backup was 195GB. I stumbled a little getting the first seed backup as I didn’t realise when the MacBook sleeps the backup shuts down. Once I figured out how to use my MacBook 🙂 I had my seed.

I kicked off the backup from work and after a few cycles it averaged 53 minutes. What’s really cool is it only sent 123MB as illustrated by the Activity Monitor.

In throughput terms thats 216 GB/hour. Not too bad for a full recovery point.

Mac_Avtar

Naturally, my colleagues had a bunch of questions for me.

What is the latency to your home lab?

At first glance I thought it was reasonable. However, when I started looking into it I realised it was quite poor with a moderate amount of packet loss. At 64bytes it was averaging 163ms with 2% packet loss. That’s not healthy. Why so bad? Turns out even though the distance between the EMC office and my home is only ~25 km the IP traffic is routed via the US (Massachusetts specifically) and back. This is far from optimal, but it shows even under poor conditions DD Boost over WAN is rock solid.

So this had me thinking. How do I perform a WAN test and keep the test in country? How about over my mobiles 4G network? I did the same backup again this time tethered via USB cable to my mobile phone network.

Network quality was better at 109ms with no packet loss. This time around the backup took 36 minutes (325 GB/hour).

The next question was how does DD Boost over WAN compare to the Avamar client-side protocol which writes to Avamar Data Store nodes?

After reconfiguring the Avamar client and seeding the first backup to Avamar storage I ran the subsequent backup over the 4G network to the Avamar storage. This time around the backup took 17 minutes (688 GB/hour) halving the time relative to DD Boost over WAN.

I expected this to be the case. One of the differences between the Avamar algorithm and DD Boost over WAN is the degree of client-side caching.

The Avamar algorithm maintains two levels of caching on the client. The file metadata hash cache and the file data hash cache. The file metadata hash cache is used to avoid exchanges with the Avamar storage when the file it is backing up has not changed. If the file has changed the file blocks get hashed and these hashes are compared to the local data hash cache. If the local data hash cache returns a hit we avoid an exchange with the Avamar storage. If we get a data hash cache miss we must ask the Avamar storage if the hash is present at the other end.

In the case of DD Boost over WAN with Avamar software, we only have at our disposal one level of client-side caching – the file metadata hash cache. In the event a files contents has changed between backups, DD Boost over WAN relies on more exchanges with the Data Domain appliance to determine if a file data hash is present at the other end.  As such, in the case where the round trip time is elevated, we should expect backups to take longer in the DD Boost over WAN use case compared to the Avamar algorithm.

As a comparison I also compared DD Boost over LAN backups to the Avamar algorithm. As expected, they both took the same time (11 minutes) as the benefit of multiple levels of caching only materialise when the round trip time is elevated. The time it took is essentially how long it takes to traverse the file system and hash the file metadata on my MacBook. Had the MacBook not been a bottleneck I would expect DD Boost to be more efficient on the client (because the client performs less work for the same outcome) in LAN use cases.

It’s always wise to know the differences and plan accordingly. Of course, your conditions and results will vary. Hope this helps. Peter..

EMC Data Protection User Group coming to a city near you

dpug

Come listen to my colleagues and I at the inaugural EMC Data Protection User Group series. This series kicks off at the end of May and will run throughout the rest of the year in 68 cities around the world.  The agenda is structured in such a way that we can address both global and local data protection topics. We gather at a local location to share EMC product information, to enable you to exchange experiences and best practices, and to allow users to network with EMC experts and fellow peers.

Want to learn more? Click here to register for a local user group in your city (I will be in Melbourne).  You can also join the EMC Community Network by clicking here.  Finally, you can follow on Twitter @EMCProtectData and Facebook.

PS: We will be covering the present and future of Data Protection. To that end, please keep in mind you will be required to complete a Non Disclosure Agreement.

Archiving Avamar Backups to AWS Glacier – Part 1

Last night I set myself an engineering challenge. Is it possible to archive Avamar backups to AWS Glacier?

Now, before you jump off your chair in excitement, first some disclaimers. What I am about to demonstrate is not supported by me or EMC. Please don’t call EMC if you end up:

  • losing your backups
  • receiving a large bill from AWS

This is being shared in the spirit of experimentation. OK so first let me describe what is required.

We need a host to act as our Avamar to Glacier gateway. This is analogous to a Cloud Gateway. For this I am going to spin up a Linux virtual machine running CentOS.

Next we need a way to extract backups from Avamar into a flat file. It turns out the Avamar Extended Retention feature introduced a method to export and import Avamar backups to/from PAX streams. In this case we are going to turn the backup stream into a flat file and use that as the basis for archiving to Glacier.

We also need a way of shipping these flat file backup archives to Glacier. There are a few options available. I am going to use mtglacier. This tool allows a local file system to be kept in sync with one or more Glacier vault’s. In our case we do not want to keep the files around so once they have been uploaded they will be deleted.

The architecture looks like this:

AvamarGlacierArch

We have our Avamar server (in this case Avamar Virtual Edition running 7.0SP1) with an Avamar Data Store (file system). We have our Cloud Gateway Server with Avamar client and mtglacier installed and we have Glacier supporting the vault.

I am not going to discuss the process of installing Avamar client or mtglacier. These are already well documented.

Exporting and Uploading Avamar Archived Backups to Glacier

To archive an Avamar backup into Glacier we need to go through 3 steps:

  1. Identify the Avamar backup that needs to be exported
  2. Export the Avamar backup to a flat file on the Cloud gateway
  3. Upload the flat file into a Glacier vault

The workflow looks like this:

Upload

For this experiment I created a small backup that is 176588 KB in size from a Windows client with the name winco.mlab.local.  This client is registered under the Avamar domain /mlab. To export this backup I need to determine its unique identifier (sequence number).

Below is the list of backups available for this client. Run this command from the Cloud gateway using the Avamar adminsitrator (MCUser) account. If you want to lock down access substitute this user with an Avamar domain account.

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local --quiet --backups
 Date Time Seq Label Size Plugin Working directory Targets
 ---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
 2014-03-27 23:06:11 62 MOD-1395921964905 176588K Windows C:\Program Files\avs\var C:/Users/administrator/Downloads/jdk-7u51-windows-x64.exe,C:/Use
 2014-03-27 22:38:47 61 MOD-1395920320594 972293K Windows C:\Program Files\avs\var C:/Users/administrator/Downloads/AvamarVMwareCombined-linux-x86-
 2014-03-27 20:48:37 59 MOD-1395900367200 2473651K Windows C:\Program Files\avs\var C:\Users
 2014-03-27 17:57:48 40 Default Schedule-Windows-1395802800168 47635164K Windows C:\Program Files\avs\var
 2014-03-26 14:06:54 33 Default Schedule-Windows-1395802800168 47635164K Windows C:\Program Files\avs\var
 2014-03-25 14:12:30 32 Default Schedule-Windows-1395716400363 47575481K Windows C:\Program Files\avs\var
 2014-03-24 14:04:05 31 Default Schedule-Windows-1395630000186 47561819K Windows C:\Program Files\avs\var
 2014-03-23 14:04:04 30 Default Schedule-Windows-1395543600164 47560966K Windows C:\Program Files\avs\var
 2014-03-22 14:04:01 29 Default Schedule-Windows-1395457200109 47560054K Windows C:\Program Files\avs\var
 2014-03-21 14:04:21 28 Default Schedule-Windows-1395370800153 47560220K Windows C:\Program Files\avs\var
 2014-03-20 14:04:04 27 Default Schedule-Windows-1395284400105 47559427K Windows C:\Program Files\avs\var
 2014-03-19 14:04:32 26 Default Schedule-Windows-1395198000192 47558551K Windows C:\Program Files\avs\var
 2014-03-18 14:04:28 25 Default Schedule-Windows-1395111600179 47546065K Windows C:\Program Files\avs\var
 2014-03-17 14:04:08 24 Default Schedule-Windows-1395025200242 47546535K Windows C:\Program Files\avs\var
 2014-03-16 14:04:09 23 Default Schedule-Windows-1394938800339 47527684K Windows C:\Program Files\avs\var
 2014-03-15 14:04:45 22 Default Schedule-Windows-1394852400289 47598941K Windows C:\Program Files\avs\var
 2014-03-14 14:05:04 21 Default Schedule-Windows-1394766000197 47533802K Windows C:\Program Files\avs\var
 2014-02-23 19:39:28 1 MOD-1393144755033 557K Windows C:\Program Files\avs\var C:\WinDump.exe

The backup we are interested in is highlighted above with sequence #62 and backup label MOD-1395921964905.

To archive we need to export a copy of the backup to the Cloud gateway. Before we do that create an archive directory tree structure on the Cloud gateway that mirrors the Avamar domain structure. This will provide a mapping between archived backups in Glacier and the Avamar domain and client it originated from. If we were archiving backups from multiple Avamar servers then we may choose to prefix the structure with the Avamar server name. This would avoid conflicts.

For each backup archive create the following directory structure:

/archives/<Avamar domain path>/<Avamar client>/<Backup Sequence #>

In this example we create it as follows:

# mkdir -p /archives/mlab/winco.mlab.local/62

Now we can begin the export process. To do this we instruct Avamar’s avtar command to extract a copy of the backup using the PAX archive format. PAX is short for Portable Archive Exchage and has similarities to tar and cpio. This is written to a file name data.avpax under the directory structure we previously created.

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local -x --dto-exportstream --streamformat=avpax --labelnumber=62 stream=/archives/mlab/winco.mlab.local/62/data.avpax

From here we can see the exported backup has been created and is now represented by a file on the Cloud gateway.

# ls -la /archives/mlab/winco.mlab.local/62/data.avpax
-rw-r--r-- 1 root root 181325312 Mar 27 23:09 /archives/mlab/winco.mlab.local/62/data.avpax

If we look at the exported file it contains some header and XML content followed by the backup data itself. The XML content is used to describe the backup if we ever wanted to bring it back into Avamar.

# head -20 /archives/mlab/winco.mlab.local/62/data.avpax
$global$paxrecs0000644000000000000000000000013412315012405010743 gustar0029 AVAMAR.sort_directories=0
27 AVAMAR.globalflags=3841
9 size=0
27 AVAMAR.enable_extents=0
backupexport_metainfo$paxrecs0000700000000000000000000000017312315012405014045 xustar0043 AVAMAR.objectname=backupexport_metainfo
41 AVAMAR.metadata=backupexport_metainfo
14 size=73033
25 AVAMAR.headflags=2241
backupexport_metainfo0000700000000000000000000021651112315012405012310 0ustar00<exportstream_metainfo>
 <archive_info>
 <flag order="1" type="textbox" value="7.0.101-56" desc="version of client" name="appversion" id="appversion" />
 <flag order="2" type="checkbox" value="true" desc="does the celerra/vnx support i18n" name="celerrai18n" id="celerrai18n" />
 <flag order="3" type="textbox" value="avtar --sysdir=&quot;C:\Program Files\avs\etc&quot; --bindir=&quot;C:\Program Files\avs\bin&quot; --vardir=&quot;C:\Program Files\avs\var&quot; --ctlcallport=49157 --ctlinterface=3001-MOD-1395921964905 --logfile=&quot;C:\Program Files\avs\var\clientlogs\MOD-1395921964905-3001-Windows.log&quot; --encrypt=tls --encrypt-strength=high --expires=1401109565 --retention-type=none --server=ave7-01.mlab.local --hfsport=27000 --id=backuponly --password=**************** --account=/mlab/winco.mlab.local --backup_mounted_vhds=true --backupsystem=false --checkcache=false --ddr=false --ddr-index=0 --debug=false --detect-acl-changes=false --filecachemax=-8 --force=false --freezecachesize=-50 --freezemethod=best --freezetimeout=300 --freezewait=4 --hashcachemax=-16 --informationals=2 --one-file-system=false --protect-profile=disabled --repaircache=false --run-after-freeze-exit=true --run-at-end-exit=true --run-at-start-exit=true --statistics=false --verbose=0 --windows-optimized-backup=false" desc="command line" name="command_line" id="command_line" />
 <flag order="4" type="textbox" value="0000h:00m:04s" desc="length of backup" name="elaptime" id="elaptime" />
 <flag order="5" type="integer" desc="number of errors in session" name="errors" id="errors" />
 <flag order="6" type="stringlist" value="C:/Users/administrator/Downloads/jdk-7u51-windows-x64.exe,C:/Users/administrator/Downloads/VMware-ClientIntegrationPlugin-5.5.0.exe" desc="list of top-level dirs/files in backup" name="files" id="files" />
 <flag order="7" type="textbox" desc="Specify a file that contains a list of options." name="flagfile" id="flagfile" />
 <flag order="8" type="checkbox" desc="Print this message." name="help" id="help" />
 <flag order="9" type="checkbox" desc="Print help including extended flags." name="helpx" id="helpx" />
 <flag order="10" type="checkbox" desc="Print this message in XML." name="helpxml" id="helpxml" />

Before we archive this backup we also want to extract file lists and backup job metadata in order to service additional use cases such as a search archive service.

For example, a full text search engine could be introduced to index these files to support the process of identifying long term archives for retrieval. There are many free search engines available. One to consider is Elasticsearch. I may get to this in a subsequent blog post.

To extract the backup job information  use the following command:

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local -x --labelnumber=62 --quiet --internal --target=/archives/mlab/winco.mlab.local/62 .system_info

This information is  internal to Avamar hence the –internal flag.

As this is a Windows file system backup we also want to extract the file list into an ascii file.

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local -t -v --labelnumber=62 --quiet > /archives/mlab/winco.mlab.local/62/file.lst

Now we have our metadata captured as follows:

# ls -la /archives/mlab/winco.mlab.local/62
total 177212
drwxr-xr-x 2 root root 4096 Mar 27 23:15 .
drwxr-xr-x 3 root root 4096 Mar 27 23:07 ..
-rw-r--r-- 1 root root 1818 Mar 27 22:38 archive_info
-rw-r--r-- 1 root root 4760 Mar 27 22:38 archive_info.xml
-rw-r--r-- 1 root root 181325312 Mar 27 23:09 data.avpax
-rw-r--r-- 1 root root 302 Mar 27 22:38 encodings.xml
-rw-r--r-- 1 root root 99 Mar 27 22:38 errors
-rw-r--r-- 1 root root 638 Mar 27 23:15 file.lst
-rw-r--r-- 1 root root 98 Mar 27 22:38 filestats
-rw-r--r-- 1 root root 109 Mar 27 22:38 groups
-rw-r--r-- 1 root root 274 Mar 27 22:38 locale.xml
-rw-r--r-- 1 root root 1492 Mar 27 22:38 machine.xml
-rw-r--r-- 1 root root 262144 Mar 27 22:38 mbr-d0.bin
-rw-r--r-- 1 root root 0 Mar 27 22:38 mounts
-rw-r--r-- 1 root root 1061 Mar 27 22:38 partitiontables.xml
-rw-r--r-- 1 root root 17520 Mar 27 22:38 sessionlog
-rw-r--r-- 1 root root 2787 Mar 27 22:38 statsfile
-rw-r--r-- 1 root root 461 Mar 27 22:38 userinfo.xml
-rw-r--r-- 1 root root 88 Mar 27 22:38 users
-rw-r--r-- 1 root root 8192 Mar 27 22:38 vbr-d0-p0.bin
-rw-r--r-- 1 root root 8192 Mar 27 22:38 vbr-d0-p1.bin
-rw-r--r-- 1 root root 8192 Mar 27 22:38 vbr-d0-p2.bin
-rw-r--r-- 1 root root 8192 Mar 27 22:38 vbr-d0-p3.bin
-rw-r--r-- 1 root root 1751 Mar 27 22:38 volumes.xml
-rw-r--r-- 1 root root 5420 Mar 27 22:38 workorder

If we like we can also generate and store a hash of the exported backup so that we can confirm its integrity if we recall it from Glacier. In this case we will use md5sum.

# md5sum /archives/mlab/winco.mlab.local/62/data.avpax | tee /archives/mlab/winco.mlab.local/62/data.avpax.md5sum
2b861454d43b317506cbe1ca3ad0751d /archives/mlab/winco.mlab.local/62/data.avpax

We must create a one-time vault in Glacier. We can have multiple vaults (I believe up to 1000 per account). In this experiment we will create one vault called avarchive

# mtglacier create-vault avarchive --config glacier.cfg
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

PID 35082 Started worker
PID 35082 Created vault avarchive
OK DONE

Now we are ready to upload the archive to Glacier. Lets instruct mtglacier to perform a dry run sync process to confirm what it will upload to Glacier. In this case we want to filter the criteria to files called data.avpax

# mtglacier sync --config glacier.cfg --vault avarchive --dir /archives --filter '+data.avpax -' --dry-run --journal /archives/avarchive.log
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

Will UPLOAD /archives/mlab/winco.mlab.local/62/data.avpax
OK DONE

As expected mtglacier has identified it needs to upload data.avpax for the archive backup with sequence #62. Now for the real thing.

# mtglacier sync --config glacier.cfg --vault avarchive --dir /archives --filter '+data.avpax -' --journal /archives/avarchive.log
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

PID 37255 Started worker
PID 37256 Started worker
PID 37257 Started worker
PID 37258 Started worker
PID 37255 Created an upload_id JdlfolL1PaPpsAcj1a_7w1UUO92PGwsFmqo51jfRPLUexyRwMvHpqpUfBi-QV6Jzb-VoUytGRF4YKx2Z5atj7FRhsVrF
PID 37256 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [0]
PID 37258 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [33554432]
PID 37257 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [16777216]
PID 37255 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [50331648]
PID 37256 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [67108864]
PID 37258 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [83886080]
PID 37255 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [117440512]
PID 37257 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [100663296]
PID 37256 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [134217728]
PID 37255 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [167772160]
PID 37258 Uploaded part for mlab/winco.mlab.local/62/data.avpax at offset [150994944]
PID 37257 Finished mlab/winco.mlab.local/62/data.avpax hash [5c6210976a6885cdf2da2af8b8ffd590eeb9b57eaa6d3a0d0f5acfcc983831bf] archive_id [XGZh128TUmUmYTC1M_KTZayQDfwtycldFCLNftu0TWjx3Kzeu_JbMftMtZLguDBT_iI6LpOd7kzH9Mts62TG74uS0NgRXy07bRhpWGgUnHO4VmMl30GriVaZwfvtf0XNuZQSV7Gfrg]
OK DONE

What we see here is mtglacier performed a multipart upload using 4 workers and 16MB chunks. This is necessary to drive parallelism and saturate bandwidth. About 9 minutes later the 181MB file upload completed. Here is what it looked like from the internet gateway.

aws_bandwidth_graph

Before we delete the archived backup from Avamar we should take a backup of the metadata we created on the Cloud gateway. This is necessary to ensure we can always maintain the relationship between Avamar archived backups and Glacier archives.

# avtar --id=MCUser --password=<password> --path=/mlab/fileserver --label=avarchive --exclude=data.avpax -c /archives

We can safely delete the exported backups and the backups they represent in Avamar.

First delete the backup in Avamar.

# avtar --id=MCUser --password=<password>--path=/mlab/winco.mlab.local --label=MOD-1395921964905 --labelnumber=62 --delete --force

Then delete the exported backup in the archive folder. This can be done automatically by searching the mtglacier journal log and deleting files that have been inserted with a CREATED record (CREATED means uploaded to Glacier). In this case we want to limit the search to files called data.avpax

# grep CREATED avarchive.log | awk '{ print $8 }' | grep data.avpax | xargs -i rm -f {}

At this point we have successfully demonstrated how to archive an Avamar backup to Glacier.

Retrieving and Importing Avamar Archived Backups from Glacier

To import an archived backup into Avamar we need to go through 4 steps:

  1. Identify the Glacier archive that needs to be retrieved
  2. Request Glacier to retrieve the archive
  3. Download the archive when it is ready to be retrieved
  4. Import the archived backup into Avamar

The workflow looks like this:

Retrieve

[UPDATE: 20140801 – mtglacier now supports restoring individual files using the new –include and –exclude options. Using grep to extract the file from the journal and creating a new journal log is no longer required]

Unfortunately mtglacier does not support restoring individual files. Rather, it restores any files referenced in the journal log that do not exist in the local archive file system.

To work around this limitation we extract the record entry we want to restore from the journal log and create a new one. We then use the new log to initiate the restore. In this case we want to retrieve the backup that was recently archived for winco.mlab.local with sequence #62.

# grep mlab/winco.mlab.local/62/data.avpax avarchive.log | tee retrieve.log
B 1395923307 CREATED XGZh128TUmUmYTC1M_KTZayQDfwtycldFCLNftu0TWjx3Kzeu_JbMftMtZLguDBT_iI6LpOd7kzH9Mts62TG74uS0NgRXy07bRhpWGgUnHO4VmMl30GriVaZwfvtf0XNuZQSV7Gfrg 181325312 1395922186 5c6210976a6885cdf2da2af8b8ffd590eeb9b57eaa6d3a0d0f5acfcc983831bf mlab/winco.mlab.local/62/data.avpax

Now initiate the restore request from Glacier using the retrieve.log journal we just created.

# mtglacier restore --config glacier.cfg --vault avarchive --dir /archives --max-number-of-files=1 --journal retrieve.log
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

PID 37884 Started worker
PID 37885 Started worker
PID 37886 Started worker
PID 37887 Started worker
PID 37884 Retrieved Archive XGZh128TUmUmYTC1M_KTZayQDfwtycldFCLNftu0TWjx3Kzeu_JbMftMtZLguDBT_iI6LpOd7kzH9Mts62TG74uS0NgRXy07bRhpWGgUnHO4VmMl30GriVaZwfvtf0XNuZQSV7Gfrg
OK DONE

After the retrieve request is issued Glacier takes several hours before the archive becomes available for download. If you try and restore an archive that is not available this is what mtglacier returns:

# mtglacier restore-completed --config glacier.cfg --vault avarchive --dir /archives --journal retrieve.log
MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

PID 37983 Started worker
PID 37984 Started worker
PID 37985 Started worker
PID 37986 Started worker
PID 37984 Retrieved Job List
OK DONE

I waited a few hours and it still wasn’t available. I tried the next morning and the restore completed.

# mtglacier restore-completed --config glacier.cfg --vault avarchive --dir /archives --journal retrieve.log MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.114

PID 48289 Started worker
PID 48290 Started worker
PID 48291 Started worker
PID 48292 Started worker
PID 48289 Retrieved Job List
PID 48292 Downloaded archive /archives/mlab/winco.mlab.local/62/data.avpax
OK DONE

Here is what the internet gateway reported.

aws_bandwidth_restore

This graph looks significantly higher and narrower than the previous one. Can anyone guess why?

My broadband connection is asymmetric. That is, the download line rate is significantly higher than the upload. This is very common for home broadband connections. However, for this use case, it is not ideal. We would generate significantly more upload traffic than download. To make this feasible requires a symmetric link.

In this experiment the download was quick and ran @ 3 MB/s. My link is capable of 10 MB/s.

Lets check the download against the MD5 hash we created.

# md5sum -c /archives/mlab/winco.mlab.local/62/data.avpax.md5sum
/archives/mlab/winco.mlab.local/62/data.avpax: OK

Now that we have downloaded the archive we need to import it back into Avamar. In this case we want to import it back into the original Avamar client. We specify the domain path as defined when it was exported.

The command is as follows:

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local --dto-exportstream --streamformat=avpax --stream=/archives/mlab/winco.mlab.local/62/data.avpax -c

Now lets list the backups to make sure it was imported correctly.

# avtar --id=MCUser --password=<password> --path=/mlab/winco.mlab.local --quiet --backups
 Date Time Seq Label Size Plugin Working directory Targets
 ---------- -------- ----- ----------------- ---------- -------- --------------------- -------------------
 2014-03-28 00:31:55 63 MOD-1395921964905 176588K Windows C:\Program Files\avs\var C:/Users/administrator/Downloads/jdk-7u51-windows-x64.exe,C:/Use
 2014-03-27 22:38:47 61 MOD-1395920320594 972293K Windows C:\Program Files\avs\var C:/Users/administrator/Downloads/AvamarVMwareCombined-linux-x86-
 2014-03-27 20:48:37 59 MOD-1395900367200 2473651K Windows C:\Program Files\avs\var C:\Users
 2014-03-27 17:57:48 40 Default Schedule-Windows-1395802800168 47635164K Windows C:\Program Files\avs\var
 2014-03-26 14:06:54 33 Default Schedule-Windows-1395802800168 47635164K Windows C:\Program Files\avs\var
 2014-03-25 14:12:30 32 Default Schedule-Windows-1395716400363 47575481K Windows C:\Program Files\avs\var
 2014-03-24 14:04:05 31 Default Schedule-Windows-1395630000186 47561819K Windows C:\Program Files\avs\var
 2014-03-23 14:04:04 30 Default Schedule-Windows-1395543600164 47560966K Windows C:\Program Files\avs\var
 2014-03-22 14:04:01 29 Default Schedule-Windows-1395457200109 47560054K Windows C:\Program Files\avs\var
 2014-03-21 14:04:21 28 Default Schedule-Windows-1395370800153 47560220K Windows C:\Program Files\avs\var
 2014-03-20 14:04:04 27 Default Schedule-Windows-1395284400105 47559427K Windows C:\Program Files\avs\var
 2014-03-19 14:04:32 26 Default Schedule-Windows-1395198000192 47558551K Windows C:\Program Files\avs\var
 2014-03-18 14:04:28 25 Default Schedule-Windows-1395111600179 47546065K Windows C:\Program Files\avs\var
 2014-03-17 14:04:08 24 Default Schedule-Windows-1395025200242 47546535K Windows C:\Program Files\avs\var
 2014-03-16 14:04:09 23 Default Schedule-Windows-1394938800339 47527684K Windows C:\Program Files\avs\var
 2014-03-15 14:04:45 22 Default Schedule-Windows-1394852400289 47598941K Windows C:\Program Files\avs\var
 2014-03-14 14:05:04 21 Default Schedule-Windows-1394766000197 47533802K Windows C:\Program Files\avs\var
 2014-02-23 19:39:28 1 MOD-1393144755033 557K Windows C:\Program Files\avs\var C:\WinDump.exe

We can see a new backup #63 was imported. Avamar did not use the original sequence number. It  increments these when backups are created. The imported backup shares the same label as the original backup we exported which is OK.

We can now browse this backup using Avamar Administrator and restore it.

browse

We should point out the imported backup has no expiration date. If we want to set this we would use the –expires argument to avtar during the import.

What about compression and encryption?

If you would prefer to compress and subsequently encrypt the backups before they are sent to Glacier then we can substitute –stream for –to-stdout in the case of exports and –from-stdin in the case of imports.

The process would look something like this for exports:

# avtar ... --to-stdout | compression_command_pipe | encryption_command_pipe | dd of=archived_backup_file

And this for imports:

# dd if=archived_backup_file | encryption_command_pipe | decompression_command_pipe | avtar ... --from-stdin

You could use gzip or bzip2 for compression and ccrypt for encryption. Make sure to compress before encrypting. For the encryption key we could use a combination of the Avamar backup label and sequence number.

What about alternative archive targets?

Although Glacier was used as the target for this experiment, the options are endless. The same approach can be used to archive backups to many popular cloud and object stores including S3, Swift, Atmos, EVault, Azure, Google and Ceph, either through similar tools like mtglacier or alternatives such as FUSE modules.

Alternatively if you want to keep your archives on premise then traditional block or file  storage systems could be consumed by the Cloud gateway. Ideally, these would implement erasure coding schemes to keep costs down.

So it can be done… But is it practical?

We have proven it is possible to archive Avamar backups but does that mean it is practical?

Lets put things into perspective. If we wanted to archive monthly backups for long term retention to Glacier what would we need?

Avamar comes in many flavours; virtual, physical, with and without Data Domain. The sizes range from 500GB (before dedup) to 124TB (Avamar 16 node grid). With Data Domain we can store 570TB in the active tier and have several attached to one Avamar server.

Now, lets assume we stopped storing backups in Avamar greater than 1 month old and instead use Glacier. To work out the size of our monthly backup we need to understand the ratio of front-end protected storage to backend consumed for a 30 day retention profile.

There are many factors that impact this ratio (data type, change rate, growth rate, etc) however for the purpose of this experiment we will use 1:1.

For a 500GB Avamar instance we would need to archive 500GB a month to Glacier. We have 30 days to complete the archive before the next cycle starts. Realistically we don’t want to consume the entire 30 days. We need to give ourselves some tolerance. Therefore, lets say we want to complete a monthly cycle within 50% either side of the next cycle. How much upload bandwidth would we need for 500GB?

We would need a 3.2 Mbps upload link. What about larger volumes of data?

Below is a table of volumes relative to time occupied between cycles. In the 100% case the archiving process is running 24×7.

ubchart

What we can infer from this chart is the upload bandwidth requirements are very high.

For example, my broadband can only accommodate 2 Mbps. Even then home broadband plans are not appropriate as most of them have upload GB caps and throttle bandwidth to impractical levels when the cap is reached. My cap is 200 GB for upload and download combined and costs $80 AUD/month.

What we need is a symmetric link which is often reserved for businesses.

For example, a leading telco offers 10 Mbps business plans. That would support a 2TB monthly archive use case to Glacier at 75% busy. However, this type of connectivity is very costly at $7931/month. Compare this to Glacier’s cost of $0.01/GB/month or in this case $20/month (first month) scaling to $240/month (12 months) to store monthly archived backups for 1 year.

In this example, the cost of networking is 33x more than storage. This makes any cloud storage look expensive even at $0.01/GB/month.

The blended cost is $0.34/GB/month after year 1 (excluding AWS get/put request and restore costs).

Summary

Glacier is a very cost effective cold storage service. However, the cost of networking in this country makes it impractical to consume Glacier over the Internet for long term backup archives. To address this issue AWS offers alternative connectivity options that bypass traditional Internet connections and provides direct connectivity.

The product is called AWS Direct Connect and is designed to be more cost effective for large scale requirements. However, in addition to AWS Direct Connect usage costs there are line costs associated with AWS Direct Connect network partners. These prices are not in the public domain (that I could find) which makes it difficult to evaluate.

In part 2 we will explore if it is possible to minimise the networking requirements between Avamar and Glacier.

Backup Deduplication Efficiencies and Capacity Planning Demystified

Seasons Greetings to All..

I recently wrote about a topic dear to my heart, backup deduplication and capacity planning. The paper was published by EMC Proven Professional Knowledge Sharing program and is available from the EMC Community Network. Access to the site requires a login. A link to the paper is here.

The paper has also been published @ www.books24x7.com here.

Abstract below..

Many organisations have embraced disk-centric backup architectures by adopting purpose-built backup appliances to overcome the reliability and performance challenges associated with tape-centric architectures.
 
The market for purpose-built backup appliances reached 2.4 bilion in 2011 and continues to experience growth. This has resulted in many new vendors’ releasing solutions. The market is now saturated with a variety of solutions, from software-only to purpose-built backup appliances and combinations thereof. Each implementation has its strengths and weaknesses. This article will attempt to provide an objective comparison of the functional and architectural properties associated with deduplicated disk-centric backup implementations.
 
Furthermore, for those that have already adopted deduplicated disk for backup, we discuss capacity planning and why traditional planning models that we apply to primary storage do not work well for deduplicated disk backup systems. To support this discussion, we will provide a generic overview of deduplicated disk backup sizing and how backup requirements and data profiles effect storage consumption. Equipped with this knowledge, the reader will be in a better position to understand and forecast deduplication storage consumption.
 

This will probably be my last post for the year. Hope you enjoyed the content in 2013. Look out for some new experiments in the new year.

Peter

Pushing NetWorker Software Client Upgrades from your local NetWorker Software Repository

Today EMC announced the GA release of NetWorker 8.1. For every backup administrator that means we need to start planning software upgrades!

Fortunately, NetWorker includes the ability to push software updates to hosts from the NetWorker server.

To setup this feature the software you would like to distribute must first be added to the NetWorker server software repository. This can be done via the Software Administration GUI available from NetWorker Administrator or the command line. For the purpose of this demonstration we will use the nsrpush command from the NetWorker server but either way will work.

This particular NetWorker server runs on CentOS which means we need to use a Windows system to help with adding the Windows software distribution to the repository. This system needs to be running a NetWorker Client and be accessible from the NetWorker server. It is used one-time to ensures Windows file schematics are preserved as the repository is populated on the UNIX NetWorker server. The inverse applies if we were using a Windows NetWorker server.

First extract the Windows software distribution to a directory on the UNIX NetWorker server and the Windows cross-platform client. In this case I have used /nw on UNIX and c:\nw81_win_x64\win_x64 on Windows. Then tell NetWorker to add the software to the repository by specifying the location on the UNIX server and Windows cross-platform client (server.mlab.local).

Here is an example below:

[root@nws nw]# nsrpush -a -p NetWorker -v 8.1.0.1 -P win_x64 -W -m /nw/win_x64 -c server.mlab.local -C 'C:\\nw81_win_x64\win_x64'
Hostname and mount point recorded.
Success adding product from:
/nw/win_x64

Add to repository status: succeeded

For UNIX software distributions (e.g. Linux, Solaris, AIX, HPUX) there is no need for a cross-platform client. We extract the bundle to the UNIX NetWorker server and add to the software repository.

Here is how its done for the Linux software distribution.

[root@nws nw]# nsrpush -a -p NetWorker -v 8.1.0.1 -P linux_x86_64 -U -m /nw/linux_x86_64/
Success adding product from:
/nw/linux_x86_64

Add to repository status: succeeded

For comparison’s sake we have also populate the Solaris x86 distribution.

[root@nws nw]# nsrpush -a -U -p NetWorker -P solaris_x86 -v 8.1.0.1 -m /nw/solaris_x86
Success adding product from:
/nw/solaris_x86

Add to repository status: succeeded

Now we can query the software repository to see what’s available:

[root@nws nw]# nsrpush -l

Products in the repository

================================

NetWorker 8.1.0.1
linux_x86_64
Japanese Language Pack
Chinese Language Pack
Korean Language Pack
Management Console
Client
Man Pages
Storage Node
License Manager
Server
French Language Pack
solaris_x86
Korean Language Pack
Chinese Language Pack
Japanese Language Pack
French Language Pack
Man Pages
Client
win_x64
Server
License Manager
Language Packs
English Language Pack
French Language Pack
Japanese Language Pack
Korean Language Pack
Chinese Language Pack
Client
Management Console
Storage Node

Before we can update our clients we need to perform an inventory of them first. Let’s run it across all clients.

In this case we have 3 clients with nws.mlab.local playing the role of the NetWorker console, server and storage node.

[root@nws nw]# nsrpush -i -all
Starting Inventory Operation on selected clients
Creating client type job for client maral-laptop.mlab.local.
Copying inventory scripts to client maral-laptop.mlab.local.
Successfully copied inventory scripts to client maral-laptop.mlab.local.
Starting inventory of client maral-laptop.mlab.local.
Successfully ran inventory scripts on client maral-laptop.mlab.local.
Cleaning up client maral-laptop.mlab.local.
Successfully cleaned up client maral-laptop.mlab.local.
Creating client type job for client nws.mlab.local.
Copying inventory scripts to client nws.mlab.local.
Successfully copied inventory scripts to client nws.mlab.local.
Starting inventory of client nws.mlab.local.
Successfully ran inventory scripts on client nws.mlab.local.
Cleaning up client nws.mlab.local.
Successfully cleaned up client nws.mlab.local.
Creating client type job for client server.mlab.local.
Copying inventory scripts to client server.mlab.local.
Successfully copied inventory scripts to client server.mlab.local.
Starting inventory of client server.mlab.local.
Successfully ran inventory scripts on client server.mlab.local.
Cleaning up client server.mlab.local.
Successfully cleaned up client server.mlab.local.

Now lets query our hosts to see what version of NetWorker and packages are installed.

[root@nws nw]# nsrpush -s -all

server.mlab.local win_x64
NetWorker 8.0.2.0
Language Packs
English Language Pack
Client

nws.mlab.local linux_x86_64
NetWorker 8.1.0.1
Storage Node
Client
Server
Man Pages
Management Console

maral-laptop.mlab.local win_x64
NetWorker 8.0.2.0
Language Packs
Client
English Language Pack

No surprise there. Our NetWorker server is already running 8.1.0.1 as it was upgraded using the NetWorker server upgrade process (i.e. rpm).

Now we are in a position to upgrade all hosts or target specific hosts.

Lets for the moment upgrade our desktop server (server.mlab.local).

[root@nws nw]# nsrpush -u -p NetWorker -v 8.1.0.1 server.mlab.local
Starting upgrade operation on selected clients.
Creating client type job for client server.mlab.local.
Copying upgrade scripts to client server.mlab.local.
Successfully copied upgrade scripts to client server.mlab.local.
Checking free space on the temporaty path for client server.mlab.local.
Successfully issued a space check command to client server.mlab.local.
Copying upgrade packages to client server.mlab.local.
Successfully copied upgrade packages to client server.mlab.local.
Starting upgrade of client server.mlab.local.
Successfully issued an upgrade command to client server.mlab.local.
Waiting for upgrade status for client server.mlab.local.
Successfully retrieved and processed the package upgrade status data for client server.mlab.local
Starting cleanup phase on client server.mlab.local.
Cleaning up client server.mlab.local.
Successfully issued the cleanup command to client server.mlab.local.

Upgrade status: succeeded

And its done. That took about a minute. To verify we can query the client again.

[root@nws nw]# nsrpush -s server.mlab.local

 server.mlab.local  win_x64
     NetWorker 8.1.0.1
                English Language Pack
                Language Packs
                Client

All done remotely from the NetWorker server command prompt. Just the way I like it.

The NetWorker software distribution feature is quite simple to use. Suggest you give it a try if you are not using it already.

Automate the download of VMWorld 2013 online session content

If you are like me and have been looking for an automated way to download the session content from VMWorld 2013 San Francisco well you probably found yourself clicking through the links hoping there was a better way.

Well I hope you have found this blog post because I have done the work for you without breaking the rules.

The perl script at the end of this post uses WWW::Mechanize to connect to VMware’s servers and download the video content and resource links from the online sessions.

All you should need is to pass your VMWorld 2013 username and password to the script via command line and let it run. It will go through the site and identify all the mp3 and pdf content. It will then output a list of commands to download the content URLs including the mp4 videos.

To run it and download the content do something like this:

# script <username> <password> > output.sh
# sh ./output.sh

This will download the URL’s in sequence using curl. If you want to get a little more adventurous look at one of my other blog posts on using GNU parallel to run commands in parallel.

The code was developed on Linux and is provided as-is. Hope it saves you some time.

NOTE: Many of the sessions do not contain pdf or mp3 content. Not sure if this is by design,a bug or the content is yet to be posted. Anyway, the content is around 28GB as of writing.

#!/usr/bin/env perl
#
# Downloads MP4, MP3 and PDF session content from VMWorld 2013
#
# Accepts your VMWorld 2013 username and password
#
# Usage:
#
# download_VMWorld2013  

use WWW::Mechanize;
my $wp = WWW::Mechanize->new();

my $username = shift;
my $password = shift;

if (!$username || !$password)
{
        print "Must supply username and password on command line\n";
        exit 1;
}

$wp->get('https://www.vmworld.com/login.jspa');

$wp->submit_form(
        form_name => 'loginform01',
        fields      => {
                username    => $username,
                password    => $password
        }
);

my(@downloadurls);

$wp->get('http://www.vmworld.com/community/sessions/2013/');

for my $i (@{$wp->find_all_links(url_regex => qr#docs/DOC-[0-9]+#)})
{
        my $url = "http://www.vmworld.com" . $i->[0];
        my $name = $i->[1];
        $name =~ s/[^[:ascii:]]+//g;
        $name =~ s#\s+#_#g;
        $name =~ s#/#_#g;
        $name =~ s#[:?]##g;
        $wp->get($url);
        $wp->content =~ m#<tr><th><p>Session ID:</p></th><td><p>([^<]+)</p></td></tr><tr>#;
        my $sessionid = lc $1;
        my $res = $wp->find_link(url_regex => qr/mylearn\?classID=[0-9]+/);
        next if (!defined $res);
        $res->url =~ m#classID=(\d+)#;
        my $classid = $1;
        $wp->get("http://www.vmworld.com" . $res->url);
        $url = "http://sessions.vmworld.com/mgrCourse/launchCourse.cfm?mL_method=player";
        $wp->get($url);
        for my $l (@{($wp->find_all_links(url_regex => qr#/courseware/[0-9]+/.+[0-9]+\..+#))})
        {
                my $url = $l->[0];
                my $filename = $url;
                $filename =~ m#([^/]+)$#;
                $filename = join("_", $name, $1);
                push(@downloadurls, [$url, $filename]);
        }
        my $url = "http://sessions.vmworld.com/lcms/mL_course/courseware/$classid/$sessionid.mp4";
        my $filename = join('_', $name, "$sessionid.mp4");
        push(@downloadurls, [$url, $filename]);
}

# Output URL's to stdout
#

map { print "curl -s -S -o \"$_->[1]\" \"$_->[0]\"\n" if (!-f $_->[1]); } @downloadurls;

EMC NetWorker parallel saveset cloning with nsrclone and GNU parallel

It should be no surprise to drive backup or cloning throughput requires high levels of parallelism. As of this writing NetWorker 8.0.2 supports cloning of individual savesets that commence at the completion of a savegroup. The individual savesets that require cloning are passed to the nsrclone command which runs through them in sequence.

To optmise for cloning, savesets need to be spread across multiple savegroups in order to create clone session parallelism. Splitting up savesets this way is not always possible or desirable.

I wrote pnsrclone to speed up the cloning process without requiring savegroup redesign.

This script is a multi-process wrapper written around the standard Networker nsrclone command. It does everything nsrclone can do with the added benefit of managing a predefined number of parallel nsrclone sessions to process a queue of savesets.

The script supports the command line switches available to nsrclone with the exception of -P and -W which are not relevant for cloning.

Three environment variables can be set to control the scripts behaviour:

PARALLELISM=<n>

Sets the number of parallel nsrclone sessions to spawn and maintain while the saveset queue is processed. The default is 10.

JOBLOG=<filename>

Accepts a filename that is used to record relevant statistics from each parallel nsrclone session. As each nsrclone session completes various statistics are recorded by GNU parallel. These statistics can be used to determine the state of completed nsrclone sessions.

The following statistics are logged:

  1. Seq: The sequence number or order in which this command run
  2. Host;: N/A
  3. Starttime: Start time as seconds since epoch
  4. Runtime: Command runtime in seconds
  5. Send: N/A
  6. Receive: N/A
  7. Exitval: Exit value of nsrclone command
  8. Signal: Value of signal if received
  9. Command: nsrclone command run with arguments

There is no default defined and the environment variable is mandatory.

DRYRUN=1

When set to 1 instructs pnsrclone to perform a dry run without executing the nsrclone sessions. This parameter can be used to determine the number of savesets that would be cloned. The default is not to perform a dry run.

All output from pnsrclone and the underlying output from nsrclone sessions are written to standard out and standard error. GNU parallel does its best to keep the output of each nsrclone session together. When run from cron it is best to redirect standard out and error to a scratch file.

The script allows one instance of pnsrclone to run per unique filename defined by the JOBLOG environment variable. This is to prevent multiple invocations of pnsrclone from cloning the same savesets at the same time.

If there is a requirement to run multiple instances a different JOBLOG must be specified. And to prevent overlap avoid defining the same group (-g) or volid (-V) between pnsrclone instances that are destined to overlap.

The script requires GNU parallel to be installed which is available from here.

Testing was conducted on a Linux distribution with NetWorker 8.0.2. It should work on other UNIX variants. If you have success feel free to drop a comment here. There is no Windows version yet. If I get enough requests I may try and port it.

Run it from either the NetWorker server or storage node and schedule it via cron or some other method.

When pnsrclone runs you should see in NetWorker Administrator multiple clone sessions as below:

clone

The number of clone sessions will be relative to the PARALLELISM setting.

Examples script runs are included below:

  • Clone all backups that were created from the Exchange group up to a week ago to the clone pool named Target Clone Pool. Only clone backups that have less than one copy in the target clone pool. Use up to 8 parallel nsrclone sessions.

env PARALLELISM=8 JOBLOG=/tmp/pnsrclone.log pnsrclone -F -g Exchange -b “Target Clone Pool” -C 1 -t “last week” -S

  • As above but set the clone savesets browse and data retention policy to 6 months from the backup copy

env PARALLELISM=8 JOBLOG=/tmp/pnsrclone.log pnsrclone -F -g Exchange -b “Target Clone Pool” -y “6 month” -w “6 month” -C 1 -t “last week” -S

Disclaimer: the script is provided as-is and is not supported or maintained by EMC.

Download version 1.2 from here.

Cloning at warp speed 100 VMs in 1 minute using one command line

I had a requirement to test backing up 100 VMs simultaneously using Avamar. I wasn’t particularly interested in creating the VMs through vSphere UI. Instead, I downloaded the vSphere Perl SDK and found a script called vmclone.pl. This script is able to create clones by connecting to a vCenter and specifying a source and target VM.

The command syntax is as follows:

/usr/lib/vmware-viperl/apps/vm/vmclone.pl –username <vcenter username> –password <vcenter password> –url https://<vcenter hostname>/sdk/webService –vmhost <esx hostname> –vmname <vm to clone> –vmname_destination <vm to create> –datastore <datastore>

The next challenge was to run the process as fast as possible and that requires parallelism. To achieve that in a shell context I used GNU parallel. This program is able to spawn multiple shell commands concurrently. A concurrency limit can be defined with the -j flag.

In this instance I wanted to run up to 8 clone operations in parallel until all 100 completed. The one line command I used to achieve this is below:

seq -w 0 99 | parallel -j8 -k /usr/lib/vmware-viperl/apps/vm/vmclone.pl –username administrator –password <password> –url https://winvc.mlab.local/sdk/webService –vmhost esx.mlab.local –vmname blank –vmname_destination blank{} –datastore raidarray

This created 100 VMs named blank00 to blank99 using the existing VM named blank.

I put all the 100 VMs into a resource pool called test100 using vmmigrate.pl.

seq -w 00 99 | parallel -j8 -k /usr/lib/vmware-viperl/apps/vm/vmmigrate.pl –username administrator –password <password> –url https://winvc.mlab.local/sdk/webService –targetpool test100 –vmname blank{} –sourcehost esx.mlab.local –targethost esx.mlab.local –targetdatastore raidarray

Job complete.

I was pretty chuffed with vmclone.pl. But it still took 4 hrs to copy from one thin VM to 100 thin VMs. There had to be a better way. Introducing the linked clone.

What is a linked clone? Essentially read/write snapshot of a source VM’s snapshot. In my case the source VM is called blank and the snapshot name is called lc.

To create the linked clones I used virtuallyGhetto. The viruallyGhetto package comes with a script vGhettoLinkedClone.pl which is a modified version of vmclone.pl that creates linked clones.

So here it is 100 VMs in 1 minute using one command line.

seq -w 0 99 | parallel -j10 -k PERL_LWP_SSL_VERIFY_HOSTNAME=0 vGhettoLinkedClone.pl –server winvc.mlab.local –vmhost esx.mlab.local –username administrator –password <password> –vmname blank –vmname_destination blank{} –snapname lc –convert sesparse –datastore raidarray

And if you are wondering it took 1m13.084s. Yes I cheated 🙂

Now you might say but this could be done with vSphere PowerCLI. Well that may be true but the UNIX shelll is my preferred terminal window for automation tasks.