Disaster Recovery
HDFS with DistCP
Replication of datasets to target HDFS or cloud storage systems is accomplished in the most simple terms with DistCP. This section aims to provide the basics around utilizing DistCP, HDFS Snapshots, and Oozie to perform interval based copies. For sanity, it is recommended that this process be tested once both clusters are online. In the case that simulating a failure of a production dataset is not an option, a secondary data location can be used on the production cluster. For example, rather than restoring the display ads data set to the production location, it could be returned to a location in the HDFS /tmp directory to avoid impacting the running tasks.
HDFS SnapShots
Snapshots provide the method on HDFS for backup and recovery of a local cluster namespace. Snapshots can also be leveraged to provide a fully consistent state of a specific set of data by providing a HDFS namespace which is unchanging. Before executing a DistCP job the source dataset can be snapshot, allowing for the snapshot to be coped over to the target storage while other operations can take place without conflict on the source dataset. After replication has taken place to the target HDFS another snapshot should be taken in produce a mirror of backups and allow for application level recovery at a daily level.
$ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2013-09-09
DistCP Validation
As part of the map stage, validation will occur between the source and target files by comparing the checksums of the two files. A DistCP counters report is generated after every job, any failures will be captured as part of this report which can be utilized for alerting of FAIL counters.
DistCP Counters
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESFAILED=10313844400
BYTESSKIPPED=7993229410
COPY=3
FAIL=2
SKIP=13
Oozie Workflows and Coordinator Jobs
Oozie coordinators are recommended for scheduling of replication jobs due to the ability for data-aware scheduling of workflows. Workflows are recommended to accept a single HDFS directory as an input path for replication. Intermediate data sets such as incoming, working, or failed datasets should be excluded from the process.
End to End Replication
-
Snapshot the Source Directory
- This step will guarantee that the copy will be in a fully consistent state, regardless of ongoing work within that directory. It also detaches scheduling and execution of backups from any other tasks that may need to take place. This step is optional.
$ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2015-05-08
- This step will guarantee that the copy will be in a fully consistent state, regardless of ongoing work within that directory. It also detaches scheduling and execution of backups from any other tasks that may need to take place. This step is optional.
-
Execute DistCP Job
-
Initiate a “distcp” to copy data from the source cluster to the target backup cluster. Optionally, this task may be run from the Backup cluster. The execution location is specified by the YARN resource manager definition in the configuration. Running the task on the Backup cluster will free up resources on the production cluster for ongoing work. That being said, be certain snapshots are being created on the correct HDFS site. If executed on the backup cluster, the workflow will function as a pull rather than a push of data.
$ hadoop distcp –update –prbugp –m 32\ hdfs://srcHDFSNN/data/project/iot-data/tables/.snapshot/2015-05-08 \ hdfs://baHDFSNN/data/project/iot-data/tables
-
-
Snapshot the Target Directory
-
Execute a snapshot on the TARGET in order to mirror historical backups. This will function to provide history in the case application-layer corruption takes place, rather than full DR level restoration is required.
$ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2015-05-08
-
-
Cleanup older Snapshots
- Clean or delete old snapshots on both systems as defined by business retention rules and capacity allocations. These parameters may also be passed to Oozie on a per-job basis, allowing flexibility in data retention rules per defined data set. As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.
$ hdfs dfs -deleteSnapshot
- Clean or delete old snapshots on both systems as defined by business retention rules and capacity allocations. These parameters may also be passed to Oozie on a per-job basis, allowing flexibility in data retention rules per defined data set. As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.
-
Calculate SnapShot Differences
-
The flags next to the output can be used to tell what change between the snapshots has occured.
$ hdfs snapshotDiff /data/project/iot-data/tables 2015-05-08 2015-05-09 Difference between snapshot 2015-05-08 and snapshot 2015-05-09 under directory /data/project/iot-data/tables: M . + ./attempt R ./raw -> ./orig M ./namenode/fs_state/2016-12.txt M ./namenode/nn_info/2016-12.txt M ./namenode/top_user_ops/2016-12.txt M ./scheduler/queue_paths/2016-12.txt M ./scheduler/queue_usage/2016-12.txt - ./scheduler/queues/2016-12.txt
+ The file/directory has been created. - The file/directory has been deleted. M The file/directory has been modified. R The file/directory has been renamed.
-
Recovery
Restoring from a backup is effectively the reverse copy operation. In order to restore, follow these steps while tweaking as necessary:
- On the backup system, select the snapshot you would like to restore.
- Wipe the target directory on the production system.
- Execute the distcp without the –update option
Alternately, differential copies may be attempted, however it is generally recommended that in the case of a DR scenario, data be returned to the production cluster in bulk.
Some Hints
Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.
If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.