josephxsxn

HDFS DR 101

Jul 28th, 2017
98
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!

Disaster Recovery

HDFS with DistCP

Replication of datasets to target HDFS or cloud storage systems is accomplished in the most simple terms with DistCP. This section aims to provide the basics around utilizing DistCP, HDFS Snapshots, and Oozie to perform interval based copies. For sanity, it is recommended that this process be tested once both clusters are online. In the case that simulating a failure of a production dataset is not an option, a secondary data location can be used on the production cluster. For example, rather than restoring the display ads data set to the production location, it could be returned to a location in the HDFS /tmp directory to avoid impacting the running tasks.

HDFS SnapShots

Snapshots provide the method on HDFS for backup and recovery of a local cluster namespace. Snapshots can also be leveraged to provide a fully consistent state of a specific set of data by providing a HDFS namespace which is unchanging. Before executing a DistCP job the source dataset can be snapshot, allowing for the snapshot to be coped over to the target storage while other operations can take place without conflict on the source dataset. After replication has taken place to the target HDFS another snapshot should be taken in produce a mirror of backups and allow for application level recovery at a daily level.

$ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2013-09-09

DistCP Validation

As part of the map stage, validation will occur between the source and target files by comparing the checksums of the two files. A DistCP counters report is generated after every job, any failures will be captured as part of this report which can be utilized for alerting of FAIL counters.

DistCP Counters

org.apache.hadoop.tools.mapred.CopyMapper$Counter        
BYTESFAILED=10313844400        
BYTESSKIPPED=7993229410        
COPY=3        
FAIL=2        
SKIP=13

Oozie Workflows and Coordinator Jobs

Oozie coordinators are recommended for scheduling of replication jobs due to the ability for data-aware scheduling of workflows. Workflows are recommended to accept a single HDFS directory as an input path for replication. Intermediate data sets such as incoming, working, or failed datasets should be excluded from the process.

End to End Replication

  1. Snapshot the Source Directory

    • This step will guarantee that the copy will be in a fully consistent state, regardless of ongoing work within that directory. It also detaches scheduling and execution of backups from any other tasks that may need to take place. This step is optional.
      $ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2015-05-08
  2. Execute DistCP Job

    • Initiate a “distcp” to copy data from the source cluster to the target backup cluster. Optionally, this task may be run from the Backup cluster. The execution location is specified by the YARN resource manager definition in the configuration. Running the task on the Backup cluster will free up resources on the production cluster for ongoing work. That being said, be certain snapshots are being created on the correct HDFS site. If executed on the backup cluster, the workflow will function as a pull rather than a push of data.

      $ hadoop distcp –update –prbugp –m 32\
           hdfs://srcHDFSNN/data/project/iot-data/tables/.snapshot/2015-05-08 \
           hdfs://baHDFSNN/data/project/iot-data/tables
  3. Snapshot the Target Directory

    • Execute a snapshot on the TARGET in order to mirror historical backups. This will function to provide history in the case application-layer corruption takes place, rather than full DR level restoration is required.

      $ hdfs dfsadmin –createSnapshot /data/project/iot-data/tables 2015-05-08
  4. Cleanup older Snapshots

    • Clean or delete old snapshots on both systems as defined by business retention rules and capacity allocations. These parameters may also be passed to Oozie on a per-job basis, allowing flexibility in data retention rules per defined data set. As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.
      $ hdfs dfs -deleteSnapshot
  5. Calculate SnapShot Differences

    • The flags next to the output can be used to tell what change between the snapshots has occured.

      $ hdfs snapshotDiff /data/project/iot-data/tables 2015-05-08 2015-05-09
      Difference between snapshot 2015-05-08 and snapshot 2015-05-09
      under directory /data/project/iot-data/tables:
      M       .
      +       ./attempt
      R       ./raw -> ./orig
      M       ./namenode/fs_state/2016-12.txt
      M       ./namenode/nn_info/2016-12.txt
      M       ./namenode/top_user_ops/2016-12.txt
      M       ./scheduler/queue_paths/2016-12.txt
      M       ./scheduler/queue_usage/2016-12.txt
      -       ./scheduler/queues/2016-12.txt
      + The file/directory has been created.
      - The file/directory has been deleted.
      M The file/directory has been modified.
      R The file/directory has been renamed.

Recovery

Restoring from a backup is effectively the reverse copy operation. In order to restore, follow these steps while tweaking as necessary:

  1. On the backup system, select the snapshot you would like to restore.
  2. Wipe the target directory on the production system.
  3. Execute the distcp without the –update option

Alternately, differential copies may be attempted, however it is generally recommended that in the case of a DR scenario, data be returned to the production cluster in bulk.

Some Hints

Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.

If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.

Add Comment
Please, Sign In to add comment