INTEGRATING VMWARE SITE RECOVERY MANAGER WITH ...

23
INTEGRATING VMWARE SITE RECOVERY MANAGER WITH VNX MIRRORVIEW/A Jason L. Gates Systems Engineer - Storage & Virtualization Presidio Networked Solutions [email protected] http://www.linkedin.com/in/mrjasongates

Transcript of INTEGRATING VMWARE SITE RECOVERY MANAGER WITH ...

INTEGRATING VMWARE SITE RECOVERY MANAGER WITH VNX MIRRORVIEW/A

Jason L. GatesSystems Engineer - Storage & VirtualizationPresidio Networked [email protected]://www.linkedin.com/in/mrjasongates

2012 EMC Proven Professional Knowledge Sharing 2

Table of Contents

Audience .................................................................................................................................... 3

Overview .................................................................................................................................... 3

What is VMware Site Recovery Manager? ................................................................................. 3

SRM Architecture ....................................................................................................................... 4

VNX MirrorView/A Replication Overview .................................................................................... 5

Delta Set and Gold Copy ........................................................................................................... 6

Reserve LUN Pool Recommendations for Performance and Sizing ........................................... 7

MirrorView Replication Link Best Practices and Settings ............................................................ 8

Installing SRM ...........................................................................................................................10

Configuring SRM .......................................................................................................................16

Troubleshooting SRM ...............................................................................................................21

Conclusion ................................................................................................................................22

References ...............................................................................................................................23

Disclaimer: The views, processes, or methodologies published in this article are those of the

author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2012 EMC Proven Professional Knowledge Sharing 3

Audience

This article is intended for system administrators, VMware engineers, and storage

administrators. Readers are assumed to be familiar with VMware ESX/ESXi hosts, basic

CLARiiON® system operations, fabric switches, basic networking, VCenter Server, EMC

Navisphere®/Unipshere® Manager, and creating EMC MirrorView Replication.

Overview

This Knowledge Sharing article will walk through the VMware Site Recovery Manager (SRM)

workflow that has to be completed to allow for the successful and automated service failover

from the designated SRM protected site to the designated SRM recovery site using EMC

MirrorView technology. This article will also provide an overview which includes the

considerations and guidance to execute a failover of services to the recovery site back and also

a failback to the original production site.

VMware SRM provides business continuity and disaster recovery protection for virtual

environments. MirrorView/A, EMC’s leading remote replication suite, is ideal to provide disaster

recovery solutions for the virtual environment running on the VNX®. I will discuss best practices

when using MirrorView with SRM and common pitfalls that can cause multiple issues on the

VNX, including performance, storage processor utilization problems, and oversaturation of the

replication link.

Additionally, step-by-step instructions are provided on how to administer and troubleshoot

replication and disaster recovery on the world’s number one mid-range storage sub-system and

VMware’s leading virtualization software.

What is VMware Site Recovery Manager?

Site Recovery Manager is an end-to-end disaster recovery DR automation product. It is

designed to protect virtual machines residing in datastores on replicated storage. In the event of

a true failure or complete site failure, virtual machines can be failed over to a remote data

center. SRM is targeted for use with array-based replication, although, with SRM 5.0, VM-level

replication is possible. However, in this article we will discuss VNX MirrorView based replication.

A customer once asked me to explain the major difference between VMware HA and SRM?

Basically, VMware HA is a clustering product designed to provide intrasite fault tolerance. HA is

2012 EMC Proven Professional Knowledge Sharing 4

designed to power on virtual machines on surviving local cluster members in the event of a

hardware failure; it has no capability to power on virtual machines outside the local HA cluster.

SRM Architecture

To fully appreciate and understand SRM, we must know the following components, which make

the product work in your environment:

SRM Server Software: Software that is installed on a separate machine or on the

VCenter Server; I recommend the VCenter Server for smooth integration

Storage Replication Adapter Software Package (EMC VNX Replication Adapter)

VMware VCenter

SRM Database (SQL or SQL Express)

SAN replication (MirrorView or RecoverPoint)

SRM Licenses

There are also some key terms used with the SRM product:

Protected Site: The protected site is the data center containing the virtual machines for

which data is being replicated to the recovery site

Recovery Site: The recovery site is the data center containing the recovery of virtual

machines in case of disaster

Protection Group: Replicated datastores containing a set of VM’s that are protected with

SRM

Inventory Mapping: Mapping between resource pools, networks, and virtual machine

folders on the protection site and the destination

Recovery Plan: SRM’s version of a runbook

2012 EMC Proven Professional Knowledge Sharing 5

High Level Overview Diagram of SRM Workflow

VNX MirrorView/A Replication Overview

MirrorView is a VNX business continuity solution that provides block-level replication. The copy

of the data on the production VNX is called the primary image and the copy at the recovery site

is called the secondary image. The design goal of MirrorView is to allow speedy recovery from a

disaster. To accomplish this, MirrorView uses low cost long distance connectivity replication,

using an asynchronous interval-based update mechanism, which I will explain in detail.

Topologies include direct connect, SAN connect, and WAN connect. Because of the

asynchronous nature, data on the secondary image is rarely identical to primary at all times. So

how does MirrorView handle this? The answer lies in VNX’s data migration software, SAN

Copy! Internally, the SAN Copy delta set mechanism is used to track changes to the primary

image, and ship those changes to the secondary image as required and defined in the update

interval. The catch to this feature is a golden copy—aka, protective snapshot of the secondary—

is captured, which guarantees the data state can revert to a previous known good state in the

event of failure during the update cycle.

2012 EMC Proven Professional Knowledge Sharing 6

VM’s

Storage

Storage Pool

Delta’s

MirrorView/A MirrorView/A

WAN

VM’s

Storage

Storage Pool

Protected Site

Delta Set

High Level View of MirrorView

Gold Copy

Recovery Site

High Level Diagram of MirrorView/A Configuration

MirrorView/A Work Flow

I/O write is received from server into primary array

Acknowledgement is sent to the servr

Point in time “gold copy” of secondary is created to protect secondary data during

Delta Set transport

Delta Set primary is created

Delta Sets are applied to the secondary mirror, gold copy is removed, and Delta Set

is cleared for the next update cycle

Delta Set and Gold Copy

During my years working as Level 2 CLARiiON® support, I was asked many times to explain

Gold Copy and Delta Set features in depth during the replication phase and when

troubleshooting live production problems in many different environments. Needless to say, I’ve

seen it all in regard to replication and configuring the product. Below is an in-depth explanation

of Gold Copy and Delta Set.

Gold Copy: The Gold Copy tracks all of the updates. When a region on the secondary is

updated, the original region is copied to the reserved LUN pool to preserve a consistent point-in-

2012 EMC Proven Professional Knowledge Sharing 7

time view of the secondary LUN at the time of an update. This is a key feature that always

ensures a consistent view of the secondary LUN. If the update from primary to secondary is

interrupted due to a link failure or failure at the primary site, the Gold Copy is used by

MirrorView/A software to rollback to its previous consistent state.

Delta Set: MirrorView uses asynchronous writes, which means that I/O is not sent to the remote

site at the same time as the host I/O. The Delta Set is created and changes are tracked during a

MirrorView/A replication cycle. MirrorView/A replicates only the last changed blocks during the

replication cycle, resulting in lower bandwidth requirement than synchronous. The Delta Set is a

local snap taken at the source side at the time of replication.

A B C

Primary Image

Write from Host, Change block from C to E

No WriteNo Write

A B E

Delta Transfer Map Delta Transfer Map Delta Transfer Map

Delta Tracking Map

Delta Tracking Map

A B E

Secondary Image

A B C

Gold Copy Gold Copy Gold Copy

Snapshot Snapshot Snapshot

MirrorView - Work Flow Diagram During Update

Reserve LUN Pool Recommendations for Performance and Sizing

The reserve LUN pool (RLP) configuration is key for performance and accommodating the

host(s) accessing the source LUN in the mirror pair. The anticipated duration of the

MirrorView/A update will depend on the amount of data that must be transferred, as well as the

transfer rate and available bandwidth. The sync rates for Mirrors are high, medium, and low. I

recommend increasing the sync rate to high (default is medium) which will speed up the data

transfer and reduce the copy on first write activity that occurs on the source LUN. Why?

Because the pointer and copy design of snapshots can affect source LUN performance. This is

due to the fact that when data is accessed that has not changed on the source volume, reads to

snapshot are accessing the same disks or spindles as reads to the source volume. Since copy

2012 EMC Proven Professional Knowledge Sharing 8

on first write (COFW) requires data to be read from and written to the reserved LUN pool, the

reserve pool can become overloaded resulting in disk latencies if the configuration is not

optimal. Below are my recommendations for the RLP:

Try to avoid the vault disks: 0_0_0 - 0_0_4

NL-SAS drives are not recommended due to their lack in performance; if the host is

writing at a high rate, this will result in heavy COFW activity

Load balance the RLP between storage processors

Dedicate RAID Group(s) for RLP if possible to increase spindle count

Use RAID 5 for protection; this tends to be a good general purpose RAID type

RLP LUNs should not be Thin-enabled or in a Storage Pool

RPL LUNs should not share the same drives as the source LUNs

The RLP LUNs size should always be %15-20% of the source LUN being replicated.

MirrorView Replication Link Best Practices and Settings

Performance issues and resetting of replication links (iFCP/FCIP) happen and there can be

many reasons for this. There are general recommendations that increase throughput and overall

reliability; again, this is not specific to any type or model of SAN router/IP device. Please refer to

your OEM guides. From my experience, when Mirrors fracture during replication cycles, the

culprit 85% of time is the link in between sites. There are some minor tweaks and settings I

recommend that are standard in SAN router/IP devices:

Available WAN bandwidth for I/O transfer should be equal to a T3 or higher, if

possible.

Confirm that FastWrite is enabled; Fast Write mitigates latency effects for SCSI write

operations. Fast Write enables the entire data segment of a SCSI write operation to

be transported across the link between the initiator and target without the

inefficiencies of waiting for the transfer ready (FCP_XFER_RDY) commands to travel

back and forth across the link.

Compression should be enabled and confirm that the IP port speeds coincide with

the bandwidth available.

Increase TCP Window Scaling size.

2012 EMC Proven Professional Knowledge Sharing 9

Also run this command; navicli -h <SP_IP_address> port -diagnose -sancopy –clean. This will

clean up old SAN Copy connections on the MirrorView ports, helping throughput and internal

login table of the VNX.

Hardware Configuration for Testing

Hardware Type

HP ProLiant BL 460c

G6

12 CPUs

VNX 5300 & NS-120 FLARE 30

VMware vSphere 5

Enterprise

1 Server each site

Cisco MDS Switch 9124

Application: Apache

Web Server

2012 EMC Proven Professional Knowledge Sharing 10

Installing SRM

SRM software can be downloaded from the VMware website. Some screen shots of install

process are shown below. The install is fairly simple and similar to installing VCenter.

Adding SRM to VCenter Server at Protected/Production Site

2012 EMC Proven Professional Knowledge Sharing 11

Adding SRM to Recovery Site

2012 EMC Proven Professional Knowledge Sharing 12

Install Certificate

2012 EMC Proven Professional Knowledge Sharing 13

Configuring Site Name, Ports, etc.

2012 EMC Proven Professional Knowledge Sharing 14

Configuring SQL Database at each site

2012 EMC Proven Professional Knowledge Sharing 15

Complete the install of SRM at each site

2012 EMC Proven Professional Knowledge Sharing 16

Install Complete and Install Plug-In

Configuring SRM

Once MirrorView replication has been configured and SRM software installed, we are ready to

begin the DR run book in case of failure. What I mean by runbook is actual configuring of the

failover scenario. The tasks inside SRM are broken down as:

Array Manager Configuration

Protection Groups, including configuring protection mappings (inventory mappings)

Recovery Plans

Testing

Inventory mappings are key. There are three main areas that can affect a failover to a recovery

site: resource pools, networking, and folders. These mappings control where virtual machines

connect and land when they are moved from the protected site to the recovery site. When

mapping networks, exercise care because network mappings do not confirm that the virtual

machines will have the proper network connectivity when they fail over; it is possible to map to a

port group a non-routed internal network or wrong physical network. We must also configure

placeholder virtual machines on a datastore(s) at a recovery site. These VM’s, used to reserve a

place in the inventory of the recovery site, contain .vmx, vmsd files. The vmdk files are not

2012 EMC Proven Professional Knowledge Sharing 17

present. Screen shots of each task to configure complete site failover protection are shown

below.

Home Page Inside of VCenter> Select Site Recovery

Sites created during the Install Renamed sites to cities, Minneapolis & Jacksonville

2012 EMC Proven Professional Knowledge Sharing 18

Configuring the Protection Group, select datastore, place holder, and VM’s called Web Server

All the mappings and settings in the event of failover to recovery site

2012 EMC Proven Professional Knowledge Sharing 19

Configuring Recovery Plan for the Protection Group Web Servers

Network settings for Recovery Site under Recovery Plan

Note: Auto creates an isolated internal network for testing.

2012 EMC Proven Professional Knowledge Sharing 20

Running a test failover of the Recovery Plan

Steps in test executing, preparing storage, and mounting at recovery site

2012 EMC Proven Professional Knowledge Sharing 21

Test was successful! Failover occurred and the Web server came up at the recovery site.

Troubleshooting SRM

In most cases, if there are issues with SRM install and or SRA issues, I recommend starting

from scratch and reinstalling the software. There are certain cases where the SRM service fails

to start or might start, then stop. This can be caused by loss of network connectivity to the

database server or even database corruption. In most cases, if database problems are indicated

by failed installation and the service will not start, try to restart SRM service. When SRM has

problems detecting the VNX array, confirm that the x86 version of Solutions Enabler is installed

even if the host running SRM is 64-bit. Also, there is a documented issue having “name”

describe any LUN being used by Site Recovery Adapter and SRM; there is an issue with the

parser checking for LUN names with the word "name" in them. For advanced options, you can

force SRM to sync mirrors prior to failing over, when possible. Under Failover Plan, right click

prepare storage:

1. Click Add Message.

2. You will be prompted to add a message. You should see a reminder to synchronize the

mirrors prior to the failover and to click continue in SRA after the synchronize finishes.

3. Click OK to continue.

2012 EMC Proven Professional Knowledge Sharing 22

4. When you run the Recovery Plan, you will be prompted to synchronize the relevant

mirror.

Always gather the SRM logs when troubleshooting; the logs are located @ C:\Documents and

Settings\All Users\Application Data\VMware\VMware vCenter Site Recovery Manager\Logs

The best tip I can share from my experience with VMware SRM and EMC’s SRA adapters, is to

make sure all software is the latest and greatest! This can save you tons of time

troubleshooting.

Conclusion

EMC’s ground-breaking VNX and VMware SRM provide an industry-leading, robust Information

Lifecycle solution. These technologies complement each other well when configured properly

using best practices. I sincerely hope that this Knowledge Sharing article will be a great asset to

EMC Proven Professionals and the community in general.

2012 EMC Proven Professional Knowledge Sharing 23

References

EMC CLARiiON Integration with VMware ESX Server - White Paper

www.yellow-bricks.com/2009/08/11/srm-faq/ - Scott Lowe

Administering VMware Site Recovery Manager 5.0 - Mike Laverick

Techbook Using VNX Storage with VMware vSphere - EMC

Next Generation Best Practices for Storage and VMware - http://virtualgeek.typepad.com

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.