INTEGRATING VMWARE SITE RECOVERY MANAGER WITH ...
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of INTEGRATING VMWARE SITE RECOVERY MANAGER WITH ...
INTEGRATING VMWARE SITE RECOVERY MANAGER WITH VNX MIRRORVIEW/A
Jason L. GatesSystems Engineer - Storage & VirtualizationPresidio Networked [email protected]://www.linkedin.com/in/mrjasongates
2012 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Audience .................................................................................................................................... 3
Overview .................................................................................................................................... 3
What is VMware Site Recovery Manager? ................................................................................. 3
SRM Architecture ....................................................................................................................... 4
VNX MirrorView/A Replication Overview .................................................................................... 5
Delta Set and Gold Copy ........................................................................................................... 6
Reserve LUN Pool Recommendations for Performance and Sizing ........................................... 7
MirrorView Replication Link Best Practices and Settings ............................................................ 8
Installing SRM ...........................................................................................................................10
Configuring SRM .......................................................................................................................16
Troubleshooting SRM ...............................................................................................................21
Conclusion ................................................................................................................................22
References ...............................................................................................................................23
Disclaimer: The views, processes, or methodologies published in this article are those of the
author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.
2012 EMC Proven Professional Knowledge Sharing 3
Audience
This article is intended for system administrators, VMware engineers, and storage
administrators. Readers are assumed to be familiar with VMware ESX/ESXi hosts, basic
CLARiiON® system operations, fabric switches, basic networking, VCenter Server, EMC
Navisphere®/Unipshere® Manager, and creating EMC MirrorView Replication.
Overview
This Knowledge Sharing article will walk through the VMware Site Recovery Manager (SRM)
workflow that has to be completed to allow for the successful and automated service failover
from the designated SRM protected site to the designated SRM recovery site using EMC
MirrorView technology. This article will also provide an overview which includes the
considerations and guidance to execute a failover of services to the recovery site back and also
a failback to the original production site.
VMware SRM provides business continuity and disaster recovery protection for virtual
environments. MirrorView/A, EMC’s leading remote replication suite, is ideal to provide disaster
recovery solutions for the virtual environment running on the VNX®. I will discuss best practices
when using MirrorView with SRM and common pitfalls that can cause multiple issues on the
VNX, including performance, storage processor utilization problems, and oversaturation of the
replication link.
Additionally, step-by-step instructions are provided on how to administer and troubleshoot
replication and disaster recovery on the world’s number one mid-range storage sub-system and
VMware’s leading virtualization software.
What is VMware Site Recovery Manager?
Site Recovery Manager is an end-to-end disaster recovery DR automation product. It is
designed to protect virtual machines residing in datastores on replicated storage. In the event of
a true failure or complete site failure, virtual machines can be failed over to a remote data
center. SRM is targeted for use with array-based replication, although, with SRM 5.0, VM-level
replication is possible. However, in this article we will discuss VNX MirrorView based replication.
A customer once asked me to explain the major difference between VMware HA and SRM?
Basically, VMware HA is a clustering product designed to provide intrasite fault tolerance. HA is
2012 EMC Proven Professional Knowledge Sharing 4
designed to power on virtual machines on surviving local cluster members in the event of a
hardware failure; it has no capability to power on virtual machines outside the local HA cluster.
SRM Architecture
To fully appreciate and understand SRM, we must know the following components, which make
the product work in your environment:
SRM Server Software: Software that is installed on a separate machine or on the
VCenter Server; I recommend the VCenter Server for smooth integration
Storage Replication Adapter Software Package (EMC VNX Replication Adapter)
VMware VCenter
SRM Database (SQL or SQL Express)
SAN replication (MirrorView or RecoverPoint)
SRM Licenses
There are also some key terms used with the SRM product:
Protected Site: The protected site is the data center containing the virtual machines for
which data is being replicated to the recovery site
Recovery Site: The recovery site is the data center containing the recovery of virtual
machines in case of disaster
Protection Group: Replicated datastores containing a set of VM’s that are protected with
SRM
Inventory Mapping: Mapping between resource pools, networks, and virtual machine
folders on the protection site and the destination
Recovery Plan: SRM’s version of a runbook
2012 EMC Proven Professional Knowledge Sharing 5
High Level Overview Diagram of SRM Workflow
VNX MirrorView/A Replication Overview
MirrorView is a VNX business continuity solution that provides block-level replication. The copy
of the data on the production VNX is called the primary image and the copy at the recovery site
is called the secondary image. The design goal of MirrorView is to allow speedy recovery from a
disaster. To accomplish this, MirrorView uses low cost long distance connectivity replication,
using an asynchronous interval-based update mechanism, which I will explain in detail.
Topologies include direct connect, SAN connect, and WAN connect. Because of the
asynchronous nature, data on the secondary image is rarely identical to primary at all times. So
how does MirrorView handle this? The answer lies in VNX’s data migration software, SAN
Copy! Internally, the SAN Copy delta set mechanism is used to track changes to the primary
image, and ship those changes to the secondary image as required and defined in the update
interval. The catch to this feature is a golden copy—aka, protective snapshot of the secondary—
is captured, which guarantees the data state can revert to a previous known good state in the
event of failure during the update cycle.
2012 EMC Proven Professional Knowledge Sharing 6
VM’s
Storage
Storage Pool
Delta’s
MirrorView/A MirrorView/A
WAN
VM’s
Storage
Storage Pool
Protected Site
Delta Set
High Level View of MirrorView
Gold Copy
Recovery Site
High Level Diagram of MirrorView/A Configuration
MirrorView/A Work Flow
I/O write is received from server into primary array
Acknowledgement is sent to the servr
Point in time “gold copy” of secondary is created to protect secondary data during
Delta Set transport
Delta Set primary is created
Delta Sets are applied to the secondary mirror, gold copy is removed, and Delta Set
is cleared for the next update cycle
Delta Set and Gold Copy
During my years working as Level 2 CLARiiON® support, I was asked many times to explain
Gold Copy and Delta Set features in depth during the replication phase and when
troubleshooting live production problems in many different environments. Needless to say, I’ve
seen it all in regard to replication and configuring the product. Below is an in-depth explanation
of Gold Copy and Delta Set.
Gold Copy: The Gold Copy tracks all of the updates. When a region on the secondary is
updated, the original region is copied to the reserved LUN pool to preserve a consistent point-in-
2012 EMC Proven Professional Knowledge Sharing 7
time view of the secondary LUN at the time of an update. This is a key feature that always
ensures a consistent view of the secondary LUN. If the update from primary to secondary is
interrupted due to a link failure or failure at the primary site, the Gold Copy is used by
MirrorView/A software to rollback to its previous consistent state.
Delta Set: MirrorView uses asynchronous writes, which means that I/O is not sent to the remote
site at the same time as the host I/O. The Delta Set is created and changes are tracked during a
MirrorView/A replication cycle. MirrorView/A replicates only the last changed blocks during the
replication cycle, resulting in lower bandwidth requirement than synchronous. The Delta Set is a
local snap taken at the source side at the time of replication.
A B C
Primary Image
Write from Host, Change block from C to E
No WriteNo Write
A B E
Delta Transfer Map Delta Transfer Map Delta Transfer Map
Delta Tracking Map
Delta Tracking Map
A B E
Secondary Image
A B C
Gold Copy Gold Copy Gold Copy
Snapshot Snapshot Snapshot
MirrorView - Work Flow Diagram During Update
Reserve LUN Pool Recommendations for Performance and Sizing
The reserve LUN pool (RLP) configuration is key for performance and accommodating the
host(s) accessing the source LUN in the mirror pair. The anticipated duration of the
MirrorView/A update will depend on the amount of data that must be transferred, as well as the
transfer rate and available bandwidth. The sync rates for Mirrors are high, medium, and low. I
recommend increasing the sync rate to high (default is medium) which will speed up the data
transfer and reduce the copy on first write activity that occurs on the source LUN. Why?
Because the pointer and copy design of snapshots can affect source LUN performance. This is
due to the fact that when data is accessed that has not changed on the source volume, reads to
snapshot are accessing the same disks or spindles as reads to the source volume. Since copy
2012 EMC Proven Professional Knowledge Sharing 8
on first write (COFW) requires data to be read from and written to the reserved LUN pool, the
reserve pool can become overloaded resulting in disk latencies if the configuration is not
optimal. Below are my recommendations for the RLP:
Try to avoid the vault disks: 0_0_0 - 0_0_4
NL-SAS drives are not recommended due to their lack in performance; if the host is
writing at a high rate, this will result in heavy COFW activity
Load balance the RLP between storage processors
Dedicate RAID Group(s) for RLP if possible to increase spindle count
Use RAID 5 for protection; this tends to be a good general purpose RAID type
RLP LUNs should not be Thin-enabled or in a Storage Pool
RPL LUNs should not share the same drives as the source LUNs
The RLP LUNs size should always be %15-20% of the source LUN being replicated.
MirrorView Replication Link Best Practices and Settings
Performance issues and resetting of replication links (iFCP/FCIP) happen and there can be
many reasons for this. There are general recommendations that increase throughput and overall
reliability; again, this is not specific to any type or model of SAN router/IP device. Please refer to
your OEM guides. From my experience, when Mirrors fracture during replication cycles, the
culprit 85% of time is the link in between sites. There are some minor tweaks and settings I
recommend that are standard in SAN router/IP devices:
Available WAN bandwidth for I/O transfer should be equal to a T3 or higher, if
possible.
Confirm that FastWrite is enabled; Fast Write mitigates latency effects for SCSI write
operations. Fast Write enables the entire data segment of a SCSI write operation to
be transported across the link between the initiator and target without the
inefficiencies of waiting for the transfer ready (FCP_XFER_RDY) commands to travel
back and forth across the link.
Compression should be enabled and confirm that the IP port speeds coincide with
the bandwidth available.
Increase TCP Window Scaling size.
2012 EMC Proven Professional Knowledge Sharing 9
Also run this command; navicli -h <SP_IP_address> port -diagnose -sancopy –clean. This will
clean up old SAN Copy connections on the MirrorView ports, helping throughput and internal
login table of the VNX.
Hardware Configuration for Testing
Hardware Type
HP ProLiant BL 460c
G6
12 CPUs
VNX 5300 & NS-120 FLARE 30
VMware vSphere 5
Enterprise
1 Server each site
Cisco MDS Switch 9124
Application: Apache
Web Server
2012 EMC Proven Professional Knowledge Sharing 10
Installing SRM
SRM software can be downloaded from the VMware website. Some screen shots of install
process are shown below. The install is fairly simple and similar to installing VCenter.
Adding SRM to VCenter Server at Protected/Production Site
2012 EMC Proven Professional Knowledge Sharing 16
Install Complete and Install Plug-In
Configuring SRM
Once MirrorView replication has been configured and SRM software installed, we are ready to
begin the DR run book in case of failure. What I mean by runbook is actual configuring of the
failover scenario. The tasks inside SRM are broken down as:
Array Manager Configuration
Protection Groups, including configuring protection mappings (inventory mappings)
Recovery Plans
Testing
Inventory mappings are key. There are three main areas that can affect a failover to a recovery
site: resource pools, networking, and folders. These mappings control where virtual machines
connect and land when they are moved from the protected site to the recovery site. When
mapping networks, exercise care because network mappings do not confirm that the virtual
machines will have the proper network connectivity when they fail over; it is possible to map to a
port group a non-routed internal network or wrong physical network. We must also configure
placeholder virtual machines on a datastore(s) at a recovery site. These VM’s, used to reserve a
place in the inventory of the recovery site, contain .vmx, vmsd files. The vmdk files are not
2012 EMC Proven Professional Knowledge Sharing 17
present. Screen shots of each task to configure complete site failover protection are shown
below.
Home Page Inside of VCenter> Select Site Recovery
Sites created during the Install Renamed sites to cities, Minneapolis & Jacksonville
2012 EMC Proven Professional Knowledge Sharing 18
Configuring the Protection Group, select datastore, place holder, and VM’s called Web Server
All the mappings and settings in the event of failover to recovery site
2012 EMC Proven Professional Knowledge Sharing 19
Configuring Recovery Plan for the Protection Group Web Servers
Network settings for Recovery Site under Recovery Plan
Note: Auto creates an isolated internal network for testing.
2012 EMC Proven Professional Knowledge Sharing 20
Running a test failover of the Recovery Plan
Steps in test executing, preparing storage, and mounting at recovery site
2012 EMC Proven Professional Knowledge Sharing 21
Test was successful! Failover occurred and the Web server came up at the recovery site.
Troubleshooting SRM
In most cases, if there are issues with SRM install and or SRA issues, I recommend starting
from scratch and reinstalling the software. There are certain cases where the SRM service fails
to start or might start, then stop. This can be caused by loss of network connectivity to the
database server or even database corruption. In most cases, if database problems are indicated
by failed installation and the service will not start, try to restart SRM service. When SRM has
problems detecting the VNX array, confirm that the x86 version of Solutions Enabler is installed
even if the host running SRM is 64-bit. Also, there is a documented issue having “name”
describe any LUN being used by Site Recovery Adapter and SRM; there is an issue with the
parser checking for LUN names with the word "name" in them. For advanced options, you can
force SRM to sync mirrors prior to failing over, when possible. Under Failover Plan, right click
prepare storage:
1. Click Add Message.
2. You will be prompted to add a message. You should see a reminder to synchronize the
mirrors prior to the failover and to click continue in SRA after the synchronize finishes.
3. Click OK to continue.
2012 EMC Proven Professional Knowledge Sharing 22
4. When you run the Recovery Plan, you will be prompted to synchronize the relevant
mirror.
Always gather the SRM logs when troubleshooting; the logs are located @ C:\Documents and
Settings\All Users\Application Data\VMware\VMware vCenter Site Recovery Manager\Logs
The best tip I can share from my experience with VMware SRM and EMC’s SRA adapters, is to
make sure all software is the latest and greatest! This can save you tons of time
troubleshooting.
Conclusion
EMC’s ground-breaking VNX and VMware SRM provide an industry-leading, robust Information
Lifecycle solution. These technologies complement each other well when configured properly
using best practices. I sincerely hope that this Knowledge Sharing article will be a great asset to
EMC Proven Professionals and the community in general.
2012 EMC Proven Professional Knowledge Sharing 23
References
EMC CLARiiON Integration with VMware ESX Server - White Paper
www.yellow-bricks.com/2009/08/11/srm-faq/ - Scott Lowe
Administering VMware Site Recovery Manager 5.0 - Mike Laverick
Techbook Using VNX Storage with VMware vSphere - EMC
Next Generation Best Practices for Storage and VMware - http://virtualgeek.typepad.com
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.