Likosar Sullivan w 200 ProvidingHighAvailability[1]

Brian Likosar, Sr. Solutions ArchitectDave Sullivan, Sr. ConsultantRed Hat4-May-2011

PROVIDING HIGH AVAILABILITY

FOR ORACLE DATABASES

WITHOUT HIGH COST

Costs?

● Oracle RAC costs: $4600/core, plus $5060 for support

● RHEL: Resilient Storage Add-On: $799/socket pair

● Specific example: HP DL360 G7 (2-way, 4 cores ea.)● Oracle RAC: $41,860 per year● RHEL: $799 per year

● To be fair, RAC provides scalability as well

● Sources: hp.com, shop.oracle.com, and www.redhat.com/rhel/purchasing_guide.html

Items to discuss

● HA (Highly Available)● Minimize unexpected down time● Reduce MTTR (Mean Time To Recovery)

● Redundancy Across The Board (No SPOFs!)

● Very customizable

● Automated Fail-over

Infrastructure Component View

Dual Rail Power

Production VLAN

Remote MGT VLAN

Heartbeat/Fence VLAN

Inter-Switch Link

Backup VLAN

Production Network

Fence Device

Oracle DB

Oracle DB

Clustered Nodes

Highly Available Oracle on RHEL with HA-LVM

● Oracle Database

● Red Hat Enterprise Linux 6

● Red Hat Cluster Suite● Controls Oracle Database● Automates Fail-over

● HA-LVM (part of Resilient Storage add-on)

● Shared storage● ISCSI● SAN

RHEL OS & Cluster Component View

System Software

RHEL6 OS

RH Cluster SuiteOracle 11GR2 RHEL Multipath

RH Cluster Suite

RH Cluster Suite

CMAN

qdiskd

fenced

corosyncCLVMD

orHA-LVM

RGMANAGER

lvmfs

vip

oracle

LVM

Red Hat Enterprise Linux

● Included in the OS:● DM (device mapper) Multipath

● LVM (logical volume management)

● Ext4 (4th extended filesystem)

● Red Hat Cluster Suite components:● corosync (previously openais/aisexec) #heartbeat

● cman - “Cluster Manager”

● clvmd - “Cluster Logical Volume Manager”

● qdiskd - “Quorum Disk”

● fenced - “I/O Fencing”

● rgmanager - “Resource Group Manager”

Cluster Logical Volume Manager (CLVMD)

● Daemon that runs on all cluster nodes and controls concurrent access to the same storage

● For our purposes – it's what prevents our logical volumes from being mounted on more than one system at a time

HA-LVM w/o CLVMD

● Older way of doing HA-LVM

● Uses Volume “Tagging” Scheme to provide LVM Mutual Exclusion

● Provides a way to do LVM maintenance on a larger cluster set of LVMs (ie. flipping tags on the particular VG)

Quorum Disk

● Adds complexity to cluster

● Typically utilized in even node clusters to act as tie-breaker in split-brain situations

● Prevents fence-loop and fence death situations

● Provides heuristics to determine “Health” of the cluster

● Provides all-but-one failure mode as well as others● e.g. 4 node cluster and 3 nodes fail

I/O Fencing

● Provides Countermeasure To Remove Misbehaving Or Dead Node From Shared Storage

● Most Critical Part Of Cluster That Utilizes Shared Storage (SAN/ISCSI)

● Protects Data From Corruption● Node Kernel Panic

● Node Freezes

● Node Hangs

I/O Fencing

● Allows Nodes To Safely Assume Control Over Shared Resources In Network Partition Situations

● Fencing Types● Power Fencing

● Normally allows for full automated recovery● Reduction in MTTR

● SCSI & SAN Fabric Fencing

● Node normally requires manual reboot● Allows for system troubleshooting● May not require additional hardware

Resource Manager (rgmanager)

● Daemon that watches for running processes

● Coordinates resources and their startup order

● Can be observed via “clustat” and controlled via “clusvcadm”

MPIO Layer Device-Mapper-Multipath

● Redundancy at I/O Fibre Layer

● MPIO Failover pre-rhel5u5

● polling_interval=5 #interval check for failed paths (seconds)

● Default (5s)● normal path check #interval check for good paths (seconds)

● 5 * polling interval (default 20s)● MPIO Failover post-rhel5u5

● polling_interval=5 #interval check for failed paths (seconds)

● Default (5s)● checker_timeout #assuming set

● Or pulls from /sys/block/sdx/device/timeout ● rc.local to make persistent

Cluster Timeouts

● Understanding of cluster timeouts is critical

● HBA Device Timeout (lpfc/qlogic/etc.)● e.g. modinfo lpfc #lpfc_devloss_tmo

● Multipath Failover Timings

● Quorum Disk Timeout

● Quorum Device Poll

● Cman Timeout

Cluster Timeout Matrix

Component Variable Equation Example

Hba timeout lpfc_devloss_tmo (lpfc) qlport_down_retry (qlogic)default=30s

x 10s

Multipath timeout checker_timeout (as of rhel5u5)/sys/block/sdx/device/timeout

x + 5s 15s

Qdisk timeout Interval * tko x + 20s 30s

Quorum device poll

quorum_dev_poll x + 25s 45s

Cman timeout token x + 50s 60s

Eviction ---->

cman tmo -->

LVM / FS Configuration examples

● lvmconf --enable-cluster

● chkconfig clvmd on; service clvmd start

● pvcreate /dev/disk

● vgcreate -n oraclevg /dev/disk

● lvcreate -n oradatalv -L 10G oraclevg

● mkfs -t ext4 /dev/oraclevg/oradatalv

Other items to assist with setup

● ip addr add 1.2.3.4/24 dev eth0

● Have DBA create database using “cluster” filesystems and configure listener against alias (VIP)

● Be sure that they do not create any “local” configurations – use spfile where appropriate.

● Once you're able to test everything manually on both sides of cluster, create the cluster configuration

Common Pitfalls

● Multicast support / IGMP snooping

● Partition table state inconsistent

● Review/modification of oracledb.sh script required

● Understanding Power I/O Fencing

● When To Use Quorum Disk

● If you have simple two node and with proper network power fencing architected you don't need qdisk

● Not Validating SPOFs

● Run any services scripts outside your cluster first to validate

● think echo $? #return 0

Cluster Configuration File<?xml version="1.0"?><cluster config_version="20" name="summit2011">

<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/><clusternodes>

<clusternode name="ayame.salab.dfw.redhat.com" nodeid="1" votes="1"><fence>

<method name="fence_ayame"><device name="wti" port="7"/>

</method></fence>

</clusternode><clusternode name="botan.salab.dfw.redhat.com" nodeid="2" votes="1">

<fence><method name="fence_botan">

<device name="wti" port="8"/></method>

</fence></clusternode>

</clusternodes><cman expected_votes="1" two_node="1"/><fencedevices>

<fencedevice agent="fence_wti" ipaddr="10.15.183.249" login="root" name="wti" passwd="red22hat"/>

</fencedevices> .....

Cluster Configuration File Cont'd ....

<rm><failoverdomains/><resources>

<lvm lv_name="misclv" name="misclv" vg_name="demovg"/><lvm lv_name="datalv" name="datalv" vg_name="demovg"/><lvm lv_name="redolv" name="redolv" vg_name="demovg"/><fs device="/dev/demovg/misclv" fsid="54499" mountpoint="/oradata/oramisc"

name="oramisc"/><fs device="/dev/demovg/redolv" fsid="54499" mountpoint="/oradata/oraredo"

name="oraredo"/><fs device="/dev/demovg/datalv" fsid="54499" mountpoint="/oradata/oradata"

name="oradata"/><ip address="10.15.183.183" monitor_link="off" sleeptime="30"/><oracledb home="/home/oracle/product/11.2.0" listener_name="listener" name="summit"

type="base" user="oracle" vhost="orcl.salab.dfw.redhat.com"/></resources><service autostart="1" exclusive="0" max_restarts="3" name="summitdemo"

recovery="restart" restart_expire_time="90"><lvm ref="misclv"/><lvm ref="datalv"/><lvm ref="redolv"/><fs ref="oramisc"/><fs ref="oradata"/><fs ref="oraredo"/><ip ref="10.15.183.183"/><oracledb ref="summit"/>

</service></rm>

</cluster>

Helpful links

● HA-LVM implementation: https://access.redhat.com/kb/docs/DOC-3068

● Multicast notes: https://access.redhat.com/kb/docs/DOC-5933

● Timing with qdisk: https://access.redhat.com/kb/docs/DOC-2882

https://access.redhat.com/kb/docs/DOC-3068



Supplemental information for this presentation

● Will be available at http://people.redhat.com/~blikosar/

● Brian can be reached at [email protected] (don't expect rapid response, he's a busy guy!)

http://people.redhat.com/~blikosar/

mailto:[email protected]

Likosar Sullivan w 200 ProvidingHighAvailability[1]

Documents

Transcript of Likosar Sullivan w 200 ProvidingHighAvailability[1]