Troubleshooting of SIB Messaging Engine Failover Problems ...

36
Click to add text IBM Software Group ® WebSphere ® Support Technical Exchange Troubleshooting of SIB Messaging Engine Failover Problems in a Clustered Environment Jhansi Kolla ([email protected] ) Ty Shrake ([email protected] ) 8 th April 2015

Transcript of Troubleshooting of SIB Messaging Engine Failover Problems ...

Click to

add text

IBM Software Group

®

WebSphere® Support Technical Exchange

Troubleshooting of SIB Messaging Engine Failover Problems in a Clustered Environment Jhansi Kolla ([email protected]) Ty Shrake ([email protected]) 8th April 2015

IBM Software Group

WebSphere® Support Technical Exchange 2

Agenda

What are the different possible SIB configurations in clustering environment?

What are the pros and cons of each configuration?

The importance of core group policies in SIB ME clusters.

Important steps in SIB configuration when using clusters.

Reasons for ME failovers.

Why and how does an ME failover to a different server?

Troubleshooting of ME failover issues.

Summary

IBM Software Group

WebSphere® Support Technical Exchange 3

Different possible SIB configurations in clustering environment.

You can have more than one ME, depending on the policy you choose.

High Availability (HA) – One ME in the entire cluster with Failover. (failback is optional)

Scalability / Work load management - Multiple messaging engines/ one ME dedicated to each

cluster member with no failover.

High Availability + Scalability - Multiple messaging engines/ one ME dedicated to each cluster

member with failover. (failback is optional)

Custom - (Use this policy when the other options of High availability, Scalability, or Scalability with high

availability do not provide the messaging engine behavior you require, and you are familiar with creating

messaging engines and configuring messaging engine policy settings.)

IBM Software Group

WebSphere® Support Technical Exchange 4

High Availability configuration pros and cons: High Availability

If you configure your environment with a High Availability policy it creates only one ME in that

cluster. This single ME is able to start on any JVM™'s in this cluster depending on how you

configure the core groups like preferred servers. By default the ME starts on the first server

in the cluster to start. The ME is able to failover to any JVM in that cluster. You can also

setup failback to failback the ME if the preferred server is back up and running.

Pros:

The messaging engine is highly available. If the messaging engine or the server on which it runs fails, then the messaging

engine starts up on another available server in the cluster.

You can set the preferred servers where you want to give preference to some servers to run the ME.

You can set the ME failback criteria if preferred server is back.

All messaging functionality is in one place for low traffic scenarios.

Application is able to connect to any SIB bootstrap server to send or consume messages.

Cons:

If you set the “Preferred servers only” option in the core groups panel for this ME policy then, if no preferred server is

available, the ME won't start.

If you set only one preferred server and setup for the “Preferred servers only” and the ME is running on this preferred server

and if catastrophic error condition occurs on this preferred server the ME is not able to failover to another server in that ME

cluster, because you set it up for ME run on preferred server only and there are no more preferred servers available.

No WLM option. Only one ME to handle all messaging functions in that cluster. Work load can't be shared.

IBM Software Group

WebSphere® Support Technical Exchange 5

High Availability configuration

IBM Software Group

WebSphere® Support Technical Exchange 6

High Availability configuration

When a server that is hosting a messaging engine fails, the messaging engine is

activated and run on another server. The messaging engine is configured to fail over to

any of the application servers in the cluster. All the application servers in the cluster are

added to the preferred servers list, and this list determines the order in which the servers

are used for failover. The earlier the server in the preferred servers list, the stronger the

preference for that server to run the ME.

IBM Software Group

WebSphere® Support Technical Exchange 7

One messaging engine is created for each application server in the cluster. The

messaging engines cannot fail over.

Pros:

One messaging engine dedicated to one server in the cluster.

Destinations are partitioned across the messaging engines. The messaging engines can share all traffic

passing through the destination, reducing the impact of one messaging engine failing.

Greater throughput of messages by the application.(spreading the messaging load across multiple servers,

and optionally across multiple hosts.)

Performance improvement.

Better accessibility for the application. (local ME access to the application).

Cons:

Message order is no longer be guaranteed due to the workload balancing of messages across queue points.

Default scalability option don't provide the failover of the messaging engine.

Scalability configuration pros and cons:

IBM Software Group

WebSphere® Support Technical Exchange 8

Scalability configuration

IBM Software Group

WebSphere® Support Technical Exchange 9

Scalability configuration

The scalability policy ensures that there is a messaging engine for each server in a cluster.

Messaging engines runs on its dedicated servers. No failover. If server is down that specific ME

is also down. ME's run on only preferred servers.

IBM Software Group

WebSphere® Support Technical Exchange 10

High Availability + Scalability configuration pros and cons

This policy provides one ME per JVM with failover option.

Pros:

Improved performance.

Failover option

Better access to the application due to local ME to the application.

Able to configure the failback to the preferred servers.

Work load management.

Cons:

Message order is no longer be guaranteed due to the workload balancing of messages

across queue partitions.

A little more complex than other options

IBM Software Group

WebSphere® Support Technical Exchange 11

High Availability + Scalability configuration

IBM Software Group

WebSphere® Support Technical Exchange 12

High Availability + Scalability configuration

In this example there are three servers, each on a separate node, and three messaging

engines. Each messaging engine has a preferred server and one other server it can use

for failover. Each server is the preferred host for one messaging engine, and the failover

host for one other messaging engine.

IBM Software Group

WebSphere® Support Technical Exchange 13

High Availability + Scalability messaging view

IBM Software Group

WebSphere® Support Technical Exchange 14

Supported Core Group policies in ME clustering:

Core group policies we support:

a) Static.

b) One of N - with no preferred servers ( This is the Default SIBus Policy)

c) One of N - with preferred servers

d) One of N - with preferred servers and the Fail back setting

e) One of N - with preferred servers and the Preferred servers only setting

f) No operation

You cannot use the "All active" or "M of N" policy types, because they are not supported for the

service integration bus component. (any ME can run on only one JVM at any specific time)

IBM Software Group

WebSphere® Support Technical Exchange 15

Policy setting for ME clusters

IBM Software Group

WebSphere® Support Technical Exchange 16

How to find out where ME is running in the cluster:

Servers > Core groups > Core group settings > Name of the Core group > RunTime tab >

Click the Show groups > High Availability group name >

IBM Software Group

WebSphere® Support Technical Exchange 17

ME lost its DataStore connection and is unable to the get a new lock

on the Datastore after failover

ME is running on one of the cluster members and suddenly it loses the database connection

due to either network issues or the database is hung/shutdown or any other reason.

ConnectionEve A J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for

resource jdbc/<data source name>.

CWSIS1594I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=< INC_UUID>, has lost the lock on the data store.

CWSIS1519E: Messaging engine <ME_NAME> cannot obtain the lock on its data store, which ensures it has exclusive

access to the data.

CWSID0046E: Messaging engine <ME_NAME>_bus detected an error and cannot continue to run in this server.

HAGroupImpl I HMGR0130I: The local member of group IBM_hc=<cluster_name>, WSAF_SIB_BUS=<bus name>,

WSAF_SIB_MESSAGING_ENGINE=<ME_NAME>,type=WSAF_SIB has indicated that is it not alive. The JVM will be

terminated. (only in WAS7.0 and 8.0)

SystemOut O Panic:component requested panic from isAlive (only in WAS7.0 and 8.0)

SystemOut O java.lang.RuntimeException: emergencyShutdown called: (only in WAS7.0 and 8.0)

(JVM termination is default behavior in WAS 7.0 / 8.0. In WAS 8.5 and above levels the JVM is not terminated and the ME is

disabled in this JVM. Other applications continue to run. Also note that the new “Restrict long running locks ” feature in WAS

8.5 gets around this locking problem.)

JDBC™

IBM Software Group

WebSphere® Support Technical Exchange 18

sib.msgstore.jdbcFailoverOnDBConnectionLoss

In order to maintain the data integrity in WAS 7.0 and above we introduced this custom parameter:

sib.msgstore.jdbcFailoverOnDBConnectionLoss with the default value as TRUE.

By setting the sib.msgstore.jdbcFailoverOnDBConnectionLoss custom property on a messaging

engine, you can control the behavior of the messaging engine and its hosting server in the event that

the connection to the data store is lost.

If this parameter is set to TRUE (the default value) then HA manager stops the messaging engine

and its hosting application server when the next core group service “Is Alive” check takes place (the

default value is 120 seconds). You will also see a “Panic” message in the SystemOut.log and the

hosting JVM is terminated (which also stops the ME).

If this parameter is set to FALSE then the messaging engine continues to run and accept work, and

periodically attempts to regain the connection to the data store. If work continues to be submitted to

the messaging engine while the data store is unavailable, the results can be unpredictable, and the

messaging engine can be in an inconsistent state when the data store connection is restored. JMS™

messages could also be lost.

ME lost the DataStore connection and is unable to the get the lock

after failover

IBM Software Group

WebSphere® Support Technical Exchange 19

ME lost the DataStore connection and is unable to the get the lock

after failover At this point the HMGR will failover the ME to the other cluster member if its available based on the core group

policies configuration.

From the ME side it lost the lock on the data store, but on the data store (database) side the lock is still alive.

IBM Software Group

WebSphere® Support Technical Exchange 20

ME lost the DataStore connection and is unable to the get the lock

after failover

HMGR starts the ME in the some other available server based on the core group policies

configuration.

This new instance of the ME tries to get a new lock on the data store. If the attempt fails it will continue

attempting for 15 minutes. If the messaging engine is unable to acquire the lock after 15 minutes it will

become disabled. (In WAS 8.5 the ME will be automatically re-enabled)

In the case of an abnormal shutdown the lock on the data store and OS socket are not removed in the

DB server. Keep in mind that the database, not WAS, owns the database lock.

CWSIS1538I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>,, is attempting to obtain an exclusive

lock on the data store.

CWSIS1593I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>, has failed to gain an initial lock on the

data store.

Above pattern continues for 15 minutes if the ME is unable to get lock on the data store. After 15

minutes HMGR disables the ME in this server and moves the ME to next available server in that

cluster. The root cause of this locking issue is an orphaned lock that still exists on the SIBOWNER

table and OS socket is still open.

IBM Software Group

WebSphere® Support Technical Exchange 21

ME lost the dataStore connection and unable to the get the lock after

failover

The reason for the failure is that when the disconnection originally occurred the messaging engine became aware

of this fact but the database did not. So while the messaging engine is attempting to reacquire its data store lock on

the SIBOWNER table the database is still holding both the lock and the socket from the previous connection before

the unexpected disconnect occurred. As long as this socket and data store lock remain in place the new

instance of messaging engine will not be able to acquire a new lock and resume messaging functions.

Old (orphaned) data store lock will not be released by the database until the socket connection to the database is

first cleaned up. Once the socket is cleaned up the lock will be released by the database and the messaging engine

will be able to acquire a fresh lock, start the ME and resume its messaging functions.

There are 3 solutions for this problem.

1) DB Admin can manually remove the lock from the SIBOWNER table.

2) Restart the DB server. (Most of the time this is not a good option because many applications are being

used the same DB server. So this is not feasible for most people)

3) Configure the database server to clean the stale/orphaned sockets. (no manual intervention)

IBM Software Group

WebSphere® Support Technical Exchange 22

Solution for this problem is TCP KeepAlive If you are running WAS 7 or 8 then TCP KeepAlive is a good solution to this problem. KeepAlive is a feature of TCP

that has 2 main benefits: It can detect that a connection to a peer is no longer valid and it can prevent a network

connection from being terminated due to inactivity.

KeepAlive periodically checks for broken connections (stale/orphaned sockets). The default OS level KeepAlive

setting is 2 hours. SoKeepAlive will check the socket connection after 2 hours of inactivity. If the connection is stale

KeepAlive will clean up the socket. In order to cleanup this stale socket connection in a timely manner (2 hours is

usually much too long) set this value to approximately ~ 5 minutes. (When you set this KeepAlive setting please

consider other timeout settings like transaction time out which is 3 minutes default).

If you set this to approximately 5 minutes then after 5 minutes of inactivity this socket will be tested to see if the

connection is valid. If it fails the test (because the connection is stale/broken) then it will be cleaned up. The old lock on

the SIBOWNER table will then be released by the database, and the messaging engine will be able to acquire a fresh

lock and resume messaging.

This KeepAlive setting is done differently on each OS. Check with your OS admin to set this.

NOTE! In WAS version 8.5 a new locking mechanism is implemented that solves this problem without the need for

setting the KeepAlive timeout. Just enable the “Restrict long running locks” feature in the messaging engines

Message Store configuration and restart the messaging engine.

IBM Software Group

WebSphere® Support Technical Exchange 23

KeepAlive settings for each OS AIX: TCP_KEEPIDLE

The default value is 14,400 half-seconds, which is 2 hours. This value can be changed using the following command:

no -o tcp_keepidle=600

In the example above the value is set to 600 half-seconds (5 minutes).

Linux: tcp_keepalive_time

The default value is 7200 seconds (2 hours). This value can be changed using the following command (procfs

interface):

# echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time

The example above sets the value to 300 seconds (5 minutes).

Solaris: TCP_KEEPALIVE_INTERVAL

The default value is 7200000 milliseconds (2 hours). This value can be changed using the following command:

ndd -set /dev/tcp tcp_keepalive_interval 300000

The example above sets the value to 300,000 milliseconds (5 minutes).

Windows: KeepAliveTime

The default value is 7200000 milliseconds (2 hours). You can change this value in the Windows Registry here:

\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Tcpip\Parameters

IBM Software Group

WebSphere® Support Technical Exchange 24

JVM hang / terminated

When the JVM the ME is running on is hung or abnormally terminated the ME is also terminated

along with JVM. In this case HA manager will fail over the ME to some other available cluster

member (depending on your core group policies). Also remember that the lock on the data store

side still exists if you are running WAS 7 or 8.

In the failover server ME tries to get a new lock on the data store. If the previous orphaned lock is

not released this ME instance won't be able to get a new lock. As previously mentioned lowering

the KeepAlive to 5 minutes or so removes the the OS level socket and SIBOWNER table lock,

allowing the new ME instance to get the lock, complete startup and resume messaging functions.

IBM Software Group

WebSphere® Support Technical Exchange 25

Locking issues with NFS3 file system

If you are using a File Store instead of a Data Store you should be aware of the following items

regarding failovers:

During the failover customers who are using NFS3 file system to store the file store files may see

locking issues after the failover and the new instance of ME may be unable to get the lock on the

file store. You may see this exception during ME startup:

CWSIS1579E: The file stores log file is locked by another process.

NFS3 file system has some locking issues and we suggest that our customers use NFS4.

IBM File System Locking Protocol Test for WebSphere Application Server:

http://www.ibm.com/support/docview.wss?uid=swg24010222

Network File System (NFS) Recommendations for WebSphere Application Server:

http://www.ibm.com/support/docview.wss?uid=swg21645066

IBM Software Group

WebSphere® Support Technical Exchange 26

HA Manager is disabled

HMGR0010I: The High Availability Manager has been disabled. Services that rely on the High Availability Manager

might not work as expected.

Do not disable the High Availability Manager! It is required for failovers to work correctly!

By default every cluster member has an equal opportunity to host and run the ME. If you disable HA Manager in

some ME cluster members HA Manager is unable to failover the ME to that server because HA Manager is

disabled.

IBM Software Group

WebSphere® Support Technical Exchange 27

Cluster members are in different core groups

In very rare scenarios we have seen ME cluster members in different core groups and no core

group bridge is defined between the core groups. In this case the HA manager is unable to

failover to the other ME server which is in the other core group because there is no

communication bridge between these 2 core groups.

It's always better to keep all ME cluster members in the same core group. It ensures a smooth ME

failover will occur, it's a better system management practice and it also improves failover

performance.

IBM Software Group

WebSphere® Support Technical Exchange 28

Run ME in preferred servers only and no preferred servers are

running If you configure the “preferred servers only” core group policy option make sure that you have at least one preferred

server available as a standby instance all the times.

When the ME is unable to continue on the server where it was running and needs to failover to the other server, if no

preferred server is available (you choose to run ME in preferred servers only) then ME failover won't happen.

IBM Software Group

WebSphere® Support Technical Exchange 29

Issues with the core group policies

In most cases the default, predefined messaging engine policies support frequently-used

cluster configurations. High Availability is enabled by default

There could be some scenarios, such as modifications you make to comply with business

requirements, where you create a coregoup policy that prevents ME failovers.

Static - The messaging engine can run only on the server to which it is restricted,

and cannot fail over to any other server in the cluster

One of N - with preferred servers and the Preferred servers only setting:

If no preferred servers are available, it cannot fail over to any other server

in the cluster.

No operation : Messaging engine is managed by an external high availability

framework, so it is outside of WebSphere control.

IBM Software Group

WebSphere® Support Technical Exchange 30

DCSV* / HMGR* messages DCS views are groups of servers that 'see' each other and are communicating with each other. Sometimes

servers may get unexpectedly removed from a view and this can cause problems. This can result in differences in

views between the server(s) and JVM where the HA coordinator is running.

Due to this view instability the HA manager may be unable to find a suitable JVM for an ME failover. Reasons for

this are:

Check for any HMGR0130I / HMGR0131I / HMGR0132I messages

Frequent changes in DCS views. Check for DCSV messages like DCSV8050I / DCSV8051I /

DCSV8052I

Firewall issues

Networking issues

JVM hangs

Sometimes CPU starvation issues occur where the HA coordinator is running. Look for HMGR0152W messages.

Check for any hung thread messages in the failover server (WSVR0605W)

IBM Software Group

WebSphere® Support Technical Exchange 31

DCSV* / HMGR* messages When there are problems with DCS views the HA Manager may not be certain where the ME is running. As a

result it may try to start a second copy of the same ME. This second copy will fail to start because it will attempt to

get a lock on SIB tables that are already locked by the copy that was already running.

If you see a second instance of a messaging engine that is trying (and failing) to start check for “DCSV” type

warning messages before the failover started, such as DCSV8104W messages, indicating that one or more

servers have been “removed” from the view. If one of the messaging servers was removed from the view before

the ME startup attempt then that is the cause of the failover. This kind of situation is usually driven by networking

problems so check for socket exceptions and other TCP related errors in the logs.

IBM Software Group

WebSphere® Support Technical Exchange 32

Mixed WebSphere versions (from info center)

Messaging engine failover is not supported for mixed version clusters.

For example you have WAS 8.0 and WAS 7.0 servers as ME cluster members. (different versions of nodes)

A messaging engine that is hosted on WebSphere Application Server Version 8.0 cannot fail over to a server that

is hosted on a different WebSphere Application Server version. If you have a cluster bus member that consists of

a mixture of Version 8.0 and other version servers, you must ensure that the high availability policy is configured to

prevent this type of failover. (same with 7.0 and 8.5 versions)

To prevent failover of a Version 8.0 messaging engine to a server on different version, configure the high

availability policy for the messaging engine so that the cluster is divided into one set of servers for Version 8.0 and

another set of servers for other versions, and the Version 8.0 messaging engine is restricted to the servers at

Version 8.0 only. (same with 7.0 and 8.5 versions)

IBM Software Group

WebSphere® Support Technical Exchange 33

Summary

WAS supports a variety of HA and Failover options

We usually recommend an HA configuration with multiple servers in your

messaging cluster

Core Group Policies allow you to control where messaging engines run and what

servers they failover to

There are many possible reasons for a failover but checking the WAS

SystemOut.log files will usually provide a clear indication why the failover

occurred.

Failovers are usually caused by disconnections between a messaging engine

and its message store or DCS view problems

IBM Software Group

WebSphere® Support Technical Exchange 34

Connect with us!

1. Get notified on upcoming webcasts

Send an e-mail to [email protected] with subject line “wste

subscribe” to get a list of mailing lists and to subscribe

2. Tell us what you want to learn

Send us suggestions for future topics or improvements about our

webcasts to [email protected]

IBM Software Group

WebSphere® Support Technical Exchange 35

Open Lines for Questions

IBM Software Group

WebSphere® Support Technical Exchange 36

Additional WebSphere Product Resources

Learn about upcoming WebSphere Support Technical Exchange webcasts, and access previously recorded presentations at: http://www.ibm.com/software/websphere/support/supp_tech.html

Discover the latest trends in WebSphere Technology and implementation, participate in technically-focused briefings, webcasts and podcasts at: http://www.ibm.com/developerworks/websphere/community/tp://www.ibm.com/software/websphere/events_1.html

IBM Software Events: http://www.ibm.com/software/websphere/events_1.html

Join the Global WebSphere Community:

http://www.websphereusergroup.org

Access key product show-me demos and tutorials by visiting IBM Education Assistant: http://www.ibm.com/software/info/education/assistant

View a webcast replay with step-by-step instructions for using the Service Request (SR) tool for submitting problems electronically: http://www.ibm.com/software/websphere/support/d2w.html

Sign up to receive weekly technical My Notifications emails: http://www.ibm.com/software/support/einfo.html