Mobile phone messaging telemedicine for facilitating self management of long‐term illnesses
Troubleshooting of SIB Messaging Engine Failover Problems ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Troubleshooting of SIB Messaging Engine Failover Problems ...
Click to
add text
IBM Software Group
®
WebSphere® Support Technical Exchange
Troubleshooting of SIB Messaging Engine Failover Problems in a Clustered Environment Jhansi Kolla ([email protected]) Ty Shrake ([email protected]) 8th April 2015
IBM Software Group
WebSphere® Support Technical Exchange 2
Agenda
What are the different possible SIB configurations in clustering environment?
What are the pros and cons of each configuration?
The importance of core group policies in SIB ME clusters.
Important steps in SIB configuration when using clusters.
Reasons for ME failovers.
Why and how does an ME failover to a different server?
Troubleshooting of ME failover issues.
Summary
IBM Software Group
WebSphere® Support Technical Exchange 3
Different possible SIB configurations in clustering environment.
You can have more than one ME, depending on the policy you choose.
High Availability (HA) – One ME in the entire cluster with Failover. (failback is optional)
Scalability / Work load management - Multiple messaging engines/ one ME dedicated to each
cluster member with no failover.
High Availability + Scalability - Multiple messaging engines/ one ME dedicated to each cluster
member with failover. (failback is optional)
Custom - (Use this policy when the other options of High availability, Scalability, or Scalability with high
availability do not provide the messaging engine behavior you require, and you are familiar with creating
messaging engines and configuring messaging engine policy settings.)
IBM Software Group
WebSphere® Support Technical Exchange 4
High Availability configuration pros and cons: High Availability
If you configure your environment with a High Availability policy it creates only one ME in that
cluster. This single ME is able to start on any JVM™'s in this cluster depending on how you
configure the core groups like preferred servers. By default the ME starts on the first server
in the cluster to start. The ME is able to failover to any JVM in that cluster. You can also
setup failback to failback the ME if the preferred server is back up and running.
Pros:
The messaging engine is highly available. If the messaging engine or the server on which it runs fails, then the messaging
engine starts up on another available server in the cluster.
You can set the preferred servers where you want to give preference to some servers to run the ME.
You can set the ME failback criteria if preferred server is back.
All messaging functionality is in one place for low traffic scenarios.
Application is able to connect to any SIB bootstrap server to send or consume messages.
Cons:
If you set the “Preferred servers only” option in the core groups panel for this ME policy then, if no preferred server is
available, the ME won't start.
If you set only one preferred server and setup for the “Preferred servers only” and the ME is running on this preferred server
and if catastrophic error condition occurs on this preferred server the ME is not able to failover to another server in that ME
cluster, because you set it up for ME run on preferred server only and there are no more preferred servers available.
No WLM option. Only one ME to handle all messaging functions in that cluster. Work load can't be shared.
IBM Software Group
WebSphere® Support Technical Exchange 6
High Availability configuration
When a server that is hosting a messaging engine fails, the messaging engine is
activated and run on another server. The messaging engine is configured to fail over to
any of the application servers in the cluster. All the application servers in the cluster are
added to the preferred servers list, and this list determines the order in which the servers
are used for failover. The earlier the server in the preferred servers list, the stronger the
preference for that server to run the ME.
IBM Software Group
WebSphere® Support Technical Exchange 7
One messaging engine is created for each application server in the cluster. The
messaging engines cannot fail over.
Pros:
One messaging engine dedicated to one server in the cluster.
Destinations are partitioned across the messaging engines. The messaging engines can share all traffic
passing through the destination, reducing the impact of one messaging engine failing.
Greater throughput of messages by the application.(spreading the messaging load across multiple servers,
and optionally across multiple hosts.)
Performance improvement.
Better accessibility for the application. (local ME access to the application).
Cons:
Message order is no longer be guaranteed due to the workload balancing of messages across queue points.
Default scalability option don't provide the failover of the messaging engine.
Scalability configuration pros and cons:
IBM Software Group
WebSphere® Support Technical Exchange 9
Scalability configuration
The scalability policy ensures that there is a messaging engine for each server in a cluster.
Messaging engines runs on its dedicated servers. No failover. If server is down that specific ME
is also down. ME's run on only preferred servers.
IBM Software Group
WebSphere® Support Technical Exchange 10
High Availability + Scalability configuration pros and cons
This policy provides one ME per JVM with failover option.
Pros:
Improved performance.
Failover option
Better access to the application due to local ME to the application.
Able to configure the failback to the preferred servers.
Work load management.
Cons:
Message order is no longer be guaranteed due to the workload balancing of messages
across queue partitions.
A little more complex than other options
IBM Software Group
WebSphere® Support Technical Exchange 11
High Availability + Scalability configuration
IBM Software Group
WebSphere® Support Technical Exchange 12
High Availability + Scalability configuration
In this example there are three servers, each on a separate node, and three messaging
engines. Each messaging engine has a preferred server and one other server it can use
for failover. Each server is the preferred host for one messaging engine, and the failover
host for one other messaging engine.
IBM Software Group
WebSphere® Support Technical Exchange 13
High Availability + Scalability messaging view
IBM Software Group
WebSphere® Support Technical Exchange 14
Supported Core Group policies in ME clustering:
Core group policies we support:
a) Static.
b) One of N - with no preferred servers ( This is the Default SIBus Policy)
c) One of N - with preferred servers
d) One of N - with preferred servers and the Fail back setting
e) One of N - with preferred servers and the Preferred servers only setting
f) No operation
You cannot use the "All active" or "M of N" policy types, because they are not supported for the
service integration bus component. (any ME can run on only one JVM at any specific time)
IBM Software Group
WebSphere® Support Technical Exchange 16
How to find out where ME is running in the cluster:
Servers > Core groups > Core group settings > Name of the Core group > RunTime tab >
Click the Show groups > High Availability group name >
IBM Software Group
WebSphere® Support Technical Exchange 17
ME lost its DataStore connection and is unable to the get a new lock
on the Datastore after failover
ME is running on one of the cluster members and suddenly it loses the database connection
due to either network issues or the database is hung/shutdown or any other reason.
ConnectionEve A J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for
resource jdbc/<data source name>.
CWSIS1594I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=< INC_UUID>, has lost the lock on the data store.
CWSIS1519E: Messaging engine <ME_NAME> cannot obtain the lock on its data store, which ensures it has exclusive
access to the data.
CWSID0046E: Messaging engine <ME_NAME>_bus detected an error and cannot continue to run in this server.
HAGroupImpl I HMGR0130I: The local member of group IBM_hc=<cluster_name>, WSAF_SIB_BUS=<bus name>,
WSAF_SIB_MESSAGING_ENGINE=<ME_NAME>,type=WSAF_SIB has indicated that is it not alive. The JVM will be
terminated. (only in WAS7.0 and 8.0)
SystemOut O Panic:component requested panic from isAlive (only in WAS7.0 and 8.0)
SystemOut O java.lang.RuntimeException: emergencyShutdown called: (only in WAS7.0 and 8.0)
(JVM termination is default behavior in WAS 7.0 / 8.0. In WAS 8.5 and above levels the JVM is not terminated and the ME is
disabled in this JVM. Other applications continue to run. Also note that the new “Restrict long running locks ” feature in WAS
8.5 gets around this locking problem.)
JDBC™
IBM Software Group
WebSphere® Support Technical Exchange 18
sib.msgstore.jdbcFailoverOnDBConnectionLoss
In order to maintain the data integrity in WAS 7.0 and above we introduced this custom parameter:
sib.msgstore.jdbcFailoverOnDBConnectionLoss with the default value as TRUE.
By setting the sib.msgstore.jdbcFailoverOnDBConnectionLoss custom property on a messaging
engine, you can control the behavior of the messaging engine and its hosting server in the event that
the connection to the data store is lost.
If this parameter is set to TRUE (the default value) then HA manager stops the messaging engine
and its hosting application server when the next core group service “Is Alive” check takes place (the
default value is 120 seconds). You will also see a “Panic” message in the SystemOut.log and the
hosting JVM is terminated (which also stops the ME).
If this parameter is set to FALSE then the messaging engine continues to run and accept work, and
periodically attempts to regain the connection to the data store. If work continues to be submitted to
the messaging engine while the data store is unavailable, the results can be unpredictable, and the
messaging engine can be in an inconsistent state when the data store connection is restored. JMS™
messages could also be lost.
ME lost the DataStore connection and is unable to the get the lock
after failover
IBM Software Group
WebSphere® Support Technical Exchange 19
ME lost the DataStore connection and is unable to the get the lock
after failover At this point the HMGR will failover the ME to the other cluster member if its available based on the core group
policies configuration.
From the ME side it lost the lock on the data store, but on the data store (database) side the lock is still alive.
IBM Software Group
WebSphere® Support Technical Exchange 20
ME lost the DataStore connection and is unable to the get the lock
after failover
HMGR starts the ME in the some other available server based on the core group policies
configuration.
This new instance of the ME tries to get a new lock on the data store. If the attempt fails it will continue
attempting for 15 minutes. If the messaging engine is unable to acquire the lock after 15 minutes it will
become disabled. (In WAS 8.5 the ME will be automatically re-enabled)
In the case of an abnormal shutdown the lock on the data store and OS socket are not removed in the
DB server. Keep in mind that the database, not WAS, owns the database lock.
CWSIS1538I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>,, is attempting to obtain an exclusive
lock on the data store.
CWSIS1593I: The messaging engine, ME_UUID=<ME_UUID>, INC_UUID=<INC_UUID>, has failed to gain an initial lock on the
data store.
Above pattern continues for 15 minutes if the ME is unable to get lock on the data store. After 15
minutes HMGR disables the ME in this server and moves the ME to next available server in that
cluster. The root cause of this locking issue is an orphaned lock that still exists on the SIBOWNER
table and OS socket is still open.
IBM Software Group
WebSphere® Support Technical Exchange 21
ME lost the dataStore connection and unable to the get the lock after
failover
The reason for the failure is that when the disconnection originally occurred the messaging engine became aware
of this fact but the database did not. So while the messaging engine is attempting to reacquire its data store lock on
the SIBOWNER table the database is still holding both the lock and the socket from the previous connection before
the unexpected disconnect occurred. As long as this socket and data store lock remain in place the new
instance of messaging engine will not be able to acquire a new lock and resume messaging functions.
Old (orphaned) data store lock will not be released by the database until the socket connection to the database is
first cleaned up. Once the socket is cleaned up the lock will be released by the database and the messaging engine
will be able to acquire a fresh lock, start the ME and resume its messaging functions.
There are 3 solutions for this problem.
1) DB Admin can manually remove the lock from the SIBOWNER table.
2) Restart the DB server. (Most of the time this is not a good option because many applications are being
used the same DB server. So this is not feasible for most people)
3) Configure the database server to clean the stale/orphaned sockets. (no manual intervention)
IBM Software Group
WebSphere® Support Technical Exchange 22
Solution for this problem is TCP KeepAlive If you are running WAS 7 or 8 then TCP KeepAlive is a good solution to this problem. KeepAlive is a feature of TCP
that has 2 main benefits: It can detect that a connection to a peer is no longer valid and it can prevent a network
connection from being terminated due to inactivity.
KeepAlive periodically checks for broken connections (stale/orphaned sockets). The default OS level KeepAlive
setting is 2 hours. SoKeepAlive will check the socket connection after 2 hours of inactivity. If the connection is stale
KeepAlive will clean up the socket. In order to cleanup this stale socket connection in a timely manner (2 hours is
usually much too long) set this value to approximately ~ 5 minutes. (When you set this KeepAlive setting please
consider other timeout settings like transaction time out which is 3 minutes default).
If you set this to approximately 5 minutes then after 5 minutes of inactivity this socket will be tested to see if the
connection is valid. If it fails the test (because the connection is stale/broken) then it will be cleaned up. The old lock on
the SIBOWNER table will then be released by the database, and the messaging engine will be able to acquire a fresh
lock and resume messaging.
This KeepAlive setting is done differently on each OS. Check with your OS admin to set this.
NOTE! In WAS version 8.5 a new locking mechanism is implemented that solves this problem without the need for
setting the KeepAlive timeout. Just enable the “Restrict long running locks” feature in the messaging engines
Message Store configuration and restart the messaging engine.
IBM Software Group
WebSphere® Support Technical Exchange 23
KeepAlive settings for each OS AIX: TCP_KEEPIDLE
The default value is 14,400 half-seconds, which is 2 hours. This value can be changed using the following command:
no -o tcp_keepidle=600
In the example above the value is set to 600 half-seconds (5 minutes).
Linux: tcp_keepalive_time
The default value is 7200 seconds (2 hours). This value can be changed using the following command (procfs
interface):
# echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time
The example above sets the value to 300 seconds (5 minutes).
Solaris: TCP_KEEPALIVE_INTERVAL
The default value is 7200000 milliseconds (2 hours). This value can be changed using the following command:
ndd -set /dev/tcp tcp_keepalive_interval 300000
The example above sets the value to 300,000 milliseconds (5 minutes).
Windows: KeepAliveTime
The default value is 7200000 milliseconds (2 hours). You can change this value in the Windows Registry here:
\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Tcpip\Parameters
IBM Software Group
WebSphere® Support Technical Exchange 24
JVM hang / terminated
When the JVM the ME is running on is hung or abnormally terminated the ME is also terminated
along with JVM. In this case HA manager will fail over the ME to some other available cluster
member (depending on your core group policies). Also remember that the lock on the data store
side still exists if you are running WAS 7 or 8.
In the failover server ME tries to get a new lock on the data store. If the previous orphaned lock is
not released this ME instance won't be able to get a new lock. As previously mentioned lowering
the KeepAlive to 5 minutes or so removes the the OS level socket and SIBOWNER table lock,
allowing the new ME instance to get the lock, complete startup and resume messaging functions.
IBM Software Group
WebSphere® Support Technical Exchange 25
Locking issues with NFS3 file system
If you are using a File Store instead of a Data Store you should be aware of the following items
regarding failovers:
During the failover customers who are using NFS3 file system to store the file store files may see
locking issues after the failover and the new instance of ME may be unable to get the lock on the
file store. You may see this exception during ME startup:
CWSIS1579E: The file stores log file is locked by another process.
NFS3 file system has some locking issues and we suggest that our customers use NFS4.
IBM File System Locking Protocol Test for WebSphere Application Server:
http://www.ibm.com/support/docview.wss?uid=swg24010222
Network File System (NFS) Recommendations for WebSphere Application Server:
http://www.ibm.com/support/docview.wss?uid=swg21645066
IBM Software Group
WebSphere® Support Technical Exchange 26
HA Manager is disabled
HMGR0010I: The High Availability Manager has been disabled. Services that rely on the High Availability Manager
might not work as expected.
Do not disable the High Availability Manager! It is required for failovers to work correctly!
By default every cluster member has an equal opportunity to host and run the ME. If you disable HA Manager in
some ME cluster members HA Manager is unable to failover the ME to that server because HA Manager is
disabled.
IBM Software Group
WebSphere® Support Technical Exchange 27
Cluster members are in different core groups
In very rare scenarios we have seen ME cluster members in different core groups and no core
group bridge is defined between the core groups. In this case the HA manager is unable to
failover to the other ME server which is in the other core group because there is no
communication bridge between these 2 core groups.
It's always better to keep all ME cluster members in the same core group. It ensures a smooth ME
failover will occur, it's a better system management practice and it also improves failover
performance.
IBM Software Group
WebSphere® Support Technical Exchange 28
Run ME in preferred servers only and no preferred servers are
running If you configure the “preferred servers only” core group policy option make sure that you have at least one preferred
server available as a standby instance all the times.
When the ME is unable to continue on the server where it was running and needs to failover to the other server, if no
preferred server is available (you choose to run ME in preferred servers only) then ME failover won't happen.
IBM Software Group
WebSphere® Support Technical Exchange 29
Issues with the core group policies
In most cases the default, predefined messaging engine policies support frequently-used
cluster configurations. High Availability is enabled by default
There could be some scenarios, such as modifications you make to comply with business
requirements, where you create a coregoup policy that prevents ME failovers.
Static - The messaging engine can run only on the server to which it is restricted,
and cannot fail over to any other server in the cluster
One of N - with preferred servers and the Preferred servers only setting:
If no preferred servers are available, it cannot fail over to any other server
in the cluster.
No operation : Messaging engine is managed by an external high availability
framework, so it is outside of WebSphere control.
IBM Software Group
WebSphere® Support Technical Exchange 30
DCSV* / HMGR* messages DCS views are groups of servers that 'see' each other and are communicating with each other. Sometimes
servers may get unexpectedly removed from a view and this can cause problems. This can result in differences in
views between the server(s) and JVM where the HA coordinator is running.
Due to this view instability the HA manager may be unable to find a suitable JVM for an ME failover. Reasons for
this are:
Check for any HMGR0130I / HMGR0131I / HMGR0132I messages
Frequent changes in DCS views. Check for DCSV messages like DCSV8050I / DCSV8051I /
DCSV8052I
Firewall issues
Networking issues
JVM hangs
Sometimes CPU starvation issues occur where the HA coordinator is running. Look for HMGR0152W messages.
Check for any hung thread messages in the failover server (WSVR0605W)
IBM Software Group
WebSphere® Support Technical Exchange 31
DCSV* / HMGR* messages When there are problems with DCS views the HA Manager may not be certain where the ME is running. As a
result it may try to start a second copy of the same ME. This second copy will fail to start because it will attempt to
get a lock on SIB tables that are already locked by the copy that was already running.
If you see a second instance of a messaging engine that is trying (and failing) to start check for “DCSV” type
warning messages before the failover started, such as DCSV8104W messages, indicating that one or more
servers have been “removed” from the view. If one of the messaging servers was removed from the view before
the ME startup attempt then that is the cause of the failover. This kind of situation is usually driven by networking
problems so check for socket exceptions and other TCP related errors in the logs.
IBM Software Group
WebSphere® Support Technical Exchange 32
Mixed WebSphere versions (from info center)
Messaging engine failover is not supported for mixed version clusters.
For example you have WAS 8.0 and WAS 7.0 servers as ME cluster members. (different versions of nodes)
A messaging engine that is hosted on WebSphere Application Server Version 8.0 cannot fail over to a server that
is hosted on a different WebSphere Application Server version. If you have a cluster bus member that consists of
a mixture of Version 8.0 and other version servers, you must ensure that the high availability policy is configured to
prevent this type of failover. (same with 7.0 and 8.5 versions)
To prevent failover of a Version 8.0 messaging engine to a server on different version, configure the high
availability policy for the messaging engine so that the cluster is divided into one set of servers for Version 8.0 and
another set of servers for other versions, and the Version 8.0 messaging engine is restricted to the servers at
Version 8.0 only. (same with 7.0 and 8.5 versions)
IBM Software Group
WebSphere® Support Technical Exchange 33
Summary
WAS supports a variety of HA and Failover options
We usually recommend an HA configuration with multiple servers in your
messaging cluster
Core Group Policies allow you to control where messaging engines run and what
servers they failover to
There are many possible reasons for a failover but checking the WAS
SystemOut.log files will usually provide a clear indication why the failover
occurred.
Failovers are usually caused by disconnections between a messaging engine
and its message store or DCS view problems
IBM Software Group
WebSphere® Support Technical Exchange 34
Connect with us!
1. Get notified on upcoming webcasts
Send an e-mail to [email protected] with subject line “wste
subscribe” to get a list of mailing lists and to subscribe
2. Tell us what you want to learn
Send us suggestions for future topics or improvements about our
webcasts to [email protected]
IBM Software Group
WebSphere® Support Technical Exchange 36
Additional WebSphere Product Resources
Learn about upcoming WebSphere Support Technical Exchange webcasts, and access previously recorded presentations at: http://www.ibm.com/software/websphere/support/supp_tech.html
Discover the latest trends in WebSphere Technology and implementation, participate in technically-focused briefings, webcasts and podcasts at: http://www.ibm.com/developerworks/websphere/community/tp://www.ibm.com/software/websphere/events_1.html
IBM Software Events: http://www.ibm.com/software/websphere/events_1.html
Join the Global WebSphere Community:
http://www.websphereusergroup.org
Access key product show-me demos and tutorials by visiting IBM Education Assistant: http://www.ibm.com/software/info/education/assistant
View a webcast replay with step-by-step instructions for using the Service Request (SR) tool for submitting problems electronically: http://www.ibm.com/software/websphere/support/d2w.html
Sign up to receive weekly technical My Notifications emails: http://www.ibm.com/software/support/einfo.html