Blog for oracle Dba's: Root Causes for Oracle RAC Nodes Rebooting

1) OCSSD Reboots and 11g RAC

The main cause of node reboots in the case of the Oracle Cluster Synchronization
Daemon (OCSSD) is due to network failures or latency issues between the cluster
nodes for an Oracle 11g RAC configuration. This problem usually manifests itself
with the OCSSD process in terms of what is called a missed checkin condition.
For instance, it may take at least 30 consecutive missed checkins to cause a node
reboot within the cluster where heartbeats are issued once per second within the
Oracle 11g RAC environment. How do we find out whether or not we have these
missed checkin conditions? Easy, the solution lies in the logs for the CSS processing.
Whenever missed checkins occur with OCSSD within the Oracle 11g Clusterware,
they will appear in the log files. Next is an example of this condition from the CSS
logfile with Oracle 11g RAC and LINUX:

WARNING: clssnmPollingThread: node (1) at 50% heartbeat fatal,
eviction in 29.100 seconds

In addition, you should also review the LINUX or UNIX system messages file
to determine the root cause for these OCSSD failures with your Oracle RAC
configuration. We provide the following guidelines to help you understand why
node reboots occur in terms of missed checkins and OCSSD operations for the
Oracle 11g Clusterware:

• If the messages file reboot time is less than the missed checkin time, then the
node eviction was likely not due to these missed checkins

• If the messages file reboot time is greater than the missed checkin time, then
the node eviction was likely a result of the missed checkins

The previous formula will help you to understand why node evictions occur with
missed checkins and the OCSSD failure conditions for the Oracle 11g Clusterware
stack. Another reason why node reboots occur lies in problems when the
Clusterware daemon OCSSD cannot read or write to or from the voting disk. We can
find out whether or not this is indeed the case by a quick review of the CSS logfiles.
The following example from the Oracle 11g CSS log file shows the problem of failed
access to the voting disks:

ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008
miliseconds)

2) OPROCD failure and node reboots

The following are four primary conditions that will cause the Oracle Process
Daemon (OPROCD) to fail within the Oracle 11g RAC environment:

• An OS scheduler problem
• The OS is getting locked up in a driver or hardware issue
• Excessive amounts of load on the machine, thus preventing the scheduler
from behaving reasonably
• An Oracle bug such as Bug 5015469

An OS scheduler issue is solved by correctly setting the operating system scheduler
so that the ntpd daemon is in sync with the Oracle 11g Clusterware, in particular
with the OPROCD process on LINUX. To verify that ntpd is synchronized with
Clusterware, you can check the logs for ntpd and the Clusterware logs. The ntp
logfiles live under the /var/log directory structure for Linux and most UNIX
platforms. Configuration for ntp is carried out by editing the ntpd.conf file located
under the /etc directory for Linux and most UNIX platforms.

If the OS is getting locked up in a driver or there is a hardware issue which is
dependent on the operating system, storage device, and hardware configuration,
by working with the vendor and Oracle support this issue can be resolved when
OS conditions cause a node reboot.

If there is an excessive amount of load on the machine, this issue could be caused by
improper system design for the Oracle 11g RAC environment. Adequate memory,
shared storage, and network capacity are required to prevent and avoid scheduler
failures with the Oracle 11g Clusterware.

An Oracle software bug might be the root cause of the OPROCD failure that results
in a node reboot condition for the cluster, which will occur depending on the
environment. Now let's review some node reboot conditions that are linked
directly to the operation of the OCLSOMON daemon process within the
Oracle 11g Clusterware.

3) OCLSOMON-RAC node reboot

There are several root causes that will cause a node reboot to occur if the OCLSOMON
daemon process fails within the Unix and Linux environments for the Oracle 11g
Clusterware. These can be summarized as follows:

• Hung threads within the CSS daemon
• OS scheduler problems
• Excessive amounts of load on the machine
• Oracle software bugs with Clusterware and database

When the OCLSOMON process fails, it results in a node reboot condition with
Oracle 11g RAC environments. Unix and Linux operating systems are multithreaded
operating system platforms that use shared memory. Whenever threads are unable
to be accessed by the Clusterware to allocate resources for the operating system
scheduler, node reboots occur.

The next condition that may cause a node reboot is related to architecture and system
implementation for the hardware, shared storage, and network configuration in
terms of placing excessive load on the systems within the Oracle 11g RAC ecosystem.
Proper planning will prevent this issue.

The last condition is due to software bugs that may exist within the Oracle 11g
Clusterware software. By consulting with Oracle support and opening a support
escalation ticket—for example, iTar or service request—a patch can be generated to
provide resolution on account of a bug that may be the root cause of the node reboot
with the OCLSOMON process within the Oracle 11g Clusterware. Now that we have
discussed the primary causes and solutions to node reboots within the Oracle 11g
Clusterware, we will discuss how to address issues that arise with the Clusterware as
a result of system and network conditions.

Blog for oracle Dba's

Monday, July 23, 2012

Root Causes for Oracle RAC Nodes Rebooting

No comments:

About Me