8.10. Diagnosing and Correcting Problems in a Cluster

To ensure the proper diagnosis of any problems in a cluster, event logging must be enabled. In addition, if problems arise in a cluster, be sure to set the severity level to DEBUG for the cluster daemons. This logs descriptive messages that may help solve problems.

NoteNote
 

Once any issues have been resolved, reset the debug level back down to its default value of WARN to avoid excessively large log message files from being generated. Refer to Section 8.6 Modifying Cluster Event Logging for more information.

Use Table 8-3 to troubleshoot issues in a cluster.

ProblemSymptomSolution
SCSI bus not terminatedSCSI errors appear in the log file

Each SCSI bus must be terminated only at the beginning and end of the bus. Depending on the bus configuration, it might be necessary to enable or disable termination in host bus adapters, RAID controllers, and storage enclosures. To support hot plugging, external termination is required to terminate a SCSI bus.
In addition, be sure that no devices are connected to a SCSI bus using a stub that is longer than 0.1 meter.
Refer to Section 2.4.4 Configuring Shared Disk Storage and Section D.3 SCSI Bus Termination for information about terminating different types of SCSI buses.

SCSI bus length greater than maximum limitSCSI errors appear in the log file

Each type of SCSI bus must adhere to restrictions on length, as described in Section D.4 SCSI Bus Length.
In addition, ensure that no single-ended devices are connected to the LVD SCSI bus, because this causes the entire bus to revert to a single-ended bus, which has more severe length restrictions than a differential bus.

SCSI identification numbers not uniqueSCSI errors appear in the log fileEach device on a SCSI bus must have a unique identification number. Refer to Section D.5 SCSI Identification Numbers for more information.
SCSI commands timing out before completionSCSI errors appear in the log file

The prioritized arbitration scheme on a SCSI bus can result in low-priority devices being locked out for some period of time. This may cause commands to time out, if a low-priority storage device, such as a disk, is unable to win arbitration and complete a command that a host has queued to it. For some workloads, this problem can be avoided by assigning low-priority SCSI identification numbers to the host bus adapters.
Refer to Section D.5 SCSI Identification Numbers for more information.

Mounted quorum partitionMessages indicating checksum errors on a quorum partition appear in the log file

Be sure that the quorum partition raw devices are used only for cluster state information. They cannot be used for cluster services or for non-cluster purposes, and cannot contain a file system. Refer to Section 2.4.4.3 Configuring Shared Cluster Partitions for more information.
These messages could also indicate that the underlying block device special file for the quorum partition has been erroneously used for non-cluster purposes.

Service file system is uncleanA disabled service cannot be enabled

Manually run a checking program such as fsck. Then, enable the service.
Note that the cluster infrastructure does by default run fsck with the -p option to automatically repair file system inconsistencies. For particularly egregious error types, you may be required to manually initiate file system repair options.

Quorum partitions not set up correctlyMessages indicating that a quorum partition cannot be accessed appear in the log fileRun the /sbin/shutil -t command to check that the quorum partitions are accessible. If the command succeeds, run the shutil -p command on both cluster systems. If the output is different on the systems, the quorum partitions do not point to the same devices on both systems. Check to make sure that the raw devices exist and are correctly specified in the /etc/sysconfig/rawdevices file. Refer to Section 2.4.4.3 Configuring Shared Cluster Partitions for more information.
Cluster service operation failsMessages indicating the operation failed to appear on the console or in the log fileThere are many different reasons for the failure of a service operation (for example, a service stop or start). To help identify the cause of the problem, set the severity level for the cluster daemons to DEBUG to log descriptive messages. Then, retry the operation and examine the log file. Refer to Section 8.6 Modifying Cluster Event Logging for more information.
Cluster service stop fails because a file system cannot be unmountedMessages indicating the operation failed appear on the console or in the log file

Use the fuser and ps commands to identify the processes that are accessing the file system. Use the kill command to stop the processes. Use the lsof -t file_system command to display the identification numbers for the processes that are accessing the specified file system. If needed, pipe the output to the kill command.
To avoid this problem, be sure that only cluster-related processes can access shared storage data. In addition, modify the service and enable forced unmount for the file system. This enables the cluster service to unmount a file system even if it is being accessed by an application or user.

Incorrect entry in the cluster databaseCluster operation is impairedThe Cluster Status Tool can be used to examine and modify service configuration. The Cluster Configuration Tool is used to modify cluster parameters.
Incorrect Ethernet heartbeat entry in the cluster database or /etc/hosts fileCluster status indicates that a Ethernet heartbeat channel is OFFLINE even though the interface is valid

Examine and modify the cluster configuration by running the Cluster Configuration Tool, as specified in Section 8.4 Modifying the Cluster Configuration, and correct the problem.
In addition, be sure to use the ping command to send a packet to all network interfaces used in the cluster.

Loose cable connection to power switchPower switch status using clufence returns an error or hangsCheck the serial cable connection.
Power switch serial port incorrectly specified in the cluster databasePower switch status using clufence indicates a problemExamine the current settings and modify the cluster configuration by running the Cluster Configuration Tool, as specified in Section 8.4 Modifying the Cluster Configuration, and correct the problem.

Table 8-3. Diagnosing and Correcting Problems in a Cluster