有些尴尬的一次集群启动故障排错

因为工作性质改变,有许久没动手处理故障了,今天的排错也是非生产环境,为验证一些测试临时搭的一套11g RAC环境,为了省时间,直接拿之前备份的vbox的环境拷贝,结果启动机器发现集群无法启动:

[root@jystdrac1 ~]# su - grid
[grid@jystdrac1 ~]$ crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[grid@jystdrac1 ~]$ crsctl stat res -t -init
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Status failed, or completed with errors.

查看集群alert日志报错:

[grid@jystdrac1 jystdrac1]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1
[grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
2021-07-01 00:26:27.379:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:26:31.384:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:32.889:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:28:36.895:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:28:38.424:
[mdnsd(4644)]CRS-5602:mDNS service stopping by request.
2021-07-01 00:30:38.407:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:30:42.412:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:43.923:
[/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2021-07-01 00:32:47.928:
[ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
2021-07-01 00:32:49.455:
[mdnsd(4822)]CRS-5602:mDNS service stopping by request.

进一步看mdns.log等最新报错信息(gpnp.log类似,为节省篇幅没有贴出):

[grid@jystdrac1 mdnsd]$ pwd
/opt/app/11.2.0/grid/log/jystdrac1/mdnsd
[grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
2021-06-30 22:50:59.275: [    MDNS][1534236416] mdnsd exit
2021-06-30 22:53:03.989: [ default][1342412544]

================================================================================
2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
[  clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:53:03.991: [  clsdmt][1335961344]PID for the Process [2201], connkey 9
2021-06-30 22:53:03.991: [  clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Failed to record pid for MDNSD
2021-06-30 22:53:03.992: [  clsdmt][1335961344]Terminating process
2021-06-30 22:53:03.992: [    MDNS][1335961344] clsdm requested mdnsd exit
2021-06-30 22:53:03.992: [    MDNS][1335961344] mdnsd exit
2021-06-30 22:57:14.236: [ default][747345664]

================================================================================
2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
[  clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
2021-06-30 22:57:14.239: [  clsdmt][740894464]PID for the Process [2375], connkey 9
2021-06-30 22:57:14.239: [  clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$

MOS 也有篇文章介绍了RAC起不来的五大问题:

  • Grid Infrastructure 启动的五大问题 (Doc ID 1526147.1)

其中问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行,就和目前的现象很匹配。

文档中描述了可能的原因和对应解决方案:

可能的原因:

1. orarootagent 缺少执行权限
2. 缺少进程相关的 <node>.pid 文件或者这个文件的所有者/权限不对
3. GRID_HOME 所有者/权限不对

解决方案:

1. 和一个好的GRID_HOME比较所有者/权限,并做相应的改正,或者以root用户执行:,
   # cd <GRID_HOME>/crs/install
   # ./rootcrs.pl -unlock
   # ./rootcrs.pl -patch
这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
2. 如果对应的 <node>.pid 不存在, 就用touch命令创建一个具有相应所有者/权限的文件, 否则就按要求改正文件<node>.pid的所有者/权限, 然后重启集群软件.
这里是<GRID_HOME>下,所有者属于root:root 权限 644的<node>.pid 文件列表:
  ./ologgerd/init/<node>.pid
  ./osysmond/init/<node>.pid
  ./ctss/init/<node>.pid
  ./ohasd/init/<node>.pid
  ./crs/init/<node>.pid
所有者属于<grid>:oinstall,权限644
  ./mdns/init/<node>.pid  
  ./evm/init/<node>.pid
  ./gipc/init/<node>.pid
  ./gpnp/init/<node>.pid

3. 对第3种原因,请参考解决方案1

可是依次排查下来发现均无问题,奇怪了,为啥权限都正确就是写不进去呢?

手工vi试下看看呢?

[grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
2201

保存时发现报错:

"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
"/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
Press ENTER or type command to continue

什么?文件系统空间满了???

[grid@jystdrac1 jystdrac1]$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxbase-lv_root   28G   27G     0 100% /
tmpfs                             1.5G     0  1.5G   0% /dev/shm
/dev/sda1                         485M   39M  421M   9% /boot

额,果然.. 好尴尬,居然是最初级的空间容量问题。
赶紧清理下空间后重启集群再试是否正常启动?
It’s Ok!

This entry was posted in Oracle故障处理 and tagged , . Bookmark the permalink.