DG gap sequence修复一例

环境:Oracle 11.2.0.4 DG
故障现象:
客户在备库告警日志中发现GAP sequence提示信息:

Mon Nov 21 09:53:29 2016
Media Recovery Waiting for thread 1 sequence 12034
Fetching gap sequence in thread 1, gap sequence 12034-12078
Mon Nov 21 09:55:20 2016
FAL[client]: Failed to request gap sequence
 GAP - thread 1 sequence 12034-12078
 DBID 3493955325 branch 881855745
FAL[client]: All defined FAL servers have been attempted.
------------------------------------------------------------
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that's sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
------------------------------------------------------------

修复过程:

1.查询备库SCN

查询备库当前SCN,如果人为造成控制文件、数据文件、数据文件头的SCN不一致,那么需要根据日志中gap的起始sequence#找出对应的SCN。可以查看文末reference中惜分飞的博客评论部分。

SQL> col CURRENT_SCN for 999999999999999999

SQL> SELECT CURRENT_SCN FROM V$DATABASE;

        CURRENT_SCN

-------------------

     11906842766974

2.确定主库是否添加数据文件(这里没有)

select FILE#,name from v$datafile where CREATION_CHANGE#> =11906842766974;

确定主库在这个scn之后是否有添加数据文件,如果添加文件,需要手工在备库添加。本次没有遇到。

3.备库停止日志应用

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

4.主库增量备份并传输到备库上

主库进行增量备份然后传输到备库上

RMAN> backup as compressed backupset INCREMENTAL from scn 11906842766974 database format '/backup/dumpfile/%u.bak';

$ scp *.bak 192.168.56.158:/oradata/rman/

5.备库上进行恢复

RMAN> CATALOG START WITH '/oradata/rman/';

--注意如果此时库是read only则需要置换为mount后再进行recover操作;
RMAN> RECOVER DATABASE NOREDO;

6.主库上创建standby controlfile文件并传输到备库

主库上创建standby controlfile文件并传输到备库

RMAN> BACKUP CURRENT CONTROLFILE FOR STANDBY FORMAT '/home/oracle/std_ctl.bak';

[oracle@localhost ~]$ scp std_ctl.bak 192.168.56.158:/home/oracle/

7.备库恢复控制文件

备库关库,启动到nomount状态后恢复控制文件,最后启动到mount状态

RMAN> shutdown;

RMAN> STARTUP NOMOUNT;

RMAN> RESTORE STANDBY CONTROLFILE FROM '/home/oracle/std_ctl.bak';

RMAN> alter database mount;

8.清空备库日志组(这里不用)

本次DG中使用了standby log模式,不需要此步骤。

SQL> ALTER DATABASE CLEAR LOGFILE GROUP 1;

如果配置了physical standby redo log则不需该步骤;

如果没有采用standby log模式,有几组需要清空几组。

9.备库重设flashback(根据实际情况选做,这里本身就没开启)

备库重设flashback(根据实际情况选做,这里DG环境备库本身就没开启,所以不用操作)

SQL> ALTER DATABASE FLASHBACK OFF;

SQL> ALTER DATABASE FLASHBACK ON;

10.备库重新接收并应用日志

备库重新接收并应用日志:

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE using current logfile DISCONNECT FROM SESSION;

恢复过程备库最后的日志(最后需要出现Media Recovery Waiting for 字样):

Mon Nov 21 17:17:05 2016

Managed Standby Recovery starting Real Time Apply

Parallel Media Recovery started with 32 slaves

Waiting for all non-current ORLs to be archived...

All non-current ORLs have been archived.

Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE using current logfile DISCONNECT FROM SESSION

Media Recovery Log /oradata/arch/1_12131_881855745.dbf

Mon Nov 21 17:18:59 2016

Media Recovery Log /oradata/arch/1_12132_881855745.dbf

Mon Nov 21 17:20:44 2016

Media Recovery Log /oradata/arch/1_12133_881855745.dbf

Mon Nov 21 17:21:02 2016

Media Recovery Log /oradata/arch/1_12134_881855745.dbf

Mon Nov 21 17:22:22 2016

Media Recovery Waiting for thread 1 sequence 12135 (in transit)

11.备库重新开启read only模式

根据实际情况,备库重新开启read only模式,本次需求是需要备库read only状态应用日志(11g ADG特性)

SQL> alter database RECOVER MANAGED STANDBY DATABASE CANCEL;

SQL> alter database open;

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE using current logfile DISCONNECT FROM SESSION;

12.验证修复是否成功

12.1 对比最大sequence#
不一定准确(比如中间产生过gap,但是后期的归档日志正常传输,那么实际上虽然结果相同,但是还是有gap)

在主库中执行alter system switch logfile;

分别主备库中执行:select max(sequence#) from v$archived_log;

12.2 通过跟踪alert文件

主库告警:

tail -200f /oracle/diag/rdbms/shoucall/shoucall/trace/alert_shoucall.log

备库告警:

tail -200f /u01/app/oracle/diag/rdbms/shoucall_dg/shoucall/trace/alert_shoucall.log

Reference

细节补充

  • a) 针对“2.确定主库是否添加数据文件”(这里演示有的情况,在增量备份恢复完成之后,重新恢复备库的控制文件后,要使用alter database create datafile语句创建)
SQL> alter system set standby_file_management = manual;

System altered.

SQL> select checkpoint_change# from v$datafile_header;

CHECKPOINT_CHANGE#
------------------
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
                 0
          33790297

11 rows selected.

SQL> select name from v$datafile where file# = 10;

NAME
-----------------------------------------------------
/u01/oradata/datafile/soe.285.957625983

SQL> alter database create datafile 10 as '/u01/oradata/datafile/soe.285.957625983';

Database altered.

SQL> select checkpoint_change# from v$datafile_header;

CHECKPOINT_CHANGE#
------------------
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33790297
          33486132
          33790297

11 rows selected.

SQL> alter system set standby_file_management = auto;

System altered.

SQL> alter database recover managed standby database disconnect from session;

Database altered.
  • b) 关于文中提到的“如果人为造成控制文件、数据文件、数据文件头的SCN不一致,那么需要根据日志中gap的起始sequence#找出对应的SCN。可以查看文末reference中惜分飞的博客评论部分。”我也实际遇到了这类场景,记录下关键信息,供以后参考:
--备库提示gap
Tue Jan 02 21:50:41 2018
Clearing online redo logfile 8 complete
Media Recovery Waiting for thread 1 sequence 1612
Fetching gap sequence in thread 1, gap sequence 1612-1711

--备库的SCN已经由于恢复过控制文件,不再具有参考价值
SQL> select current_scn||'' from v$database;

CURRENT_SCN||''
----------------------------------------
48779127

--备库的数据文件头信息
SQL> select checkpoint_change# from v$datafile_header;

CHECKPOINT_CHANGE#
------------------
          48775720
          48775720
          48775720
          48775720
          48775720
          48775720
          48775720
          46063891
          46064457
          44684686
          44685159

CHECKPOINT_CHANGE#
------------------
          46064873
          46076635
          46080427
          46081384
          46081884

16 rows selected.

--主库根据gap的sequence#查询first_change#
SYS@jyzhao1 >select FIRST_CHANGE#  from v$archived_log where SEQUENCE#  =1612;

FIRST_CHANGE#
-------------
     44660277
     44660277
可以看到,这个值的确比数据文件头的最小的checkpoint_change#还要小,符合逻辑。
  • c) 演示主库对应gap的归档还可以找到但无法自动传输到备库的情况,可以直接备份对应的归档日志传输到备库进行恢复:

1.主库备份对应归档日志:

backup as copy archivelog sequence 43315 thread 1 format '/tmp/incr/arch_seq_43315_thread_1.bak';
backup as copy archivelog sequence 76073 thread 2 format '/tmp/incr/arch_seq_76073_thread_2.bak';
backup as copy archivelog sequence 88970 thread 3 format '/tmp/incr/arch_seq_88970_thread_3.bak';

–并行备份指定sequence范围内的归档日志:

run{
allocate channel c1 device type disk;
allocate channel c2 device type disk;
allocate channel c3 device type disk;
allocate channel c4 device type disk;

backup as copy archivelog from sequence 88925 until sequence 88961 thread 3 format '/tmp/incr/arch_seq_88925_thread_3_88961_%u.bak';

release channel c1;
release channel c2;
release channel c3;
release channel c4;
}

2.传送到备库:

scp -P 22 arch_seq_88925_thread_3_88961_* 10.xx.xx.11:/tmp/incr/

3.备库rman注册:

catalog start with '/tmp/incr/arch';

之后备库就可以继续recover..

This entry was posted in Oracle故障处理 and tagged , , , , . Bookmark the permalink.