Oracle7 / TECH: Internals of Recovery

Subject:  TECH: Internals of Recovery
Type:  REFERENCE
Creation Date:  13-SEP-1996

Oracle7 v7.2 Recovery Outline

Authors: Andrea Borr  & Bill Bridge
Version: 1 May 3, 1995

Abstract

This document gives an overview of how database recovery works
in Oracle7 version 7.2. It is assumed that the reader is familiar
with the Database Administrator's Guide for Oracle7 version 7.2.
The intention of this document is to describe the recovery
algorithms and data structures, providing more details than the
Administrator's Guide.

Table of Contents

1	Introduction
	1.1	Instance Recovery and Media Recovery: Common Mechanisms
	1.2	Instance Failure and Recovery, Crash Failure and Recovery
	1.3	Media Failure and Recovery

2	Fundamental Data Structures
	2.1	Controlfile
		2.1.1	Database Info Record (resetControlfile)
		2.1.2	Datafile Record (Controlfile)
		2.1.3	Thread Record (Controlfile)
		2.1.4	Logfile Record (Controlfile)
		2.1.5	Filename Record (Controlfile)
		2.1.6	Log-History Record (Controlfile)
	2.2	Datafile Header
	2.3	Logfile Header
	2.4	Change Vector
	2.5	Redo Record
	2.6	System Change Number (SCN)
	2.7	Redo Logs
	2.8	Thread of Redo
	2.9	Redo Byte Address (RBA)
	2.10	Checkpoint Structure
	2.11	Log History
	2.12	Thread Checkpoint Structure
	2.13	Database Checkpoint Structure
	2.14	Datafile Checkpoint Structure
	2.15	Stop SCN
	2.16	Checkpoint Counter
	2.17	Tablespace-Clean-Stop SCN
	2.18	Datafile Offline Range

3	Redo Generation
	3.1	Atomic Changes
	3.2	Write-Ahead Log
	3.3	Transaction Commit
	3.4	Thread Checkpoint
	3.5	Online-Fuzzy Bit
	3.6	Datafile Checkpoint
	3.7	Log Switch
	3.8	Archiving Log Switches
	3.9	Thread Open
	3.10	Thread Close
	3.11	Thread Enable
	3.12	Thread Disable

4	Hot Backup
	4.1	BEGIN BACKUP
	4.2	File Copy
	4.3	END BACKUP
	4.4	"Crashed" Hot Backup

5	Instance Recovery
	5.1	Detection of the Need for Instance Recovery
	5.2	Thread-at-a-Time Redo Application
	5.3	Current Online Datafiles Only
	5.4	Checkpoints
	5.5	Crash Recovery Completion

6	Media Recovery
	6.1	When to Do Media Recovery
	6.2	Thread-Merged Redo Application
	6.3	Restoring Backups
	6.4	Media Recovery Commands
		6.4.1	RECOVER DATABASE
		6.4.2	RECOVER TABLESPACE
		6.4.3	RECOVER DATAFILE
	6.5	Starting Media Recovery
	6.6	Applying Redo, Media Recovery Checkpoints
	6.7	Media Recovery and Fuzzy Bits
		6.7.1	Media-Recovery-Fuzzy
		6.7.2	Online-Fuzzy
		6.7.3	Hotbackup-Fuzzy
	6.8	Thread Enables
	6.9	Thread Disables
	6.10	Ending Media Recovery (Case of Complete Media Recovery)
	6.11	Automatic Recovery
	6.12	Incomplete Recovery
		6.12.1	Incomplete Recovery UNTIL Options
		6.12.2	Incomplete Recovery and Consistency
		6.12.3	Incomplete Recovery and Datafiles Known to the
			Controlfile
		6.12.4	Resetlogs Open after Incomplete Recovery
		6.12.5	Files Offline during Incomplete Recovery
	6.13	Backup Controlfile Recovery
	6.14	CREATE DATAFILE: Recover a Datafile Without a Backup
	6.15	Point-in-Time Recovery Using Export/Import

7	Block Recovery
	7.1	Block Recovery Initiation and Operation
	7.2	Buffer Header RBA Fields
	7.3	PMON vs. Foreground Invocation

8	Resetlogs
	8.1	Fuzzy Files
	8.2	Resetlogs SCN and Counter
	8.3	Effect of Resetlogs on Threads
	8.4	Effect of Resetlogs on Redo Logs
	8.5	Effect of Resetlogs on Online Datafiles
	8.6	Effect of Resetlogs on Offline Datafiles
	8.7	Checking Dictionary vs. Controlfile on Resetlogs Open

9	Recovery-Related V$ Fixed-Views
	9.1	V$LOG
	9.2	V$LOGFILE
	9.3	V$LOG_HISTORY
	9.4	V$RECOVERY_LOG
	9.5	V$RECOVER_FILE
	9.6	V$BACKUP

10	Miscellaneous Recovery Features
	10.1	Parallel Recovery (v7.1)
		10.1.1	Parallel Recovery Architecture
		10.1.2	Parallel Recovery System Initialization Parameters
		10.1.3	Media Recovery Command Syntax Changes
	10.2	Redo Log Checksums (v7.2)
	10.3	Clear Logfile (v7.2)


1  Introduction

The Oracle RDBMS provides database recovery facilities capable
of preserving database integrity in the face of two major failure
modes:

1.	Instance failure: loss of the contents of a buffer cache, or data
residing in memory.

2.	Media failure: loss of database file storage on disk.

Each of these two major failure modes raises its own set of
challenges for database integrity. For each, there is a set of
requirements that a recovery utility addressing that failure mode
must satisfy.

Although recovery processing for the two failure modes has much
in common, the requirements differ enough to motivate the
implementation of two different recovery facilities:

1.	Instance recovery: recovers data lost from the buffer cache
due to instance failure.

2.	Media recovery: recovers data lost from disk storage.

1.1  Instance Recovery and Media Recovery: Common Mechanisms

Both instance recovery and media recovery depend for their
operation on the redo log. The redo log is organized into redo
threads, referred to hereafter simply as threads. The redo log of a
single-instance (non-Parallel Server option) database consists of a
single thread. A Parallel Server redo log has a thread per instance.

A redo log thread is a set of operating system files in which an
instance records all changes it makes - committed and
uncommitted - to memory buffers containing datafile blocks.
Since this includes changes made to rollback segment blocks, it
follows that rollback data is also (indirectly) recorded in the redo
log.

The first phase of both instance and media recovery processing is
roll-forward. Roll-forward is the task of the RDBMS recovery
layer. During roll-forward, changes recorded in the redo log are re-
applied (as needed) to the datafiles. Because changes to rollback
segment blocks are recorded in the redo log, roll-forward also
regenerates the corresponding rollback data. When the recovery
layer finishes its task, all changes recorded in the redo log have
been restored by roll-forward. At this point, the datafile blocks
contain not only all committed changes, but also any uncommitted
changes recorded in the redo log.

The second phase of both instance and media recovery processing
is roll-back. Roll-back is the task of the RDBMS transaction layer.
During roll-back, undo information from rollback segments (as
well as from save-undo/deferred rollback segments, if appropriate)
is used to undo uncommitted changes that were applied during the
roll-forward phase.

1.2  Instance Failure and Recovery, Crash Failure and Recovery

Instance failure, a failure resulting in the loss of the instance's
buffer cache, occurs when an instance is aborted, either
unexpectedly or expectedly. Examples of reasons for unexpected
instance aborts are operating system crash, power failure, or
background process failure. Examples of reasons for expected
instance aborts are use of the commands SHUTDOWN ABORT
and STARTUP FORCE.

Crash failure is the failure of all instances accessing a database. In
the case of a single-instance (non-Parallel Server option) database,
the terms crash failure and instance failure are used
interchangeably. Crash recovery (equivalent to instance recovery in
this case) is the process of recovering all online datafiles to a
consistent state following a crash. This is done automatically in
response to the ALTER DATABASE OPEN command.

In the case of the Parallel Server option, the term crash failure is
used to refer to the simultaneous failures of all open instances.
Parallel Server crash recovery is the process of recovering all
online datafiles to a consistent state after all instances accessing the
database have failed. This is done automatically in response to the
ALTER DATABASE OPEN command. Parallel Server instance
failure refers to the failure of an instance while a surviving instance
continues in operation. Parallel Server instance recovery is the
automatic recovery by a surviving instance of a failed instance.

Instance failure impairs database integrity because it results in loss
of the instance's dirty buffer cache. A "dirty" buffer is one whose
memory version differs from its disk version. An instance that
aborts has no opportunity for writing out "dirty" buffers so as to
prevent database integrity breakage on disk following a crash. Loss
of the dirty buffer cache is a problem due to the fact that the cache
manager uses algorithms optimized for OLTP performance rather
than for crash-tolerance. Examples of performance-optimizing
cache management algorithms that make the task of instance
recovery more difficult are as follows:

7	LRU (least recently used) based buffer replacement

7	no-datablock-force-at-commit (see 3.3).

As a consequence of the performance-oriented cache management
algorithms, instance failure can cause database integrity breakage
as follows:

A.	At crash time, the datafiles on disk might contain some but not
all of a set of datablock changes that constitute a single atomic
change to the database with respect to structural integrity
(see 2.5).

B.	At crash time, the datafiles on disk might contain some dat-
ablocks modified by uncommitted transactions.

C.	At crash time, the datafiles on disk might contain some dat-
ablocks missing changes from committed transactions.

During instance recovery, the RDBMS recovery layer repairs
database integrity breakages A and C. It also enables subsequent
repair - by the RDBMS transaction layer - of database integrity
breakage B.

In addition to the requirement that it repair any integrity breakages
resulting from the crash, instance recovery must meet the following
requirements:

1.	Instance recovery must accomplish the repair using the current
online datafiles (as left on disk after the crash).

2.	Instance Recovery must use only the on-line redo logs. It must
not require use of the archived logs. Although instance recov-
ery could work successfully from archived logs (except for a
database running in NOARCHIVELOG mode), it could not
work autonomously (requirement 4) if an operator were
required to restore archived logs.

3.	The invocation of instance recovery must be automatic,
implicit at the next database startup.

4.	Detection of the need for repair and the repair itself must pro-
ceed autonomously, without operator intervention.

5.	The duration of the roll-forward phase of instance recovery is
governed by both RDBMS internal mechanisms (checkpoint)
and user-configurable parameters (e.g. number and sizes of
logfiles, checkpoint-frequency tuning parameters, parallel
recovery parameters).

As seen above, Oracle's buffer cache component is optimized for
OLTP performance rather than for crash-tolerance. This document
describes some of the mechanisms used by the cache and recovery
components to solve the problems posed by use of performance-
optimizing cache algorithms such as LRU buffer replacement and
no-datablock-force-at-commit. These mechanisms enable instance
recovery to meet its requirements while allowing optimal OLTP
performance. These mechanisms include:

7	Log-Force-at-Commit: see 3.3.
Facilitates repair of breakage type C by guaranteeing that, at
transaction commit time, all of the transaction's redo records,
including its "commit record," are stored on disk in the on-line
redo log.

7	Checkpointing: see 3.4, 3.6.
Bounds the amount of transaction redo that instance recovery
must potentially apply.
Works in conjunction with online-log switch management to
ensure that instance recovery can be accomplished using only
online logs and current online datafiles.

7	Online-Log Switch Management: see 3.7.
Works in conjunction with checkpointing to ensure that
instance recovery can be accomplished using only online logs
and current online datafiles. It guarantees that the current
checkpoint is beyond an online logfile before that logfile is
reused.

7	Write-Ahead-Log: see 3.2.
Facilitates repair of breakage types A and B by guaranteeing
that: (i) at crash time there are no changes in the datafiles that
are not in the redo log; (ii) no datablock change was written to
disk without first writing to the log sufficient information to
enable undo of the change should a crash intervene before
commit.

7	Atomic Redo Record Generation: see 3.1.
Facilitates repair of breakage types A and B.

7	Thread-Open Flag: 5.1.
Enables detection at startup time of the need for crash recov-
ery.

1.3  Media Failure and Recovery

Instance failure affects logical database integrity. Because instance
failure leaves a recoverable version of the online datafiles on the
post-crash disk, instance recovery can use the online datafiles as a
starting point.

Media failure, on the other hand, affects physical storage media
integrity or accessibility. Because the original datafile copies are
damaged, media recovery uses restored backup copies of the
datafiles as a starting point. Media recovery then uses the redo log
to roll-forward these files, either to a consistent present state or to a
consistent past state. Media recovery is run by issuing one of the
following commands: RECOVER DATABASE, RECOVER
TABLESPACE, RECOVER DATAFILE.

Depending on the failure scenario, a media failure has the potential
for causing database integrity breakages similar to those caused by
an instance failure. For example, an integrity breakage of type A,
B, or C could result if I/O accessibility to a datablock were lost
between the time the block was read into the buffer cache and the
time DBWR attempted to write out an updated version of the
block. More typical, however, is the case of a media failure that
results in the permanent loss of the current version of a datafile, and
hence of all updates to that datafile that occurred since the last time
the file was backed up.

Before media recovery is invoked, backup copies of the damaged
datafiles are restored. Media recovery then applies relevant
portions of the redo log to roll-forward the datafile backups,
making them current. Current implies a pre-failure state consistent
with the rest of the database

Media recovery and instance recovery have in common the
requirement to repair database integrity breakages A-C. However,
media recovery and instance recovery differ with respect to
requirements 1-5. The requirements for media recovery are as
follows:

1.	Media recovery must accomplish the repair using restored
backups of damaged datafiles.

2.	Media recovery can use archived logs as well as the online
logs.

3.	Invocation of media recovery is explicit, by operator com-
mand.

4.	Detection of media failure (i.e. the need to restore a backup) is
not automatic.Once a backup has been restored however,
detection of the need to recover it via media recovery is auto-
matic.

5.	The duration of the roll-forward phase of media recovery is
governed solely by user policy
(e.g. frequency of backups, parallel recovery parameters)
rather than by RDBMS internal mechanisms.


2  Fundamental Data Structures

2.1  Controlfile

The controlfile contains records that describe and keep state
information about all the other files of the database.

The controlfile contains the following categories of records:

7	Database Info Record (1)

7	Datafile Records (1 per datafile)

7	Thread Records (1 per thread)

7	Logfile Records (1 per logfile)

7	Filename Records (1 per datafile or logfile group member)

7	Log-History Records (1 per completed logfile)

Fields of the controlfile records referenced in the remainder of this
document are listed below, together with the number(s) of the
section(s) describing their use:

2.1.1  Database Info Record (Controlfile)

7	resetlogs timestamp: 8.2

7	resetlogs SCN: 8.2

7	enabled thread bitvec: 8.3

7	force archiving SCN: 3.8

7	database checkpoint thread (thread record index): 2.13, 3.10

2.1.2  Datafile Record (Controlfile)

7	checkpoint SCN: 2.14, 3.4

7	checkpoint counter: 2.16, 5.3, 6.2

7	stop SCN: 2.15, 6.5, 6.10, 6.13

7	offline range (offline-start SCN, offline-end checkpoint): 2.18

7	online flag

7	read-enabled, write-enabled flags (1-1: read/write, 1-0: read-
only)

7	filename record index

2.1.3  Thread Record (Controlfile)

7	thread checkpoint structure: 2.12, 3.4, 8.3

7	thread-open flag: 3.9, 3.11, 8.3

7	current log (logfile record index)

7	head and tail (logfile record indices) of list of logfiles in
thread: 2.8

2.1.4  Logfile Record (Controlfile)

7	log sequence number: 2.7

7	thread number: 8.4

7	next and previous (logfile record indices) of list of logfiles in
thread: 2.8

7	count of files in group: 2.8

7	low SCN: 2.7

7	next SCN: 2.7

7	head and tail (filename record indices) of list of filenames in
group: 2.8

7	"being cleared" flag: 10.3

7	"archiving not needed" flag: 10.3

2.1.5  Filename Record (Controlfile)

7	filename

7	filetype

7	next and previous (filename record indices) of list of filenames
in group: 2.8

2.1.6  Log-History Record (Controlfile)

7	thread number: 2.11

7	log sequence number: 2.11

7	low SCN: 2.11

7	low SCN timestamp: 2.11

7	next SCN: 2.11

2.2  Datafile Header

Fields of the datafile header referenced in the remainder of this
document are listed below, together with the number(s) of the
section(s) describing their use:

7	datafile checkpoint structure: 2.14

7	backup checkpoint structure: 4.1

7	checkpoint counter: 2.16, 3.4, 5.3, 6.2

7	resetlogs timestamp: 8.2

7	resetlogs SCN: 8.2

7	creation SCN: 8.1

7	online-fuzzy bit: 3.5, 6.7.1, 8.1

7	hotbackup-fuzzy bit: 4.1, 4.4, 6.7.1, 8.1

7	media-recovery-fuzzy bit: 6.7.1, 8.1

2.3  Logfile Header

Fields of the logfile header referenced in the remainder of this
document are listed below, together with the number(s) of the
section(s) describing their use:

7	thread number: 2.7

7	sequence number: 2.7

7	low SCN: 2.7

7	next SCN: 2.7

7	end-of-thread flag: 6.10

7	resetlogs timestamp: 8.2

7	resetlogs SCN: 8.2

2.4  Change Vector

A change vector describes a single change to a single datablock. It
has a header that gives the Data Block Address(DBA) of the block,
the incarnation number, the sequence number, and the operation.
After the header is information that depends on the operation. The
incarnation number and sequence number are copied from the
block header when the change vector is constructed. When a block
is made "new," the incarnation number is set to a value that is
greater than its previous incarnation number and the sequence
number is set to one. The sequence number on the block is
incremented after every change is applied.

2.5  Redo Record

A redo record is a group of change vectors describing a single
atomic change to the database. For example, a transaction's first
redo record might group a change vector for the transaction table
(rollback segment header), a change vector for the undo block
(rollback segment), and a change vector for the datablock. A
transaction can generate multiple redo records. The grouping of
change vectors into a redo record allows multiple database blocks
to be changed so that either all changes occur or no changes occur,
despite arbitrary intervening failures. This atomicity guarantee is
one of the fundamental jobs of the cache layer. Recovery preserves
redo record atomicity across failures.

2.6  System Change Number (SCN)

An SCN defines a committed version of the database. A query
reports the contents of the database as it looked at some specific
SCN. An SCN is allocated and saved in the header of a redo record
that commits a transaction. An SCN may also be saved in a record
when it is necessary to mark the redo as being allocated after a
specific SCN. SCN's are also allocated and stored in other data
structures such as the controlfile or datafile headers. An SCN is at
least 48 bits long. Thus they can be allocated at a rate of 16,384
SCN's per second for over 534 years without running out of them.
We will run out of SCN's in June, 2522 AD (we use 31 day months
for time stamps).

2.7  Redo Logs

All changes to database blocks are made by constructing a redo
record for the change, saving this record in a redo log, then
applying the change vectors to the datablocks. Recovery is the
process of applying redo to old versions of datablocks to make
them current. This is necessary when the current version has been
lost.

When a redo log becomes full it is closed and a log switch occurs.
Each log is identified by its thread number (see below), sequence
number (within thread), and the range of SCN's spanned by its redo
records. This information is stored in the thread number, sequence
number, low SCN, and next SCN fields of the logfile header.

The redo records in a log are ordered by SCN. Moreover, redo
records containing change vectors for a given block occur in
increasing SCN order across threads (case of Parallel Server). Only
some records have SCN's in their header, but every record is
applied after the allocation of the SCN appearing with or before it
in the log. The header of the log contains the low SCN and the next
SCN. The low SCN is the SCN associated with the first redo record
(unless there is an SCN in its header). The next SCN is the low
SCN of the log with the next higher sequence number for the same
thread. The current log of an enabled thread has an infinite next
SCN, since there is no log with a higher sequence number.

2.8  Thread of Redo

The redo generated by an instance - by each instance in the
Parallel Server case - is called a thread of redo. A thread is
comprised of an online portion and (in ARCHIVELOG mode) an
archived portion. The online portion of a thread is comprised of
two or more online logfile groups. Each group is comprised of one
or more replicated members. The set of members in a group is
referred to variously as a logfile group, group, redo log, online log,
or simply log. A redo log contains only redo generated by one
thread. Log sequence numbers are independently allocated for each
thread. Each thread switches logs independently.

For each logfile, there is a controlfile record that describes it. The
index of a log's controlfile record is referred to as its log number.
Note that log numbers are equivalent to log group numbers, and are
globally unique (across all threads). The list of a thread's logfile
records is anchored in the thread record (i.e. via head and tail
logfile record indices), and linked through the logfile records, each
of which stores the thread number. The logfile record also has fields
identifying the number of group members, as well as the head and
tail (i.e. filename record indices) of the list (linked through
filename records) of filenames in the group.

2.9  Redo Byte Address (RBA)

An RBA points to a specific location in a particular redo thread. It
is ten bytes long and has three components: log sequence number,
block number within log, and byte number within block.

2.10  Checkpoint Structure

The checkpoint structure is a data structure that defines a point in
all the redo ever generated for a database. Checkpoint structures
are stored in datafile headers and in the per-thread records of the
controlfile. They are used by recovery to know where to start
reading the log thread(s) for redo application.

The key fields of the checkpoint structure are the checkpoint SCN
and the enabled thread bitvec.

The checkpoint SCN effectively demarcates a specific location in
each enabled thread (for a definition of enabled see 3.11). For each
thread, this location is where redo was being generated at some
point in time within the resolution of one commit. The redo record
headers in the log can be scanned to find the first redo record that
was allocated at the checkpoint SCN or higher.

The enabled thread bitvec is a mask defining which threads were
enabled at the time the checkpoint SCN was allocated. Note that a
bit is set for each thread that was enabled, regardless of whether it
was open or closed. Every thread that was enabled has a redo log
that contains the checkpoint SCN. A log containing this SCN is
guaranteed to exist (either online or archived).

The checkpoint structure also stores the time that the checkpoint
SCN was allocated. This timestamp is only used to print a message
to aid a person looking for a log.

In addition, the checkpoint structure stores the number of the
thread that allocated the checkpoint SCN and the current RBA in
that thread when the checkpoint SCN was allocated. Having an
explicitly-stored thread RBA (as opposed to only having the
checkpoint SCN as an implicit thread location "pointer") makes the
log sequence number (part of the RBA) and archived log name
readily available for the single-instance (i.e. single-thread, non
Parallel Server) case.

A checkpoint structure for a port that supports up to 1023 threads
of redo is 150 bytes long. A VMS checkpoint is 30 bytes and
supports up to 63 threads of redo.

2.11  Log History

The controlfile can be configured (using the MAXLOGHISTORY
clause of the CREATE DATABASE or CREATE CONTROLFILE
command) to contain a history record for every logfile that is
completed. Log history records are small (24 bytes on VMS). They
are overwritten in a circular fashion so that the oldest information
is lost.

For each logfile, the log-history controlfile record contains the
thread number, log sequence number, low SCN, low SCN
timestamp, and next SCN (i.e. low SCN of the next log in
sequence). The purpose of the log history is to reconstruct archived
logfile names from an SCN and thread number. Since a log
sequence number is contained in the checkpoint structure (part of
the RBA), single thread (i.e. non-Parallel Server) databases do not
need log history to construct archived log names.

The fields of the log history records are viewable via the
V$LOG_HISTORY "fixed-view" (see Section 9 for a description
of the recovery-related "fixed-views"). Additionally,
V$RECOVERY_LOG, which displays information about archived
logs needed to complete media recovery, is derived from
information in the log history records. Although log history is not
strictly needed for easy administration of single-instance (non-
Parallel Server) databases, enabling use of V$LOG_HISTORY and
V$RECOVERY_LOG might be a reason to configure it.

2.12  Thread Checkpoint Structure

Each enabled thread's controlfile record contains a checkpoint
structure called the thread checkpoint. The SCN field in this
structure is known as the thread checkpoint SCN. The thread
number and RBA fields in this structure refer to the associated
thread.

The thread checkpoint structure is updated each time an instance
checkpoints its thread (see 3.4). During such thread checkpoint
events, the instance associated with the thread writes to disk in the
online datafiles all dirty buffers modified by redo generated before
the thread checkpoint SCN.

A thread checkpoint event guarantees that all pre-thread-
checkpoint-SCN redo generated in that thread for all online
datafiles has been written to disk. (Note that if the thread is closed,
then there is no redo beyond the thread checkpoint SCN; i.e. the
RBA points just past the last redo record in the current log.)

It is the job of instance recovery to ensure that all of the thread's
redo for all online datafiles is applied. Because of the guarantee
that all of the thread's redo prior to the thread checkpoint SCN has
already been applied, instance recovery can make the guarantee
that, by starting redo application at the thread checkpoint SCN, and
continuing through end-of-thread, all of the thread's redo will have
been applied.

2.13  Database Checkpoint Structure

The database checkpoint structure is the thread checkpoint of the
thread that has the lowest checkpoint SCN of all the open threads.
The number of the database checkpoint thread - the number of
the thread whose thread checkpoint is the current database
checkpoint - is recorded in the database info record of the
controlfile. If there are no open threads, then the database
checkpoint is the thread checkpoint that contains the highest
checkpoint SCN of all the enabled threads.

Since each instance guarantees that all redo generated before its
own thread checkpoint SCN has been written, and since the
database checkpoint SCN is the lowest of the thread checkpoint
SCNs, it follows that all pre-database-checkpoint-SCN redo in all
instances has been written to all online datafiles.

Thus, all pre-database-checkpoint-SCN redo generated in all
threads for all online datafiles is guaranteed to be in the files on
disk already. This is described by saying that the online datafiles
are checkpointed at the database checkpoint. This is the rationale
for using the database checkpoint to update the online datafile
checkpoints (see below) when an instance checkpoints its thread
(see 3.4).

2.14  Datafile Checkpoint Structure

The header of each datafile contains a checkpoint structure known
as the datafile checkpoint. The SCN field in this structure is known
as the datafile checkpoint SCN.

All pre-checkpoint-SCN redo generated in all threads for a given
datafile is guaranteed to be in the file on disk already. An online
datafile has its checkpoint SCN replicated in its controlfile record.
Note: Oracle's recovery layer code is designed to "tolerate" a
discrepancy in checkpoint SCN between the file header and the
controlfile record. These values could get out of sync should an
instance failure occur between the time the file header was updated
and the time the controlfile "transaction" committed. (Note: A
controlfile "transaction" is an RDBMS internal mechanism,
independent of the Oracle transaction layer, that allows an
arbitrarily large update to the controlfile to be "committed"
atomically.)

The execution of a datafile checkpoint (see 3.6) for a given datafile
updates the checkpoint structure in the file header, and guarantees
that all pre-checkpoint-SCN redo generated in all threads for that
datafile is on disk already.

A thread checkpoint event (see 3.4) guarantees that all pre-
database-checkpoint-SCN redo generated in all threads for all
online datafiles has been written to disk. The execution of a thread
checkpoint may advance the database checkpoint (e.g. in the
single-instance case; or if the thread having the oldest checkpoint
changed from being the current thread to another thread). If the
database checkpoint does advance, then the new database
checkpoint is used to update the datafile checkpoints of all the
online datafiles (except those in hot backup: see Section 4).

It is the job of media recovery (see Section 6) to ensure that all redo
for a recovery-datafile (i.e. a datafile being media-recovered)
generated in any thread through the recovery end-point is applied.
Because of the guarantee that all recovery-datafile-redo generated
in any enabled thread prior to that datafile's checkpoint SCN has
already been applied, media recovery can make the guarantee that,
by starting redo application in each enabled thread with the datafile
checkpoint SCN and continuing through the recovery end-point
(e.g. end-of-thread on all threads in the case of complete media
recovery), all redo for the recovery-datafile from all threads will
have been applied.

Since the datafile checkpoint is stored in the header of the datafile
itself, it is also present in backup copies of the datafile. It is the job
of hot backup (see Section 4) to ensure that - despite the
occurrence of ongoing updates to the datafile during the backup
copy operation - the version of the datafile's checkpoint captured
in the backup copy satisfies the checkpoint-SCN guarantee with
respect to the versions of the datafile's datablocks captured in the
backup copy.

2.15  Stop SCN

Each datafile's controlfile record has a field called the stop SCN. If
the file is offline or read-only, the stop SCN is the SCN beyond
which no further redo exists for that datafile. If the file is online and
any instance has the database open, the stop SCN is set to
"infinity." The stop SCN is used during media recovery to
determine when redo application for a particular datafile can stop.
This ensures that media recovery will terminate when recovering
an offline file while the database is open.

The stop SCN is set whenever a datafile is taken offline or set read-
only. This is true whether the offline was "immediate" (due to an I/
O error, or due to taking the file's tablespace offline "immediate"),
"temporary" (due to taking the file's tablespace offline
"temporary"), or "normal" (due to taking the file's tablespace
offline "normal"). However, in the case of a datafile taken offline
"immediate," there is no file checkpoint (see 3.6), and dirty buffers
are discarded. Hence, media recovery may need to apply redo from
before the stop SCN in order to bring the datafile online. However,
media recovery does not need to look for redo after the stop SCN,
since it does not exist. If the stop SCN is equal to the datafile
checkpoint SCN, then the file does not need recovery.

2.16  Checkpoint Counter

There is a checkpoint counter kept in both the datafile header and
in the datafile's controlfile record. Its purpose is to allow detection
of the fact that a datafile or controlfile is a restored backup.

The checkpoint counter is incremented every time checkpoints of
online files are being advanced (e.g. by thread checkpoint). Thus
the datafile's checkpoint counter is incremented even though the
datafile's checkpoint is not being advanced because the file is in hot
backup (see Section 4), or because its checkpoint SCN is already
beyond that of the intended checkpoint (e.g. the file is new or has
undergone a recent datafile checkpoint).

The old value of the checkpoint counter - matching the
checkpoint counter in the datafile's controlfile record - is also
remembered in the file header. It is usually one less than the current
counter in the header, but may differ from the current counter by
more than one if the previous file header update failed after the
header was written but before the controlfile "transaction"
committed.

A mismatch in checkpoint counters between the datafile header and
the datafile's controlfile record is used to detect when a backup
datafile (or a backup controlfile) has been restored.

2.17  Tablespace-Clean-Stop SCN

TS$, a data dictionary table that describes tablespaces, has a
column called the tablespace-clean-stop-SCN. It identifies an SCN
at which a tablespace was taken offline or set read-only "cleanly":
i.e. after checkpointing its datafiles (see 3.6). The SCN at which the
datafiles are checkpointed is recorded in TS$ as the
tablespace-clean-stop SCN. It allows such a "clean-stopped"
tablespace to survive (i.e. not need to be dropped after) a
RESETLOGS open (see 8.6). During media recovery, prior to
resetlogs, the "clean-stopped" tablespace would be set offline.
After resetlogs, the tablespace - which needs no recovery - is
permitted to be brought online and/or set read-write. (An
immediate backup of the tablespace is recommended).

The tablespace-clean-stop SCN is set to zero (after being set
momentarily to "infinity" during datafile state transition) when
bringing an offline-clean tablespace online, or setting a read-only
tablespace read-write. The tablespace-clean-stop SCN is also
zeroed when taking a tablespace offline "immediate" or
"temporary."

A tablespace that has a non-zero tablespace-clean-stop SCN in TS$
is clean at that SCN: the tablespace currently contains all redo up
through that SCN, and no redo for the tablespace beyond that SCN
exists. If the tablespace's datafiles are still in the state they had
when the tablespace was taken offline "normal" or set read-only -
i.e. they are not restored backups, are not fuzzy, and are
checkpointed at the clean-stop SCN - then the tablespace can be
brought online without recovery. Note that the semantics of the
tablespace-clean-stop SCN differ from those of a constituent
datafile's stop SCN in the datafile's controlfile record. The
controlfile stop SCN designates an SCN beyond which no redo for
the datafile exists. This does not imply that the datafile currently
contains all redo up through that SCN.

The tablespace-clean-stop SCN is stored in TS$ rather than in the
controlfile so that it is covered by redo and will finish in the correct
state - i.e. reflecting the correct online/offline state of the
tablespace - following an incomplete recovery (see 6.12). Its
value will not be lost if a backup controlfile is restored, or if a new
controlfile is created. Furthermore, the presence of the tablespace-
clean-stop SCN in TS$ allows an offline normal (or read-only)
tablespace to survive (not need to be dropped after) a
RESETLOGS open, since it is known that no redo application is
needed to bring it online (see 8.6 for more detail). Thus, for
example, an offline normal (or read-only) tablespace that was
offline during an incomplete recovery can be brought online (or set
read-write) subsequent to a RESETLOGS open. Without the
tablespace-clean-stop SCN, there would be no way of knowing that
the tablespace does not need recovery using redo that was
discarded by the resetlogs. The only alternative would have been to
force the tablespace to be dropped.

2.18  Datafile Offline Range

The offline-start SCN and offline-end checkpoint fields of the
controlfile datafile record describe the offline range. If valid, they
delimit a log range guaranteed not to contain any redo for the
datafile. Thus, media recovery can skip this log range when
recovering the datafile, obviating the need to access old archived
log data (which may be uavailable or unusable due to resetlogs: see
Section 7). This optimization aids in recovering a datafile that is
presently online (or read-write), but that was offline-clean (or read-
only) for a long time, and whose last backup dates from that time.
For example, this would be the case if, after a RESETLOGS open,
an offline normal (or read-only) tablespace had been brought online
(or set read-write), but not yet backed up.

When a datafile transitions from offline-clean to online (or from
read-only to read-write), the offline range is set as follows: The
offline-start SCN is set from the tablespace-clean-stop SCN saved
when setting the file offline (or read-only). The offline-end
checkpoint is set from the file checkpoint taken when setting the
file online (or read-write).


3  Redo Generation

Redo is generated to describe all changes made to database blocks.
This section describes the various operations that occur while the
database is open and generating redo.

3.1  Atomic Changes

The most fundamental operation is to atomically change a set of
datablocks. A foreground process intending to change one or more
datablocks first acquires exclusive access to cache buffers
containing those blocks. It then constructs the change vectors
describing the changes. Space is allocated in the redo log buffer to
hold the redo record. The redo log buffer - the buffer from which
LGWR writes the redo log - is located in the SGA (System
Global Area). It may be necessary to ask LGWR to write the buffer
to the redo log in order to make space. If the log is full, LGWR
may need to do a log switch in order to make the space available.
Note that allocating space in the redo buffer also allocates space in
the logfile. Thus, even though the redo buffer has been written, it
may not be possible to allocate redo log space. After the space is
allocated, the foreground process builds the redo record in the redo
buffer. Only after the redo record has been built in the redo buffer
may the datablock buffers be changed. Writing the redo to disk is
the real change to the database. Recovery ensures that all changes
that make it into the redo log make it into the datablocks (except in
the case of incomplete recovery).

3.2  Write-Ahead Log

Write-ahead log is a cache-enforced protocol governing the order
in which dirty datablock buffers are written vs. when the redo log
buffer is written. According to write-ahead log protocol, before
DBWR can write out a cache buffer containing a modified
datablock, LGWR must write out the redo log buffer containing
redo records describing changes to that datablock.

Note that write-ahead log is independent of log-force-at-commit
(see 3.3).

Note also that write-ahead log protocol only applies to datafile
writes that originate from the buffer cache. In particular, write-
ahead log does not apply to so-called direct path writes (e.g.
originating from direct path load, table create via subquery, or
index create). Direct path writes (targeted above the segment high-
water mark) originate not as writes out of the buffer cache, but as
bulk-writes out of the foreground process' data space. Indeed,
correct handling of direct path writes by media recovery dictates a
write-behind-log protocol. (The basic reason is that, because the
bulk-writes do not go through the buffer cache, there is no
mechanism to guarantee their completion at checkpoint).

One guarantee made by write-ahead log protocol is that there are
no changes in the datafiles that are not in the redo log, regardless of
intervening failure. This is what enables recovery to preserve the
guarantee of redo record atomicity despite intervening failure.

Another guarantee made by write-ahead log protocol is that no
datablock change can be written to disk without first writing to the
redo log sufficient information to enable the change to be undone
should the transaction fail to commit. That undo-enabling
information is written to the redo log in the form of "redo" for the
rollback segment.

Write-ahead log protocol plays a key role in enabling the
transaction layer to preserve the guarantee of transaction atomicity
despite intervening failure.

3.3  Transaction Commit

Transaction commit allocates an SCN and builds a commit redo
record containing that SCN. The commit is complete when all of
the transaction's redo (including commit redo record) is on disk in
the log. Thus, commit forces the redo log to disk - at least up to
and including the transaction's commit record. This is termed log-
force-at-commit.

Recovery is designed such that it is sufficient to write only the redo
log at commit time - rather than all datablocks changed by the
transaction - in order to guarantee transaction durability despite
intervening failure. This is termed no-datablock-force-at-commit.

3.4  Thread Checkpoint

A thread checkpoint event, executed by the instance associated
with the redo thread being checkpointed, forces to disk all dirty
buffers in that instance that contain changes to any online datafile
before a designated SCN - the thread checkpoint SCN. Once all
redo in the thread prior to the checkpoint SCN has been written to
disk, the thread checkpoint structure in the thread's controlfile
record is updated in a controlfile transaction.

When a thread checkpoint begins, an SCN is captured and a
checkpoint structure is initialized. Then all the dirty buffers in the
instance's cache are marked for checkpointing. DBWR proceeds to
write out the marked buffers in a staged manner. Once all the
marked buffers have been written, the SCN in the checkpoint
structure is set to the captured SCN, and the thread checkpoint
structure in the thread's controlfile record is updated in a controlfile
transaction.

A thread checkpoint might or might not advance the database
checkpoint. If only one thread is open, the new checkpoint is the
new database checkpoint. If multiple threads are open, the database
checkpoint will advance if the local thread is the current database
checkpoint. Since the new checkpoint SCN was allocated recently,
it is most likely greater than the thread checkpoint SCN in some
other open thread. If it advances, the database checkpoint becomes
the new lowest-SCN open thread checkpoint. If the old checkpoint
SCN for the local thread was higher than the current checkpoint
SCN of some other open thread, then the database checkpoint does
not change.

If the database checkpoint is advanced, then the checkpoint counter
is advanced in every online datafile header. Furthermore, for each
online datafile that is not in hot backup (see Section 4), and not
already checkpointed at a higher SCN (e.g. as would be the case for
a recently added or recovered file), the datafile header checkpoint is
advanced to the new database checkpoint, and the file header is
written to disk. Also, the checkpoint SCN in the datafile's
controlfile record is advanced to the new database checkpoint SCN.

3.5  Online-Fuzzy Bit

Note that more changes - beyond those already in the marked
buffers - may be generated after the start of checkpoint. Such
changes would be generated at SCNs higher than the SCN that will
be recorded in the file header. They could either be changes to
marked buffers that were added since checkpoint start, or else
changes to unmarked buffers. Buffers containing these changes
could written out for a variety of reasons. Thus, the online files are
online-fuzzy; that is, they generally contain changes in the future of
(i.e. generated at higher SCNs than) their header checkpoint SCN.
A datafile is virtually always online-fuzzy while it is online and the
database is open.

Online-fuzzy state is indicated by setting the so-called online-fuzzy
bit in the datafile header. The online-fuzzy bits of all online
datafiles are set at database open time. Also, when a datafile is
brought online while the database is open, its online-fuzzy bit is
set.

The online-fuzzy bits are cleared after the last instance does a
shutdown "normal" or "immediate." Other occasions for clearing
the online-fuzzy bits are: (i) the finish of crash recovery; (ii) when
media recovery "checkpoints" (flushes its buffers) after
encountering an end-crash-recovery redo record (see 5.5); (iii)
when taking a datafile offline "temporary" or "normal" (i.e. an
offline operation that is preceded by a file checkpoint); (iv) when
BEGIN BACKUP is issued (see 4.1).

As will be seen in 8.1, open with resetlogs will fail if any online
datafile has the online-fuzzy bit (or any fuzzy bit) set.

3.6  Datafile Checkpoint

A datafile checkpoint event, executed by all open instances (for all
open threads), forces to disk all dirty buffers in any instance that
contain changes to a particular datafile (or set of datafiles) before a
designated SCN - the datafile checkpoint SCN. Once all datafile-
related redo from all open threads prior to the checkpoint SCN has
been written to disk, the datafile checkpoint structure in the file
header is updated and written to disk.

Datafile checkpoints occur as part of operations such as beginning
hot backup (see Section 4) and offlining datafiles as part of taking a
tablespace offline normal (see 2.17).

3.7  Log Switch

When an instance needs to generate more redo but cannot allocate
enough blocks in the current log, it does a log switch. The first step
in a log switch is to find an online log that is a candidate for reuse.

The first requirement for the candidate log is that it must not be
active: i.e. it must not be needed for crash/instance recovery. In
other words, it must be overwritable without losing redo data
needed for instance recovery. The principle enforced is that a
logfile cannot be reused until the current thread checkpoint is
beyond that logfile. Since instance recovery starts at the current
thread checkpoint SCN/RBA (and expects to find that RBA in an
online redo log), the ability to do instance recovery using only
online logs translates into the requirement that the current thread
checkpoint SCN be beyond the highest SCN associated with redo
in the candidate log. If this is not the case, then the thread
checkpoint currently in progress - e.g. the one started when the
candidate log was originally switched into (see below) - is
hurried up to complete.

The other requirement for the candidate log is that it does not need
archiving. Of course, this requirement only applies to a database
running in ARCHIVELOG mode. If archiving is required, the
archiver is posted.

As soon as the log switch completes, a new thread checkpoint is
started in the new log. Hopefully, the checkpoint will complete
before the next log switch is needed.

3.8  Archiving Log Switches

Each thread switches logs independently. Thus, when running
Parallel Server, an SCN is almost never at the beginning of a log in
all threads. However, it is desirable to have roughly the same range
of SCNs in the archived logs of all enabled threads. This ensures
that the last log archived in each thread is reasonably current. If an
unarchived log for an enabled thread contained a very old SCN (as
would occur in the case of a relatively idle instance), it would not
be possible to use archived logs from a primary site to do recovery
to a higher SCN at a standby site. This would be true even if the log
with the low SCN contained no redo.

This problem is solved by forcing log switches in other threads
when their current log is significantly behind the log just archived.
For the case of an open thread, a lock is used to "kick" the laggard
instance into switching logs and archiving when it can. For the case
of a closed thread, the archiving process in the active instance does
the closed thread's log switch and archiving for it. Note that this
can result in a thread that is enabled but never used having a bunch
of archived logs with only a file header. A force archiving SCN is
maintained in the database info controlfile record to implement this
feature. The system strives to archive any log that contains that
SCN or less. In general, the log with the lowest SCN is archived
first.

The command ALTER SYSTEM ARCHIVE LOG CURRENT can
be used to manually archive the current logs of all enabled threads.
It forces all threads, open and closed, to switch to a new log. It
archives what is necessary to ensure all the old logs are archived. It
does not return until all redo generated before the command was
entered is archived. This command is useful for ensuring all redo
logs necessary for the recovery of a hot backup are archived. It is
also useful for ensuring the potential currency of a standby site in a
configuration in which archived logs from a primary site are
shipped to a standby site for application by recovery in case of
disaster (i.e. "standby database").

3.9  Thread Open

When an instance opens the database, it needs to open a thread for
redo generation. The thread is chosen at mount time. A system
initialization parameter can be used to specify the thread to mount
by number. Otherwise, any available publicly-enabled thread can
be chosen by the instance at mount time. A thread-mounted lock is
used to prevent two instances from mounting the same thread.
When an instance opens a thread, it sets the thread-open flag in the
thread's controlfile record. While the instance is alive, it holds a set
of thread-opened locks (one held by each of LGWR, DBWR,
LCK0, LCK1, ...). (These are released at instance death, enabling
one instance to detect the death of another in the Parallel Server
environment: see 5.1). Also at thread open time, a new checkpoint
is captured and used for the thread checkpoint. If this is the first
database open, this becomes the new database checkpoint, ensuring
all online files have their header checkpoints advanced at open
time. Note that a log switch may be forced at thread open time.

3.10  Thread Close

When an instance closes the database, or when a thread is
recovered by instance/crash recovery, the thread is closed. The first
step in closing a thread is to ensure that no more redo is generated
in it. The next step is to ensure that all changes described by
existing redo records are in the online datafiles on disk. In the case
of normal database close, this is accomplished by doing a thread
checkpoint. The SCN from this final thread checkpoint is said to be
the "SCN at which the thread was closed." Finally, the thread's
controlfile record is updated to clear the thread-open flag.

In the case of thread close by instance recovery, the presence in the
online datafiles of all changes described by thread redo records is
ensured by starting redo application at the most recent thread
checkpoint and continuing through end-of-thread. Once all changes
described by thread redo records are in the online datafiles, the
thread checkpoint is advanced to the end-of-thread. Just as in the
case of a normal thread checkpoint, this checkpoint may advance
the database checkpoint. If this is the last thread close, the database
checkpoint thread field in the database info controlfile record -
which normally points to an open thread - will be left pointing at
this thread, even though it is closed.

3.11  Thread Enable

In order for a thread to be opened, it must be enabled. This ensures
that its redo will be found during media recovery. A thread may be
enabled in either public or private mode. A private thread can only
be mounted by an instance that specifies it in the THREAD system
initialization parameter. This is analogous to rollback segments. A
thread must have at least two online redo log groups while it is
enabled. An enabled thread always has one online log that is its
current log. The next SCN of the current log is infinite, so that any
new SCN allocated will be within the current log. A special thread-
enable redo record is written in the thread of an instance enabling a
new thread (i.e. via ALTER DATABASE ENABLE THREAD).
The thread-enable redo record is used by media recovery to start
applying redo from the new thread. Note that this means it takes an
open thread to enable another thread. This chicken and egg
problem is resolved by having thread one automatically enabled
publicly at database creation. This also means that databases that
do not run in Parallel Server mode do not need to enable a thread.

3.12  Thread Disable

If a thread is not going to be used for a long while, it is best to
disable it. This means that media recovery will not expect any redo
to be found in the thread. Once a thread is disabled, its logs may be
dropped. A thread must be closed before it can be disabled. This
ensures all its changes have been written to the datafiles. A new
SCN is allocated to save as the next SCN for the current log. The
log header is marked with this SCN and flags saying it is the end of
a disabled thread. It is important that a new current SCN is
allocated. This ensures the SCN in any checkpoint with this thread
enabled will appear in one of the logs from the thread. Note that
this means a thread must be open in order to disable another thread.
Thus, it is not possible to disable all threads.


4  Hot Backup

A hot backup is a copy of a datafile that is taken while the file is in
active use. Datafile writes (by DBWR) go on as usual during the
time the backup is being copied. Thus, the backup gets a "fuzzy"
copy of the datafile:

7	Some blocks may be ahead in time versus other blocks of the
copy.

7	Some blocks of the copy may be ahead of the checkpoint SCN
in the file header of the copy.

7	Some blocks may contain updates that constitute breakage of
the redo record atomicity guarantee with respect to other
blocks in this or other datafiles.

7	Some block copies may be "fractured" (due to front and back
halves being copied at different times, with an intervening
update to the block on disk).

The "hotbackup-fuzzy" copy is unusable without "focusing" (via
the redo log) that occurs when the backup is restored and
undergoes media recovery. Media recovery applies redo (from all
threads) from the begin-backup checkpoint SCN (see Step 2. in
Section 4.1) through the end-point of the recovery operation (either
complete or incomplete). The result is a transaction-consistent
"focused" version of the datafile.

There are three steps to taking a hot backup:

7	Execute the ALTER TABLESPACE ... BEGIN BACKUP
command.

7	Use an operating system copy utility to copy the constituent
datafiles of the tablespace(s).

7	Execute the ALTER TABLESPACE ... END BACKUP com-
mand.

4.1  BEGIN BACKUP

The BEGIN BACKUP command takes the following actions (not
necessarily in the listed order) for each datafile of the tablespace:

1.	It sets a flag in the datafile header - the hotbackup-fuzzy bit
- to indicate that the file is in hot backup. The header with
this flag set (copied by the copy utility) enables the copy to be
recognized as a hot backup. A further purpose of this flag in
the online file header is to cause the checkpoint in the file
header to be "frozen" at the begin-backup checkpoint value
that will be set in Step 4. This is the value that it must have in
the backup copy in order to ensure that, when the backup is
recovered, media recovery will start redo application at a suffi-
ciently early checkpoint SCN so as to cover all changes to the
file in all threads since the execution of BEGIN BACKUP (see
6.5). Since we cannot guarantee that the file header will be the
first block to be written out by the copy utility, it is important
that the file header checkpoint structure remain "frozen" until
END BACKUP time. This flag keeps the datafile checkpoint
structure "frozen" during hot backup, preventing it (and the
checkpoint SCN in the datafile's controlfile record) from being
updated during thread checkpoint events that advance the
database checkpoint. New in v7.2: While the file is in hot
backup, a new "backup" checkpoint structure in the datafile
header receives the updates that the "frozen" checkpoint
would have received.

2.	It executes a datafile checkpoint, capturing the resultant
"begin-backup" checkpoint information, including the begin-
backup checkpoint SCN. When the file is checkpointed, all
instances are requested to write out all dirty buffers they have
for the file. If the need for instance recovery is detected at this
time, the file checkpoint operation waits until it is completed
before proceeding. Checkpointing the file at begin-backup
time ensures that only file blocks changed after begin-backup
time might have been written to disk during the course of the
file copy. This guarantee is crucial to enabling block before-
image logging to cope with the fractured block problem, as
described in Step 3.

3.	[Platform-dependent option]: It starts block before-image log-
ging for the file. During block before-image logging, all
instances log a full block before-image to the redo log prior to
the first change to each block of the file (since the backup
started, or since the block was read anew into the buffer
cache). This is to forestall a recovery problem that would arise
if the backup were to contain a fractured block copy (mis-
matched halves). This could happen if (the database block size
is greater than the operating system block size, and) the front
and back halves of the block were copied to the backup at dif-
ferent times - with an intervening update to the block on
disk. In this eventuality, recovery can reconstruct the block
using the logged block before-image.

4.	It sets the checkpoint in the file header equal to the begin-
backup checkpoint captured in Step 2. This file header check-
point will be "frozen" until END BACKUP is executed.

5.	It clears the file's online-fuzzy bit. The online-fuzzy bit
remains clear during the course of the file copy operation, thus
ensuring a cleared online-fuzzy bit in the file copy. Note that
the online-fuzzy bit is set again by the execution of END
BACKUP.

4.2  File Copy

The file copy is done by utilities that are not part of Oracle. The
presumption is that the platform vendor will have backup facilities
that are superior to any portable facility that we could develop. It is
the responsibility of the administrator to ensure that copies are only
taken between the BEGIN BACKUP and END BACKUP
commands, or when the file is not in use.

4.3  END BACKUP

The END BACKUP command takes the following actions for each
datafile of the tablespace:

1.	It restores (i.e. sets) the file's online-fuzzy bit.

2.	It creates an end-backup redo record (end-backup "marker")
for the datafile. This record, interpreted only by media recov-
ery, contains the begin-backup checkpoint SCN (i.e. the SCN
matching that in the "frozen" checkpoint in the backup's
header). This record serves to mark the end of the redo gener-
ated during the backup. The end-backup "marker" is used by
media recovery to determine when all redo generated between
BEGIN BACKUP and END BACKUP has been applied to the
datafile. Upon encountering the end-backup "marker", media
recovery can (at the next media recovery checkpoint: see
6.7.1) clear the hotbackup-fuzzy bit. This is only important in
preventing an incomplete recovery that might erroneously
attempt to end before all redo generated between BEGIN
BACKUP and END BACKUP has been applied. Ending
incomplete recovery at such a point may result in an inconsis-
tent file, since the backup copy may already have contained
changes beyond this endpoint. As will be seen on 8.1, open
with resetlogs following incomplete media recovery will fail if
any online datafile has the hotbackup-fuzzy bit (or any other
fuzzy bit) set.

3.	It clears the file's hotbackup-fuzzy bit.

4.	It stops block before-image logging for the file.

5.	It advances the file checkpoint to the current database check-
point. This compensates for any file header update(s) missed
during thread checkpoints that may have advanced the data-
base checkpoint while the file was in hot backup state, with its
checkpoint "frozen".

4.4  "Crashed" Hot Backup

A normal shutdown of the instance that started a backup, or the last
remaining instance, is not allowed while any files are in hot
backup. Nor may a file in backup be taken offline normal or
temporary. This is to ensure an end-backup "marker" is generated
whenever possible, and to make administrators aware that they
forgot to issue the END BACKUP command, and that the backup
copy is unusable.

When an instance failure or shutdown abort leaves a hot backup
operation incomplete (i.e. lacking termination via END BACKUP),
any file that was in backup before the failure has its hotbackup-
fuzzy bit set and its checkpoint "frozen" at the begin-backup
checkpoint. Even though the online file's datablocks are actually
current to the database checkpoint, the file's header makes it look
like a restored backup that needs media recovery and is current
only to the begin-backup checkpoint. Crash recovery will fail -
claiming media recovery is required - if it encounters an online
file in "crashed" hot backup state. The file does not actually need
media recovery, however, but only an adjustment to its file header
to take it out of "crashed" hot backup state.

Media recovery could be used to recover and allow normal open of
a database that has files left in "crashed" hot backup state. For v7.2
however, a preferable option - because it requires no archived
logs - is to use the (new in v7.2) command ALTER DATABASE
DATAFILE... END BACKUP on the files left in "crashed" hot
backup state (identifiable using the V$BACKUP fixed-view: see
9.6). Following execution of this command, crash recovery will
suffice to open the database. Note that the ALTER TABLESPACE
... END BACKUP format of the command cannot be used when the
database is not open. This is because the database must be open in
order to translate (via the data dictionary) tablespace names into
their constituent datafile names.


5  Instance Recovery

Instance recovery is used to recover from both crash failures and
Parallel Server instance failures. Instance recovery refers either to
crash recovery or to Parallel Server instance recovery (where a
surviving instance recovers when one or more other instances fail).

The goal of instance recovery is to restore the datablock changes
that were in the cache of the dead instance and to close the thread
that was left open. Instance recovery uses only online redo logfiles
and current online datafiles (not restored backups). It recovers one
thread at a time, starting at the most recent thread checkpoint and
continuing until end-of-thread.

5.1  Detection of the Need for Instance Recovery

The kernel performs instance recovery automatically upon
detecting that an instance died leaving its thread-open flag set in
the controlfile. Instance recovery is performed automatically on
two occasions:

1.	at the first database open after a crash (crash recovery);

2.	when some but not all instances of a Parallel Server fail.

In the case of Parallel Server, a surviving instance detects the need
to perform instance recovery for one or more failed instances by
the following means:

1.	A foreground process in a surviving instance detects an
"invalid block lock" condition when it attempts to bring a
datablock into the buffer cache. This is an indication that
another instance died while a block covered by that lock was
in a potentially "dirty" state in its buffer cache.

2.	The foreground process sends a notification to its instance's
SMON process, which begins a search for dead instances.

3.	The death of another instance is detected if the current
instance is able to acquire that instance's thread-opened locks
(see 3.9).

SMON in the surviving instance obtains a stable list of dead
instances, together with a list of "invalid" block locks. Note: After
instance recovery is complete, locks in this list will undergo "lock
cleanup" (i.e. they will have their "invalid" condition cleared,
making the underlying blocks accessible again).

5.2  Thread-at-a-Time Redo Application

Instance recovery operates by processing one thread at a time,
thereby recovering one instance at a time. It applies all redo (from
the thread checkpoint through the end-of-thread) from each thread
before starting on the next thread. This algorithm depends on the
fact that only one instance at a time can have a given block
modified in its cache. Between changes to the block by different
instances, the block is written to disk. Thus, a given block (as read
from disk during instance recovery) can need redo applied from at
most one thread - the thread containing the most recent
modification.

Instance recovery can always be accomplished using the online
redo logs for the thread being recovered. Crash recovery operates
on the thread with the lowest checkpoint SCN first. It proceeds to
recover the threads in the order of increasing thread checkpoint
SCNs. This ensures that the database checkpoint is advanced by
each thread recovered.

5.3  Current Online Datafiles Only

The checkpoint counters are used to ensure that the datafiles are the
current online files rather than restored backups. If a backup copy
of a datafile is restored, then media recovery is required.

Media recovery is required for a restored backup even if recovery
can be accomplished using the online logs. The reason is that crash
recovery applies all post-thread-checkpoint redo from each thread
before starting on the next thread. Crash recovery can use this
thread-at-a-time redo application algorithm because a given
datablock can need redo application from at most one thread.

However, starting recovery from a restored backup enables no such
assumption about the number of threads that have relevant redo.
Thus, the thread-at-a-time algorithm would not work. Recovering a
backup requires thread-merged redo application: i.e. application of
all post-file-checkpoint redo, simultaneously merging redo from all
threads in SCN order. This thread-merged redo application
algorithm is the one used by media recovery (see Section 6).

Crash recovery would not suffice - even with thread-merged redo
application - to recover a backup datafile, even if it were
checkpointed at the current database checkpoint. The reason is that
in all but the database checkpoint thread, crash recovery would
miss applying redo between the database checkpoint and the
(higher) thread checkpoint. By contrast, media recovery would
start redo application at the file checkpoint in all threads.
Furthermore, crash recovery might fail even if it started redo
application at the file checkpoint in all threads. The reason is that
crash recovery assumes that it will need only online logfiles. All
but the database checkpoint thread might have already archived
and re-used a needed log.

If the STARTUP RECOVER command is used (in place of simple
STARTUP), and crash recovery fails due to datafiles needing
media recovery (e.g. they are restored backups), then media
recovery via RECOVER DATABASE (see 6.4.1) is automatically
executed prior to database open.

5.4  Checkpoints

Instance recovery does not attempt to apply redo that is before the
checkpoint SCN of a datafile. (The datafile header checkpoint
SCNs are not used to decide where to start recovery, however.)

The redo from the thread checkpoint through the end-of-thread
must be read to find the end-of-thread and the highest SCN
allocated by the thread. These are then used to close the thread and
advance the thread checkpoint. The end of a instance recovery
almost always advances the datafile checkpoints, and always
advances the checkpoint counters.

5.5  Crash Recovery Completion

At the termination of crash recovery, the "fuzzy bits" - online-
fuzzy, hotbackup-fuzzy, media-recovery-fuzzy - of all online
datafiles are cleared. A special redo record, the end-crash-recovery
"marker," is generated. This record is interpreted by media
recovery to know when it is permissible to clear the online-fuzzy
and hotbackup-fuzzy bits of the datafiles undergoing recovery (see
6.6).


6  Media Recovery

Media recovery is used to recover from a lost or damaged datafile,
or from a lost current controlfile. It is used to transform a restored
datafile backup into a "current" datafile. It is also used to restore
changes that were lost when a datafile went offline without a
checkpoint. Media recovery can apply archived logs as well as
online logs. Unlike instance or crash recovery, media recovery is
invoked only via explicit command.

6.1  When to Do Media Recovery

As was seen in 5.3, a restored datafile backup always needs media
recovery, even if its recovery can be accomplished using only
online logs. The same is true of a datafile that went offline without
a checkpoint. The database cannot be opened if any of the online
datafiles needs media recovery. A datafile that needs media
recovery cannot be brought online until media recovery has been
executed. Unless the database is not open by any instance, media
recovery can only operate on offline files. Media recovery may be
explicitly invoked to recover a database prior to open even when
crash recovery would have sufficed. If so, crash recovery - though
it may find nothing to do - will still be invoked automatically at
database open. Note that media recovery may be run - and, in
cases such as restored backups or datafiles that went offline
immediate, must be run - even if recovery can be accomplished
using only the online logs. Media recovery may find nothing to do
- and signal the "no recovery required" error - if invoked for
files that do not need recovery.

If the current controlfile is lost and a backup controlfile is restored
in its place, media recovery must be done. This is the case even if
all of the datafiles are current.

6.2  Thread-Merged Redo Application

Media recovery uses a thread-merged redo application algorithm:
i.e. it applies redo from all threads simultaneously, merging redo
records in increasing SCN order. The process of media-recovering
a backup datafile differs from the process of crash-recovering a
current online datafile in the following fundamental way: Crash
recovery applies redo from one thread at a time because any block
of a current online file can need redo from at most one thread (one
instance at a time can dirty a block in cache). With a restored
backup, however, no assumption can be made about the number of
threads that have redo relevant to particular block. In general,
recovering a backup requires simultaneous application of redo
from all threads, with merging of redo records across threads in
SCN order. Note that this algorithm depends on a redo-generation-
time guarantee that changes for a given block occur in increasing
SCN order across threads (case of Parallel Server).

6.3  Restoring Backups

The administrator may copy backup versions of datafiles to the
current datafile while the database is shut down or the file is offline.
There is a strong assumption that backups are never copied to files
that are currently accessible. Every file header read verifies that this
has not been done by comparing the checkpoint counter in the file
header with the checkpoint counter in the datafile's controlfile
record.

6.4  Media Recovery Commands

   There are three media recovery commands:

7	RECOVER DATABASE

7	RECOVER TABLESPACE

7	RECOVER DATAFILE

The only essential difference in these commands is in how the set
of files to recover is determined. They all use the same criteria for
determining if the files can be recovered. There is a lock per
datafile that is held exclusive by a process doing media recovery on
a file, and is held shared by an instance that has the database open
with the file online. Media recovery signals an error if it cannot get
the lock for a file it is asked to recover. This prevents two recovery
sessions from recovering the same file, and prevents media
recovery of a file that is in use.

6.4.1  RECOVER DATABASE

This command does media recovery on all online datafiles that
need any redo applied. If all instances were cleanly shutdown, and
no backups were restored, this command will signal the "no
recovery required" error. It will also fail if any instances have the
database open, since they will have the datafile locks.

6.4.2  RECOVER TABLESPACE

This command does media recovery on all datafiles in the
tablespaces specified. In order to translate (i.e. via the data
dictionary) the tablespace names into datafile names, the database
must be open. This means that the tablespaces and their constituent
datafiles must be offline in order to do the recovery. An error is
signalled if none of the tablepace's constituent files needs recovery.

6.4.3  RECOVER DATAFILE

This command specifies the datafiles to be recovered. The database
may be open; or it may be closed, as long as the media recovery
locks can be acquired. If the database is open in any instance, then
datafile recovery can only recover offline files.

6.5  Starting Media Recovery

Media recovery starts by finding the media-recovery-start SCN: i.e.
the lowest SCN of the datafile header checkpoints of the files being
recovered. Note: An exception occurs if a file's checkpoint is in its
offline range (see 2.18). In that case, the file's offline-end
checkpoint is used in place of its datafile header checkpoint in
computing the media-recovery-start SCN.

A buffer for reading redo is allocated for each thread in the enabled
thread bitvec of the media-recovery-start checkpoint (i.e. the
datafile checkpoint with the lowest SCN). The initial file header
checkpoint SCN of every file is saved to ensure that no redo from a
previous use of the file number is applied, as well as to eliminate
needlessly attempting to apply redo to a file from before its
checkpoint. The stop SCNs (from the datafiles' controlfile records)
are also saved. If finite, the highest stop SCN can be used to allow
recovery to terminate without needlessly searching for redo beyond
that SCN to apply (see 6.10). At recovery completion, any datafile
initially found to have a finite stop SCN will be left checkpointed at
that stop SCN (rather than at the recovery end-point). This allows
an offline-clean or read-only datafile to be left checkpointed at an
SCN that matches the tablespace-clean-stop-SCN of its tablespace.

6.6  Applying Redo, Media Recovery Checkpoints

A log is opened for each thread of redo that was enabled at the time
the media-recovery-start SCN was allocated (i.e. for each thread in
the enabled thread bitvec of the media-recovery-start checkpoint).
If the log is online, then it is automatically opened. If the log was
archived, then the user is prompted to enter the name of the log
(unless automatic recovery is being used). The redo is applied from
all the threads in the order it was generated, switching threads as
needed. The order of application of redo records without an SCN is
not precise, but it is good enough for rollback to make the database
consistent.

Except in the case of cancel-based incomplete recovery (see
6.12.1) and backup controlfile recovery (see 6.13), the next online
log in sequence is accessed automatically, if it is on disk. If not, the
user is prompted for the next log.

At log boundaries, media recovery executes a "checkpoint." As
part of media recovery checkpoint, the dirty recovery buffers are
written to disk and the datafile header checkpoints of the files
undergoing recovery are advanced, so that the redo does not need
to be reapplied. Another type of media recovery "checkpoint"
occurs when a datafile initially found to have a finite stop SCN
reaches that stop SCN. At such a stop SCN boundary, all dirty
recovery buffers are written to disk, and the datafiles that have been
made current have their datafile header checkpoints advanced to
their stop SCN values.

6.7  Media Recovery and Fuzzy Bits

6.7.1  Media-Recovery-Fuzzy

The media-recovery-fuzzy bit is a flag in the datafile header that is
used to indicate that - due to ongoing redo application by media
recovery - the file may contain changes in the future of (at SCNs
beyond) the current header checkpoint SCN. The media-recovery-
fuzzy bit is set at the start of media recovery for each file
undergoing recovery. Generally the media-recovery-fuzzy bits can
be cleared when a media recovery checkpoint advances the
checkpoints in the datafile headers. They are left clear when a
media recovery session completes successfully or is cancelled. As
will be seen on 8.1, open with resetlogs following incomplete
media recovery will fail if any online datafile has the media-
recovery-fuzzy bit (or any fuzzy bit) set.

6.7.2  Online-Fuzzy

Upon encountering an end-crash-recovery "marker" (or a file-
specific offline-immediate "marker": generated when a datafile
goes offline without a checkpoint), media recovery can (at the next
media recovery checkpoint) clear (if set) the online-fuzzy and
hotbackup-fuzzy bits in the appropriate datafile header(s).

6.7.3  Hotbackup-Fuzzy

Upon encountering an end-backup "marker" (or an end-crash-
recovery "marker"), media recovery can (at the next media
recovery checkpoint) clear the hotbackup-fuzzy bit. Open with
resetlogs following incomplete media recovery will fail if any
online datafile has the hotbackup-fuzzy bit (or any fuzzy bit) set.
This prevents a successful RESETLOGS open following an
incomplete recovery that terminated before all redo generated
between BEGIN BACKUP and END BACKUP had been applied.
Ending incomplete recovery at such a point would generally result
in an inconsistent file, since the backup copy may already have
contained changes between this endpoint and the END BACKUP.

6.8  Thread Enables

A special thread-enable redo record is written in the thread of an
instance enabling a new thread. If media recovery encounters a
thread-enable redo record, it allocates a new redo buffer, opens the
appropriate log in the new thread, and prepares to start applying
redo from the new thread.

6.9  Thread Disables

When a thread is disabled, its current log is marked as the end of a
disabled thread. After media recovery finishes applying redo from
such a log, it deallocates the thread's redo buffer and stops looking
for redo from the thread.

6.10  Ending Media Recovery (Case of Complete Media Recovery)

The current (i.e. last) log in every enabled thread has the end-of-
thread flag set in its header. Complete (as opposed to incomplete:
see 6.12) media recovery always continues redo application
through the end-of-thread in all threads. The end-of-thread log can
be identified without having the current controlfile, since the end-
of-thread flag is in the log header rather than in the logfile's
controlfile record.

Note: Backing up and later restoring copies of current online logs
is dangerous, and can lead to mis-identification of the current true
end-of-thread. This is because the end-of-thread flag in the backup
copy will in general be out-of-date with respect to the current end-
of-thread log.

If the datafiles being recovered have finite stop SCNs in their
controlfile records (assuming a current controlfile), then media
recovery can stop prior to the end-of-threads. Redo application for
a datafile with a finite stop SCN can terminate at that SCN, since it
is guaranteed that no redo for that datafile beyond that SCN was
generated.

As described on 2.15, the stop SCN is set when a datafile goes
offline. Note that without the optimization that allows recovery of a
file with a finite stop SCN to terminate at that SCN, it could not be
guaranteed that recovery of an offline datafile while the database is
open would terminate.

6.11  Automatic Recovery

Automatic recovery is invoked by using the AUTOMATIC option
of the media recovery command. It saves the user the trouble of
entering the names of archived logfiles, provided they are on disk.
If the sequence number of the log can be determined, then a name
can be constructed by concatenating the current values of the
initialization parameters LOG_ARCHIVE_DEST and
LOG_ARCHIVE_FORMAT. The current LOG_ARCHIVE_DEST
is assumed, unless the user overrides it by specifying a different
archiving destination for the recovery session. The media-
recovery-start checkpoint (see 6.5) contains (in the RBA field) the
initial log sequence number for one thread (i.e. the thread that
generated the checkpoint). If multiple threads of redo are enabled,
the log history section of the controlfile (if configured) can be used
to map the media-recovery-start SCN to a log sequence number for
each thread. Once the initial recovery log is found for a thread, all
subsequent logs needed from the thread follow in order. If it is not
possible to determine the initial log sequence number, the user will
have to guess and try logs until the right one is accepted. The
timestamp from the media-recovery-start checkpoint is reported to
aid in this effort.

6.12  Incomplete Recovery

A RECOVER DATABASE execution can be stopped and the
database opened before all the redo has been applied. This type of
recovery is termed incomplete recovery. The subsequent database
open is termed a RESETLOGS open.

Incomplete recovery effectively sets the entire database backwards
in time to a transaction-consistent state at or near the recovery end-
point. All subsequent updates to the database are lost and must be
re-entered.

Use of incomplete recovery is indicated in the following
circumstances:

7	Media recovery is necessary (e.g. due to datafile damage or
loss), but cannot be complete (i.e. all redo cannot be applied)
because all copies of a needed online or archived redo log
were lost.

7	All copies of an active (i.e. needed for instance recovery) log
were damaged or lost while the database was open. Since
crash recovery is precluded, this case reduces to the previous
case.

7	It is necessary to reverse the effect of an erroneous user action
(e.g. table drop or batch run); and it is acceptable to set the
entire database - not just the affected schema objects -
backwards to a point-in-time before the error.

6.12.1  Incomplete Recovery UNTIL Options

There are three types of incomplete recovery. They differ in the
means used to stop the recovery:

7	Cancel-Based (RECOVER DATABASE UNTIL CANCEL)

7	Change-Based (RECOVER DATABASE UNTIL CHANGE)

7	Time-Based (RECOVER DATABASE UNTIL TIME)

The UNTIL CANCEL option terminates recovery when the user
enters "cancel" rather than the name of a log. Online logs are not
automatically applied in this mode in case cancellation at the next
log is desired. If multiple threads of redo are being recovered, there
may be logs in other threads that are partially applied when the
recovery is cancelled.

The UNTIL CHANGE option terminates redo application just
before any redo associated with the specified SCN or higher. Thus
the transaction that committed at that SCN will be rolled back. If
you want to recover through a transaction that committed at a
specific SCN, then add one to the specified SCN.

The UNTIL TIME option works similarly to the UNTIL CHANGE
option, except that a time rather than an SCN is specified.
Recovery uses the timestamps in the redo block headers to convert
the specified time into an SCN. Then recovery is stopped when that
SCN is reached.

6.12.2  Incomplete Recovery and Consistency

In order to avoid database corruption when running incomplete
recovery, all datafiles must be recovered to the exact same point.
Furthermore, no datafile must have any changes in the future of this
point. This requires that incomplete media recovery must start from
datafiles restored from backups whose copies completed prior to
the intended stop time. The system uses file header fuzzy bits (see
8.1) to ensure that the datafiles contain no changes in the future of
the stop time.

6.12.3  Incomplete Recovery and Datafiles Known to the Controlfile

If recovering to a time before a datafile was dropped, the dropped
file must appear in the controlfile used for recovery. Otherwise it
would not be recovered. One alternative for achieving this is to
recover using a backup controlfile made before the datafile was
dropped. Another alternative is to use the CREATE
CONTROLFILE command to construct a controlfile that lists the
dropped datafile.

Recovering to a time before a file was added is not a problem. The
extra datafile will be eliminated from the controlfile after the
database is open. The unwanted file may be taken offline before the
recovery to avoid accessing it.

6.12.4  Resetlogs Open after Incomplete Recovery

The next database open after an incomplete recovery must specify
the RESETLOGS option. Amongst other effects (see Section 7),
resetlogs throws away the redo that was not applied during the
incomplete recovery, and marks the database so that the skipped
redo can never be accidentally applied by a subsequent recovery. If
the incomplete recovery was a mistake (e.g. the lost log was
found), the next open can specify the NORESETLOGS option.
However, for the open with NORESETLOGS to succeed, it must
be preceded by a successful execution of complete recovery (i.e.
one in which all redo is applied).

6.12.5  Files Offline during Incomplete Recovery

If a file is offline during incomplete recovery, it will not be
recovered. This is ok if the file is part of a tablespace that was taken
offline normal, and that is still offline normal at the recovery end-
point. Otherwise, if the file is still offline when the resetlogs is
done, the tablespace containing the file will have to be dropped.
This is because it will need media recovery with logs from before
the resetlogs. In general V$DATAFILE should be checked to
ensure that files are online before running an incomplete recovery.
Only files that will be dropped and files that are part of offline
normal (or read-only) tablespaces should be offline (Section 8.6).

6.13  Backup Controlfile Recovery

If recovery is done with a controlfile other than the current one,
then backup controlfile recovery (RECOVER
DATABASE...USING BACKUP CONTROLFILE) must be used.
This applies both to the case of a restored controlfile backup, and to
the case of a "backup" controlfile created via CREATE
CONTROLFILE...RESETLOGS.

Use of CREATE CONTROLFILE...RESETLOGS makes a
controlfile that is a "backup." Only a backup controlfile recovery
can be run after executing CREATE
CONTROLFILE...RESETLOGS. Only a RESETLOGS open can
be used after executing CREATE
CONTROLFILE...RESETLOGS. Use of CREATE
CONTROLFILE...RESETLOGS is indicated if (all copies of) an
online redo log were lost in addition to (all copies of) the control
file.

By contrast, CREATE CONTROLFILE...NORESETLOGS makes
a controlfile that is "current"; i.e. it has knowledge of the current
state of the online logfiles and log sequence numbers. A backup
controlfile recovery is not necessary following CREATE
CONTROLFILE...NORESETLOGS. Indeed, no recovery at all is
required if there was a clean shutdown, and if no datafile backups
have been restored. A normal or NORESETLOGS open may
follow CREATE CONTROLFILE ...NORESETLOGS.

A backup controlfile lacks valid information about the current
online logs and datafile stop SCNs. Hence, recovery cannot look
for online logs to automatically apply. Moreover, recovery must
assume infinite stop SCN's. A RESETLOGS open corrects this
information. The backup controlfile may have a different set of
threads enabled than did the original controlfile. That set will be the
effective enabled thread set following RESETLOGS open.

The BACKUP CONTROLFILE option may be used either alone or
in conjunction with an incomplete recovery option. Unless an
incomplete recovery option is included, all threads must be applied
to the end-of-thread. This is validated at open resetlogs time.

It is currently required that a RESETLOGS open follow execution
of backup controlfile recovery, even if no incomplete recovery
option was used. The following procedure could be used to avoid a
backup controlfile recovery and resetlogs in case the only problem
is a lost current controlfile (and a backup controlfile exists):

1.	Copy the backup controlfile to the current control file and do a
STARTUP MOUNT.

2.	Issue ALTER DATABASE BACKUP CONTROLFILE TO
TRACE NORESETLOGS.

3.	Issue the CREATE CONTROLFILE...NORESETLOGS com-
mand from the SQL script output by Step 2.

It is important to assure that the CREATE CONTROLFILE
command issued in Step 3 creates a controlfile reflecting a database
structure equivalent to that of the lost current controlfile. For
example, if a datafile was added since the backup controlfile was
saved, then the CREATE CONTROLFILE command should be
modified to declare the added datafile.

Failure to specify the BACKUP CONTROLFILE option on the
RECOVER DATABASE command when the controlfile is indeed a
backup can frequently be detected. One indication of a restored
backup controlfile would be a datafile header checkpoint count that
is greater than the checkpoint count in the datafile's controlfile
record. However, this test may not catch the backup controlfile if
the datafiles are also backups. Another test validates the online
logfile headers against their corresponding controlfile records, but
this too may not always catch an old controlfile.

6.14  CREATE DATAFILE: Recover a Datafile Without a Backup

If a datafile is lost or damaged and no backup of the file is
available, it can be recovered using only information in the redo
logs and control file. The following conditions must be met:

1.	All redo logs written since the datafile was originally created
must be available.

2.	A control file in which the datafile is declared (i.e. name and
size information) must be available or re-creatable.

The CREATE DATAFILE clause of the ALTER DATABASE
command is first used to create a new, empty replacement for the
lost datafile. RECOVER DATAFILE is then used to apply all redo
generated for the file from the time of its original creation until the
time it was lost. After all redo logs written since the datafile was
originally created have been applied, the file will have been
restored to its state at the time it was lost. This mechanism is useful
for recovering a recently-created datafile for which no backup has
yet been taken. The original datafiles of the SYSTEM tablespace
cannot be recovered by this means, however, since relevant redo
data is not saved at database creation time.

6.15  Point-in-Time Recovery Using Export/Import

Occasionally, it may become necessary to reverse the effect of an
erroneous user action (e.g. table drop or batch run). One approach
would be to perform an incomplete media recovery to a point-in-
time before the corruption, then open the database with the
RESETLOGS option. Using this approach, the entire database -
not just the affected schema objects - would be set backwards in
time.

This approach has an undesirable side-effect: it discards committed
transactions. Any updates that occurred subsequent to the resetlogs
SCN are lost and must be re-entered. Resetlogs has another
undesirable side-effect: it renders all pre-existing backups unusable
for future recovery.

Setting a mission-critical database globally back in time is often
not an acceptable solution. The following procedure is an
alternative whose effect on the mission-critical database is to set
just the affected schema objects - termed the recovery-objects -
backwards in time.

Point-in-time incomplete media recovery is run against a side-copy
of the production database, called the recovery-database. The
initial version of the recovery-database is created using backups of
the production database that were taken before the corruption
occurred. Non-relevant objects in the recovery-database can be
taken offline in order to avoid unnecessarily recovering them.
However, the SYSTEM tablespace and all tablespaces containing
rollback segments must participate in the media recovery in order
to allow a clean open. (Note that this is a good reason to place
rollback segments and data segments into separate tablespaces.)

After it has undergone point-in-time incomplete media recovery,
the recovery-database is opened with the RESETLOGS option.
The recovery-database is now set backwards to a point-in-time
before the recovery-objects were corrupted. This effectively
creates pre-corruption versions of the recovery-objects in the
recovery-database. These objects can then be exported from the
recovery-database and imported back into the production database.
Prior to importing the recovery-objects, the production database is
prepared as follows:

7	In the case of recovering an erroneously updated schema
object, the copy of the object in the production database is pre-
pared by discarding just the data; e.g. the table is truncated.

7	In the case of recovering an erroneously dropped schema
object, the object is re-created (empty) in the production data-
base.

The import operation is then executed, using the data-only option
as appropriate. Since export/import can be a lengthy process, it
may be desirable to postpone it until a time when recovery-object
unavailability can be tolerated. In the meantime, the recovery-
objects can be made available, albeit at degraded performance, via
a database link between the production database and the recovery-
database.

An undesirable side-effect of this approach is that transaction
consistency across objects is lost. This side-effect can be avoided
by widening the recovery-object set to include all objects that must
be kept transaction-consistent.


7  Block Recovery

Block recovery is the simplest type of recovery. It is performed
automatically by the system during normal operation of the
database, and is transparent to the user.

7.1  Block Recovery Initiation and Operation

Block recovery is used to clean up the state of a buffer whose
modification by a foreground process (in the middle of invoking a
redo application callback to apply a change vector to the buffer)
was interrupted by the foreground process dying or signalling an
error. Recovery involves (i) reading the block from disk; (ii) using
the current thread's online redo logs to reconstruct the buffer to a
state consistent with the redo already generated; and (iii) writing
the recovered block back to disk. If block recovery fails, then after
a second attempt, the block is marked logically corrupt (by setting
the block sequence number to zero) and a corrupt block error is
signalled.

Block recovery is guaranteed doable using only the current thread's
online redo logs, since:

1.	Block recovery cannot require redo from another thread or
from before the last thread checkpoint.

2.	Online logs are not reused until the current thread checkpoint
is beyond the log.

3.	No buffer currently in the cache can need recovery from
before the last thread checkpoint.

7.2  Buffer Header RBA Fields

The buffer header (an in-memory data structure) contains the
following fields pertaining to block recovery:

Low-RBA and High-RBA: Delineate the range of redo (from the
current thread) that needs to be applied to the disk version of the
block in order make it consistent with redo already generated.

Recovery-RBA: A place marker for recording progress in case the
invoker of block recovery is PMON and complete recovery in
one invocation would take too long (see next section).

7.3  PMON vs. Foreground Invocation

If an error is signalled while a foreground process is in a redo
application callback, then the process itself executes block
recovery. If foreground process death is detected during a redo
application callback, on the other hand, PMON executes block
recovery.

Block recovery may require an unbounded amount of time and I/O.
However, PMON cannot be allowed to spend an inordinate amount
of time working on the recovery of one block while neglecting
other necessary time-critical tasks. Therefore, a limit is placed on
the amount of redo applied by one PMON call to block recovery.
(A port-specific constant specifies the maximum number of redo
log blocks applied per invocation). As PMON applies redo during
invocations of block recovery, it updates the recovery-RBA in the
buffer header to record its progress. When a PMON call to block
recovery causes the recovery-RBA to reach the high-RBA, then
block recovery for that block is complete.


8  Resetlogs

The RESETLOGS option is needed on the first database open
following:

7	Incomplete recovery

7	Backup controlfile recovery

7	CREATE CONTROLFILE...RESETLOGS.

The primary function of resetlogs is to discard the redo that was not
applied during incomplete recovery, ensuring that the skipped redo
can never be accidentally applied by a subsequent recovery. To
accomplish this, resetlogs effectively invalidates all existing redo
in all online and archived redo logfiles. This has the side effect of
making any existing datafile backups unusable for future recovery
operations.

Resetlogs also reinitializes the controlfile information about online
logs and redo threads, clears the contents of any existing online
redo log files, creates the online redo log files if they do not
currently exist, and resets the log sequence number in all threads to
one.

8.1  Fuzzy Files

The most important requirement when doing a RESETLOGS open
is that all datafiles be validated as recovered to the same point-in-
time. This is what ensures that all the changes in a single redo
record are done atomically. It is also important for other
consistency reasons. If all threads of redo have been applied
through end-of-thread to all online datafiles, then we can be sure
that the database is consistent.

If incomplete recovery was done, there is the possibility that a file
was not restored from a sufficiently old backup. In the general case,
this is detectable if the file has a different checkpoint than the other
files (exceptions: offline or read-only files).

The other possibility is that the file is fuzzy - i.e. it may contain
changes in the future of its checkpoint. As seen earlier, the
following "fuzzy bits" are maintained in the file header to
determine if a file is fuzzy:

7	online-fuzzy bit (see 3.5, 6.7.2)

7	hotbackup-fuzzy bit (see 4, 6.7.3)

7	media-recovery-fuzzy bit (see 6.7.1)

Open with resetlogs following incomplete media recovery will fail
if any online datafile has any of the three fuzzy bits set.

Redo records are created at the end of a hot backup (the end-
backup "marker") and after crash recovery (the end-crash-recovery
"marker") to enable media recovery to determine when it can clear
the fuzzy bits. Resetlogs signals an error if any of the datafiles has
any of the fuzzy bits set.

Except in the following special circumstances, resetlogs signals an
error if any of the datafiles is recovered to a checkpoint SCN
different from the one at which the other files are checkpointed (i.e.
the resetlogs SCN: see 8.2):

1.	A file recovered to an SCN earlier than the resetlogs SCN
would be tolerated in case there were no redo generated for the
file between its checkpoint SCN and the resetlogs SCN. For
example, such would be the case if the file were read-only, and
its offline range spanned the checkpoint SCN and resetlogs
SCN. In this case, resetlogs would allow the file but set it
offline.

2.	A file checkpointed at an SCN later than the resetlogs SCN
would be tolerated in case its creation SCN (allocated at file
creation time and stored in the file header) showed it to have
been created after the resetlogs SCN. During the data dictio-
nary vs. controlfile check performed by RESETLOGS open
(see 8.7), such a file would be found to be missing from the
data dictionary but present in the controlfile. As a conse-
quence, it would be eliminated from the controlfile.

8.2  Resetlogs SCN and Counter

A resetlogs SCN and resetlogs timestamp - known together as the
resetlogs data - are kept in the database info record of the
controlfile. The resetlogs data is intended to uniquely identify each
execution of a RESETLOGS open. The resetlogs data is also stored
in each datafile header and in each logfile header. A redo log cannot
be applied by recovery if its resetlogs data does not match that in
the database info record of the controlfile. Except for some very
special circumstances (e.g. offline normal or read-only
tablespaces), a datafile cannot be recovered or accessed if its
resetlogs data does not match that of the database info record of the
controlfile. This ensures that changes discarded by resetlogs do not
get back into the database. It also renders previous backups
unusable for future recovery operations, making it prudent to take a
database backup immediately after a resetlogs.

8.3  Effect of Resetlogs on Threads

Each thread's controlfile record is updated to clear the thread-open
flag and to set the thread-checkpoint SCN to the resetlogs SCN.
Thus, the thread appears to have been closed at the resetlogs SCN.
The set of enabled threads from the enabled thread bitvec of the
database info controlfile record is used as is. It does not matter
which threads were enabled at the end of recovery, since none of
the old redo can ever be applied to the database again. The log
sequence numbers in all threads are also reset to one. One of the
enabled threads is picked as the database checkpoint.

8.4  Effect of Resetlogs on Redo Logs

The redo is thrown away by zeroing all the online logs. Note that
this means that redo in the online logs would be lost forever - and
there would be no way to undo the resetlogs in an emergency - if
the online logs were not backed up prior to executing resetlogs.
Note that ensuring the ability to undo an erroneous resetlogs is the
only valid rationale for making backups of online logs. Undoing an
erroneous resetlogs requires re-running the entire recovery
operation from the beginning, after restoring backups of all
datafiles, controlfile, and online logs.

One log is picked to be the current log for every enabled thread.
That log header is written as log sequence number one. Note that
the set of logs and their thread association is picked up from the
controlfile (i.e. using the thread number and log list fields of the
logfile records). If it is a backup controlfile, this may be different
from what was current the last time the database was open.

8.5  Effect of Resetlogs on Online Datafiles

The headers of all the online datafiles are updated to be
checkpointed at the new database checkpoint. The new resetlogs
data is also written to the header.

8.6  Effect of Resetlogs on Offline Datafiles

The controlfile record for an offline file is set to indicate the file
needs media recovery. However that will not be possible because it
would be necessary to apply redo from logs with the wrong
resetlogs data. This means that the tablespace containing the file
will have to be dropped. There is one important exception to this
rule. When a tablespace is taken offline normal or set read-only, the
checkpoint SCN written to the headers of the tablespace's
constituent datafiles is saved in the data dictionary TS$ table as the
tablespace-clean-stop SCN (see 2.17). No recovery is ever needed
to bring a tablespace and its files online if the files are not fuzzy
and are checkpointed at exactly the tablespace-clean-stop SCN.
Even the resetlogs data in the offline file header is ignored in this
case. Thus a tablespace that is offline normal is unaffected by any
resetlogs that leaves the database at a time when the tablespace is
offline.

8.7  Checking Dictionary vs. Controlfile on Resetlogs Open

After the rollback phase of RESETLOGS open, the datafiles listed
in the data dictionary FILE$ table are compared with the datafiles
listed in the controlfile. This is also done on the first open after a
CREATE CONTROLFILE. There is the possibility that incomplete
recovery ended at a time when the files in the database were
different from those in the controlfile used for the recovery. Using a
backup controlfile or creating one can have the same problem.
Checking the dictionary does not do any harm, so it could be done
on every database open; however there is no point in wasting the
time under normal circumstances.

   The entry in FILE$ is compared with the entry in the controlfile
for every file number. Since FILE$ reflects the space allocation
information in the database, it is correct, and the controlfile might
be wrong. If the file does not exist in FILE$ but the controlfile
record says the file exists, then the file is simply dropped from the
controlfile.

If a file exists in FILE$ but not in the controlfile, a placeholder
entry is created in the control file under the name MISSINGnnnn
(where nnnn is the file number in decimal). MISSINGnnnn is
flagged in the control file as being offline and needing media
recovery. The actual file corresponding (with respect to the file
header contents as opposed to the file name) to MISSINGnnnn can
be made accessible by renaming MISSINGnnnn to point to it.

In the RESETLOGS open case however, rename can succeed in
making the file usable only in case the file was read-only or offline
normal. If, on the other hand, MISSINGnnnn corresponds to a file
that was not read-only or offline normal, then the rename operation
cannot be used to make it accessible, since bringing it online would
require media recovery with redo from before the resetlogs. In this
case, the tablespace containing the datafile must be dropped.

When the dictionary check is due to open after CREATE
CONTROLFILE...NORESETLOGS rather than to open resetlogs,
media recovery may be used to make the file current.

Another option is to repeat the entire operation that lead up to the
dictionary check with a controlfile that lists the same datafiles as
the data dictionary. For incomplete recovery, this would involve
restoring all backups and repeating the recovery.


9  Recovery-Related V$ Fixed-Views

The V$ fixed-views contain columns that extract information from
data structures dynamically maintained in memory by the kernel.
These "views" make this information accessible to the DBA under
SYS. The following is a summary of recovery-related information
that is viewable via V$ views:

9.1  V$LOG

Contains log group information from the controlfile:

GROUP#

THREAD#

SEQUENCE#

SIZE_IN_BYTES

MEMBERS_IN_GROUP

ARCHIVED_FLAG

STATUS_OF_ GROUP (unused, current, active, inactive)

LOW_SCN

LOW_SCN_TIME

9.2  V$LOGFILE

Contains log file (i.e. group member) information from the
controlfile:

GROUP#

STATUS_OF_MEMBER (invalid, stale, deleted)

NAME_OF_MEMBER

9.3  V$LOG_HISTORY

Contains log history information from the controlfile:

THREAD#

SEQUENCE#

LOW_SCN

LOW_SCN_TIME

NEXT_SCN

9.4  V$RECOVERY_LOG

Contains information (from the controlfile log history) about
archived logs needed to complete media recovery.:

THREAD#

SEQUENCE#

LOW_SCN_TIME

ARCHIVED_NAME

9.5  V$RECOVER_FILE

Contains information on the status of files needing media recovery:

FILE#

ONLINE_FLAG

REASON_MEDIA_RECOVERY_NEEDED

RECOVERY_START_SCN

RECOVERY_START_SCN_TIME

9.6  V$BACKUP

Contains status information relative to datafiles in hot backup:

FILE#

FILE_STATUS (no-backup-active, backup-active, offline-normal,
error)

BEGIN_BACKUP_SCN

BEGIN_BACKUP_TIME


10  Miscellaneous Recovery Features

10.1  Parallel Recovery (v7.1)

The goal of the parallel recovery feature is to use compute and I/O
parallelism to reduce the elapsed time required to perform crash
recovery, single-instance recovery, or media recovery. Parallel
recovery is most effective at reducing recovery time when several
datafiles on several disks are being recovered concurrently.

10.1.1  Parallel Recovery Architecture

Parallel recovery partitions recovery processing into two
operations:

1.	Reading the redo log.

2.	Applying the change vectors.

Operation #1 does not easily lend itself to parallelization. The redo
log(s) must be read in sequentially, and merged in the case of
media recover. Thus, this task is assigned to one process: the
redo-reading-process.

Operation #2, on the other hand, easily lends itself to
parallelization. Thus, the task of change vector application is
delegated to some number of redo-application-slave-processes.
The redo-reading-process sends change vectors to the redo-
application-slave-processes using the same IPC (inter-process-
communication) mechanism used by parallel query. The change
vectors are distributed based on the hash function that takes the
block address as argument (i.e. DBA modulo # redo-application-
slave-processes). Thus, each redo-application-slave-process
handles only change vectors for blocks whose DBAs hash to its
"bucket" number. The redo-application-slave-processes are
responsible for reading the datablocks into cache, checking
whether or not the change vectors need to be applied, and applying
the change vectors if needed.

This architecture achieves parallelism in log read I/O, datablock
read I/O, and change vector processing. It allows overlap of log
read I/Os with datablock read I/Os. Moreover, it allows overlap of
datablock read I/Os for different hash "buckets." Recovery elapsed
time is reduced as long as the benefits of compute and I/O
parallelism outweigh the costs of process management and inter-
process-communication.

10.1.2  Parallel Recovery System Initialization Parameters

PARALLEL_RECOVERY_MAX_THREADS

PARALLEL_RECOVERY_MIN_THREADS
These initialization parameters control the number of redo-
application-slave-processes used during crash recovery or
media recovery of all datafiles.

PARALLEL_INSTANCE_RECOVERY_THREADS
This initialization parameter controls the number of redo-appli-
cation-slave-processes used during instance recovery.

10.1.3  Media Recovery Command Syntax Changes

RECOVER DATABASE has a new optional parameter for specify-
ing the number of redo-application-slave-processes. If specified,
it overrides PARALLEL_RECOVERY_MAX_THREADS.

RECOVER TABLESPACE has a new optional parameter for spec-
ifying the number of redo-application-slave-processes. If speci-
fied, it overrides PARALLEL_RECOVERY_MIN_THREADS.

RECOVER DATAFILE has a new optional parameter for specify-
ing the number of redo-application-slave-processes. If specified,
it overrides PARALLEL_RECOVERY_MIN_THREADS.

10.2  Redo Log Checksums (v7.2)

The log checksum feature allows a potential corruption in an online
redo log to be detected when the log is read for archiving. The goal
is to prevent the corruption from being propagated, undetected, to
the archive log copy. This feature is intended to be used in
conjunction with a new command, CLEAR LOGFILE, that allows
a corrupted online redo log to be discarded without having to
archive it.

A new initialization parameter, LOG_BLOCK_CHECKSUM,
controls activation of log checksums. If it is set, a log block
checksum is computed and placed in the header of each log block
as it is written out of the redo log buffer. If present, checksums are
validated whenever log blocks are read for archiving or recovery. If
a checksum is detected as invalid, an attempt is made to read
another member of the log group (if any). If an irrecoverable
checksum error is detected - i.e. the checksum is invalid in all
members - then the log read operation fails.

Note that a rudimentary mechanism for detecting log block header
corruption was added, along with log group support, in v7.1. The
log checksum feature extends corruption detection to the whole
block.

If an irrecoverable checksum error prevents a log from being read
for archiving, then the log cannot be reused. Eventually log switch
- and redo generation - will stall. If no action is taken, the
database will hang. The CLEAR LOGFILE command provides a
way to obviate the requirement that the log be archived before it
can be reused.

10.3  Clear Logfile (v7.2)

If all members of an online redo log group are "lost" or "corrupted"
(e.g. due to checksum error, media error, etc.), redo generation may
proceed normally until it becomes necessary to reuse the logfile.
Once the thread checkpoints of all threads are beyond the log, it is a
potential candidate for reuse. Possible scenarios preventing reuse
are the following:

1.	The log cannot be archived due to a checksum error; it cannot
be reused because it needs archiving.

2.	A log switch attempt fails because the log is inaccessible (e.g.
due to a media error). The log may or may not have been
archived.

The ALTER DATABASE CLEAR LOGFILE command is
provided as an aid to recovering from such scenarios involving an
inactive online redo log group (i.e. one that is not needed for crash
recovery). CLEAR LOGFILE allows an inactive online logfile to
be "cleared": i.e. discarded and reinitialized, in a manner analogous
to DROP LOGFILE followed by ADD LOGFILE. In many cases,
use of this command obviates the need for database shutdown or
resetlogs.

Note: CLEAR LOGFILE cannot be used to clear a log needed for
crash recovery (i.e. a "current" or "active" log of an open thread).
Instead, if such a log becomes lost or corrupted, shutdown abort
followed by incomplete recovery and open resetlogs will be
necessary.

Use of the UNARCHIVED option allows the log clear operation to
proceed even if the log needs archiving: an operation that would be
disallowed by DROP LOGFILE. Furthermore, CLEAR LOGFILE
allows the log clear operation to proceed in the following cases:

7	There are only two logfile groups in the thread.

7	All log group members have been lost through media failure.

7	The logfile being cleared is the current log of a closed thread.

All of these operations would be disallowed in the case of DROP
LOGFILE.

Clearing an unarchived log makes unusable any existing backup
whose recovery would require applying redo from the cleared log.
Therefore, it is recommended that the database be immediately
backed up following use of CLEAR LOGFILE with the
UNARCHIVED option. Furthermore, the UNRECOVERABLE
DATAFILE option must be used if there is a datafile that is offline,
and whose recovery prior to onlining requires application of redo
from the cleared logfile. Following use of CLEAR LOGFILE with
the UNRECOVERABLE DATAFILE option, the offline datafile,
together with its entire tablespace, will have to be dropped from the
database. This is due to the fact that redo necessary to bring it
online has been cleared, and there is no other copy of it.

The foreground process executing CLEAR LOGFILE processes
the command in several steps:

7	It checks that the logfile is not needed for crash recovery and
is clearable.

7	It sets the "being cleared" and "archiving not needed" flags in
the logfile controlfile record. While the "being cleared" flag is
set, the logfile is ineligible for reuse by log switch.

7	It recreates a new logfile, and performs multiple writes to clear
it to zeroes (a lengthy process).

7	It resets the "being cleared" flag.

If the foreground process executing CLEAR LOGFILE dies while
execution is in process, the log will not be usable as the current log.
Redo generation may stall and the database may hang, much as
would happen if log switch had to wait for checkpoint completion,
or for log archive completion. Should the process executing
CLEAR LOGFILE die, the operation should be completed by
reissuing the same command. Another option would be to drop the
partially-cleared log. CLEAR LOGFILE could also fail due to an I/
O error encountered while writing zeros to a log group member. An
option for recovering would be to drop that member and add
another to replace it.