Instance recovery in Oracle 10g

Learning objective

After completing this topic, you should be able to recognize how to tune instance recovery and configure a database for maximum recoverability.

1. Instance recovery

Instance failure occurs when the database instance shuts down before it can synchronize its files.

Causes of instance failure include

software failure such as the failure of background processes
hardware failure or a power outage
the use of emergency shutdown commands such as SHUTDOWN ABORT and STARTUP FORCE

The DBA needs only to restart the instance and automatic recovery will begin.

Recovery involves rolling forward the data files to the time of shutdown using changes in the redo log groups, and then rolling back uncommitted transactions.

Administrators should endeavor to prevent future failures. They can investigate the causes of failure through the alert log, trace files, and Enterprise Manager.

Instance recovery is a four-stage process:

the DBA restarts the instance
the instance mounts the control files and attempts to open the datafiles
the instance rolls forward all datafiles to the time of shutdown
the instance rolls back uncommitted transactions

the DBA restarts the instance: The first stage of Oracle Database 10 g's automatic instance failure recovery occurs when the DBA restarts the instance.
the instance mounts the control files and attempts to open the datafiles: The instance mounts the control files and attempts to open the datafiles.

The instance can only open a datafile if the SCN (System Change Number) contained within its header matches the SCN in the database control file.

If the SCNs do not match, the datafiles have not been synchronized during shutdown, and recovery is initiated.
the instance rolls forward all datafiles to the time of shutdown: The instance sequentially redoes the transactions performed on the datafiles up to the time of shutdown.

To roll forward the datafiles in this way, the instance uses the information in the redo log groups.

The database will now contain data from transactions that were in progress but not committed when failure occurred.
the instance rolls back uncommitted transactions: After the database opens, any uncommitted transactions are rolled back.

At this point only committed data exists in the datafiles.

Question

Identify the typical causes of instance failure.

Options:

Hardware failure
Listener failure
Software failure
The use of emergency shutdown commands

Answer

Instance failure can be the result of software failure, hardware failure, or the use of emergency shutdown commands.

Option 1 is correct. Instance failure can be the result of hardware failure - for example, a disk may be corrupted due to a power outage.

Option 2 is incorrect. Listener failure is a type of network failure and can occur even if the instance is still running.

Option 3 is correct. Instance failure can be the result of software failure - for example, the failure of background processes.

Option 4 is correct. Instance failure can be the result of the issuing of emergency shutdown commands such as SHUTDOWN ABORT and STARTUP FORCE .

Question

Rank the phases of instance recovery.

Options
Option	Description
A	The DBA restarts the instance
B	The instance mounts control files and attempts to open datafiles
C	The instance rolls back uncommitted transactions
D	The instance rolls forward datafiles

Answer

Correct ranking
Option	Description
A	The DBA restarts the instance After the DBA restarts the instance, the instance will mount the control files and attempt to open the datafiles.
B	The instance mounts control files and attempts to open datafiles The instance can only open a datafile if the SCN (System Change Number) contained within its header matches the SCN in the database control file.
D	The instance rolls forward datafiles When rolling forward the datafiles, the instance uses the information in the redo log groups.
C	The instance rolls back uncommitted transactions Transactions that are in progress but not committed when failure occurs get rolled forward along with all the others. They need to be rolled back.

2. Tuning instance recovery time

Before an instance returns the commit complete command for a transaction, information about the transaction is always recorded in the redo log groups.

This information guarantees the recoverability of the transaction if failure occurs.

The same information must also be written to the datafile. The write to the datafile typically happens after the write to the redo log groups, because random writes to datafiles take longer than serial writes to redo log files.

Checkpoints keep track of what has already been written to a datafile.

Wherever a checkpoint occurs, it guarantees that data up to a specific SCN is recorded in a datafile.

Transactions after the checkpoint position may or may not be written to a datafile.

In the graphic illustrating how checkpoints operate, the striped blocks are not yet written to disk.

Instance recovery time equals the time needed to bring the datafiles from the last checkpoint to the latest SCN recorded in the control file.

The administrator controls the duration of instance recovery by setting a Mean Time To Recovery (MTTR) target and altering the size of redo log groups.

Note

The distance between the checkpoint position and the end of the redo log group should never exceed 90 percent of the smallest redo log group.

As a DBA, you can set your own MTTR. Ideally, the duration you specify should match the service level agreement with your customer.

You need to be aware of the drawbacks involved in

setting the MTTR value too high
setting the MTTR value too low

setting the MTTR value too high: Setting the MTTR target too high means that the instance takes too much time to recover after a crash.
setting the MTTR value too low: Setting the MTTR target value too low delays writes to the log groups until writes to the data files have completed. This undermines performance.

Suppose you want to specify the MTTR for your database.

You first open Enterprise Manager's Maintenance page.

To configure a range of recovery settings, you click Configure Recovery Settings in the Backup/Recovery section of the page.

The Configure Recovery Settings page opens. The Instance Recovery section tells you what the current estimated mean time to recover is.

The advisor allows you to set Desired Mean Time To Recover in minutes or seconds. The value you specify is translated into settings for the FAST_START_MTTR_TARGET initialization parameter. The maximum value is 60 minutes.

You want the Mean Time To Recovery to be 1 minute.

You type 1 in the Desired Mean Time To Recover field and click Apply.

A success message displays.

Question

Your service level agreement specifies an MTTR of 30 seconds for the database.

Starting from the Maintenance tabbed page in Enterprise Manager, which steps set the database's MTTR to 30 seconds?

Options:

Click Configure Recovery Settings in the Backup/Recovery section of the page. Then type 30 in the Desired Mean Time To Recover field and select Seconds from the Minutes/Seconds drop-down list. Click Apply.
Click Configure Recovery Settings in the Backup/Recovery section of the page. Then type 30 in the Desired Mean Time To Recover field and click Apply.

Answer

You click Configure Recovery Settings in the Backup/Recovery section of the page. Then type 30 in the Desired Mean Time To Recover field and select Seconds from the Minutes/Seconds drop-down list. Finally, you click Apply.

3. Handling media failure

Media failure describes any failure resulting in the loss or corruption of one or more database files.

A database file could be a datafile, a control file, or a redo log file.

Media failure can occur when a disk drive fails, when a disk controller fails, or when a database file is deleted or corrupted.

If media failure occurs, you should restore the affected files from your backups.

If needs be, you should inform the database of new file locations.

You may also have to recover files by applying redo information.

To aid recoverability, the DBA should

schedule regular backups
multiplex control files
multiplex redo log groups
keep archived copies of redo logs

schedule regular backups: The majority of media failures require you to restore lost or damaged files from backup.

So the more regular the backup, the more up-to-date the restore is likely to be.
multiplex control files: Every database has a set of identical control files.

Losing all the files presents a more serious challenge than losing one.

For maximum recoverability, you should retain at least three copies.
multiplex redo log groups: When instance or media failure occurs, the information stored in redo log groups is used to roll forward data files to the most recently committed transaction.
If your database's redo log groups depend on one redo log file, the loss of that file will most likely result in data loss.

So you should ensure that there are at least two copies of each redo log group member.
keep archived copies of redo logs: When a file is being restored from backup, the instance has to apply redo information up to the most recent SCN in the control file.

By default, once written, redo information is deleted.

But you can configure the database to archive copies of the redo logs by placing it in what is called ARCHIVELOG mode.

Question

Suppose you want to configure a database for maximum recoverability from media failure. Which are the accurate descriptions of measures to achieve this?

Options:

Archive copies of the control files by placing the database in ARCHIVELOG mode
Keep archived copies of the redo log files
Keep multiple copies of the control files
Keep multiple copies of the redo log group members
Schedule regular backups

Answer

You should keep multiple copies of the redo log groups and control files. You should also keep archived copies of redo log files and schedule regular backups.

Option 1 is incorrect. Placing the database in ARCHIVELOG mode means the database is configured to retain archived copies of the redo logs.

Option 2 is correct . When a file is being restored from backup, the instance has to apply redo information up to the most recent SCN in the control file. By default, once written, redo information is deleted. But you can configure the database to archive copies of the redo logs by placing it in ARCHIVELOG mode.

Option 3 is correct. Every database has a set of identical control files. Losing all the files presents a more serious challenge than losing one. So you should retain multiple copies.

Option 4 is correct. If your database's redo log groups depend on one redo log file, the loss of that file will most likely result in data loss. So keeping multiple copies improves recoverability.

Option 5 is correct. Most media failures require you to restore lost or damaged files from backup. So the more regular the backup, the more up-to-date the restore is likely to be.

Summary

Instance failure occurs when the database instance shuts down before it can synchronize its files. The DBA can restart the instance to begin automatic recovery. This involves rolling forward changes in the redo logs and rolling back uncommitted transactions.

To guarantee transaction recovery, an instance always records transaction information in redo log groups and to a datafile. The write to the datafile typically happens later, because it takes longer to perform. The administrator controls the time needed to recover the instance by using the EM to set a target MTTR.

Media failure describes any failure resulting in the loss or corruption of one or more database files. To minimize the impact of media failure, you should schedule regular backups, multiplex control files, multiplex redo log groups, and keep archived copies of redo logs.

Oracle 10g

Thursday, August 30, 2007