Thursday, August 30, 2007

Database failure in Oracle 10g

Database failure in Oracle 10g

Learning objective

After completing this topic, you should be able to recognize the types of database failure that can occur in an Oracle 10g database and identify possible solutions.

1. The role of the DBA

The DBA must ensure that the database is open and available when users need it.

Achieving this goal involves collaborating with the system administrator to

  • protect the database from failure
  • increase the Mean Time Between Failures (MTBF)
  • decrease the Mean Time To Recover (MTTR)
  • minimize loss of data
protect the database from failure
To protect databases from failure, DBAs need to be able to distinguish between failures originating from errors in statements, user processes, the network, user activity, instances, and media.

Once the kind of failure is diagnosed, appropriate remedial action can be taken.
increase the Mean Time Between Failures (MTBF)
To increase the MTBF, the DBA needs to ensure that

  • hardware is as reliable as possible
  • critical components are protected by redundancy
  • operating system maintenance is performed in a timely manner


Two examples of Oracle configuration options that increase MTBF are Real Application Clusters and Streams.
decrease the Mean Time To Recover (MTTR)
Planning by the DBA can decrease the MTTR.

For example, the DBA can practice recovery procedures and configure backups in advance so that they are readily available when needed.
minimize loss of data
By following standards-based best practices, DBAs can configure their databases to ensure that committed transactions are never lost.

Tools that help the DBA to minimize data loss include archived redo logs, standby databases, and Oracle Data Guard.

Question

What should the DBA aim to do in order to ensure that the database is open and available for users?

Options:

  1. Decrease the Mean Time Between Failures
  2. Decrease the Mean Time To Recover (MTTR)
  3. Increase the Mean Time Between Failures
  4. Minimize the loss of data

Answer

The DBA should aim to decrease the Mean Time To Recover (MTTR), to increase the Mean Time Between Failures (MTBF), and to minimize the loss of data.

Option 1 is incorrect. In fact, the DBA wants to increase the Mean Time Between Failures. Decreasing the Mean Time Between Failures would mean increasing the frequency of failures.

Option 2 is correct. Tactics a DBA can employ to decrease the Mean Time To Recover include practicing recovery procedures in advance and configuring backups so that they are readily available when needed.

Option 3 is correct. To increase the Mean Time Between Failures, the DBA needs to ensure that hardware is as reliable as possible, that critical components are protected by redundancy, and that operating system maintenance is performed in a timely manner.

Option 4 is correct. Tools that help the DBA to minimize data loss include archived redo logs, standby databases, and Oracle Data Guard.

As we have seen, the DBA can recognize the following categories of database failures:

  • statement failure
  • network failure
  • user process failure
  • user error failure
  • instance failure
  • media failure
statement failure
A statement failure occurs when a single database operation - such as a SELECT action - fails.
network failure
A network failure occurs when connectivity to the database is lost. The best way to avert network failure is to provide redundant paths for network connections.
user process failure
A user process failure occurs when a single database session fails.This can happen when users perform abnormal disconnects, when their sessions terminate abnormally, or when a program error terminates a user session.
user error failure
A user error failure occurs when a user successfully completes an operation that ought not to have been performed in the first place - for example, dropping a table or entering incorrect data.
instance failure
An instance failure occurs when a database instance shuts down unexpectedly.

Causes of instance failure include software failure, hardware failure, and the use of emergency shutdown commands.
media failure
A media failure error occurs when one or more database files are lost due to a media problem such as a deleted or corrupt disk.

Question

Match the failure category to the situation it describes.

Options:

  1. Statement failure
  2. User process failure
  3. Network failure
  4. User error failure
  5. Instance failure
  6. Media failure

Targets:

  1. The connection to the database is lost
  2. A user completes a damaging operation
  3. A SELECT action fails
  4. A single database session fails
  5. A disk is corrupt
  6. A database instance shuts down unexpectedly

Answer


When a single database operation fails, it's a statement failure. When a single database session fails, it's a user process failure. When the connection to the database goes down, it's a network failure. When a user completes a damaging operation, it's a user error failure. When a database instance shuts down, it's an instance failure. And when a disk is corrupt, it's a media failure.

A statement failure occurs when a single database operation - such as a SELECT action - fails.

A database session can fail when users perform abnormal disconnects, when their sessions terminate abnormally, or when a program error terminates a user session.

The best way to avert network failure is to provide redundant paths for network connections.

A user error failure occurs when a user successfully completes an operation that ought not to have been performed in the first place: for example, dropping a table or entering incorrect data.

Causes of instance failure include software failure, hardware failure, and the use of emergency shutdown commands.

A media failure occurs when one or more database files are lost due to a media problem such as a deleted or corrupt disk.

2. Statement and network failure

As stated above, a statement failure occurs when a single database operation fails.

Causes of failure include

  • users entering invalid data
  • operations attempted with insufficient privileges
  • failed attempts to allocate database space
  • logic errors in applications
users entering invalid data
Statement errors can occur because of invalid data entry by users, for instance attempting to enter data that has a different format from the datatype defined for a column.

One way of avoiding such errors is for the DBA to work with users to validate and correct data.
operations attempted with insufficient privileges
Statement errors can occur when operations are attempted with insufficient privileges.

In these cases, the solution may be for the DBA to amend object and system privileges accordingly.
failed attempts to allocate database space
A statement error can occur when an attempt to allocate space in the database fails.

To resolve this, the DBA could enable resumable space allocation, increase the user quota, or add space to the tablespace.
logic errors in applications
Logic errors in applications can cause statement errors.

In such cases, the DBA should work with developers to correct program logic.

Question

When can a statement failure happen?

Options:

  1. When an attempt to allocate database space fails
  2. When an operation is attempted with insufficient privileges
  3. When a session fails
  4. When a user enters invalid data

Answer

Examples of statement errors include users entering invalid data, operations being attempted with insufficient privileges, and failed attempts to allocate database space.

Option 1 is correct. To resolve this, the DBA could enable resumable space allocation, increase the user quota, or add space to the tablespace.

Option 2 is correct. To solve this, the DBA may have to amend object and system privileges accordingly.

Option 3 is incorrect. By definition, a statement error occurs when a database operation fails. A session is not a database operation.

Option 4 is correct. To avoid such failure occurring, the DBA could work with users to validate and correct data.

A network failure occurs when connectivity to the database is lost. The best way to avert network failure is to provide redundant paths for network connections.

So the DBA needs to configure

  • a backup listener and connect-time failover for when the listener fails
  • multiple Network Interface Cards (NIC) so that connectivity continues when one card fails
  • a backup network connection for when the network connection fails

Question

What should a DBA configure to anticipate NIC failure?

Options:

  1. A backup listener and connect-time failover
  2. A backup network connection
  3. Multiple NICs

Answer

To anticipate NIC failure, the DBA should configure multiple NICs.

Option 1 is incorrect. You configure a backup listener and connect-time failover to anticipate listener failure.

Option 2 is incorrect. You configure a backup network connection to anticipate network connection failure.

Option 3 is correct. If one Network Interface Card fails, you need to have configured multiple other ones so that connectivity can continue.

3. User process and user error failure

A user process failure occurs when a single database session fails.

This can happen when users perform abnormal disconnects, when their sessions terminate abnormally, or when a program error terminates a user session.

DBA intervention is not usually required to resolve such failures - instance background processes will roll back uncommitted changes and release locks. Instead, the DBA should monitor trends.

When users disconnect from the instance abnormally, they sometimes leave behind uncommitted work that needs to be cleaned up.

The PMON process that runs in the background periodically polls server processes, to check whether their sessions remain connected. If PMON uncovers a server process with a newly disconnected user, it recovers from whichever transactions are in progress.

The recovery includes rolling back changes that have not been committed and releasing locks that the failed session held.

It is normal that a small percentage of user processes should fail. If failures are regular, however, this can point to other problems.

A large number of abnormal disconnects may indicate network or application problems. Or it may require that users be trained to log out rather than terminate their programs.

Question

When can a user process failure happen?

Options:

  1. When an operation is attempted with insufficient privileges
  2. When a program error terminates a user session
  3. When a user enters invalid data
  4. When a user performs an abnormal disconnect
  5. When a user session terminates abnormally

Answer

User process failures occur when a program error terminates a user session, a user performs an abnormal disconnect, or a user session terminates abnormally.

Option 1 is incorrect. This is likely to result in a statement failure.

Option 2 is correct. As with all user process failures, DBA intervention is not usually required - instance background processes will roll back uncommitted changes and release locks. However, the DBA should monitor trends.

Option 3 is incorrect. This is likely to result in a statement failure.

Option 4 is correct. This user process failure may be the result of the user not being trained to log out properly. As with all user process failures, DBA intervention is not usually required - instance background processes will roll back uncommitted changes and release locks. However, the DBA should monitor trends.

Option 5 is correct. It is normal that a small percentage of user processes should fail. If failures are regular, however, this could point to other problems.

A user error failure occurs when a user successfully completes an operation that ought not to have been performed in the first place.

For example, a user may increase salaries for all employees instead of just increasing salaries for employees associated with a particular job role!

When a user inadvertently deletes or modifies data, the DBA may be called upon to help recover from the error.

A rollback is the most straightforward solution for instances where the user has not yet committed the changes or quit the program.

If the user has committed the changes, you can find out what the original data values were by using flashback queries.

Then you can update the data to restore the original values.

Flashback queries are only possible if the undo retention period has not been exceeded.

If this period has been exceeded, it may still be possible to retrieve the information using the Oracle LogMiner. This tool allows you to query online redo logs and archived redo logs using a SQL interface.

Transaction data sometimes persists for longer in redo logs than it does in undo logs. And if you have configured Oracle to archive redo information, the information will last until you delete the archived files.

A user can recover a dropped table from the recycle bin by flashing the table back to before the drop.

Even if the recycle bin has been purged, or if the user has used the PURGE option to drop the table, all is not lost - with proper database configuration, the dropped table can still be recovered by using point-in-time recovery (PITR).

Question

Match the appropriate data recovery action to each of these situations in which user error failure has occurred.

Options:

  1. Perform a rollback
  2. Recover the data from the recycle bin
  3. Use a flashback query
  4. Use the Oracle LogMiner tool

Targets:

  1. Invalid data is entered but changes are not committed
  2. A table is dropped accidentally
  3. Invalid data is entered and changes are committed. The undo retention period has not been exceeded.
  4. Invalid data is entered and changes are committed. The undo retention period has been exceeded.

Answer

When invalid data is entered but changes are not committed, you perform a rollback. If changes are committed and the undo retention period has not been exceeded, you use a flashback query. If the retention period has been exceeded, you use Oracle LogMiner. And you can recover dropped tables from the recycle bin.

Because the changes are not committed it's possible to perform a rollback, which is the most straightforward way of recovering data.

You can recover a dropped table from the recycle bin by flashing the table back to before the drop.

Flashback queries are only possible if the undo retention period has not been exceeded - which is the case.

If the undo retention period has been exceeded, you cannot use a flashback query. Instead you need to use the Oracle LogMiner tool.

Summary

The role of the DBA involves protecting the database from failure, increasing the Mean Time Between Failures (MTBF), decreasing the Mean Time To Recover (MTTR), and minimizing data loss. To these ends, the DBA needs to be able to distinguish between statement failure, user process failure, network failure, user error failure, instance failure, and media failure.

A statement failure occurs when a single database operation - such as a SELECT action - fails. A network failure occurs when connectivity to the database is lost, and is best averted by providing redundant paths for network connections.

A user process failure occurs when a single database session fails.This can happen when a user or program error terminates a user session abnormally. A user error failure occurs when a user successfully completes an inadvisable operation. Methods of resolving user error failures include rollbacks, point-in-time recovery, flashback queries, and using the LogMiner tool.

No comments: