As I have discussed in this series of eDBA columns before, availability is the most important issue faced by eDBAs in managing the database environment for an e-business (check out The eDBA and Data Availability). An e-business, by definition, is an online business – and an online business should never close. Customers expect web applications to deliver full functionality regardless of the day of the week or the time of day. And never forget that the Web is worldwide – when it is midnight in New York it is still prime time in Singapore. Simply stated, an e-business must be available and operational 24 hours a day every day of the year.
An e-business must be prepared to engage with customers at any time or risk losing business to a company whose Web site is more accessible. Studies show that if a web user clicks on a link and does not receive a transmission back to his browser within seven seconds he will abandon that request and go somewhere else. And chances are that customer will never come back if his needs were satisfied elsewhere. Outages result in a lost business, and lost business can spell doom for an e-business.
Now, the previous discussion notwithstanding, problems will happen, and problems can cause outages. You can plan for many contingencies, and indeed you should plan for as many as are fiscally reasonable. But regardless of the amount of up-front planning, eventually problems will occur. And when problems impact data, databases will need to be recovered. Therefore, the eDBA must be prepared to resolve data problems by implementing a sound strategy for database recoveries. But this is good advice for all DBAs, not just eDBAs. The eDBA must take database recovery planning to a higher level – a level that anticipates failure with a plan to reduce (perhaps eliminate) downtime during recovery.
The truth of the matter is that an outage-less recovery is usually not possible in most shops today. Sometimes this is the fault of technology and software deficiencies. However, in many cases, technology exists that can reduce downtime during a database recovery, but is not implemented due to budget issues or lack of awareness on the part of the eDBA.
eDatabase Recovery Strategies
A database recovery strategy must plan for all types of database recovery because problems can impact data at many levels and in many ways. Depending upon the nature of the problem and its severity, integrity problems can occur at any place within the database. Several rows, or indeed, only certain columns within those rows may be corrupted. This type of problem is usually caused by an application error. An error can occur that impacts an entire database object such as a table, data space, or table space becoming corrupted. This type of problem is likely to be caused by an application error or bug, a DBMS bug, an operating system error, or a problem with the actual file used by the database object. More severe errors can impact multiple database objects, or even worse, an entire database. A large program glitch, hardware problem or DBMS bug can cause integrity problems for an entire database, or depending on the scale of the system, multiple databases may be impacted.
Sometimes small data integrity problems can be more difficult to eradicate than more massive problems. For example, if only a small percentage of columns of a specific table are impacted it may take several days to realize that the data is in error. However, problems that impact a larger percentage of data are likely to be identified much earlier. In general, the earlier an error is found, the more recovery options available to the eDBA and the easier it is to correct the data. This is true because transactions performed subsequent to the problem may have changed other areas of the database, and may even have changed other data based on the incorrect values.
Recovery-To-Current
A useful database recovery strategy must plan for many different types of recovery. The first type of recovery that usually comes to mind is a recovery-to-current to handle some sort of disaster. This disaster could be anything from a simple media failure to a natural disaster destroying your data center. Applications may be completely unavailable until the recovery is complete.
Outages due to simple media failures can be avoided in many cases these days by implementing modern disk technologies such as RAID. RAID, an acronym for Redundant Arrays of Inexpensive Disks, is a technology that combines multiple disk devices into a single array that is perceived by the system as a single disk drive. There are many levels of RAID technology and, depending on the level in use, different degrees of fault-tolerance that are supported. For more details on RAID refer to the Sidebar titled RAID Levels. Another desirable aspect of RAID arrays is the ability to utilize hot swappable drives so the array does not have to be powered down to replace a failed drive. Instead, a drive can be replaced while the array is up and running – and that is a good thing for eDBAs because it enhances overall data availability.
Sidebar: RAID Levels
There are several levels of RAID that can be implemented. RAID Level 0 (or RAID-0) is also commonly referred to as disk striping. With RAID-0, data is split across multiple drives, which delivers higher data throughput. But there is no redundancy (which really doesn’t fit the definition of the RAID acronym). Because there is no redundant data being stored, performance is usually very good, but a failure of any disk in the array will result in data loss.
RAID-1, sometimes referred to as data mirroring, provides redundancy because all data is written to two or more drives. A RAID-1 array will generally perform better when reading data and worse when writing data (as compared to a single drive). However, RAID-1 provides data redundancy so if any drive fails, no data will be lost.
RAID-2 provides error correction coding. RAID-2 would be useful only for drives without any built-in error detection. RAID-3 stripes data at a byte level across several drives, with parity stored on one drive. RAID-3 provides very good data transfer rates for both reads and writes. RAID-4 stripes data at a block level across several drives, with parity stored on a single drive. For RAID-3 and RAID-4, the parity information allows recovery from the failure of any single drive. The performance of write can be slow with RAID-4 and it can be quite difficult to rebuild data in the event of RAID-4 disk failure.
RAID-5 is similar to RAID-4, but it distributes the parity information among the drives. RAID-5 can outperform RAID-4 for small writes in multiprocessing systems because the parity disk does not become a bottleneck. But read performance can suffer because the parity information is on several disks. RAID-6 is basically an extension of RAID-5, but it provides additional fault tolerance through the use of a second independent distributed parity scheme. Write performance of RAID-6 can be poor.
RAID-10 is a striped array where each segment is a RAID-1 array. Therefore, it provides the same fault tolerance as RAID-1. A high degree of performance and reliability can be delivered by RAID-10, so it is very suitable for high performance database processing. However, RAID-10 can be very expensive.
RAID-53 is a striped array where each segment is a RAID-3 array. Therefore, RAID-53 has the same fault tolerance and overhead as RAID-3. Finally, RAID-0+1 combines the mirroring of RAID-1 with the striping of RAID-0. This couples the high performance of RAID-0 with the reliability of RAID-1
In some cases storage vendors come up with their own variants of RAID. Indeed, there are a good number of proprietary variants and levels of RAID defined by the storage vendors. If you are in the market for RAID storage be sure you understand exactly what the storage vendor is delivering.
For more details on RAID check out the detailed information at http://www.raid-storage.com/04_00.html.
A disaster that takes out your data center is the worst of all possible situations and will definitely result in an outage of some considerable length. The length of the outage will depend greatly on the processes in place to send database copies and database logs to an off-site location. Overall downtime for a disaster also depends a good deal on how comprehensive and automated your recovery procedures are at the remote site. The eDBA should be prepared with automated procedures for handling a disaster.
But simple automation is insufficient. The eDBA must ensure the consistent backup and offsite routing of not just all of the required data, but also the IT infrastructure resources required to bring up the organization’s databases at the remote site. This is a significant task that requires planning, periodic testing, and vigilance. The better the plan, the shorter the outage and the smaller the impact will be on the e-business. Consider purchasing and deploying DBA tools that automate backup and recovery processes to shorten the duration of a disaster recovery scenario.
Of course, other considerations are involved if your entire data center has been destroyed. The resumption of business will involve much more than being able to re-deploy your databases and get your applications back online. But those topics are outside the scope of this particular article.
Point-in-Time Recovery
Another type of database recovery is a Point-in-Time (PIT) recovery. PIT recovery usually is performed to deal with application level problems. Conventional techniques to perform a point in time recovery will remove the effects of all transactions performed since a specified point in time. The traditional approach will involve an outage. Steps for PIT recovery include:
-
Identifying the point-in-time to which the database should be recovered. Depending on the DBMS being used this can be to an actual time, an offset on the database log, or to a specific image copy backup (or set of backups). Care must be taken to ensure that the PIT selected for recovery will provide data integrity not just for the database object impacted, but among all related database objects, as well.
-
The database objects must be taken offline while the recovery process applies the image copy backups.
-
If the recovery is to a PIT later than the time the backup was taken, the DBMS must roll forward through the database logs applying the changes to the database objects.
-
When complete, the database objects can be brought back online.
The outage will last as long as it takes to complete steps 2 through 4. Depending on the circumstances, you might want to make the database objects unavailable for update immediately upon discovering data integrity problems so that subsequent activities do not make the situation worse. In that case, the outage will encompass steps 1 through 4.
Further problems can ensue if there were some valid transactions after the PIT selected that still need to be applied. In that case, an additional step (say, step 5) should be added to re-run appropriate transactions. That is, if the transactions can even be identified and re-running is a valid option.
Overall, though, the quicker this entire process can be accomplished the shorter the outage. Step 1 can take a lot of time and the more it can be automated the better. Tools exist which make it easier to interpret database logs and identify an effective PIT for recovery. For the e-business, this type of tool can pay for itself after a single usage if it significantly reduces an outage and enables the e-business application to come back online quickly.
Transaction Recovery
A third type of database recovery exists for e-businesses willing to invest in sophisticated third party recovery solutions. Transaction Recovery addresses the shortcomings of traditional recoveries by reducing or eliminating downtime and avoiding the loss of good data. Simply stated, Transaction Recovery is the process of removing the undesired effects of specific transactions from the database. This statement, while simple on the surface, hides a bevy of complicated details. Let’s examine the details behind the concept of Transaction Recovery.
Traditional recovery is at the database object level: for example, at the dataspace, table space or index level. When performing a traditional recovery, a specific database object is chosen. Then, a backup copy of that object is applied, followed by re-applying log entries for changes that occurred after the image copy was taken. This approach is used to recover the database object to a specific, desired point-in-time. If multiple objects must be recovered, this approach is repeated for each database object impacted.
Transaction recovery uses the database log instead of image copy backups. Remember that all changes made to a relational database are captured in the database log. So, if the change details can be read from the log, recovery can be achieved by reversing the impact of the logged changes. Log-based transaction recovery can take two forms: UNDO recovery or REDO recovery.
For UNDO recovery the database log is read to find the data modifications that were applied during a given timeframe and:
-
INSERTs are turned into DELETEs
-
DELETEs are turned into INSERTs
-
UPDATEs are turned around to UPDATE to the old value
In effect, an UNDO recovery reverses database modifications using SQL. The traditional DBMS products do not provide native support for this. To generate UNDO recovery SQL you will need a third party solution that understands the database log format and can create the SQL needed to undo the data modifications. An eDBA should note that in the case of UNDO Transaction Recovery, the portion of the database that does not need to be recovered remains undisturbed. When undoing erroneous transactions, recovery can be done online without suffering an outage of the application or the database. UNDO Transaction Recovery is basically an online database recovery. Of course, whether or not it is desirable to keep the database online during a Transaction Recovery will depend on the nature and severity of the database problem.
The second type of Transaction Recovery is REDO Transaction Recovery. This strategy is a combination of PIT recovery and UNDO Transaction Recovery with a twist. Instead of generating SQL for the bad transaction that we want to eliminate, we generate the SQL for the transactions we want to save. Then we do a standard PIT recovery eliminating all the transactions since the recovery point. Finally we re-apply the good transactions captured in the first step.
Unlike the UNDO process which creates SQL statements that are designed to back out all of the problem transactions, the REDO process re-creates SQL statements that are designed to reapply only the valid transactions from a consistent point of recovery to the current time. Since the REDO process does not generate SQL for the problem transactions, performing a recovery and then executing the REDO SQL can restore the data to a current state that does not include the problem transactions.
A REDO Transaction Recovery requires an outage for the PIT recovery. When redoing transactions in an environment where availability is crucial, the database can be brought down during the PIT recovery and when done, the database can brought back online. The subsequent redoing of the valid transactions to complete the recovery can be done with the data online, thereby reducing application downtime.
In contrast with the granularity provided by traditional recovery, Transaction Recovery allows a user to recover a specific portion of the data based on user-defined criteria. So only a portion of the data is affected. And any associated indexes are automatically recovered as the transaction is recovered. Additionally, with Transaction Recovery the transaction may impact data in multiple database objects. A traditional recovery is performed object by object through the database.
A transaction is a set of related operations that, when grouped together, define a logical unit of work within an application. Transactions are defined by the user’s view of the process. This might be the set of panels that comprise a new hire operation. Or perhaps the set of jobs that post to the General Ledger. Examples of user-level transaction definitions might be:
-
All UPDATEs issued by userid DSGRNTLD since last Wednesday at 11:50 AM.
-
All DELETEs made by the application program PAYROLL since 8:00 PM yesterday.
Why is Transaction Recovery a much-needed tool in the arsenal of e-DBAs? Well, applications are prone to all types of problems, bugs, and errors. Using Transaction Recovery, the DBA can quickly react to application-level problems and maintain a higher degree of data availability. The database does not always need to be taken offline while Transaction Recovery occurs (it depends on the type of Transaction Recovery being performed and the severity of the problem).
Choosing the Optimum Recovery Strategy
So, what is the best recovery strategy? Of course, the answer is – it depends! While Transaction Recovery may seem like the answer to all your database recovery problems, there are times when it is not possible or not advisable. To determine the type of recovery to choose, you need to consider several questions:
-
Transaction Identification. Can all the problem transactions be identified? You must be able to actually identify the transactions that will be removed from the database. Can all the work that was originally done be located and redone?
-
Data Integrity. Has anyone else updated the rows since the problem occurred? If they have, can you still proceed? Is all the data required still available? Recovering after a REORG, LOAD, or mass DELETE may require the use of image copy backups. Will any other data be lost? If so, can the data lost be identified in some fashion?
-
Availability. How fast can the application become available again? Can you afford to go offline? What is the business impact of the outage?
These questions actually boil down to a matter of cost. What is the cost of rework and is it actually possible to determine what would need to be redone (what jobs to run, what documents to reenter). This cost needs to be balanced against the cost of long scans of log data sets to isolate data to redo or undo, and the cost of applying that data using SQL.
The ultimate database recovery solution should analyze your overall environment and the transactions needing to be recovered, and recommend which type of recovery to perform. Furthermore, it should automatically generate the appropriate scripts and jobs to perform the recovery to avoid the errors that are sure to be introduced with manually developed scripts and jobs.
Database Design
In some cases you can minimize the impact of future database problems by properly designing the database for the e-business application that will use the database. For example, you might be able to segment or partition the database by type of customer, location, or some other business criterion whereby only a portion of the database can be taken offline while the rest remains operational. In this way, only certain clients will be affected, not the entire universe of users. Of course, this approach is not always workable, but sometimes “up front” planning and due diligence during database design can mitigate the impact of future problems.
Conclusion
These are just a few of the recovery techniques available to eDBAs to reduce outages and the impact of downtime for e-businesses. For example, some disk storage devices provide the capability to very quickly “snap” files using hardware techniques – the result being very fast image copy backups. Some recovery solutions work well with these new, smart storage devices and can “snap” the files back very quickly as well.
Other solutions exist that back out transactions from the log to perform a database recovery. For eDBAs, a backout recovery may be desired in instances where a problem is identified quickly. You may be able to decrease the time required to recover by backing out the effects of a bad transaction instead of going back to an image copy and rolling forward through the log.
The bottom line is, as an eDBA you need to keep up-to-date with the technology available to reduce outages – both hardware and software offerings – and you need to understand how these technologies can work with your database environment. Remember that recovery does not always have to involve an outage. Think creatively, plan accordingly, and deploy diligently and you can deliver the service required of e-database administration.
With proper planning and wise implementation of technologies that minimize outages, you can maintain high e-vailability for your web-enabled databases and applications.
|