Understanding the Oracle Backup Ecosystem

matthew's picture

Mirrored at https://blogs.oracle.com/storageops/entry/understanding_the_oracle_backup_ecosystem

Table of Contents

Understanding the Oracle Backup Ecosystem

Backup/Restore Drivers

The “Oops”

Defending against and pursuing lawsuits

Taxes & Audits

Disaster Recovery

Reduce Downtime

Improve Productivity

The Backup/Restore Tiers

Tier 1 Backups

Tier 2 Backups

Tier 3 Backups

Tier 4 Backups

The Tools

ZDLRA

SMU

OSB

ACSLS

STA

Oracle ZFS Storage

Tools For Tiers

Understanding the Oracle Backup Ecosystem

A frequent question I hear these days is something along the lines of “How is Oracle IT leveraging the Zero Loss Data Recovery Appliance, Oracle Secure Backup, and ZFS together?”

Disclaimer 1: The opinions in this blog are my own, and do not necessarily represent the position of Oracle or its affiliates.

Disclaimer 2: In Oracle IT, we “eat our own dog food”. That is, we try to use the latest and greatest releases of our product in production or semi-production environments, and the implementation pain makes us pretty strong advocates for improvements and bug fixes. So what I talk about here is only what we're doing right now; it's not what we were doing a year ago, and probably won't be what we're doing a year from now. Some of today's innovative solutions are tomorrow's deprecated processes. Take it all with a grain of salt!

Disclaimer 3: I'm going to talk about some of my real-world, actual experiences here in Oracle IT over the past decade that influenced my position on backups. Don't take these experiences as an indictment of our Information Technology groups. Accidents happen; some are preventable, some not. The real key to success is not in not failing, but in moving forward and learning from the experience so we don't repeat it.

Backup/Restore Drivers

Typically, the need for offline backup & restore is driven by a few specific types of needs.

The “Oops”

Humans are fallible. We make mistakes. The single most common reason for unplanned restores in Oracle IT is human error. This is also true for other large enterprises: Google enjoyed a high-profile incident of corrupted mailboxes several years ago due to a flawed code update. Storing data in the “cloud” is not a protection against human error. The only real protection you have from this kind of incident is some kind of backup that is protected by virtue of being either read-only or offline.

Defending against and pursuing lawsuits

In today's litigious environment, being able to take “legal hold” offline, non-modifiable, long-retention backups of critical technology is a prerequisite to efficiently defending you and your company from various legal attacks. Trying to back up or restore an environment that has zero backup infrastructure in place is a huge hassle, and can endanger your ability to win a lawsuit. You want to have a mechanism in place to deal with the claims of your attackers – or to support the needs of your Legal team in pursuing infringements – without disrupting your normal operations.

Taxes & Audits

Tax laws in various countries usually require some mandatory minimum of data retention to satisfy potential audit requirements. If you can't cough up the data required to pass an audit – regardless of the reason, even if it's a really good one! – you're probably facing a stiff fine at a minimum.

Disaster Recovery

I'm going to be real here. This is my blog, not some sanitized, glowing sales brochure. Everybody is – or should be! – familiar with what “Disaster Recovery” is. Various natural and man-made disasters have happened in recent decades, and many companies went out of business as a result due to inadequate disaster recovery plans. While the chance of a bomb, earthquake, or flood striking your data center is probably very low, it does exist. Here's a short list of minor disasters I've personally observed during my career. There have been many more; I'll only speak of relatively recent ones.

  • A minor earthquake had an epicenter just two miles from one of our data centers. I was in the data center in question at the time; it felt as if a truck struck the building. Several racks of equipment didn't have adequate earthquake protection and shifted; they could easily have fallen over and been destroyed.

  • An uninterruptible power supply's automated transfer switch exploded, resulting in smoke throughout the data center and a small fire that could have spread and destroyed data.

  • Another data center had a failure in the fire prevention system, resulting in sprinklers dousing several racks worth of equipment.

  • Busy staff and a flawed spreadsheet resulted in the wrong rack of equipment being forklifted and shipped to another data center.

  • A data center was in the midst of a major equipment move with very narrow outage windows. During one such time-critical move, facilities staff incorrectly severed the ZFS Appliance “Clustron” cables with a box knife before shipping the unit. I powered the unit up without detecting the break, resulting in a split-brain situation on our appliance that corrupted data. Mea culpa! Seriously, don't do that. I don't think the ZFSSA is vulnerable to this anymore as a result of this incident, but it was painful at the time and I don't want anyone to go through that again...

  • Multiple storage admins on my team have accidentally destroyed the wrong share or snapshot on a storage appliance. When you have hundreds of thousands of similarly-named projects, shares, and snapshots, it's nearly inevitable, even if the “nodestroy” bit is set: if the service request says to destroy a share, and all the leadership signed off on the change request for destroying it, you destroy it despite the “nodestroy” thing. But it's quite rare.

  • Admins allowed too many disks to be evicted from the disk pool on an Exadata because ((reasons, won't go into it)), resulting in widespread data loss and a data restore.

This was the minor stuff. Imagine if it were major! If you don't have solid, tested disaster recovery plans that include some kind of offline or near-line backup, you're exposed and are likely to go out of business even if you suffer a user-induced disaster such as the “Oops” category above.

Reduce Downtime

Having a good backup means that you have less downtime for your staff in case of any challenge with your data. Knowing how long it takes to restore your data is a benefit of a regularly-scheduled restore test.

Improve Productivity

Finally, if you don't have a good backup, the chance is high that you'll eventually end up having to do some work over again due to lack of good back-out options. This loss of productivity hurts the bottom line.

The Backup/Restore Tiers

In any large enterprise environment, there exist multiple tiers of needs for backup/restore. It's often helpful to view backup and restore as a single type of tier: if your backup needs tend to be time-sensitive, your restore needs are probably even more so. Therefore, in the interest of simplicity I'll assume your tier need for restores mirrors your tier for backups.

Here's how I view these tiers today. They aren't strictly linear as below – there is a lot of cross-over – but they align nicely with the technologies used to back them up.

  1. Mission-critical, high-visibility, high-impact, unique database content.
  2. Mission-critical, high-visibility, high-impact, unique general purpose content.
  3. Lower-criticality unique database and general purpose content.
  4. Non-unique database and general purpose content.

Tier 1 Backups

For Tier 1 Oracle database backup and restore, there exists one best choice today: The Zero Data Loss Recovery Appliance, or "ZDLRA". While you can perform backups to ZFS or OSB tape directly – which works quite well, and we've done it for years in various environments – the ZDLRA has some important advantages I'll cover below.

That said, though, the Oracle ZFS Storage Appliance in combination with Oracle Secure Backup can provide Tier 1-level backups, but the “forever-incremental” strategy available on ZDLRA is simply not an option. For Tier 1 non-ZDLRA backups, we resort to more typical strategies: rman backup backupset using a disk-to-disk-to-tape approach, NFS targets, direct-to-tape options, etc.

For Tier 1, you also want multiple options if possible: layer upon layer of protection.

Tier 2 Backups

For Tier 2 general-purpose content, the ZDLRA just isn't particularly relevant because it doesn't deal with non-Oracle-Database data. By calling it “Tier 2” I'm not implying it's less important than Tier 1 backups, just that you have a lot more flexibility with your backup and recovery strategies. Tier 2 also applies to your Oracle database environments that do not merit the expense of ZDLRA; ZFS and tape tend to be considerably cheaper, but with a corresponding rise in recovery time and manageability.

In Tier 2, you'll have the same kind of backup & restore windows as Tier 1, but will use non-ZDLRA tools to take care of the data: direct-to-tape backups, staging to OSB disk targets for later commitment to tape, etc. Like Tier 1, you want to layer your recovery options. Our typical layers are:

  1. Sound change management process to eliminate the most common category of “Oops” restores.

  2. Snapshots. Usually a week or more, but a minimum of 4 daily automated snapshots to create a 3-day snap recovery window.
  3. Replication to DR sites. For Oracle Database, this usually means “Dataguard”. For non-DB data, ZFS Remote Replication is commonly used and has proven exceptionally reliable, if occasionally a little tricky to set up for extremely large (100+TB) shares.
  4. For Oracle databases, an every-15-minutes archive log backup to tape that is sent off site regularly at the primary and DR site(s).
  5. Weekly incremental backups to tape, using whatever hot backup type of technology is available to us on the platform so that a backup is “clean” and can be restored from without corrupted in-flight data at both the primary & DR site(s).
  6. Monthly full backups to tape at both the primary & DR site(s).
  7. Ad-hoc backups to tape as required.

Tier 3 Backups

Leveraging the same toolset as Tier 2 backups, Tier 3 backups are simply environments that need less-frequent backups of any sort. It's the kind of stuff that if you lost access for 12-24 hours, your enterprise could keep running but would inconvenience a bunch of users. It's not stuff that endangers your bottom line – if it's a revenue-producing service, it must be treated as Tier 1 or Tier 2, or else you might end up owing your customers some money back! – but would be painful/irritating/time-consuming to reproduce.

In Oracle IT, this tier of data receives second-class treatment. It gets backed up once per week instead of constantly. Restore windows range from a few hours to a couple of days. Retention policies are narrower. Typically, very static environments like those held for Legal Hold or rarely-read data are stored in this tier. The data is important enough to back up, but the restoration window is much more fluid and the demands infrequent.

ZFS Snapshots are critical for this kind of environment, and typically will be held for a much longer period than the few days one might see in a production environment. Because the data is much more static, the growth of snapshots relative to their filesystems is very low.

Tier 4 Backups

The key phrase for backups in this tier is “non-unique”. In other words, the data could easily be reproduced with roughly the same amount of effort it would take to restore from tape. In general, these Tier 4 systems don't receive much if any backup at all. ZFS snapshots occur on user-modifiable filesystems so that we can recover within a few days from a user “oops” incident, but if we were to lose the entire pool it could be reconstructed within a couple of days. Although it's important to have some mechanism for tape backup should one be required, they will be the exception and not the rule.

The Tools

Now to the fun part. How do we glue these things together in various tiers? What tools do we use?

ZDLRA

  1. The forever-incremental approach to backups means that there is less CPU and I/O load on your database instance. Backup windows typically generate the heaviest load on your appliance, and since the ZDLRA should never require full backups after the first one, it's an outstanding choice for I/O-challenged environmental backups.

  2. The ZDLRA easily services a thousands-of-SIDs environment without backup collisions. This is really critical for Cloud-style environments with many small databases, where traditional rman scheduling tends to fall apart pretty easily due to schedule conflicts to limited tape resources.
  3. Autonomous tape archival helps aggregate backups and provide on-demand in-scope Legal Hold, Disaster Recovery, Environment Retirement, and Tax/Audit backups to tape. Many may think “tape is dead”... but they think wrong!

SMU

Oracle's SMU – “Snap Management Utility – is a great way to back up Tier 2 Oracle databases to ZFS. It handles putting your database into hot standby mode so that you can take an ACID-compliant snapshot of the data and set up restore points along the way. If you can't afford ZDLRA, SMU + ZFS is a great first step. Just don't forget to take it to tape too!

OSB

OSB version 12 provides “Disk Targets”. This, in essence, gives users of OSB 12 a pseudo-VTL capability. This new Disk Target functionality provides some other unique benefits:

  1. Aggregate multiple rman backups of smaller-than-a-single-tape size onto a single tape.

  2. With sufficient streams to disk, you can be rid of rman scheduling challenges that often vex thousands-of-SIDs environments when backing up to tape.
  3. By aggregating rman and other data to a single archive tape, you increase the density of data on tape, avoid buffer underruns, and maximize the free time for your tape drive. What often happens with a slow rman backup is that the tape ramps its speed down to match the input stream, doubling or even quadrupling the time the tape drive is busy. By buffering the backups to disk first, you can ensure the tape drive is driven at maximum speed once you're ready to use “obtool cpinstance” to copy those instances to tape.
  4. Ability to use any kind of common spindle or SSD storage as a disk target. We use a combination of local disks on Sun/Oracle X5-2L servers running Solaris as well as ZFS Storage Appliance targets over 10gbit Ethernet

ACSLS

Oracle's StorageTek Automated Cartridge System Library Software – ACSLS for short – provides a profoundly useful capability: virtualization of our tape silos. We can present a single silo from our smaller SL3000 libraries to the Big Boy SL8500 library as a virtual tape silo to a given instance of OSB. This allows truly isolated multi-tenancy and reporting for individual customers or lines of business. This capability is leveraged to the max across all of our Enterprise, Cloud, and Managed Cloud environments.

STA

Oracle's StorageTek Analytics (STA) provide predictive failure analysis of tapes and silo components. All storage – tape, SSD, and magnetic spindle – will fail eventually. STA provides valuable insight into the rate of this decay, and works in tandem with ACSLS to pro-actively, predictively fail media out of the library when it's no longer reliable.

Oracle ZFS Storage

Oracle's ZFS Storage Appliance provides a uniquely flexible, configurable storage platform to leverage as a disk backup target, rman “backup backupset” staging area for massive-throughput Oracle database backups, remote replication source or target, and more. The proven self-healing capabilities of Oracle's ZFS storage – particularly effective in a once-in, many-out backup situation – helps guarantee that backups are healthy and exactly what you intended to commit to tape. In many ways, the ZFS Storage Appliance is the fulcrum around which all our other utilities rotate, and its seamless integration as a disk target for OSB over either NFS or NDMP is simple, straightforward, and provides unparalleled analytic ability.

Tools For Tiers

If you've read this far, you probably already have a pretty good idea of what to use for which tier. ACSLS, STA, ZFS, and OSB all factor into every tier of backups in one way or another. By tier:

  1. ZDLRA with a sub-15-minute recovery point objective.

  2. ZFS Snapshots, hot backups to tape and/or OSB Disk Targets, and for some specific environments SMU may be appropriate, with a 15-minute recovery point objective.
  3. ZFS Snapshots are the primary “backup”, with a far more generous 24-hour recovery point objective using OSB disk and tape targets.
  4. ZFS Snapshots as the primary or only “backup”; no specific recovery point objective as the environment could be reconstructed if necessary.

I hope this is helpful for you when figuring out how to back up your Red Stack. All the best!