ZFS Tricks: Scheduling Scrubs

Content mirrored at https://blogs.oracle.com/storageops/entry/zfs_trick_scheduled_scrubs

A frequently-asked-question on ZFS Appliance-related mailing lists is "How often should I scrub my disk pools?" The answer to this is often quite challenging, because it really depends on you and your data.

Usually when asked a question I want to provide the answers to the questions they should have asked first, so that I’m certain our shared conversational contexts match up. So here’s some background questions that we should have answers to before answering the "How often" question.

Content mirrored at https://blogs.oracle.com/storageops/entry/zfs_trick_scheduled_scrubs

What is a scrub?

To "scrub" a disk means to read data from all disks in all vdevs in a pool. This process compares blocks of data against their checksums; if any of the blocks don’t match the related checksum, ZFS assumes that data has been corrupted (bit rot happens to every form of storage!) and will look for valid copies of the data. If found, it’ll write a good copy of the data to that storage, marking the old copy as "bad".

What is the benefit of a disk scrub?

Most people have a lot more "stale" data than they think they do: stuff that was written once, and never read from again. If data isn’t read, there’s no way to tell if it’s gone bad due to bit rot or not. ZFS will self-heal data if bad data is found, so a scrub forces a read of all data in the pool to verify that it isn’t currently bit-rotted, and heal the data if it is.

What performance impact is there to a scrub?

The ZFS appliance runs disk scrubs at a very low priority as a nearly-invisible background process. While there is a performance impact to scrubbing disk pools, this very low-priority background process should not have much if any impact to your environment. But the busier your appliance is with other things, and the more data is on-disk, the longer the scrub takes.

How long do scrubs run?

On a fresh system with little data and low utilization, scrubs complete very quickly. For instance, on a brand-new, quiescent pool with 192 4TB disks, scrubs typically complete in just moments. There is no data to read, therefore the scrubs return almost as soon as we start them.

On very busy systems with very large pools and lots of I/O, it’s possible for scrubs to run for months before completion. For example, a 192-disk, full-rack 7410 with 2TB drives in the Oracle Cloud recently required eight months to complete a pool scrub. The system was used around-the-clock with extreme write loads; the low quantity of of RAM (256GB/head), compression (LZJB better than 2:1), and nearly-full pool (80%+) conspired to force the scrub to run extremely slowly.

If the slow-running, low-impact scrub needs to complete in a shorter time than that, contact Support and ask for a workflow to prioritize your scrubs to run a little faster. Realize, of course, if you do so that the performance impact goes up if scrubs run at higher priority!

Should I scrub my pools?

Is the pool formatted with either RAIDZ or Mirror2 configuration? Although these two options offer higher performance than RAIDZ2 or Mirror3, redundancy is lower. (No, I’m not going to talk about Stripe. That should only ever be used on a simulator; I don’t even know why it exists on a ZFS appliance.)
Are unable to absolutely 100% guarantee that every byte of data in the pool is read frequently? Note that even databases that the DBAs think of as "very busy" often have blocks of data that go un-read for years and are at risk of bit rot. Ask me how I know…
Do you run restore tests of your data less frequently than once per year?
Do you back up every byte of data in your pool less frequently than once per quarter?

If you answer "Yes" to any of the above questions, then you probably want to scrub your pools from time to time to guarantee data consistency.

How often should I scrub my pools?

This question is challenging for Support to answer, because as always the true answer is "It Depends". So before I offer a general guideline, here are a few tips to help you create an answer more tailored to your use pattern.

What is the expiration of your oldest backup? You should probably scrub your data at least as often as your oldest tapes expire so that you have a known-good restore point.
How often are you experiencing disk failures? While the recruitment of a hot-spare disk invokes a "resilver" — a targeted scrub of just the VDEV which lost a disk — you should probably scrub at least as often as you experience disk failures on average in your specific environment.
How often is the oldest piece of data on your disk read? You should scrub occasionally to prevent very old, very stale data from experiencing bit-rot and dying without you knowing it.

If any of your answers to the above are "I don’t know", I’ll provide a general guideline: you should probably be scrubbing your zpool at least once per quarter. It’s a schedule that works well for most use cases, provides enough time for scrubs to complete before starting up again on all but the busiest & most heavily-loaded systems, and even on very large zpools (192+ disks) should complete fairly often between disk failures.

How do I schedule a pool scrub automatically?

There exists no easy mechanism to schedule pool scrubs from the BUI or CLI as of February 2015. I opened a RFE a few months back for one to be provided, but I’m not certain how far down the development pipeline such a feature is, if it will exist at all. So in Oracle IT, we just rolled our own.

The below code is an example of how this can be accomplished. It is provided as-is, with no warranty expressed or implied. Use it at your own risk.

It’s been working well for many months for us. Simply copy/paste the below code to some convenient filename, such as "safe_scrub.akwf". Then upload the below workflow to your appliance using the "maintenance workflows" BUI screen. The default schedule runs once every 12 weeks on a Sunday. You can tweak the schedule to match your needs either in the source code if you want to adjust the default schedule, or by visiting the "maintenance workflows" command-line interface and adjust the schedule manually after you upload it.

/*globals run, continue, list, printf, print, get, set, choices, akshDump, nas, audit, shell, appliance*/ /*jslint maxerr: 50, indent: 4, plusplus: true, forin: true */

/*safe_scrub.akwf * A workflow to initiate a scrub on a schedule. * Author: Matthew P. Barnson  * Update history: * 2014-10-09 Initial concept * 2014-11-20 EIS deployment * 2015-02-19 Sanitized for more widespread use * 2015-02-19 Multiple pool functionality added by: Adam Rappner  */

/* This program is provided 'as is' without warranty of any kind, expressed or * implied, including, but not limited to, the implied warranties of * merchantability and fitness for a particular purpose.*/

var MySchedules = [ // Offset 3 days (Sunday), 9 hours, 00 minutes, week interval. // The UNIX Epoch -- January 1, 1970 -- occurred on a Thursday. // Therefore the ZFS appliance's week in a schedule starts on Thursday. // Sample offset: Every week //{offset: (3 * 24 * 60 * 60) + (9 * 60 * 60), period: 604800, units: "seconds"} // Sample offset: Every 4 weeks //{offset: (3 * 24 * 60 * 60) + (9 * 60 * 60), period: 2419200, units: "seconds"} // Sample offset: Once every 12 weeks on a Sunday {offset: (3 * 24 * 60 * 60) + (9 * 60 * 60), period: 7257600, units: "seconds"} ];

var workflow = { name: 'Scheduled Scrub', origin: 'Oracle PDIT mbarnson', description: 'Scrub on a schedule', version: '1.2', hidden: false, alert: false, setid: true, scheduled: true, schedules: MySchedules, execute: function (params) { "use strict"; var myDate = run('date'), myReturn = "", pools = nas.listPoolNames(), p = 0; // Iterate over pools & start scrubs for (p = 0; p < pools.length; p = p + 1) { myDate = run('date'); try { run('cd /'); run('configuration storage set pool=' + pools[p]); run('configuration storage scrub start'); myReturn += "New scrub started on pool: " + pools[p] + " "; audit('Scrub started on pool: ' + pools[p] + ' at ' + myDate); } catch (err) { myReturn += "Scrub already running on pool: " + pools[p] + " "; audit('Scrub already running on pool: ' + pools[p] + ' at ' + myDate); } } return ('Scrub in progress. ' + myReturn + '\n'); } };

Happy scrubbing!

ZFS: Doing It Right

Imagine you’re a system administrator, and an email arrives from your boss. It goes something like this:

"Hey, bud, we need some new storage for Project Qux. We heard that this [insert major company here] uses a product called the Oracle Sun ZFS Storage Appliance as the back-end for their [insert really popular app here]. We want to do something like that at similar scale; can you evaluate how well that compares to XYZ storage we already own?"

So you get in touch with your friendly local ZFS sales dudette, who arranges a meeting that includes a Sales Engineer to talk about technical stuff related to your application. The appliance, however, has an absolutely dizzying array of options. Where do you start?

ZFS: Doing It Right

Imagine you’re a system administrator, and an email arrives from your boss. It goes something like this:

Without a thorough evaluation of performance characteristics, there are two scenarios most people evaluating these appliances end up choosing:

ZFS choices that will almost certainly fail, and
ZFS choices with a reasonable chance of success despite their lack of knowledge.

To start with, I’ll talk about Scenario 1: setting up yourself and your ZFS evaluation up to fail: Doing It Wrong.

How Are People Do It Wrong?

I bumped into several individuals at OpenWorld that had obviously already made choices that guaranteed the ZFS appliance they purchased was not going to work for them. They just didn’t know it yet. And of course, despite my best intentions to help them cope with the mess they made, they remained unsatisfied with their purchase.

Both the choices and outcome were eminently predictable, and apparently motivated by several common factors.

Misplaced Cost-Consciousness

From my point of view if someone isn’t ready to invest six figures in storage, then they aren’t yet ready for the kind of performance and reliability an enterprise-grade NAS like the ZFS appliance can offer them. The hardware they can afford won’t provide them an accurate picture of how storage performs at scale.

Any enterprise storage one can buy at a four or five-figure price point is still a toy; a useful one, but still a toy compared with its bigger siblings.

It’ll be nifty and entertaining if the goal is familiarize oneself with the operating system and interfaces. It will allow users to get a glimpse of the kinds of awesome advantages ZFS offers. It’ll offer a reasonable test platform for bigger & better things later as you explore REST, Analytics, Enterprise Manager, and the Oracle-specific optimizations available to you. And perhaps it might serve reasonably well as a departmental file server or small-scale storage for a few dozen terabytes of data. But it won’t offer performance or reliability on a scale similar to what serious enterprises deserve.

Misunderstanding Needs

Most customers that invest in dedicated storage for the first time don’t yet understand their data usage patterns. IOPS? A stab in the dark. Throughput? Maybe a few primitive tests from a prototype workstation. Hot data volume? Read response latency requirements? Burst traffic vs. steady-state traffic? Churn rate? Growth over time? Deduplication or cloning strategies? Block sizes? Tree depth? Filesystem entries per directory? Data structure? Best supported protocol? Protocol bandwidth compared to on-disk usage? Compressibility? Encryption requirements? Replication requirements?

I’m not saying one has to have all these answers prior to purchasing storage. In fact, the point of this series is to encourage you to purchase a good general-purpose hardware platform that is really good at most workloads, and configure it in a way that you’re less likely to shoot yourself in the foot. But over and over the people with the biggest problems were the ones who didn’t understand their data, yet hoped that purchasing some low-end ZFS storage would somehow magically solve their poorly-understood problems.

Lack Of Backups

Most data worth storing is worth backing up. While I’m a big fan of the Oracle StorageTek SL8500 tape silo, not everybody is ready for a tape backup solution that can span the size of a football field or Quidditch pitch.

Nevertheless, trusting that the inherent reliability and self-healing of a filesystem will see a company through a disaster is not a good idea. Earthquakes, tornados, errant forklift drivers, newbie admins with root access, overly-enthusiastic Logistics personnel with a box knife and a typo-ridden list of systems to move are common. Backups should be considered and implemented long before valuable data is committed to storage.

Solving Yesterday’s Problems

Capacity planning is crucial in the modern enterprise. While I’m certain our sales guys are really happy to sell systems on an urgent basis with little or no discount in response to poor planning on the part of customers, that kind of decision making is often really hard on the capital expense budget.

A big part of successful capacity planning is forecasting future needs. Products like Oracle Enterprise Manager and ZFS Analytics can help. Home-brewed capacity forecasting is viable and common. A system administrator is at her best when she’s already anticipated the need of the business and has a ready solution for the future problems she understands will arrive eventually, and with an enterprise NAS a modest investment in hardware can continue to yield future dividends as an admin continues to better understand her data utilization patterns and learns to use the available tools to intelligently manage it.

How To Fail At ZFS And Performance Reviews

Here are the options I would pick if I wanted to set up my ZFS appliance to fail:

Go with any non-clustered option; reliability suffers. Failure imminent.
Choose the lowest RAM option; space pressure will make my bosses really unhappy with the storage as things slow down. Great way to fail.
Buy CPUs at the lowest possible specification; taking advantage of CPU speed for compression would make the storage run better, and using CPU for encryption gives us options for handling sensitive data. Don’t want that if our goal is failure!
Pick an absurdly low number of high-capacity, low-IOPS spindles, like maybe twenty to forty 7200RPM drives; I/O pressure will drive me nuts troubleshooting, but heck, it’s job security.
Don’t invest in Logzillas (SLOG devices). The resultant write IOPS bottleneck will guarantee everybody hates this storage.
If I do invest in Logzillas (SLOG devices), use as few as possible and stripe them instead of mirroring them; that kills two birds with one stone: impaired reliability AND impaired performance!
Buy Readzillas (L2ARC), but ignore the total RAM available to the system and go for the big, fat, expensive Readzilla SSDs because I think we’re going to have a "lot of reads" without understanding what Readzillas actually do. This will impair RAM performance further, wasting both my money AND squandering performance!

If you do the above, you’ll pretty much guarantee a bad time for yourself with ZFS storage. Unfortunately, this seems to be the way far too many people try to configure the storage, and they set themselves up for failure right from the start.

So we’ve talked about Doing It Wrong. How do you Do It Right?

Do It Right: Rock ZFS, Rock Your Performance Review

In case you don’t know what I do, I co-manage several hundred storage appliances for a living (soon to be over a thousand, with hundreds of thousands of disks among them. Wow. The sheer scope of working for Oracle continues to amaze me!). Without knowing anything else about the workload except that the customer wants high-performance general-purpose file storage, below is the reference configuration I would pick if I want to maximize the workload’s chances of success. If I think I need to differ from this reference configuration, it’s important to ask "How does this improve on the reference configuration?" This reference configuration has proven its merit time and time again under a dizzying array of workloads, and I’d only depart from it under very compelling arguments to do so.

Such arguments exist, but if they are motivated by price, I am always trading away performance for a lower price!

Understanding The Basics

Guiding this reference configuration are the following priorities:

Redundancy. If it’s worth doing, it’s worth protecting; the ZFS appliance is reliable because it’s very fault-tolerant and self-healing, not because the commodity Sun hardware it’s built with is inherently more reliable than competing options.
Mirrored Logzillas (SLOG devices). Balance this with RAM and spindles, though, as too much of any of the three and one or more will be underused. And for a few obscure technical reasons related to reliability, I strongly prefer Mirrored Logzillas over Striped.
RAM. ZFS typically leverages RAM really well. You’ll want to balance this with Logzilla & spindles, of course, using ratios similar to the reference configuration.
Spindle read IOPS. Ideally, I should have some idea of the total expected read IOPS of my application, and configure sufficient spindles to handle the maximum anticipated random read load. If this kind of data is unavailable, I’ll default to the reference configuration.
Network. 10Gbit Ethernet is cheap enough these days that any reasonable storage should use it. It’s still a really tough pipe to fill for most organizations since it’s so large, but it is possible.
CPU. It’s almost an afterthought, really; even the lowest CPU configuration of a given appliance that is capable of handling 1TB of RAM per head (2TB per cluster) comes with abundant CPU. But if I want to use ZFS Encryption heavily, or use the more CPU-intensive compression algorithms, CPU becomes a pretty legitimate thing to spend some money on.
Readzilla/L2ARC/Read Cache. The ARC — main memory — is really your best, highest-performing cache on a ZFS appliance, but if there are specific reasons for investing heavily in Readzilla (L2ARC) cache, we’ll know a few months after we start using it. Basically, if my ARC hit rate drops down into the 80% range or lower, I want to add a Readzilla or two to the system. The cool thing is, you can add these any time; you don’t have to put this into the capital expense budget up-front, but it’s something you can do responsively if the storage appliance use pattern starts to suggest you ought to.

Your Best Baseline Hardware Configuration

So here’s the hardware configuration we typically use in Oracle IT. It’s not the biggest, it’s certainly not the most expensive, but it has the advantage of simplicity, flexibility, and stellar performance for the vast majority of our use cases, and it all fits neatly into one little standard 48U rack. I’ll hold off on part numbers, though, as those change over time.

ZS4-4 cluster (two heads).
15 core (or more) processor.
1TB or 1.5TB RAM per head (2TB or 3TB total RAM across the cluster).
Dual port 10Gbit NIC per head. We typically buy two of these for a total of four ports for full redundancy.
Two SAS cards per head (required).
Clustron (pre-installed) to connect your cluster heads together.
8 shelves. I suggest if you anticipate fairly low IOPS and mostly capacity-related pressures that you opt for the DE2-24C configuration (capacity), but if you think IOPS will be pretty heavy, opting for DE2-24P (performance) is a good alternative but with pretty dramatically reduced capacity.
8x200GB Logzilla SSDs. This is probably overkill, but some few environments can leverage having this much intent log.
Fill those shelves with 7200RPM drives as required. Formatted capacity in TiB as I recommend below will be around 44.5% of raw capacity in TB once spares and the conversion from TB to TiB is taken into account. Typically in this configuration I’ll have 184 spinning disks, so whatever capacity of disk I buy, I can do the math. The cool part is that, on average, I’ll roughly double this with LZJB compression on average mixed-use workloads, giving around 67% up to 106% of raw capacity when formatted and used. Which is, in essence, freakin’ awesome.

Fundamental Tuning & Configuration

Now let’s step into software configuration. If you’ve configured your system as above, random writes are a breeze. Your appliance will rock the writes. The Achilles’ heel of the ZFS appliance in a typical general-purpose "capacity" configuration as above is random reads. They can be both slow themselves, and they can slow down other I/O. You want to do whatever you can to minimize their impact.

I’ll create two pools, splitting the shelves down the middle, and when setting up the cluster assign half of each shelf’s resources to a pool.
Those pools will be assigned one per head in the cluster configuration. This really lets us exploit maximum performance as long as we’re not failed over.
Use LZJB by default for each project. Numerous technical reasons for this; for now, if you don’t know what they are, take it on faith that LZJB typically provides a ZFS appliance a SERIOUS performance boost, but only if it’s applied before data is written… if applied after, it doesn’t do much. This speeds up random reads considerably.
If using an Oracle database, just use OISP. It makes your life so so much easier from configuration to layout: two shares, and done. If not using OISP, then pay close attention to the best practices for database layout to avoid shooting oneself in the foot!
If using an Oracle database, leverage HCC on every table where it’s practical. HCC-compressing the data — despite the CPU cost on your front-end database CPU initially — usually provides a pretty huge I/O boost to the back-end once again for reads. Worth it.
Scrub your pools. In a later blog entry I’ll discuss using a scheduled workflow to invoke a scrub, but for now just use Cron on an admin host, or assign some entry-level dude to mash the "scrub" button once a week for data safety. Around about year 3 of use, hard drive failure rates peak and continue failing at a more-or-less predictable rate indefinitely. There are certain extremely rare conditions under which it’s possible to lose data that is written once and very infrequently read in a mirror2 configuration; if you scrub your pools on a regularly-scheduled basis (at the default priority, this means more or less continuously), your exposure to the risk is dramatically lower to the point of "negligible risk".

Wrapping It Up

There you have it: an ideal general-purpose file server with good capacity, great performance for average loads, and something that in typical Oracle Database or mixed-use environments will really make you glad you invested in an Oracle Sun ZFS Storage Appliance.