ZFS: Doing It Right

matthew's picture

ZFS: Doing It Right

Imagine you're a system administrator, and an email arrives from your boss. It goes something like this:

"Hey, bud, we need some new storage for Project Qux.  We heard that this [insert major company here] uses a product called the Oracle Sun ZFS Storage Appliance as the back-end for their [insert really popular app here]. We want to do something like that at similar scale; can you evaluate how well that compares to XYZ storage we already own?"

So you get in touch with your friendly local ZFS sales dudette, who arranges a meeting that includes a Sales Engineer to talk about technical stuff related to your application. The appliance, however, has an absolutely dizzying array of options.  Where do you start?

Without a thorough evaluation of performance characteristics, there are two scenarios most people evaluating these appliances end up choosing:

  1. ZFS choices that will almost certainly fail, and
  2. ZFS choices with a reasonable chance of success despite their lack of knowledge.

To start with, I'll talk about Scenario 1: setting up yourself and your ZFS evaluation up to fail: Doing It Wrong.

How Are People Do It Wrong?

I bumped into several individuals at OpenWorld that had obviously already made choices that guaranteed the ZFS appliance they purchased was not going to work for them.  They just didn't know it yet. And of course, despite my best
intentions to help them cope with the mess they made, they remained
unsatisfied with their purchase.

Both the choices and outcome were eminently predictable, and apparently motivated by several common factors.

Misplaced Cost-Consciousness

From my point of view if someone isn't ready to invest six figures in storage, then they aren't yet ready for the kind of performance and reliability an enterprise-grade NAS like the ZFS appliance can offer them.  The hardware they can afford won't provide them an accurate picture of how storage performs at scale.

Any enterprise storage one can buy at a four or five-figure price point is still a toy; a useful one, but still a toy compared with its bigger siblings.

It'll be nifty and entertaining if the goal is familiarize oneself with the operating system and interfaces. It will allow users to get a glimpse of the kinds of awesome advantages ZFS offers. It'll offer a reasonable test platform for bigger & better things later as you explore REST, Analytics, Enterprise Manager, and the Oracle-specific optimizations available to you.  And perhaps it might serve reasonably well as a departmental file server or small-scale storage for a few dozen terabytes of data.  But it won't offer performance or reliability on a scale similar to what serious enterprises deserve.

Misunderstanding Needs

Most customers that invest in dedicated storage for the first time don't yet understand their data usage patterns. IOPS? A stab in the dark. Throughput? Maybe a few primitive tests from a prototype workstation. Hot data volume? Read response latency requirements? Burst traffic vs. steady-state traffic? Churn rate? Growth over time? Deduplication or cloning strategies? Block sizes? Tree depth? Filesystem entries per directory? Data structure? Best supported protocol? Protocol bandwidth compared to on-disk usage? Compressibility? Encryption requirements? Replication requirements?

I'm not saying one has to have all these answers prior to purchasing storage.  In fact, the point of this series is to encourage you to purchase a good general-purpose hardware platform that is really good at most workloads, and configure it in a way that you're less likely to shoot yourself in the foot.  But over and over the people with the biggest problems were the ones who didn't understand their data, yet hoped that purchasing some low-end ZFS storage would somehow magically solve their poorly-understood problems.

Lack Of Backups

Most data worth storing is worth backing up. While I'm a big fan of the Oracle StorageTek SL8500 tape silo, not everybody is ready for a tape backup solution that can span the size of a football field or Quidditch pitch.

Nevertheless, trusting that the inherent reliability and self-healing of a filesystem will see a company through a disaster is not a good idea.  Earthquakes, tornados, errant forklift drivers, newbie admins with root access, overly-enthusiastic Logistics personnel with a box knife and a typo-ridden list of systems to move are common.  Backups should be considered and implemented long before valuable data is committed to storage.

Solving Yesterday's Problems

Capacity planning is crucial in the modern enterprise. While I'm certain our sales guys are really happy to sell systems on an urgent basis with little or no discount in response to poor planning on the part of customers, that kind of decision making is often really hard on the capital expense budget.

A big part of successful capacity planning is forecasting future needs. Products like Oracle Enterprise Manager and ZFS Analytics can help. Home-brewed capacity forecasting is viable and common. A system administrator is at her best when she's already anticipated the need of the business and has a ready solution for the future problems she understands will arrive eventually, and with an enterprise NAS a modest investment in hardware can continue to yield future dividends as an admin continues to better understand her data utilization patterns and learns to use the available tools to intelligently manage it.

How To Fail At ZFS And Performance Reviews

Here are the options I would pick if I wanted to set up my ZFS appliance to fail:

  • Go with any non-clustered option; reliability suffers. Failure imminent.
  • Choose the lowest RAM option; space pressure will make my bosses really unhappy with the storage as things slow down. Great way to fail.
  • Buy CPUs at the lowest possible specification; taking advantage of CPU speed for compression would make the storage run better, and using CPU for encryption gives us options for handling sensitive data. Don't want that if our goal is failure!
  • Pick an absurdly low number of high-capacity, low-IOPS spindles, like maybe twenty to forty 7200RPM drives; I/O pressure will drive me nuts troubleshooting, but heck, it's job security.
  • Don't invest in Logzillas (SLOG devices). The resultant write IOPS bottleneck will guarantee everybody hates this storage.
  • If I do invest in Logzillas (SLOG devices), use as few as possible and stripe them instead of mirroring them; that kills two birds with one stone: impaired reliability AND impaired performance!
  • Buy Readzillas (L2ARC), but ignore the total RAM available to the system and go for the big, fat, expensive Readzilla SSDs because I think we're going to have a "lot of reads" without understanding what Readzillas actually do. This will impair RAM performance further, wasting both my money AND squandering performance!

If you do the above, you'll pretty much guarantee a bad time for yourself with ZFS storage.  Unfortunately, this seems to be the way far too many people try to configure the storage, and they set themselves up for failure right from the start.

So we've talked about Doing It Wrong. How do you Do It Right?

Do It Right: Rock ZFS, Rock Your Performance Review

In case you don't know what I do, I co-manage several hundred storage appliances for a living (soon to be over a thousand, with hundreds of thousands of disks among them. Wow. The sheer scope of working for Oracle continues to amaze me!). Without knowing anything else about the workload except that the customer wants high-performance general-purpose file storage, below is the reference configuration I would pick if I want to maximize the workload's chances of success.  If I think I need to differ from this reference configuration, it's important to ask "How does this improve on the reference configuration?"  This reference configuration has proven its merit time and time again under a dizzying array of workloads, and I'd only depart from it under very compelling arguments to do so.

Such arguments exist, but if they are motivated by price, I am always trading away performance for a lower price!

Understanding The Basics

Guiding this reference configuration are the following priorities:

  1. Redundancy. If it's worth doing, it's worth protecting; the ZFS appliance is reliable because it's very fault-tolerant and self-healing, not because the commodity Sun hardware it's built with is inherently more reliable than competing options.
  2. Mirrored Logzillas (SLOG devices). Balance this with RAM and spindles, though, as too much of any of the three and one or more will be underused.  And for a few obscure technical reasons related to reliability, I strongly prefer Mirrored Logzillas over Striped.
  3. RAM. ZFS typically leverages RAM really well. You'll want to balance this with Logzilla & spindles, of course, using ratios similar to the reference configuration.
  4. Spindle read IOPS. Ideally, I should have some idea of the total expected read IOPS of my application, and configure sufficient spindles to handle the maximum anticipated random read load.  If this kind of data is unavailable, I'll default to the reference configuration.
  5. Network. 10Gbit Ethernet is cheap enough these days that any reasonable storage should use it. It's still a really tough pipe to fill for most organizations since it's so large, but it is possible.
  6. CPU. It's almost an afterthought, really; even the lowest CPU configuration of a given appliance that is capable of handling 1TB of RAM per head (2TB per cluster) comes with abundant CPU. But if I want to use ZFS Encryption heavily, or use the more CPU-intensive compression algorithms, CPU becomes a pretty legitimate thing to spend some money on.
  7. Readzilla/L2ARC/Read Cache. The ARC -- main memory -- is really your best, highest-performing cache on a ZFS appliance, but if there are specific reasons for investing heavily in Readzilla (L2ARC) cache, we'll know a few months after we start using it. Basically, if my ARC hit rate drops down into the 80% range or lower, I want to add a Readzilla or two to the system. The cool thing is, you can add these any time; you don't have to put this into the capital expense budget up-front, but it's something you can do responsively if the storage appliance use pattern starts to suggest you ought to.

Your Best Baseline Hardware Configuration

So here's the hardware configuration we typically use in Oracle IT. It's not the biggest, it's certainly not the most expensive, but it has the advantage of simplicity, flexibility, and stellar performance for the vast majority of our use cases, and it all fits neatly into one little standard 48U rack.  I'll hold off on part numbers, though, as those change over time.

  • ZS4-4 cluster (two heads).
  • 15 core (or more) processor.
  • 1TB or 1.5TB RAM per head (2TB or 3TB total RAM across the cluster).
  • Dual port 10Gbit NIC per head.  We typically buy two of these for a total of four ports for full redundancy.
  • Two SAS cards per head (required).
  • Clustron (pre-installed) to connect your cluster heads together.
  • 8 shelves. I suggest if you anticipate fairly low IOPS and mostly capacity-related pressures that you opt for the DE2-24C configuration (capacity), but if you think IOPS will be pretty heavy, opting for DE2-24P (performance) is a good alternative but with pretty dramatically reduced capacity.
  • 8x200GB Logzilla SSDs. This is probably overkill, but some few environments can leverage having this much intent log.
  • Fill those shelves with 7200RPM drives as required.  Formatted capacity in TiB as I recommend below will be around 44.5% of raw capacity in TB once spares and the conversion from TB to TiB is taken into account.  Typically in this configuration I'll have 184 spinning disks, so whatever capacity of disk I buy, I can do the math.  The cool part is that, on average, I'll roughly double this with LZJB compression on average mixed-use workloads, giving around 67% up to 106% of raw capacity when formatted and used.  Which is, in essence, freakin' awesome.

Fundamental Tuning & Configuration

Now let's step into software configuration.  If you've configured your system as above, random writes are a breeze. Your appliance will rock the writes. The Achilles' heel of the ZFS appliance in a typical general-purpose "capacity" configuration as above is random reads. They can be both slow themselves, and they can slow down other I/O. You want to do whatever you can to minimize their impact.

  • I'll create two pools, splitting the shelves down the middle, and when setting up the cluster assign half of each shelf's resources to a pool.
  • Those pools will be assigned one per head in the cluster configuration. This really lets us exploit maximum performance as long as we're not failed over.
  • Use LZJB by default for each project. Numerous technical reasons for this; for now, if you don't know what they are, take it on faith that LZJB typically provides a ZFS appliance a SERIOUS performance boost, but only if it's applied before data is written... if applied after, it doesn't do much.  This speeds up random reads considerably.
  • If using an Oracle database, just use OISP. It makes your life so so much easier from configuration to layout: two shares, and done.  If not using OISP, then pay close attention to the best practices for database layout to avoid shooting oneself in the foot!
  • If using an Oracle database, leverage HCC on every table where it's practical. HCC-compressing the data -- despite the CPU cost on your front-end database CPU initially -- usually provides a pretty huge I/O boost to the back-end once again for reads. Worth it.
  • Scrub your pools. In a later blog entry I'll discuss using a scheduled workflow to invoke a scrub, but for now just use Cron on an admin host, or assign some entry-level dude to mash the "scrub" button once a week for data safety. Around about year 3 of use, hard drive failure rates peak and continue failing at a more-or-less predictable rate indefinitely. There are certain extremely rare conditions under which it's possible to lose data that is written once and very infrequently read in a mirror2 configuration; if you scrub your pools on a regularly-scheduled basis (at the default priority, this means more or less continuously), your exposure to the risk is dramatically lower to the point of "negligible risk".

Wrapping It Up

There you have it: an ideal general-purpose file server with good capacity, great performance for average loads, and something that in typical Oracle Database or mixed-use environments will really make you glad you invested in an Oracle Sun ZFS Storage Appliance.