Handy Space Monitoring on ZFSSA

matthew's picture

This is a re-post from my blog at http://blogs.oracle.com/storageops/entry/handy_space_monitoring

Semi-real-time space monitoring is pretty straightforward with
ECMAScript & XMLRPC.  I've never really been a fan of using used
+ avail as a metric; it's simply too imprecise for this kind of
work.  With XMLRPC, you can gauge costs down to the byte, and with
Javascript/ECMAScript you have some easy date handling for your
report.

Here's a code snippet to monitor fluctuations in your overall pool space usage.  Just
copy-paste at the CLI to run it. Let's call this "Matt's Handy Pool
Space Delta Monitor".  This one will update every 5 seconds; just
change the "sleep" interval to whatever you need to increase or
decrease the update speed; press CTRL-C a few times rapidly to exit.

There must be a way to get the
ECMASCript interpreter to break out of the whole loop in response to
a CTRL-C the first time, rather than just breaking the current loop
requiring multiple CTRL-C presses, but I'm not exactly certain how to do it:

script
var previousSize = 0,
  currentSize = 0;
while (true) {
  currentDate = new Date();
  currentSize = nas.poolStatus(nas.listPoolNames()[0]).np_used;
  printf('%s bytes delta: %s bytes\n',
    currentDate.toUTCString(),
    currentSize - previousSize);
  previousSize = currentSize;
  run('sleep 5');
}
.

Here's some sample output from a very busy system which handles some
of Oracle's ZFS bundle analysis uploads.  The system is constantly
extracting, compressing, and destroying data, so it's pretty
dynamic.

aueis19nas09:> script
("." to run)> var previousSize = 0,
("." to run)>   currentSize = 0;
("." to run)> while (true) {
("." to run)>   currentDate = new Date();
("." to run)>   currentSize =
nas.poolStatus(nas.listPoolNames()[0]).np_used;
("." to run)>   printf('%s bytes delta: %s bytes\n',
("." to run)>     currentDate.toUTCString(),
("." to run)>     currentSize - previousSize);
("." to run)>   previousSize = currentSize;
("." to run)>   run('sleep 5');
("." to run)> }
("." to run)> .
Wed, 08 Jul 2015 17:44:31 GMT bytes delta:
102937482702848 bytes
Wed, 08 Jul 2015 17:44:36 GMT bytes delta: 0 bytes
Wed, 08 Jul 2015 17:44:42 GMT bytes delta: 362925056
bytes
Wed, 08 Jul 2015 17:44:47 GMT bytes delta: 1039872 bytes
Wed, 08 Jul 2015 17:44:52 GMT bytes delta: 424662016
bytes
Wed, 08 Jul 2015 17:44:57 GMT bytes delta: -181739520
bytes
Wed, 08 Jul 2015 17:45:02 GMT bytes delta: 0 bytes
Wed, 08 Jul 2015 17:45:07 GMT bytes delta: -362792960
bytes
Wed, 08 Jul 2015 17:45:13 GMT bytes delta: -56487936
bytes
Wed, 08 Jul 2015 17:45:18 GMT bytes delta: 0 bytes
Wed, 08 Jul 2015 17:45:23 GMT bytes delta: 311884288
bytes
Wed, 08 Jul 2015 17:45:28 GMT bytes delta: -3111936 bytes
Wed, 08 Jul 2015 17:45:33 GMT bytes delta: 329170944
bytes
Wed, 08 Jul 2015 17:45:38 GMT bytes delta: 94827520 bytes
Wed, 08 Jul 2015 17:45:44 GMT bytes delta: -24576 bytes
Wed, 08 Jul 2015 17:45:49 GMT bytes delta: 356221440
bytes
Wed, 08 Jul 2015 17:45:54 GMT bytes delta: -36864 bytes
Wed, 08 Jul 2015 17:45:59 GMT bytes delta: 503583744
bytes
Wed, 08 Jul 2015 17:46:04 GMT bytes delta: 175494144
bytes
Wed, 08 Jul 2015 17:46:10 GMT bytes delta: -342528 bytes
Wed, 08 Jul 2015 17:46:15 GMT bytes delta: 135242240
bytes
Wed, 08 Jul 2015 17:46:20 GMT bytes delta: -39769600
bytes
Wed, 08 Jul 2015 17:46:25 GMT bytes delta: -124416 bytes
Wed, 08 Jul 2015 17:46:30 GMT bytes delta: -136044544
bytes
^CWed, 08 Jul 2015 17:46:31 GMT bytes delta: 0 bytes
^C^Cerror: script interrupted by user
aueis19nas09:>

Caveats:

  • This isn't actually a 5-second sample; it simply sleeps 5
    seconds between sample periods, and due to execution time you
    will probably get a little drift that will manifest as a
    displayed interval of 6 seconds here & there if left running
    a long time.
  • If you wanted to modify this to be GB instead of bytes, you'd
    replace "currentSize - previousSize" with something like
    "Math.round((currentSize - previousSize) / 1024 / 1024 / 1024)",
    but that will probably just end up with a string of 0 or 1
    results with such a short polling interval.  You'd need to see
    significant and rapid data turnover to get a non-zero result if
    polling by gigabyte every five seconds!
  • This only monitors the first pool on your system. To monitor other pools on your system, you'd change "nas.listPoolNames()[0]" to "nas.listPoolNames()[1]" or whatever number the pool you want to monitor is in response to the "nas.listPoolNames()" command.

Enjoy!