Should I be making multiple dRAID vdevs?

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Once I figure out how to stop the current replication (https://www.truenas.com/community/threads/how-to-stop-an-active-replication-run.113920/), I can start doing more tests.

I just bought 15 more of those HDDs, but they're not gonna get here for a few days. All said and done, this dRAID will be 45 drives instead of 30 now. With the hardware I have this weekend, I can only test with 44 drives. I wanna play around with the number of data drives and see if I can maximize parity vs data.

At this point, I think a dRAID2 is gonna be better than a dRAID3 since I can sustain 2 drive deaths and a third drive death once the distributed hotspare fills in the gaps; however long that takes. If I knew the fill time, I could be educated on the risks of using dRAID2 vs dRAID3. This is only a backup pool, but I'll eventually make my main pool dRAID as well, so it's important to know.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Let us know how the reconfigure goes and how much better it is! I know I am interested.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
No matter what dRAID layout you try, a 2-way mirror as special vdev is not balanced; you'd need a 3-way or 4-way mirror to be safe.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
TrueNAS issues
No matter what dRAID layout you try, a 2-way mirror as special vdev is not balanced; you'd need a 3-way or 4-way mirror to be safe.
I agree. Right now, I'm testing, and I can always attach a 3rd drive to this mirror.

In the new UI, TrueNAS wants me to use another dRAID for every other vdev type including metadata ‍♂️:

1699238714455.png

The whole point of adding a metadata vdev is to speed up file access times by moving them to SSDs. For dRAID, you really want one to reduce the size of each block by storing it more-efficiently. This is a horrible restriction.

DFI Issues
After reformatting a drive with DFI issues (2 of them needed it), TrueNAS still won't let me use the drive even if Protection is `0` with `sg_readcap`. I have to physically unplug the drive and plug it back in to fix it. Since this is temporary, I didn't bother with that and went with 41 drives instead. It was that or 44 drives based on what TrueNAS listed.

Not sure why I can't specify which drives are used for a given dRAID. If create my own dRAID manually, the zpool won't show up for TrueNAS to import, so it's unusable unless I use the TrueNAS UI. Pretty awful user experience right now.

Adding Special Mirrored vdev
Adding the Special Mirrored vdev manually requires this command (second one):
Code:
# zpool add Temp -o ashift=12 special mirror /dev/sdy /dev/sdt
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, draid and mirror vdevs, 2 vs. 1 (2-way)
# zpool add Temp -f -o ashift=12 special mirror /dev/sdy /dev/sdt

Performance Testing
My SSD pool is showing reads at 1.05GB/s:
Code:
# watch -n 0.5 zpool iostat -v Bunnies
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
Bunnies                                   74.5T  39.7T  9.49K    454  1.05G  47.6M

That makes me think the Temp pool is writing at that speed, but it's not:
Code:
# watch -n 0.5 zpool iostat -v Temp
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
Temp                                       668G   357T      0  8.31K  5.19K   367M
  draid2:8d:41c:1s-0                       664G   356T      0  7.58K  5.15K   362M

With 4 redundancy groups, it's writing only ~40% faster than with one redundancy group. That's a lot, but it's also what I'd expect. Shouldn't it be closer to 4x the performance? I didn't even use a non-^2 number of data drives, so it's not that.

Strangely. When I run this different command, the numbers are all over the place; sometimes as low as 0 or even 250MB/s, and sometimes as high as high as 860MB/s:
Code:
# zpool iostat -v Temp -n 0.5
----------------------------------------  -----  -----  -----  -----  -----  -----
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
Temp                                       682G   357T      0  15.2K      0   860M
  draid2:8d:41c:1s-0                       678G   356T      0  14.7K      0   858M

I think `-n` is more of a measurement at any given time whereas a regular `iostat` gives you an average (which is why it takes time to build up).

I'll have to get more data later. I think one issue are the two drives drives I formatted to remove DFI protection (apparently, they made it into the array *shrug*:
Code:
# zpool iostat -vL Temp
                        capacity     operations     bandwidth
pool                  alloc   free   read  write   read  write
--------------------  -----  -----  -----  -----  -----  -----
Temp                  1.15T   357T      0  8.73K  3.45K   429M
  draid2:8d:41c:1s-0  1.14T   355T      0  8.04K  3.43K   424M
    sdap2                 -      -      0    197    106  10.0M
    sdaq2                 -      -      0    196    106  9.99M
    sdar2                 -      -      0    200    106  10.0M
    sdas2                 -      -      0    196    106  9.99M
    sdat2                 -      -      0    196    106  10.1M
    sdau2                 -      -      0    200    106  10.0M
    sdav2                 -      -      0    198    106  10.0M
    sdaw2                 -      -      0    200    106  10.0M
    sdax2                 -      -      0    200    106  10.0M
    sday2                 -      -      0    201    106  10.1M
    sdaz2                 -      -      0    199    106  10.1M
    sdba2                 -      -      0    198    106  10.1M
    sdbb2                 -      -      0    203    106  10.1M
    sdbc2                 -      -      0    196    106  10.0M
    sdbd2                 -      -      0    198    106  10.1M
    sdbe2                 -      -      0    197    106  10.1M
    sdbf2                 -      -      0    198    106  10.1M
    sdbg2                 -      -      0    198    106  10.1M
    sdbh2                 -      -      0    196    106  10.1M
    sdbi2                 -      -      0    196    106  10.1M
    sdbj2                 -      -      0    199    106  10.1M
    sdbk2                 -      -      0    200    106  9.97M
    sdbl2                 -      -      0    199    106  10.0M
    sdbm2                 -      -      0    201    106  10.1M
    sdbn2                 -      -      0    201     10  10.0M
    sdbo2                 -      -      0    198    106  10.1M
    sdbp2                 -      -      0    201    106  10.1M
    sdbq2                 -      -      0    203     10  10.1M
    sdbr2                 -      -      0    197    106  10.1M
    sdbs2                 -      -      0    200    106  10.1M
    sdbt2                 -      -      0    197    106  10.1M
    sdbu2                 -      -      0    195    106  10.0M
    sdbv2                 -      -      0    198    106  10.0M
    sdbw2                 -      -      0    199    106  10.1M
    sdbx2                 -      -      0    196    106  10.0M
    sdby2                 -      -      0    196    106  10.0M
    sdbz2                 -      -      0    197    106  9.97M
    sdca2                 -      -      0    194    106  10.1M
    sdcb2                 -      -      0    200    106  10.1M
    sdcc2                 -      -      0    197    106  10.0M
    sdcd2                 -      -      0    200    106  10.0M
special                   -      -      -      -      -      -
  mirror-1            5.45G  1.81T      0    901     38  5.90M
    sdy                   -      -      0    449     26  2.95M
    sdt                   -      -      0    451     12  2.95M
--------------------  -----  -----  -----  -----  -----  -----

For whatever reason, they have a slower read rate than the other drives which may be affecting write speed. I think I have to unplug and plug them back in to fix this. That or restart the NAS.

Just noticed in this last set of data, the speed picked up a lot! We're at a 70% increase. Still not even double with 4x the redundancy groups :(.
 
Last edited:

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
For zpool iostat, -n is not needed really and your number is the interval, .5 seconds. I'd up that to like a 5 or 10 and see what the numbers are over a larger interval. Ignore the very first line as it's an average, every line after that is realtime.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Fun stuff, I pulled both drives with slower read speeds and plugged them back in. Here's what I'm seeing:
Code:
# zpool status -v Temp
  pool: Temp
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver (draid2:8d:41c:1s-0) in progress since Sun Nov  5 21:38:55 2023
        1.33T / 1.20T scanned at 5.12G/s, 331G / 200G issued at 1.25G/s
        32.8G resilvered, 100.00% done, no estimated completion time
config:

    
NAME                  STATE     READ WRITE CKSUM
        Temp                  ONLINE       0     0     0
          draid2:8d:41c:1s-0  ONLINE       0     0     0
            sdap2             ONLINE       0     0     0  (resilvering)
            sdaq2             ONLINE       0     0     0  (resilvering)
            sdar2             ONLINE       0     0     0  (resilvering)
            sdas2             ONLINE       0     0     0  (resilvering)
            sdat2             ONLINE       0     0     0  (resilvering)
            sdau2             ONLINE       0     0     0  (resilvering)
            sdav2             ONLINE       0     0     0  (resilvering)
            sdaw2             ONLINE       0     0     0  (resilvering)
            sdax2             ONLINE       0     0     0  (resilvering)
            sday2             ONLINE       0     0     0  (resilvering)
            sdaz2             ONLINE       0     0     0  (resilvering)
            sdba2             ONLINE       0     0     0  (resilvering)
            sdbb2             ONLINE       0     0     0  (resilvering)
            sdbc2             ONLINE       0     0     0  (resilvering)
            sdbd2             ONLINE       0     0     0  (resilvering)
            sdbe2             ONLINE       0     0     0  (resilvering)
            sdbf2             ONLINE       0     0     0  (resilvering)
            sdbg2             ONLINE       0     0     0  (resilvering)
            sdbh2             ONLINE       0     0     0  (resilvering)
            sdbi2             ONLINE       0     0     0  (resilvering)
            sdbj2             ONLINE       0     0     0  (resilvering)
            sdbk2             ONLINE       0     0     0  (resilvering)
            sdbl2             ONLINE       0     0     0  (resilvering)
            sdbm2             ONLINE       0     0     0  (resilvering)
            spare-24          ONLINE       0     0     0
              sdek2           ONLINE       0     0     0
              draid2-0-0      ONLINE       0     0     0  (resilvering)
            sdbo2             ONLINE       0     0     0  (resilvering)
            sdbp2             ONLINE       0     0     0  (resilvering)
            sdbn2             ONLINE       0     0     0  (resilvering)
            sdbr2             ONLINE       0     0     0  (resilvering)
            sdbs2             ONLINE       0     0     0  (resilvering)
            sdbt2             ONLINE       0     0     0  (resilvering)
            sdbu2             ONLINE       0     0     0  (resilvering)
            sdbv2             ONLINE       0     0     0  (resilvering)
            sdbw2             ONLINE       0     0     0  (resilvering)
            sdbx2             ONLINE       0     0     0  (resilvering)
            sdby2             ONLINE       0     0     0  (resilvering)
            sdbz2             ONLINE       0     0     0  (resilvering)
            sdca2             ONLINE       0     0     0  (resilvering)
            sdcb2             ONLINE       0     0     0  (resilvering)
            sdcc2             ONLINE       0     0     0  (resilvering)
            sdcd2             ONLINE       0     0     0  (resilvering)
        special
          mirror-1            ONLINE       0     0     0
            sdy               ONLINE       0     0     0
            sdt               ONLINE       0     0     0
        spares
          draid2-0-0          INUSE     currently in use

It looks like it resilvered insanely fast if I'm not mistaken. This is a brand new pool with only 1.2TiB of data, but that's pretty good still! It must've happened so fast it didn't have time to calculate the completion time.

I remember something about a scrub occurring after every resilver with dRAID, but that scrub hasn't yet happened. I'm assuming the combination of me writing data to the drives and them trying to resilver is causing it to not finish maybe?

Confirmed. Immediately after killing the transfer, the resilver stopped too:
Code:
# zpool status -vL Temp
  pool: Temp
 state: ONLINE
  scan: scrub in progress since Sun Nov  5 21:48:50 2023
        1.59T / 1.59T scanned, 10.3G / 1.59T issued at 961M/s
        0B repaired, 0.63% done, 00:28:48 to go
  scan: resilvered (draid2:8d:41c:1s-0) 36.4G in 00:09:54 with 0 errors on Sun Nov  5 21:48:50 2023
config:

        NAME                  STATE     READ WRITE CKSUM
        Temp                  ONLINE       0     0     0
          draid2:8d:41c:1s-0  ONLINE       0     0     0
            sdap2             ONLINE       0     0     0
            sdaq2             ONLINE       0     0     0
            sdar2             ONLINE       0     0     0
            sdas2             ONLINE       0     0     0
            sdat2             ONLINE       0     0     0
            sdau2             ONLINE       0     0     0
            sdav2             ONLINE       0     0     0
            sdaw2             ONLINE       0     0     0
            sdax2             ONLINE       0     0     0
            sday2             ONLINE       0     0     0
            sdaz2             ONLINE       0     0     0
            sdba2             ONLINE       0     0     0
            sdbb2             ONLINE       0     0     0
            sdbc2             ONLINE       0     0     0
            sdbd2             ONLINE       0     0     0
            sdbe2             ONLINE       0     0     0
            sdbf2             ONLINE       0     0     0
            sdbg2             ONLINE       0     0     0
            sdbh2             ONLINE       0     0     0
            sdbi2             ONLINE       0     0     0
            sdbj2             ONLINE       0     0     0
            sdbk2             ONLINE       0     0     0
            sdbl2             ONLINE       0     0     0
            sdbm2             ONLINE       0     0     0
            sdek2             ONLINE       0     0     0
            sdbo2             ONLINE       0     0     0
            sdbp2             ONLINE       0     0     0
            sdbn2             ONLINE       0     0     0
            sdbr2             ONLINE       0     0     0
            sdbs2             ONLINE       0     0     0
            sdbt2             ONLINE       0     0     0
            sdbu2             ONLINE       0     0     0
            sdbv2             ONLINE       0     0     0
            sdbw2             ONLINE       0     0     0
            sdbx2             ONLINE       0     0     0
            sdby2             ONLINE       0     0     0
            sdbz2             ONLINE       0     0     0
            sdca2             ONLINE       0     0     0
            sdcb2             ONLINE       0     0     0
            sdcc2             ONLINE       0     0     0
            sdcd2             ONLINE       0     0     0
        special
          mirror-1            ONLINE       0     0     0
            sdy               ONLINE       0     0     0
            sdt               ONLINE       0     0     0
        spares
          draid2-0-0          AVAIL

errors: No known data errors

And it did kick off the scrub!
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
Yes, 2 steps, first step is to bring the pool back to an online state via sequential resilver (default) which is very fast (maybe 5 times faster but depends on how big your stripes are), and the second step is a scrub I believe. That is still slow. Only the first part has to be done to be healthy again so that's the big advantage in large pools. If you replace the drive, then the array is rebalanced like a resilver.

I believe but am not certain that if you have a hot spare, the scrub is not done as step 2, but a rebalance instead. Anyone know? I personally only used it once so trying to recall for sure.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The ZFS scrub is mandatory after a hot spare is activated & re-silvered, because;
Checksums are not validated during a sequential resilver.

That sort of makes sense, as the intent is to restore as much redundancy as it can. (Aka for dRAID-1, all redundancy... and for 2 or 3 parity versions, the additional redundancy.) They make the assumption that the dRAID vDev is otherwise clean, and will check it after redundancy is fully restored.

Some of the wording in the referenced doc could use some clean up. It is not as clear as it could be in regards to disk failure, integrated hot spare activation, and disk replacements.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
This is the 4th time I've remade the pool for testing. We're back at 44 drives:
Code:
# zpool iostat -vL Temp -n 10
                        capacity     operations     bandwidth
pool                  alloc   free   read  write   read  write
--------------------  -----  -----  -----  -----  -----  -----
Temp                   137G   385T      0  3.35K      0   625M
  draid2:4d:44c:1s-0   136G   383T      0  3.32K      0   624M

Roughly 900GiB of family videos (I have 3 kids :tongue:):
Code:
# zfs send -RL Bunnies/Videos@transferTest | pv -Wbraft | zfs recv -Fuv -o recordsize=16M Temp/Videos
receiving full stream of Bunnies/Videos@weekly-2023-05-14_00-00 into Temp/Videos@weekly-2023-05-14_00-00
 149GiB 0:05:14 [ 435MiB/s] [ 488MiB/s]
 244GiB 0:08:28 [ 432MiB/s] [ 492MiB/s]

I also set it to a 16M recordsize when transferring to see if that would speed things up. Probably only affects reads though.

Either way, this dRAID as much less capacity, but now runs 7 redundancy groups. It's so freakin' slow. What, is each redundancy group only worth ~100MB/s? I thought newer HDDs--especially these enterprise ones--could do ~200MB/s easy.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Well crap. I meant to delete an attachment on my post and deleted the whole post...

The same transfer with 3 x draid2:4d:13c:1s:
Code:
received 867G stream in 831.54 seconds (1.04G/sec)

While undeniably faster, I think this is still too slow. With this many drives, mirrors give me almost the same capacity, but I'm uncertain how much faster they are.

I'm not understanding why the transfer speed of 39 HDDs in this zpool are so slow. I might have strange expectations from using SSDs almost exclusively for over 10 years, but this pool should still be a lot faster.

Unless... Maybe the issue is the pool I'm transferring from! Is there any good way to check where the bottleneck is coming from?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
The ZFS scrub is mandatory after a hot spare is activated & re-silvered, because;


That sort of makes sense, as the intent is to restore as much redundancy as it can. (Aka for dRAID-1, all redundancy... and for 2 or 3 parity versions, the additional redundancy.) They make the assumption that the dRAID vDev is otherwise clean, and will check it after redundancy is fully restored.

Some of the wording in the referenced doc could use some clean up. It is not as clear as it could be in regards to disk failure, integrated hot spare activation, and disk replacements.
Yes, I am just not clear of if it skips the scrub with a hot spare, since a hot spare would negate the necessity since it would be doing a healing resilver.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Yes, I am just not clear of if it skips the scrub with a hot spare, since a hot spare would negate the necessity since it would be doing a healing resilver.
It is not clear in the documentation...

If the initial hot spare activation was done without verifying checksums, then I would want a checksum verification scan of those blocks after the hot spare is fully synced.


For example, the RAID-Zx column grow feature appears to avoid the checksum verification, so that it can complete faster. This implies a scrub should be done after to make sure no bit rot was copied blindly.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I wouldn't change the recordsize to 16M myself. 16M would necessitate calculating compression and checksum based on more data. 1M is a sweet spot for the most part. Not sure if it would make a huge difference, but go back to 1M if you can.

I agree your 7 group is too slow. But you are replicating which does add some overhead.

What if you did the test with a cp -a for a copy instead of replication?
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
It is not clear in the documentation...

If the initial hot spare activation was done without verifying checksums, then I would want a checksum verification scan of those blocks after the hot spare is fully synced.


For example, the RAID-Zx column grow feature appears to avoid the checksum verification, so that it can complete faster. This implies a scrub should be done after to make sure no bit rot was copied blindly.
Hm, I wouldn't think so. From the openzfs website:

"A traditional healing resilver scans the entire block tree. This means the checksum for each block is available while it’s being repaired and can be immediately verified."

Followed by this:

"Distributed spare space can be made available again by simply replacing any failed drive with a new drive. This process is called rebalancing and is essentially a resilver. When performing rebalancing a healing resilver is recommended since the pool is no longer degraded. This ensures all checksums are verified when rebuilding to the new disk and eliminates the need to perform a subsequent scrub of the pool."

So, the resilver should verify the checksums. But yeah, agree, it's not clear at all to me at least. That's just nice to know, but the more odd issue here is his 7 group draid being so slow! I am not sure why, have any thoughts?
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Ran some `fio` tests with my SSD and HDD pools. This is the command:
Code:
zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=[DEPENDS] --bs=16M --numjobs=1 --iodepth=1 --runtime=10 --size=5G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

HDD (in the 3 x dRAID2 vdev configuration) `--rw=readwrite`
Code:
   READ: bw=3164MiB/s (3318MB/s), 3164MiB/s-3164MiB/s (3318MB/s-3318MB/s), io=30.9GiB (33.2GB), run=10001-10001msec
  WRITE: bw=3336MiB/s (3498MB/s), 3336MiB/s-3336MiB/s (3498MB/s-3498MB/s), io=32.6GiB (35.0GB), run=10001-10001msec

From what I'm seeing, it's not writing as fast as it can since zfs send/recv transfers sequential blocks.
Code:
fio: (g=0): rw=rw, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
fio: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=2915MiB/s,w=3331MiB/s][r=182,w=208 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=417174: Mon Nov  6 00:41:19 2023
  read: IOPS=197, BW=3164MiB/s (3318MB/s)(30.9GiB/10001msec)
    slat (usec): min=645, max=200233, avg=2094.72, stdev=6827.10
    clat (nsec): min=500, max=20880, avg=1559.05, stdev=2007.76
     lat (usec): min=646, max=200238, avg=2096.28, stdev=6827.79
    clat percentiles (nsec):
     |  1.00th=[  532],  5.00th=[  580], 10.00th=[  612], 20.00th=[  644],
     | 30.00th=[  684], 40.00th=[  732], 50.00th=[  812], 60.00th=[  932],
     | 70.00th=[ 1144], 80.00th=[ 1880], 90.00th=[ 3408], 95.00th=[ 5280],
     | 99.00th=[ 9664], 99.50th=[15040], 99.90th=[20864], 99.95th=[20864],
     | 99.99th=[20864]
   bw (  MiB/s): min=  992, max= 5184, per=99.72%, avg=3155.53, stdev=1229.60, samples=19
   iops        : min=   62, max=  324, avg=197.21, stdev=76.85, samples=19
  write: IOPS=208, BW=3336MiB/s (3498MB/s)(32.6GiB/10001msec); 0 zone resets
    slat (usec): min=1133, max=13434, avg=2803.03, stdev=1217.90
    clat (nsec): min=640, max=124045, avg=2002.21, stdev=3347.53
     lat (usec): min=1134, max=13445, avg=2805.04, stdev=1219.36
    clat percentiles (nsec):
     |  1.00th=[   860],  5.00th=[   940], 10.00th=[   980], 20.00th=[  1048],
     | 30.00th=[  1112], 40.00th=[  1176], 50.00th=[  1256], 60.00th=[  1400],
     | 70.00th=[  1656], 80.00th=[  2256], 90.00th=[  3344], 95.00th=[  5344],
     | 99.00th=[ 11712], 99.50th=[ 14144], 99.90th=[ 21376], 99.95th=[ 23936],
     | 99.99th=[124416]
   bw (  MiB/s): min=  800, max= 5184, per=98.85%, avg=3297.16, stdev=1328.73, samples=19
   iops        : min=   50, max=  324, avg=206.05, stdev=83.06, samples=19
  lat (nsec)   : 750=20.45%, 1000=16.39%
  lat (usec)   : 2=41.77%, 4=13.81%, 10=6.42%, 20=0.96%, 50=0.17%
  lat (usec)   : 250=0.02%
  cpu          : usr=3.58%, sys=58.67%, ctx=5063, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1978,2085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
HDD `--rw=randrw`
Code:
   READ: bw=358MiB/s (376MB/s), 358MiB/s-358MiB/s (376MB/s-376MB/s), io=3584MiB (3758MB), run=10005-10005msec
  WRITE: bw=397MiB/s (416MB/s), 397MiB/s-397MiB/s (416MB/s-416MB/s), io=3968MiB (4161MB), run=10005-10005msec

Code:
fio: (g=0): rw=randrw, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
fio: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=208MiB/s,w=320MiB/s][r=13,w=20 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=412966: Mon Nov  6 00:39:04 2023
  read: IOPS=22, BW=358MiB/s (376MB/s)(3584MiB/10005msec)
    slat (usec): min=1201, max=486549, avg=41983.91, stdev=46202.53
    clat (nsec): min=1440, max=110244, avg=4379.27, stdev=7352.40
     lat (usec): min=1203, max=486556, avg=41988.29, stdev=46203.31
    clat percentiles (nsec):
     |  1.00th=[  1608],  5.00th=[  2320], 10.00th=[  2672], 20.00th=[  2896],
     | 30.00th=[  3088], 40.00th=[  3344], 50.00th=[  3568], 60.00th=[  3792],
     | 70.00th=[  4016], 80.00th=[  4384], 90.00th=[  5728], 95.00th=[  7136],
     | 99.00th=[ 12608], 99.50th=[ 22400], 99.90th=[110080], 99.95th=[110080],
     | 99.99th=[110080]
   bw (  KiB/s): min=98304, max=524288, per=99.60%, avg=365363.20, stdev=104854.91, samples=20
   iops        : min=    6, max=   32, avg=22.30, stdev= 6.40, samples=20
  write: IOPS=24, BW=397MiB/s (416MB/s)(3968MiB/10005msec); 0 zone resets
    slat (usec): min=1138, max=4315, avg=2409.60, stdev=612.09
    clat (nsec): min=710, max=6561, avg=2300.41, stdev=627.52
     lat (usec): min=1140, max=4318, avg=2411.90, stdev=612.28
    clat percentiles (nsec):
     |  1.00th=[  852],  5.00th=[ 1656], 10.00th=[ 1768], 20.00th=[ 1912],
     | 30.00th=[ 2008], 40.00th=[ 2160], 50.00th=[ 2224], 60.00th=[ 2352],
     | 70.00th=[ 2448], 80.00th=[ 2576], 90.00th=[ 2864], 95.00th=[ 3152],
     | 99.00th=[ 4192], 99.50th=[ 5856], 99.90th=[ 6560], 99.95th=[ 6560],
     | 99.99th=[ 6560]
   bw (  KiB/s): min=65536, max=753664, per=100.00%, avg=406323.20, stdev=188207.80, samples=20
   iops        : min=    4, max=   46, avg=24.80, stdev=11.49, samples=20
  lat (nsec)   : 750=0.21%, 1000=0.85%
  lat (usec)   : 2=15.47%, 4=67.80%, 10=15.04%, 20=0.21%, 50=0.21%
  lat (usec)   : 250=0.21%
  cpu          : usr=0.84%, sys=8.77%, ctx=513, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=224,248,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
SSD (39 mirrors) `--rw=readwrite`
Code:
   READ: bw=2209MiB/s (2316MB/s), 2209MiB/s-2209MiB/s (2316MB/s-2316MB/s), io=21.6GiB (23.2GB), run=10003-10003msec
  WRITE: bw=2318MiB/s (2430MB/s), 2318MiB/s-2318MiB/s (2430MB/s-2430MB/s), io=22.6GiB (24.3GB), run=10003-10003msec

The SSD array has more bandwidth than 1GB/s to spare, but it's unbelievably slow (wow). I think that's because I'm doing both write and read operations. When I do either individually, the results are 1-3GB/s faster with this SSD zpool. Still, this looks so bad!
Code:
fio: (g=0): rw=rw, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
fio: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=1890MiB/s,w=2306MiB/s][r=118,w=144 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=429273: Mon Nov  6 00:46:09 2023
  read: IOPS=138, BW=2209MiB/s (2316MB/s)(21.6GiB/10003msec)
    slat (usec): min=832, max=70369, avg=4795.38, stdev=6672.22
    clat (nsec): min=580, max=12401, avg=1815.52, stdev=1380.53
     lat (usec): min=833, max=70375, avg=4797.19, stdev=6672.90
    clat percentiles (nsec):
     |  1.00th=[  620],  5.00th=[  660], 10.00th=[  724], 20.00th=[  852],
     | 30.00th=[ 1288], 40.00th=[ 1384], 50.00th=[ 1464], 60.00th=[ 1544],
     | 70.00th=[ 1704], 80.00th=[ 2008], 90.00th=[ 3568], 95.00th=[ 5216],
     | 99.00th=[ 6752], 99.50th=[ 7328], 99.90th=[10304], 99.95th=[12352],
     | 99.99th=[12352]
   bw (  MiB/s): min= 1408, max= 2880, per=100.00%, avg=2213.05, stdev=399.72, samples=19
   iops        : min=   88, max=  180, avg=138.32, stdev=24.98, samples=19
  write: IOPS=144, BW=2318MiB/s (2430MB/s)(22.6GiB/10003msec); 0 zone resets
    slat (usec): min=1137, max=6843, avg=2305.48, stdev=680.92
    clat (nsec): min=720, max=6461, avg=1420.27, stdev=588.31
     lat (usec): min=1138, max=6846, avg=2306.90, stdev=681.35
    clat percentiles (nsec):
     |  1.00th=[  772],  5.00th=[  884], 10.00th=[  940], 20.00th=[ 1032],
     | 30.00th=[ 1096], 40.00th=[ 1160], 50.00th=[ 1224], 60.00th=[ 1336],
     | 70.00th=[ 1464], 80.00th=[ 1720], 90.00th=[ 2160], 95.00th=[ 2640],
     | 99.00th=[ 3664], 99.50th=[ 3920], 99.90th=[ 5920], 99.95th=[ 6432],
     | 99.99th=[ 6432]
   bw (  MiB/s): min= 1600, max= 3872, per=100.00%, avg=2325.89, stdev=554.26, samples=19
   iops        : min=  100, max=  242, avg=145.37, stdev=34.64, samples=19
  lat (nsec)   : 750=6.22%, 1000=13.32%
  lat (usec)   : 2=63.96%, 4=11.77%, 10=4.59%, 20=0.14%
  cpu          : usr=1.59%, sys=75.17%, ctx=11479, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1381,1449,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
SSD `--rw=readwrite`
Code:
   READ: bw=786MiB/s (824MB/s), 786MiB/s-786MiB/s (824MB/s-824MB/s), io=7872MiB (8254MB), run=10018-10018msec
  WRITE: bw=835MiB/s (876MB/s), 835MiB/s-835MiB/s (876MB/s-876MB/s), io=8368MiB (8774MB), run=10018-10018msec

Code:
fio: (g=0): rw=randrw, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
fio: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=784MiB/s,w=768MiB/s][r=49,w=48 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=430476: Mon Nov  6 00:46:33 2023
  read: IOPS=49, BW=786MiB/s (824MB/s)(7872MiB/10018msec)
    slat (usec): min=946, max=202518, avg=17939.80, stdev=12526.76
    clat (nsec): min=890, max=13860, avg=2661.49, stdev=1702.05
     lat (usec): min=948, max=202526, avg=17942.46, stdev=12527.27
    clat percentiles (nsec):
     |  1.00th=[ 1320],  5.00th=[ 1416], 10.00th=[ 1448], 20.00th=[ 1496],
     | 30.00th=[ 1592], 40.00th=[ 1704], 50.00th=[ 1848], 60.00th=[ 2128],
     | 70.00th=[ 2704], 80.00th=[ 4192], 90.00th=[ 5280], 95.00th=[ 6048],
     | 99.00th=[ 8640], 99.50th=[ 9664], 99.90th=[13888], 99.95th=[13888],
     | 99.99th=[13888]
   bw (  KiB/s): min=622592, max=917504, per=99.98%, avg=804454.40, stdev=91747.32, samples=20
   iops        : min=   38, max=   56, avg=49.10, stdev= 5.60, samples=20
  write: IOPS=52, BW=835MiB/s (876MB/s)(8368MiB/10018msec); 0 zone resets
    slat (usec): min=1381, max=10733, avg=2268.95, stdev=639.04
    clat (nsec): min=810, max=7041, avg=1788.96, stdev=786.72
     lat (usec): min=1383, max=10739, avg=2270.74, stdev=639.51
    clat percentiles (nsec):
     |  1.00th=[  932],  5.00th=[ 1096], 10.00th=[ 1176], 20.00th=[ 1256],
     | 30.00th=[ 1336], 40.00th=[ 1448], 50.00th=[ 1544], 60.00th=[ 1720],
     | 70.00th=[ 1896], 80.00th=[ 2192], 90.00th=[ 2608], 95.00th=[ 3184],
     | 99.00th=[ 5472], 99.50th=[ 6112], 99.90th=[ 7072], 99.95th=[ 7072],
     | 99.99th=[ 7072]
   bw (  KiB/s): min=524288, max=1146880, per=100.00%, avg=856883.20, stdev=166500.21, samples=20
   iops        : min=   32, max=   70, avg=52.30, stdev=10.16, samples=20
  lat (nsec)   : 1000=0.79%
  lat (usec)   : 2=64.43%, 4=23.55%, 10=11.13%, 20=0.10%
  cpu          : usr=0.88%, sys=24.23%, ctx=18161, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=492,523,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I wouldn't change the recordsize to 16M myself. 16M would necessitate calculating compression and checksum based on more data. 1M is a sweet spot for the most part. Not sure if it would make a huge difference, but go back to 1M if you can.

I agree your 7 group is too slow. But you are replicating which does add some overhead.

What if you did the test with a cp -a for a copy instead of replication?
I did some `fio` tests, and was writing a response when you posted. Do those suffice? You're right, recalculating to 16M does require more compression checks and that could be negatively affecting performance.

The SSD zpool dataset is 128K and the HDD one is 1M by default. So either way (1M or 16M), it's gonna have to redo those calculations anyway. 16M allows for larger blocks which leads to better compression, but I'm assuming that compression step is heavier on the CPU the larger it is? I have a 16-core Eypc (Zen 3) in this system with 256GB of RAM. I'd assume in-place compression calculations should be fast enough to not make a difference here since write speed to HDDs is gonna be slower.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
The compression might add some latency due to the larger recordsize. Yes in the end it will calculate for the total data anyway, but still, I think it adds some latency. Is it significant? Not sure. The testing I've seen says 16M slower, except for certain use cases. Not sure it's enough to make much of a difference though. But not even sure which draid config you are testing now, it all gets confusing eventually. How long is your fixed stripe width now (without parity)? I presume the destination draid is a new pool, so should all be sequential writes. I wonder though about the large recordsize. If you have 6 drives in a group, that's 24k uncompressed, let's say it doesn't compress as most video won't. 16M recordsize is an awful lot of 24k data stripes! The Draid primer recommends 1M recordsize for sequential IO. That reduces a write to 43 stripes or so.

Openzfs has this to say also which should be the way it works and what we all said earlier I think:

In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity

Note, on zfs send, in an incremental at least, it's not really sequential in the sense that the changelog is processed in time order, so, data can be all over the place. It's quite random.

I'd suggest resetting to 1M recordsize and I'd like to eliminate the send/recv and just go with cp -a some big directory with lots of video and time it. I think send/recv introduces other factors and isn't representative of how fast/slow a pool might perform. Simple amount copied / time give you the speed. Just copy some actual files.

One thing I got by re-reading the openzfs doc, and I didn't consider the time I used draid. Since we know small files will eat a lot of space with draid, they say this:

"If a dRAID pool will hold a significant amount of small blocks, it is recommended to also add a mirrored special vdev to store those blocks."

Hah, makes total sense! That eliminates the small file issue.

Now I wish I had a bunch of drive to play with Draid again.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
The compression might add some latency due to the larger recordsize. Yes in the end it will calculate for the total data anyway, but still, I think it adds some latency. Is it significant? Not sure. The testing I've seen says 16M slower, except for certain use cases. Not sure it's enough to make much of a difference though. But not even sure which draid config you are testing now, it all gets confusing eventually. How long is your fixed stripe width now (without parity)? I presume the destination draid is a new pool, so should all be sequential writes. I wonder though about the large recordsize. If you have 6 drives in a group, that's 24k uncompressed, let's say it doesn't compress as most video won't. 16M recordsize is an awful lot of 24k data stripes! The Draid primer recommends 1M recordsize for sequential IO. That reduces a write to 43 stripes or so.

Openzfs has this to say also which should be the way it works and what we all said earlier I think:

In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity

Note, on zfs send, in an incremental at least, it's not really sequential in the sense that the changelog is processed in time order, so, data can be all over the place. It's quite random.

I'd suggest resetting to 1M recordsize and I'd like to eliminate the send/recv and just go with cp -a some big directory with lots of video and time it. I think send/recv introduces other factors and isn't representative of how fast/slow a pool might perform. Simple amount copied / time give you the speed. Just copy some actual files.

One thing I got by re-reading the openzfs doc, and I didn't consider the time I used draid. Since we know small files will eat a lot of space with draid, they say this:

"If a dRAID pool will hold a significant amount of small blocks, it is recommended to also add a mirrored special vdev to store those blocks."

Hah, makes total sense! That eliminates the small file issue.

Now I wish I had a bunch of drive to play with Draid again.
My dRAID is currently the 3 x dRAID2 with 6d, 2p, 1s, 13c. I have a 2-drive 2 TiB SSD mirror for metadata. Wouldn't create a dRAID without it!

I'm testing zfs send/recv because this pool is going to be used only as a backup pool for my SSDs. What's the exact `cp -a` command you'd like me to run?

Even with 1M recordsize, the write speed is unchanged in this configuration (averaging 1GB/s):
Code:
60.5GiB 0:01:00 [ 813MiB/s] [1.01GiB/s]

But I wonder how sequential the reads are for this transfer from my SSD zpool.
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
You shouldn't have to use a 3x stripe, I would hope you can get the 7 group or 6 group to work which is the ideal scenario.

Just cp -a directory somedirontargetpool/ (that is the whole command), any old directory with a large enough # of files, at least 50gb say. This will avoid any send/recv issue and test real performance with your data. I realize you want to use it for send/recv, but first I'd like to see real world performance.

I think but am not certain your FIO test is flawed, and I want to avoid that anyway. iodepth=1 and numdepth=1 seems like it will greatly slow it down acting more like sync writes. Maybe someone more familiar with fio can comment.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
You shouldn't have to use a 3x stripe, I would hope you can get the 7 group or 6 group to work which is the ideal scenario.

Just cp -a directory somedirontargetpool/ (that is the whole command), any old directory with a large enough # of files, at least 50gb say. This will avoid any send/recv issue and test real performance with your data. I realize you want to use it for send/recv, but first I'd like to see real world performance.

I think but am not certain your FIO test is flawed, and I want to avoid that anyway. iodepth=1 and numdepth=1 seems like it will greatly slow it down acting more like sync writes. Maybe someone more familiar with fio can comment.
HDD: 4 x draid2:5d:15c:1s (no SSD metadata)
iodepth=1
Code:
# zfs set primarycache=none Temp
# fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --numjobs=1 --iodepth=1 --runtime=10 --size=5G --time_based --name=fio
# rm /mnt/Temp/performanceTest
# zfs set primarycache=all Temp

   READ: bw=3537MiB/s (3709MB/s), 3537MiB/s-3537MiB/s (3709MB/s-3709MB/s), io=35.0GiB (37.6GB), run=10137-10137msec
  WRITE: bw=3720MiB/s (3901MB/s), 3720MiB/s-3720MiB/s (3901MB/s-3901MB/s), io=36.8GiB (39.5GB), run=10137-10137msec

No `iodepth` set
Code:
# zfs set primarycache=none Temp
# fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --numjobs=1 --runtime=10 --size=5G --time_based --name=fio
# rm /mnt/Temp/performanceTest
# zfs set primarycache=all Temp

   READ: bw=3591MiB/s (3765MB/s), 3591MiB/s-3591MiB/s (3765MB/s-3765MB/s), io=35.1GiB (37.7GB), run=10013-10013msec
  WRITE: bw=3769MiB/s (3953MB/s), 3769MiB/s-3769MiB/s (3953MB/s-3953MB/s), io=36.9GiB (39.6GB), run=10013-10013msec

SSD: 39 x mirrors w/ NVMe Optane metadata
No `iodepth` set
Code:
   READ: bw=3707MiB/s (3887MB/s), 3707MiB/s-3707MiB/s (3887MB/s-3887MB/s), io=36.2GiB (38.9GB), run=10001-10001msec
  WRITE: bw=3889MiB/s (4078MB/s), 3889MiB/s-3889MiB/s (4078MB/s-4078MB/s), io=38.0GiB (40.8GB), run=10001-10001msec

Code:
# time cp -a /mnt/Bunnies/performanceTest /mnt/Temp/
Transfer Speed 2280.31 MB/s
 
Top