Should I be making multiple dRAID vdevs?

Sawtaytoes · Oct 25, 2023

I keep hearing that dRAID is there for 100+ drive pools, and looking at TrueNAS SCALE Cobia, the UI is clearly designed around a single dRAID vdev. It doesn't even let you add more dRAID vdevs.

So I'm wondering if dRAID is this mystical "one vdev is plenty" scenario. Does it work that different to RAID-Z and mirrors that you no longer need to stripe it since it's doing that internally?

ABain · Oct 25, 2023

There is some information available in the DRAID primer in our documentation which covers both the layouts and the concepts, hope this helps

ZFS dRAID Primer

Background and general information about the dRAID storage solution included in OpenZFS version 2.1.0.

www.truenas.com

Sawtaytoes · Oct 25, 2023

I've read that multiple times already. Can you point me to or quote the the section that explains how you layout dRAID vdevs versus RAID-Z?

From reading the article and testing it out in TrueNAS, it looks like I wanna put as many discs in my zpool as I can into a single dRAID vdev.

That article says it's for hundreds of drives, and others note that it was tested on 90 device configurations (how many spares? Which dRAID?). I also read somewhere that the ZFS devs think it doesn't make sense for less than 20 devices, then other devs are telling me RAID-Z2 is bad in large vdevs even just 11 drives wide because of the resilver time.

I've been trying to figure this out for a week now, and having a single vdev could make sense. Unlike RAID-Z, dRAID splits the pool data in data+parity+spare sized chunks. So if children is larger than the minimum, you're able to use more drives as if you were using more vdevs.

ZFS may not work this way with vdevs though. I'm wondering if many smaller dRAID vdevs would be better because you get the benefit of striping and fast resilvering at the cost of at least 1 extra drive unless you go down from RAID-Z2 to dRAID1 + 1 spare where it's the same number of drives.

iXChris · Oct 25, 2023

Hey, Chris with iX here. dRAID can be confusing, perhaps more so for users who are familiar with RAIDz than those who are new to ZFS in general as some of the terminology is similar and some is dissimilar.

In general a single zdev is ideal for dRAID as the distributed spares are shared across the entire vdev, multiple vdevs would require proportionally more spares to provide the same level of spare availability. In dRAID configurations redundancy groups are similar in function to vdevs in a RAIDz pool. A redundancy group is composed of a number of data devices and a number of parity devices, and a number of redundancy groups will make up a single dRAID vdev. Data is striped across redundancy groups, providing a similar performance benefit to multiple vdevs in a RAIDz pool.

For me personally, it helped to look at the ZFS CLI commands to better understand how a dRAID vdev is constructed.

Code:

zpool create tank draid2:8d:2s:22c /dev/sd[...]

This statement creates a single dRAID vdev.

draid2 indicates a parity level, P, of 2 per redundancy group.
8d indicates 8 data devices, D, per redundancy group.
2s indicates 2 distributed spares, S, which are distributed across the entire vdev.
22c indicates the number of children, C, or total number of drives in the dRAID vdev.

The number of redundancy groups can be calculated as follows:

Code:

number of redundancy groups = (C - S)/(D + P)

So in this case we would have (22 - 2)/(2 + 8), which results in 2 redundancy groups. Data will be striped across these 2 redundancy groups which each have 10 total disks. Creation of this same pool using the TrueNAS WebUI is as follows:

dRAID pools may not be a "one size fits all" solution, but is most valuable with large pools composed of large disks as resilver time is greatly improved of RAIDz and the pool can get back to a healthy state far faster. Part of this increase in speed is because dRAID uses a sequential resilver, in which blocks are read and restored sequentially and a scrub is performed after the resilver to verify checksums, unlike in a RAIDz resilver where the entire block tree is traversed and checksums are verified during the resilver.

The cost of using sequential resilver is the requirement to use fixed stripe width. In the above example, using 4k sector disks, the minimum allocation size is 32k (8 data devices, D, multiplied by 4K sectors). Any files smaller than 32k will be padded with 0's to take up an entire stripe. This could have a largely negative impact for certain use cases, such as databases. In addition best storage utilization will occur when the data devices are a power of 2, which will align with ZFS record sizes.

One other caveat is that dRAID pools depend on precomputed permutation maps to determine where data, parity, and spare capacity reside across the pool, this ensures that during resilvers IO is distributed equally across all members of the vdev. Because of this, distributed spares cannot be added or removed after creation, unlike spares in a RAIDz pool.

Please let me know if that helps, happy to clear up any questions.

Sawtaytoes · Oct 25, 2023

Thanks for your response Chris!

The piece I got out of it is that you don't need to stripe a bunch of dRAID vdevs to achieve the same performance. That's the part that I was struggling to understand in my first post.

Some other things that are unclear:

General recommendations (like "never use RAID-Z") for how many distributed hotspares to use for what number of drives and which dRAID redundancy.
How long it might take to fill in the distributed spare gaps if a drive is taken out?
Is the healing resilver time the same as RAID-Z (relative to the number of data+parity?) when replacing a drive?

I was already familiar with the rest of what you wrote. This article from IBM about their dRAID technology is what explained it to me:

DRAID Expansion. The Important Details

When I introduced the updates in Spectrum Virtualize 8.3.1 I promised a dedicated post discussing the details around the new DRAID dynamic expansion feature. Before we dive in, there are some DRAID…

barrywhytestorage.blog

Since dRAID has no vdev expansion, only zpool, you have to be much more careful when setting it up as you cannot expand without adding another vdev; and these are meant to be very large.

For instance, if I had a dRAID of 60 drives, I may want to add 15 more drives, but I'd have to make another dRAID vdev in the same pool with only 15 rather than 60; making it unbalanced. ZFS is supposed to handle this, but TrueNAS complains. In fact, TrueNAS won't even let me do it in `23.10.0`; I have to use the `zpool` CLI command, and then the dashboard UI looks funky.

Which is why I'm also wondering if it makes sense to use dRAID for smaller groups of drives (15-16) with multiple vdevs.

I opened a GitHub issue for adding dRAID expansion: https://github.com/openzfs/zfs/issues/15413. This is one of the key features of IBM's dRAID implementation.

iXChris · Oct 26, 2023

General recommendations (like "never use RAID-Z") for how many distributed hotspares to use for what number of drives and which dRAID redundancy.

We do not have any solid guidance around number of hotspares or dRAID parity level at this point. We will test over time and get additional insight from the testing. In general I would suggest each redundancy group has similar number of drives/parity level as RAIDz vdevs.

How long it might take to fill in the distributed spare gaps if a drive is taken out?

Again our testing is limited. However, I would suspect that sequential resilver time will decrease proportionally to the total number of children in the vdev. As all drives are read from and written to at the same time, total IO should increase based on the total number of drives.

Is the healing resilver time the same as RAID-Z (relative to the number of data+parity?) when replacing a drive?

Rebalancing occurs after a drive is replaced, this is essentially a traditional resilver and will likely take a similar amount of time as a standard RAIDz resilver, but the array is not in a degraded state during this process.

Ultimately we will discover more performance characteristics as we get more testing time with dRAID, but in general the benefits of dRAID over RAIDz should become more apparent with higher number of total disks.

Sawtaytoes · Oct 26, 2023

Thanks for your detailed response! This gives me more information even if some of it is "we haven't done enough testing" :P.

Rebalancing occurs after a drive is replaced, this is essentially a traditional resilver and will likely take a similar amount of time as a standard RAIDz resilver

At least the rebuild is only as slow as RAID-Z and not slower.

the array is not in a degraded state during this process

That's great! Not being in a degraded state when doing a week-long resilver is a clear benefit of dRAID.

It says that even with multiple vdevs (so long as you have at least 1 spare), dRAID is gonna be better than RAID-Z except in the case where you might have a bunch of really small files. Even then, as long as you load it up with more drive capacity, you should be fine.

I'm excited to use this technology in my new 30- and 100-drive storage pools. I'll probably go with smaller 15-,16-, and 32-drive vdevs because they're easier to grow (for the amount of money I'll be spending) while keeping each vdev uniform.

iXChris · Oct 26, 2023

Happy to be of assistance! That said, I hope that my assumptions are accurate and can be verified with testing. Enterprise demand will dictate where our testing resources are devoted, so any community users who can help contribute to tribal knowledge will be greatly appreciated!

Patrick M. Hausen · Oct 26, 2023

Sawtaytoes said:
I have to use the `zpool` CLI command, and then the dashboard UI looks funky.

To avoid that you need to construct the vdev referring to individual partitions by UUID. In FreeBSD I am familiar enough with gpart to construct the necessary layout from scratch. Since that is not the case for Linux I have come to this method to construct custom vdev layouts in SCALE:

Check the "swap size" setting, e.g. set to zero for SSDs if you already have a pool of spinning disks with swap on them. Or make sure to match the size on the disks already present. You get the idea.

Create a dummy pool containing all your new drives. RAIDZ2, whatever, using the UI.

Export the dummy pool.

Use lsblk (IIRC) to find the UUIDs of all the ZFS partitions on the new disks.

Create new vdev referring to these UUIDs.

Done - UI will look consistent, now.

Arwen · Oct 26, 2023

I thought dRAID was faster to re-silver in a hot spare. This is because the data being read is scattered, as is the destination. (Aka integrated hot spare.) Their is no single disk for write, causing a write bottle neck. So multiple reads and writes can be occurring at the same time to re-silver after a disk failure.

Next, of course the dRAID should be degraded during disk loss. I don't know if it should show it or not. But, their is a loss of some redundancy on disk loss.

Last, replacement of the failed disk does take as long as it would for RAID-Zx. This is because now their is a write bottle neck due to a single disk being written. But, because a (integrated) hot spare is in use, a dRAID vDev is not degraded.

Take my understanding with a huge block of salt. I could be wrong on some or all the points.

Sawtaytoes · Oct 26, 2023

Next, of course the dRAID should be degraded during disk loss. I don't know if it should show it or not. But, their is a loss of some redundancy on disk loss.

I see what you mean. Since all drives are technically data + parity + spares (gaps), you'll be degraded until the hotspare gaps are filled. Once filled though, you're back to normal even while a replacement drive is resilvering.

I don't think redundancy is lost until you lose all your spares though.

ASSUMPTIONS
If you have 1 spare + 1 parity, when you lose one drive, it takes up the remaining spare space. If you lose another drive during this rebuild, then you're gone, but if it finishes fast enough, you're back to 1-parity until you replace that physical drive.

With 4 hotspares and 2 parity, you can lose 2 drives (degraded) before a sequential resilver has completed and retain your data. Once resilvered, you could lose 2 more drives, and you're still good (degraded until resilvered). It isn't until you lose a 5th drive that you're in parity-loss territory and are running degraded until replacing a physical drive. Losing a 6th drive is the same. A 7th in this case would mean a complete failure of the vdev and zpool, but even 3 drives could've failed your pool if they happened prior to a sequential resilver finishing.

That's why it's important to know how long a sequential resilver takes at different dRAID sizes.

Sawtaytoes · Oct 26, 2023

Patrick M. Hausen said:
To avoid that you need to construct the vdev referring to individual partitions by UUID. In FreeBSD I am familiar enough with gpart to construct the necessary layout from scratch. Since that is not the case for Linux I have come to this method to construct custom vdev layouts in SCALE:

Check the "swap size" setting, e.g. set to zero for SSDs if you already have a pool of spinning disks with swap on them. Or make sure to match the size on the disks already present. You get the idea.

Create a dummy pool containing all your new drives. RAIDZ2, whatever, using the UI.

Export the dummy pool.

Use lsblk (IIRC) to find the UUIDs of all the ZFS partitions on the new disks.

Create new vdev referring to these UUIDs.

Done - UI will look consistent, now.

I don't understand what you mean.

The Dashboard UI is wrong because it doesn't understand how to read multiple dRAID vdevs. I don't think it has to do with how you've set them up. If you have one vdev, it works fine.

Even recently, they changed "Mixed sizes" display for when you have mirrors of different sizes. While it doesn't report all the sizes, it does report the correct number of vdevs now. I think it's a UI issue.

Arwen · Oct 26, 2023

Arwen said:
...
But, their is a loss of some redundancy on disk loss.
...

Sawtaytoes said:
...
I don't think redundancy is lost until you lose all your spares though.
...

My wording was precise, some redundancy. Meaning if you have 2 parity and loose a drive, you now have less redundancy than you had before.

The exact point of when a dRAID looses all redundancy, is completely based on the user's design:

Amount of parity
Amount of hot spares
Amount of failed disks
Re-silvers in progress
etc...

Patrick M. Hausen · Oct 26, 2023

Sawtaytoes said:
I don't understand what you mean.

I'm referring to the fact that if you construct a pool or vdev with the CLI and refer to disks by device like e.g. /dev/sdb2 instead of the UUID, the UI for pool and disk management gets messed up. I assumed you were referring to that problem.

Sawtaytoes · Nov 3, 2023

I forgot to post a the speed of my new 30 HGST HDDs in a dRAID:

Code:

# zpool iostat -v Wolves
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
Wolves                                    60.4T   200T      0  11.7K    395   267M
  draid3:16d:30c:1s-0                     60.4T   198T      0  11.7K    118   267M
    0e296301-337a-4eb7-a766-a8a4568ecc4c      -      -      0    397      4  8.89M
    f035f03c-fe6b-4d6e-95da-61439d502555      -      -      0    397      4  8.89M
    3320bc66-0c40-4c77-ad65-6d0f0c3a073d      -      -      0    397      3  8.89M
    331f34ad-b1af-486d-8c7c-baa1b9f3fb8c      -      -      0    397      4  8.89M
    a277d48a-dc04-4834-aed9-acd33eabb934      -      -      0    396      3  8.88M
    24eae042-e2fd-407d-8963-f7ed0a3db56d      -      -      0    400      4  8.89M
    0847e712-339c-479c-acce-ed022e64ff8a      -      -      0    396      3  8.88M
    57137b8b-b212-4b64-b688-ba1e92a0fad0      -      -      0    397      5  8.88M
    3d9deea1-3b48-43d8-be38-8d99d736949c      -      -      0    400      3  8.88M
    bee3816d-e208-4a38-ac11-347f9ed380cc      -      -      0    397      4  8.89M
    d39ada4f-2d3b-4a8b-a6e0-fea518e0d1c6      -      -      0    396      4  8.88M
    a5c09a64-b97a-4a1f-8b66-287814aad4db      -      -      0    397      4  8.89M
    ca9812c7-9e0a-4737-866b-3b47bb803f72      -      -      0    397      2  8.88M
    353c85c2-71b4-4f57-abf9-13410d0bbc92      -      -      0    397      3  8.89M
    236e1cba-04eb-414c-ae96-456e2fb7c483      -      -      0    400      3  8.88M
    1d84b8cf-12ef-4fd6-a50c-c31d99b8eceb      -      -      0    397      4  8.89M
    bdf36023-cfa7-4aa0-8670-c71fa9316407      -      -      0    397      3  8.88M
    4151fc39-f807-4c91-9567-982700e83843      -      -      0    397      4  8.89M
    583d3927-6756-4700-8894-d69cb059e012      -      -      0    397      4  8.89M
    331f6303-ddb4-4b61-8be4-8ae014a8693e      -      -      0    397      3  8.89M
    3ed2b1b7-3ca1-4968-b507-8dfb289e3f6c      -      -      0    397      4  8.88M
    d3c0dc5c-2e90-4213-ba26-ea30c9b3fd50      -      -      0    397      3  8.88M
    7d656de7-1c66-4f9b-bd3a-bf51875f4f9e      -      -      0    397      3  8.89M
    95b4a28e-d908-4d97-ba5f-4b89f6d406f1      -      -      0    397      3  8.88M
    23cfe926-25be-4926-a69e-b3a167f1c34f      -      -      0    397      3  8.89M
    fc321b7e-4ce4-42de-b342-ebbb397c9a1b      -      -      0    397      3  8.88M
    1961b04b-7a83-495a-88d8-56b397b07db7      -      -      0    400      4  8.89M
    849599cf-d842-4807-a888-cf47d7203a1b      -      -      0    400      4  8.88M
    02e5b50c-fb9f-4afd-a327-53cc3eae0ade      -      -      0    397      4  8.89M
    33df4da3-1b9d-42dd-8830-669cca3a736f      -      -      0    400      3  8.88M
special                                       -      -      -      -      -      -
  mirror-1                                54.4G  1.76T      0     76    384   877K
    sdi                                       -      -      0     38    170   439K
    sdw                                       -      -      0     38    213   439K
----------------------------------------  -----  -----  -----  -----  -----  -----

The two special drives are Crucial MX500 2TB.

This is really slow to me. Like, unbearably slow. I expected this to take 11-12 hours, not a week, and this is the fastest it's been. It was copying at 214MB/s yesterday which is still slow. Why isn't this doing at least 1GB/s?

Strangely, the read speed from the other pool is not matching the write speed of this all-SSD pool (mix of Crucial MX500 2TB and 4TB drives):

Code:

# zpool iostat -v Bunnies
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
Bunnies                                   71.3T  42.8T  2.05K    230   218M  21.0M
  mirror-0                                3.58T  51.0G     91      1  9.19M   141K
    ed5edcb7-8763-49f0-bc00-f780a3b6f409      -      -     44      0  4.54M  70.7K
    e30dc3cd-cc08-4bf1-9b84-f9419c83b536      -      -     46      0  4.65M  70.7K
  indirect-1                                  -      -      0      0      0      0
  mirror-2                                1.79T  20.0G     38      0  4.30M  15.7K
    d0b4495b-7e13-11ed-a976-a8a159c2849a      -      -     19      0  2.13M  7.87K
    d0ba8170-7e13-11ed-a976-a8a159c2849a      -      -     19      0  2.17M  7.87K
  mirror-3                                1.79T  19.5G     39      0  4.25M  14.5K
    fafb2e96-d1a5-45a5-8fa4-c79cbe9439dd      -      -     19      0  2.11M  7.23K
    b9790124-e168-49d7-bf15-4ef5d92c9727      -      -     20      0  2.14M  7.23K
  mirror-4                                1.79T  20.2G     39      0  4.24M  17.8K
    81d41751-fe27-4b1a-a45f-9ec11efd0ec2      -      -     19      0  2.12M  8.89K
    da8385bc-ce10-43f4-b1d6-b75ac22fc886      -      -     19      0  2.12M  8.89K
  mirror-5                                1.79T  20.1G     39      0  4.25M  14.1K
    070917b1-2043-4c93-8c46-34c6ec1f0a8a      -      -     19      0  2.15M  7.07K
    26bad8a4-7d6f-4542-8c17-d1c5aec98440      -      -     19      0  2.10M  7.07K
  indirect-6                                  -      -      0      0      0      0
  indirect-7                                  -      -      0      0      0      0
  mirror-8                                1.79T  19.7G     39      0  4.26M  14.6K
    fe6db0d0-45ca-4a9e-96d3-0c6d9ff235c3      -      -     20      0  2.15M  7.28K
    e6e0c7e6-b380-40ca-87e1-25292cad6215      -      -     19      0  2.10M  7.28K
  indirect-9                                  -      -      0      0      0      0
  mirror-11                               1.79T  19.6G     40      0  4.24M  14.7K
    aad8d68e-1763-4c44-aa42-d5b95521bc12      -      -     20      0  2.15M  7.35K
    0c34ca9a-787d-4486-bab1-49edffee1432      -      -     19      0  2.10M  7.35K
  mirror-15                               1.79T  18.9G     44      0  4.60M  14.3K
    d8569602-18d1-441e-a97b-0c15c73c5776      -      -     22      0  2.32M  7.14K
    e7167f98-c5c0-4a5d-bda2-4348579bf98d      -      -     22      0  2.28M  7.14K
  mirror-16                               1.79T  18.9G     45      0  4.64M  14.4K
    dbc073ca-54e6-4391-8086-95f61a7d97bd      -      -     22      0  2.34M  7.20K
    1771fac6-ce5e-4ed9-afb1-cd840cad7cf4      -      -     22      0  2.30M  7.20K
  mirror-17                               1.79T  19.0G     46      0  4.65M  14.2K
    c2ef0edd-8450-4874-a238-8a89c8ef0f90      -      -     22      0  2.31M  7.10K
    a30ec5fa-b252-48c3-87f1-187146911fe5      -      -     23      0  2.34M  7.10K
  mirror-18                               1.79T  19.3G     46      0  4.66M  13.9K
    83cddd5d-5778-4a87-890d-885134f5d897      -      -     22      0  2.31M  6.93K
    db183522-817f-489e-b1bd-078867d52cd6      -      -     23      0  2.35M  6.93K
  indirect-19                                 -      -      0      0      0      0
  mirror-20                               1.79T  19.3G     49      0  5.16M  14.2K
    d0548c12-6f97-42b4-986d-5f53cb79d682      -      -     25      0  2.59M  7.11K
    f4394834-efa9-45e3-8553-133c3e73091f      -      -     24      0  2.56M  7.11K
  indirect-21                                 -      -      0      0      0      0
  indirect-22                                 -      -      0      0      0      0
  indirect-23                                 -      -      0      0      0      0
  mirror-24                               3.52T   108G    112      1  10.4M   135K
    e7c12f73-a06c-446a-8145-ca2fa121c326      -      -     56      0  5.18M  67.6K
    3b5a83a8-acc5-439c-ac58-59ee2277ae4b      -      -     56      0  5.18M  67.6K
  indirect-25                                 -      -      0      0      0      0
  mirror-26                               1.79T  19.3G     52      0  5.52M  35.0K
    355b0cd1-11f8-4369-bba3-7988f921730c      -      -     26      0  2.76M  17.5K
    f653c5f5-a4a6-46ff-a4f9-aef5626b7330      -      -     26      0  2.76M  17.5K
  mirror-27                               3.57T  57.5G    105      1  10.2M   142K
    b3253c5f-7b4b-4d06-bdea-18f57e349394      -      -     52      0  5.08M  71.1K
    45dca8cf-752b-43ac-b698-7abb80506f08      -      -     52      0  5.08M  71.1K
  mirror-28                               3.35T   281G    107      1  10.3M   154K
    5f8c7fb5-e2fd-4740-a47e-886632fb27cd      -      -     53      0  5.14M  77.2K
    d8563a9f-0a41-43ec-9649-b6858cc8b2e1      -      -     53      0  5.14M  77.2K
  mirror-29                               3.29T   340G     97      1  10.1M   228K
    635785c4-202e-418f-b190-3ca68cb0fa55      -      -     48      0  5.03M   114K
    e44144a7-8f9f-470c-96f4-e72e55484fbe      -      -     48      0  5.04M   114K
  mirror-31                               3.25T   381G     99      2  9.81M   242K
    fd63daf7-4c04-4f52-ba01-b41ed627b406      -      -     49      1  4.90M   121K
    b4863955-3c1a-4d37-a41e-930137c90432      -      -     49      1  4.90M   121K
  mirror-32                               3.25T   382G     96      1  9.77M   207K
    5e00ea87-03a0-4915-82ea-907ed4cdabe1      -      -     48      0  4.88M   103K
    7f31f244-1133-4076-b961-463f28a1f2b6      -      -     48      0  4.89M   103K
  mirror-33                               3.21T   426G     92      1  9.60M   215K
    9940c16a-920e-4bb0-9a41-dac0182dd00f      -      -     46      0  4.80M   107K
    482e06b1-7d2c-4238-b917-82c8b621a773      -      -     46      0  4.80M   107K
  mirror-34                               3.21T   422G     91      1  9.55M   215K
    c6b6b2ea-6aa7-4b63-881c-f1e9131e4d3d      -      -     45      0  4.78M   108K
    ac596471-f702-40c9-8fc9-0fd90acf7b7d      -      -     45      0  4.77M   108K
  mirror-35                               3.11T   528G     95      1  9.67M   232K
    d9af6488-edd1-4d5d-a48e-379c974a5712      -      -     47      0  4.83M   116K
    e6af351b-a051-4e36-ba3c-aa36f3d90975      -      -     47      0  4.84M   116K
  mirror-36                               2.31T  1.31T     75      5  8.21M   666K
    b909055e-3d2d-436c-9aeb-ba0f8f40e5f4      -      -     37      2  4.10M   333K
    85b801c8-1be4-441b-beb3-882b85ec496b      -      -     37      2  4.11M   333K
  mirror-37                               2.33T  1.29T     76      5  8.29M   652K
    ae8a8eba-d9bd-4ff8-948e-9a2866414d89      -      -     38      2  4.15M   326K
    eab40ed3-2944-4862-b6de-44dc3875fcbf      -      -     38      2  4.15M   326K
  mirror-38                               2.34T  1.28T     76      5  8.22M   659K
    740c27f8-8b93-4e5c-b785-297807576164      -      -     38      2  4.11M   330K
    5e8f2818-b4ed-4014-be31-b13248e117cb      -      -     38      2  4.11M   330K
  mirror-39                               2.18T  1.45T     72      6  7.88M   766K
    953f8850-ca66-4a9f-bdef-84d79d4eafc3      -      -     36      3  3.94M   383K
    d694be02-99e7-4bbd-8681-9a8f5c85e52a      -      -     36      3  3.94M   383K
  mirror-40                               2.16T  1.46T     72      5  7.83M   735K
    92386503-a62a-48d0-9ef1-35b95f121d9b      -      -     35      2  3.91M   368K
    b082cda3-9dc4-41ae-b5b0-faa385ad95a7      -      -     36      2  3.93M   368K
  mirror-41                               1.83T  1.79T     62      7  7.08M   877K
    94d4fff3-2f45-4068-8307-6107f204d377      -      -     31      3  3.54M   439K
    3bff4ac6-797f-4146-97dd-8a1f974d4a0f      -      -     31      3  3.54M   439K
  mirror-42                                933G  2.71T     34     10  4.24M  1.29M
    ef5d48d5-8718-4619-92aa-f22f962afb94      -      -     17      5  2.12M   662K
    87ead9e9-9e82-462d-b8d7-4c6ba67bca67      -      -     17      5  2.12M   662K
  mirror-43                                287G  3.35T     13     12  1.61M  1.53M
    4b399c04-a2bc-47f4-8f78-a426416c9a1b      -      -      6      6   826K   784K
    71fb1330-ab29-4ace-a00f-1af100faaede      -      -      6      6   824K   784K
  mirror-44                                286G  3.35T     13     12  1.61M  1.53M
    6600905e-d6e6-4e11-8576-e7739705e4ab      -      -      6      6   824K   782K
    32643442-942b-403a-b2ab-4c8c59c8b747      -      -      6      6   825K   782K
  mirror-45                                287G  3.35T     13     12  1.61M  1.53M
    0a9c9085-9005-477d-963f-6ca44e65c936      -      -      6      6   824K   783K
    20e17af2-00f0-48cb-9fa3-a2f419aabe08      -      -      6      6   824K   783K
  mirror-46                                287G  3.34T     13     12  1.61M  1.53M
    1663e707-eb45-4083-8505-f96b2457e070      -      -      6      6   826K   786K
    adcabd99-24d9-4d22-b30d-70441a123d65      -      -      6      6   826K   786K
  mirror-47                                287G  3.34T     13     12  1.61M  1.53M
    eea123b7-ef78-4e15-9c6a-bee04379804d      -      -      6      6   825K   785K
    abac17e0-9804-4696-9b00-43e1a398a88d      -      -      6      6   825K   785K
  mirror-48                                286G  3.35T     13     12  1.61M  1.53M
    4b31a493-1510-4410-81de-ec755a625cb7      -      -      6      6   830K   782K
    f02fd389-2b0d-4de5-b4ed-f48d4880044e      -      -      6      6   821K   782K
  mirror-49                                291G  3.34T     13     12  1.64M  1.57M
    bf17603c-d89b-45d8-ae9c-ab653b3f7b39      -      -      6      6   840K   801K
    052f8be3-a067-4467-9e6d-ff8631a3c767      -      -      6      6   839K   801K
  mirror-50                                287G  3.35T     13     12  1.61M  1.53M
    12fe2ac8-95bd-47a8-9d05-8e2538a20934      -      -      6      6   829K   783K
    e55e3a81-7178-4cca-8504-09d5a4c7ebc4      -      -      6      6   821K   783K
special                                       -      -      -      -      -      -
  mirror-13                               46.3G   842G      6     30  54.9K   570K
    4f961158-125c-474c-a90b-fdc6f9833f4a      -      -      3     15  27.4K   285K
    ac5c6fe2-0b3b-4941-9337-28649600136e      -      -      3     15  27.5K   285K
  mirror-14                               46.2G   842G      6     34  54.8K   624K
    25e755ed-cc1c-438c-890c-d9ff5be64191      -      -      3     17  27.3K   312K
    548a3fcd-4aba-4b70-814c-079e388e41dc      -      -      3     17  27.5K   312K
----------------------------------------  -----  -----  -----  -----  -----  -----

Yesterday, when Wolves was 214MB/s write, Bunnies was showing a read speed of 167MB/s read. Today, it's about the same ratio of 267MB/s write vs 218MB/s read

But after doing this transfer, I think I get why you would use less data drives in a dRAID vdev. I was honestly toying around with numbers as high as 32 or even 64 data drives. If I wanted to mix HDDs and SSDs, I could even try out 128 data drives! I'd also like to try out values that aren't powers of 2 even though that goes against TrueNAS's recommendation. TrueNAS devs, I need to see a link to an article with charts and graphs when you warn me about bad performance decisions or I won't believe you.

Guessing
Using less data drives means you're writing smaller chunks, and you can do that in parallel. So for 30 children, choosing 16 data drives means I'm using half of the drives for each write operation, then I'm using almost half + one being written to already for another write operation. That means everything slows down to the speed of 1 drive in this configuration. If I used 8 data drives, I'd get the speed of 3 drives.

Next Steps
I have 14 more of these HDDs sitting around that I'm gonna use in another 36-drive NAS. I can add them to this system and figure out the fastest way configure dRAID that gives me an adequate drive capacity. At this point, I think I'm gonna have to buy more HDDs (60 total) if I want dRAID performance to be at least on par with a single QLC SSD.

sfatula · Nov 3, 2023

Sawtaytoes said:
Using less data drives means you're writing smaller chunks, and you can do that in parallel.

This is correct, smaller data drive numbers = better performance with Draid. Not only will it be better performance, you won't waste so much disk space as you are with 16d. A file with 1 byte will take 16 stripes! Stripe width is fixed in Draid. So, 16d is fine for the width as long as the vast majority of your files are not much smaller than the stripe width. But it won't perform great.

Remember what ixChris posted above, he posted:

number of redundancy groups = (C - S)/(D + P)

where redundancy groups = vdevs roughly. So, using your Draid layout:

(30 - 1) / (16 + 3) = 29 / 19 = 1 basically, so, yes, single disk performance.

With your proposed 8, you'd have:

(30 - 1) / (8 + 3) = 29 / 11 = 2, so, not quite 3.

Use that formula to make as many "vdevs" (redundancy groups) as you want. And the lower the D number, the smaller the stripe width.

Sawtaytoes · Nov 3, 2023

sfatula said:
This is correct, smaller data drive numbers = better performance with Draid. Not only will it be better performance, you won't waste so much disk space as you are with 16d. A file with 1 byte will take 16 stripes! Stripe width is fixed in Draid. So, 16d is fine for the width as long as the vast majority of your files are not much smaller than the stripe width. But it won't perform great.

Remember what ixChris posted above, he posted:

number of redundancy groups = (C - S)/(D + P)

where redundancy groups = vdevs roughly. So, using your Draid layout:

(30 - 1) / (16 + 3) = 29 / 19 = 1 basically, so, yes, single disk performance.

With your proposed 8, you'd have:

(30 - 1) / (8 + 3) = 29 / 11 = 2, so, not quite 3.

Use that formula to make as many "vdevs" (redundancy groups) as you want. And the lower the D number, the smaller the stripe width.

Well I feel like a dummy. I didn't catch that `(C - S)/(D + P)` (redundancy groups) is a different calculation to storage capacity: `(C−S)⋅(D+PD)⋅DS`. They looked similar when I was reading it before, but now that I take a closer look... Sorry about that @iXChris!

I'm using a metadata vdev. Don't those help with stripe width of small files? Or is it more like they're only helping with the metadata about where each block is located but not anything to do with the contents of those blocks?

In my case, I do have small files, but most of my files taking up the majority of space are straight rips of my Blu-ray and UHD Blu-rays for Plex taking up over 65 TiB.
I have other files taking up terabytes like my 12Mp family photos and 4K60 videos.
In terms of small files, it's only my many Git repos. Those could take up a lot of space. Windows is showing over 5 million files! That dataset's already been transferred to those HDDs. All of that + my games backups from multiple PCs adds up to only 2.5 TiB.
Even if we 10x'd the size of my personal documents (many of which are in the 1kB range), it'd still be well under 100GiB. Nothing to worry about.

What I really care about is making the HDDs have a reasonable write speed. This is a backup pool, but I still want to be able to backup to it at least as fast as I can rip a bunch of Blu-ray discs in a night. Also, I'm going to remake my SSD zpool with all 4 TiB drives in a dRAID configuration, so if I can at least read this data at a reasonable speed, I won't have much downtime.

sfatula · Nov 3, 2023

Sawtaytoes said:
Well I feel like a dummy. I didn't catch that `(C - S)/(D + P)` (redundancy groups) is a different calculation to storage capacity: `(C−S)⋅(D+PD)⋅DS`. They looked similar when I was reading it before, but now that I take a closer look... Sorry about that @iXChris!

I'm using a metadata vdev. Don't those help with stripe width of small files? Or is it more like they're only helping with the metadata about where each block is located but not anything to do with the contents of those blocks?

In my case, I do have small files, but most of my files taking up the majority of space are multiple gigabyte rips of my Blu-ray and UHD Blu-rays for Plex taking up something like over 65TiB.

I have other files taking up terabytes like my 12Mp family photos and 4K60 videos. In terms of small files, it's only my many Git repos. Even if we 10x'd the size of my documents (many of which are in the 1kB range), it'd still be under 20GiB. Nothing to worry about.

What I really care about is making the HDDs have a reasonable write speed. At least then, I could copy my data and read it back at a reasonable speed when I go to remake my SSD zpool with all 4 TiB drives in a hopefully-fast dRAID.

If you want speed, you need more redundancy groups. Play with the numbers to make more redundancy groups. Stripe width is fixed for Draid, so, no, metadata vdev does not help. You will use 16 stripes minimum for any file as configured. With 4k disks, that's 64k for a 1 byte file for the data portion. Recordsize does still come into play here, so the usual applies to your media files, 1M recordsize good. This is one reason why Draid good for larger # of disks, you likely lose more space than a raidz config.

Since you don't care much about small files, 16d is fine, but if you care about speed, smaller is better. A 6d would give you 3 groups. With just 1 more drive, a 7d would give 3 groups exactly. WIth 3 more drives, a 5d would give you 4 groups.

Sawtaytoes · Nov 3, 2023

sfatula said:
If you want speed, you need more redundancy groups. Play with the numbers to make more redundancy groups. Stripe width is fixed for Draid, so, no, metadata vdev does not help. You will use 16 stripes minimum for any file as configured. With 4k disks, that's 64k for a 1 byte file for the data portion. Recordsize does still come into play here, so the usual applies to your media files, 1M recordsize good. This is one reason why Draid good for larger # of disks, you likely lose more space than a raidz config.

Since you don't care much about small files, 16d is fine, but if you care about speed, smaller is better. A 6d would give you 3 groups. With just 1 more drive, a 7d would give 3 groups exactly. WIth 3 more drives, a 5d would give you 4 groups.

Recordsize
I'm not understanding.

By Recordsize, do you mean the 128k that it defaults to for a dataset? I just leave it at the default 128k for all datasets. I've heard people say you can get better speeds by messing with it, but I don't have a good way to test, so I left it as-is even though I have tons of datasets and could easily benefit from changing it.

Are you suggesting I change it to 1M across the board? If there's a good script I could run to create a new dataset, copy my data over without breaking permissions, and then rename the new dataset back to the old dataset's name, I'll do that.

dRAID vs RAID-Z
And then you said dRAID is good for larger numbers of drives because you lose more space compared to RAID-Z? How is losing space making dRAID better?

Non-^2 dRAID data drives
Can I do a 6d dRAID without incurring a performance penalty? TrueNAS specifies that you want them in powers of 2. If I knew that I could use other numbers, then I'd do something that fits nicely within the number of children as you suggested.

Since I can technically only fit 12 of those 14 drives in another PC, that leaves me with 2 extras. I was hoping to keep those around as cold spares, and it'd allow me to add another dRAID in the future, but maybe I should just go all-in and buy the full 60 drives. Then I could use a stripe width of 7 and get 6 redundancy groups. And my total capacity wouldn't suffer as much.

sfatula · Nov 3, 2023

Recordsize 1M, yes, default is 128k. 1m performs better with many workloadds including larger files as you described. You will get better speeds. I'm saying if you have a dataset that is media files, blurays, photos, etc, then set it to 1M, I can't comment on other data you may or may not have but in general 1M is better than 128k for many things and it is better for larger files. So, segregate your larger files into their own dataset ideally if you don't want other stuff at 1M. I can't comment on if you want to set 1M everywhere as I don't know what all uses you have, you have described some of them but I like being conservative.

Losing more space is not better per se. But it's balanced with the other benefits / downsides. In other words, a guy at home with 6 drives might not like draid as he's potentially losing so much space and maybe can't afford more drives (all depends on pools config). And really with 6 drives, I'd go 3 mirrors or Raidz2 not draid. Whereas a business who wants the benefits of draid such as faster regain of healthy status after a drive goes bad can more easily handle any space loss as other things are more important is the way I look at it. It's the same as Raidz1 vs Raidz2 in other words, you could say how is raidz2 better since you are losing more space over Raidz1.

The power of 2 "rule" applies to space used, not performance.

If I understand correctly, you have more drives to experiment with performance (and capacity)?

It's good to see someone here experimenting with Draid (sorry I use wrong case every time). I think if you fine tune that D number, you'll get some decent speed (along with 1M recordsize which will give even more speed).

Important Announcement for the TrueNAS Community.

Should I be making multiple dRAID vdevs?

Patron

Bug Conductor

Patron

Solutions

Patron

Solutions

Patron

Solutions

Hall of Famer

MVP

Patron

Patron

MVP

Hall of Famer

Patron

Guru

Patron

Guru

Patron

Guru

Similar threads