Seemingly Healthy disks faulting with read/write errors

Status
Not open for further replies.

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
I have two Freenas file servers (FS05 and FS06) both are identical in hardware configuration, I've noticed drives keep randomly being labelled as faulted with a small about of READ/WRITE errors. If I use smartctl on the disks they are healthy and have no errors.

Info:
Chassis - Dell R510 + 2x SuperMicro SC847J
OS - FreeNas 9.3-STABLE (latest update)
Network - 10GBe
RAM - 128GB
HBA - LSI 9300E

here is an example, this happened under very light load:

upload_2015-5-6_15-14-25.png


So I check the disk and see no errors (its brand new)

upload_2015-5-6_15-15-3.png


Now if I check "dmesg.today" log file I see these errors (DA6/DA10 also failed no a different raidz3)

upload_2015-5-6_15-16-12.png


If I reboot it clears the counters and all is well, what could be causing these timeouts? They are two identical boxes in different racks having the same problem so I doubt its cabling.. FS05 has more load and ive noticed more drives have this issue.

Drives are a mix of WD 4TB/6TB and some Seagate 4TB.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I would say it's the cable(s) or the PSU.

I think the PSU is more probable because of this: "FS05 has more load and ive noticed more drives have this issue."
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
I would say it's the cable(s) or the PSU.

I think the PSU is more probable because of this: "FS05 has more load and ive noticed more drives have this issue."

Yeah during database backups FS05 will hit 350-400MB/s, FS06 only has replication which tops out at 80-100MB/s.

The randomness I think screams of hardware, just odd that both are having similar issues.. I might order a full set of new cables for both of them, maybe ill do PSUs as well (doesn't hurt to have spares).

Could it possibly be the controller? firmware of both are:
mpr1: Firmware: 05.00.00.00, Driver: 05.255.05.00-fbsd
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
It can be the firmware. You should use the IT P16 version as it's the only one 100% compatible with the driver (and you can't (at least don't want to) change the driver version). I don't know if you have the right version from this numbers, wait for an answer from a more experienced member.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It's as No, that's not true Bidule0hm. This is the SAS 12Gb stuff.

That LSI card is basically untested and extremely alpha. You shouldn't be using it in production, at all. It's not expected to work very well, if at all, and it could go south without warning.

So yeah, I'd call this "working as expected".

There's a reason why my hardware recommendations thread doesn't recommend the SAS 12Gb stuff. It's basically "use at your own risk" hardware.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Oh sorry, didn't see the catch.

Actually there is another thread on the same problem right here.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
It's as No, that's not true Bidule0hm. This is the SAS 12Gb stuff.

That LSI card is basically untested and extremely alpha. You shouldn't be using it in production, at all. It's not expected to work very well, if at all, and it could go south without warning.

So yeah, I'd call this "working as expected".

There's a reason why my hardware recommendations thread doesn't recommend the SAS 12Gb stuff. It's basically "use at your own risk" hardware.

Yeah I couldn't find much info on it when selecting cards, We knew it was a risk. Nothing has broken it's just been annoying.

A follow up question, when I do a zpool clear I still see disks listed as "too many errors". Aside from rebooting is there anyway to clear these counters and have them revert back to "online" state if I know they are in good standing?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Nope. The expectation with ZFS is that if a disk is having problems that result in the disk being kicked for 'too many errors' that you'd resolve the issue as appropriate. Continuing to operate with a disk that appears to have problems while trying to say "I run the most resilient file system ever conceived" is a contradiction in terms.

I'd recommend you ditch that card and get one of the cards we recommend so you can avoid this problem. I wouldn't expect this to be fixed in the next year (or more) as I wouldn't expect a driver update until FreeNAS 10 comes out, which is at least a year away by any good guesstimate.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
Nope. The expectation with ZFS is that if a disk is having problems that result in the disk being kicked for 'too many errors' that you'd resolve the issue as appropriate. Continuing to operate with a disk that appears to have problems while trying to say "I run the most resilient file system ever conceived" is a contradiction in terms.

I'd recommend you ditch that card and get one of the cards we recommend so you can avoid this problem. I wouldn't expect this to be fixed in the next year (or more) as I wouldn't expect a driver update until FreeNAS 10 comes out, which is at least a year away by any good guesstimate.

And the fact that ZFS is not hardware based I can just swap cards and zpool import correct? Thanks for the help, I've learned so much in the last few months.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You sure can! Just get another controller and put it in. Poof, ZFS will "just work". :)

Isn't ZFS awesome!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Based on my limited experience, the operating system appears to search all available controllers to see what drives are present looking for the expected pool and then mounts the pool even if you have the drives connected to different controllers in a different physical order than when the pool was established. Even if some of the drives are not detected, the GUI will tell you that the system is degraded and you can shut it down, connect the missing drives, bring it back up and it will just work. I am sure that is not a recommended action, but it did work in version 8.2 when an external drive enclosure did not come ready after the server was shutdown to perform hardware maintenance.
 

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147

fullspeed

Contributor
Joined
Mar 6, 2015
Messages
147
For anyone who finds this thread, I fixed it a while back by downgrading to LSI 9206-16E and also downgrading the firmware on the card to p16 to be inline with Freenas.

Although the whole process wasn't easy or fun I haven't had a drive pop out in weeks.
 
Status
Not open for further replies.
Top