delayed boot after upgrading form core to scale

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
After I upgraded my existing core installation to scale last night, I the following message:

mpt2sas_cm1: log_info(0x31160000): originator(PL), code(0x16), success

img1.jpg


I tried adding the pool form thatcore installation to a fresh scale installation, and I have the same issue there upon reboot. It takes quite a while but then boots normaly. First I thought it's probably a one time thing, but it happened again after another reboot.

Any hint or advice on how to prevent that is greatly appreciated!
-Tobi
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hello @urobe

Sorry to hear you're having trouble. Can you please share your complete hardware and software configuration, paying special attention to the storage controllers, cabling, and pathing to disks?

Based on the photo, it looks like you might have some rackmounted hardware - the presence of mpt2sas in the log does hopefully indicate that you're using an appropriate LSI/Broadcom/Avago HBA.

My initial guesses would be related to, in no particular order - cables loose/faulted, potential multipath detection, a disk shelf that has some manner of controller attempting to be "intelligent" when what's needed is a pure expansion setup.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
Thank you very much for the reply @HoneyBadger ! It's a supermicro 847 case, and the cables of the sas expander backplane are going into two dell percs H310. Which should be both in it mode. Could it be, that one isn't?
Will open the case now and check for loose cables.
how can I detect the multipath detection?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Based on your description, if you have two separate PERC H310 cards connecting to the backplane, you definitely have SAS multipath existing. I don't know the SCSI error codes off the top of my head, but I believe those ones are related to reservations/multipath claiming; it could be the devices trying to determine which HBA will be the "owner" of the disks.

There might be two slightly different firmwares present on the cards as well - matching them would be a good idea if so.

Can you post the output of sas2flash -listall inside of [CODE] tags?
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
okay... thank you very much for taking the time.
unfortunately the system freezes upon executing the command. Will Try it on an older mainboard as well (X9SCL-F) - The one on the mainsystem is a X11DPI-N. Could it be that the controller on the mainbaord is interfering?
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
@HoneyBadger , alright, here's the output:
1677275206561.png


but as I said, it freezes on the x11DPI-N, this is the output with the actual cards on the x9SCL-F.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The X9 does look like it's picked up both cards successfully, and they're showing as identical. Are you using true SAS drives in this multipath setup, or do you have SATA drives with/without an interposer?

unfortunately the [X11DPI-N] freezes upon executing the command

That's definitely unexpected, especially since that motherboard doesn't have an embedded SAS controller, only the Intel C621 SATA controller. Did it freeze the entire system, or just the shell/prompt you ran it from?

(Side note - that's a very nice board, and it even has support for Optane DC pmem devices if you have 2nd-gen Xeon Scalables.)
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
The X9 does look like it's picked up both cards successfully, and they're showing as identical. Are you using true SAS drives in this multipath setup, or do you have SATA drives with/without an interposer?



That's definitely unexpected, especially since that motherboard doesn't have an embedded SAS controller, only the Intel C621 SATA controller. Did it freeze the entire system, or just the shell/prompt you ran it from?

(Side note - that's a very nice board, and it even has support for Optane DC pmem devices if you have 2nd-gen Xeon Scalables.)
Yes, I have sata drives without an interposer. Was that a bad idea? under Truenas core the error is not present, and the system ran just fine for about three years (but on the X9 board). only since I upgraded to scale (and upgraded the mainboard), but I think I upgraded the mainboard first and then the system and the delayed didn't happen under core with the new mainboard, but I can verify that quickly.

I booted from an usb stick into freedos, which then froze. Couldn't ctrl alt del out of it.

The board is very nice indeed, "unfortunately" I have 2 Xeon 6154, but for the price I picked them up, I really can't complain.
i did look for backup cpus but couldn't find anything affordable yet. Only thing I saw, was an Xeon 8222L, but I couldn't find much info about it, as it isn't even listed on the intel website, which made a bit suspicious.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
The X9 does look like it's picked up both cards successfully, and they're showing as identical. Are you using true SAS drives in this multipath setup, or do you have SATA drives with/without an interposer?
@HoneyBadger , Do you think I need to use an imposer?

I actually had now another major issue on this machine. in a pool of 6 drives, 2 HDDs failed, and two more are about to fail. The 6 drives are a week old. (WD Ultrastar DC HC560). As far as I know they should be CMR drives. Do you think that might be linked to the issue regarding this thread?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, I have sata drives without an interposer. Was that a bad idea?
SATA drives only have a single data channel - it's possible you're confusing the HBAs by having them connect to both ports on the backplane, and they're seeking that second channel that they never find. Try shutting down, disconnect the secondary HBA, and put both cables into the same HBA.

I actually had now another major issue on this machine. in a pool of 6 drives, 2 HDDs failed, and two more are about to fail. The 6 drives are a week old. (WD Ultrastar DC HC560). As far as I know they should be CMR drives. Do you think that might be linked to the issue regarding this thread?
What is the failure they're reporting? It could be related to the multipath issue here.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
SATA drives only have a single data channel - it's possible you're confusing the HBAs by having them connect to both ports on the backplane, and they're seeking that second channel that they never find. Try shutting down, disconnect the secondary HBA, and put both cables into the same HBA.


What is the failure they're reporting? It could be related to the multipath issue here.
there are actually three cables coming out ouf the backplane, thats why I had two cards installed. Is one of them not necessary?

I Have to power up the system and check agin, right now it is shut down, I wanted to figgure out this error first, before putting the pool back online.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
there are actually three cables coming out ouf the backplane, thats why I had two cards installed. Is one of them not necessary?
While it's powered down, can you look inside and provide the exact model of the backplane, as well as perhaps some photos and a wiring diagram? The required wiring will differ depending on if your backplane has a SAS expander chip or not. "Three ports" seems to identify with the 12-bay "826A" series backplanes.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
Omg! I just read the manual of the backplane, and I completle failed connecting the pcb to the board... will do it correctly tomorrow! It's my first contact with backplane... and a steep learning curve...
Thank you very much for pointing me in the right direction...

While it's powered down, can you look inside and provide the exact model of the backplane, as well as perhaps some photos and a wiring diagram? The required wiring will differ depending on if your backplane has a SAS expander chip or not. "Three ports"





seems to identify with the 12-bay "826A" series backplanes.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
Alright! after properly connetcting everthing the error is gone! Tahnk you so much!

The pool that had failed drives also looks slightly better:
1677583305256.png
How Can I see the type of error the disks have/had?
can I tell the system to reuse the degraded disks somehow?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Under your Storage dashboard, look at the ZFS Health option and run a Scrub. You can also check the "Manage Devices" view, and select each disk individually to see the number of read/write/checksum errors, as well as run manual SMART tests and review the results from the "Manage Disks" option.
 

urobe

Contributor
Joined
Jan 27, 2017
Messages
113
I started the scrub, and after a minute of running it says that it'd take 23 years. Is this to be expected?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I started the scrub, and after a minute of running it says that it'd take 23 years. Is this to be expected?
The time estimate being vastly incorrect, yes; it actually taking 23 years, no.

Did/does the pending scrub time update after a few moments?
 
Top