SATA Disk Temperatures Missing After Upgrade

alexmarkley · Oct 28, 2023

Yesterday I upgraded my TrueNAS SCALE box from Bluefin 22.12.3.2 to Cobia 23.10.0. After the upgrade, I noticed that none of my SATA drives are reporting temperatures anymore.

Here are some relevant screenshots:

At a low level, everything still seems to be working:

Code:

root@veritas2[~]# for DISK in $(smartctl --scan | awk '{ print $1; }'); \
    do echo "==== Smartctl on $DISK ====" ; \
    smartctl -a $DISK | grep -E '^Temperature:|Airflow_Temperature|ATTRIBUTE_NAME|Health Information'; \
    echo ; done
==== Smartctl on /dev/sda ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   048   000    Old_age   Always       -       39 (Min/Max 29/41)

==== Smartctl on /dev/sdb ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   059   044   000    Old_age   Always       -       41 (Min/Max 31/44)

==== Smartctl on /dev/sdc ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   060   049   000    Old_age   Always       -       40 (Min/Max 30/44)

==== Smartctl on /dev/sdd ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   043   000    Old_age   Always       -       36 (Min/Max 30/41)

==== Smartctl on /dev/sde ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   053   000    Old_age   Always       -       39 (Min/Max 28/40)

==== Smartctl on /dev/sdf ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   067   044   040    Old_age   Always       -       33 (Min/Max 29/35)

==== Smartctl on /dev/sdg ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   049   000    Old_age   Always       -       42 (Min/Max 30/44)

==== Smartctl on /dev/sdh ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   048   000    Old_age   Always       -       42 (Min/Max 31/44)

==== Smartctl on /dev/sdi ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   038   040    Old_age   Always   In_the_past 36 (2 54 39 30 0)

==== Smartctl on /dev/sdj ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   062   040   000    Old_age   Always       -       38 (Min/Max 30/43)

==== Smartctl on /dev/sdk ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   069   045   000    Old_age   Always       -       31 (Min/Max 28/35)

==== Smartctl on /dev/sdl ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   066   048   000    Old_age   Always       -       34 (Min/Max 27/35)

==== Smartctl on /dev/sdm ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   063   045   000    Old_age   Always       -       37 (Min/Max 27/38)

==== Smartctl on /dev/sdn ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   052   000    Old_age   Always       -       36 (Min/Max 28/37)

==== Smartctl on /dev/sdo ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   068   048   000    Old_age   Always       -       32 (Min/Max 28/36)

==== Smartctl on /dev/sdp ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   061   000    Old_age   Always       -       29

==== Smartctl on /dev/sdq ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   072   062   000    Old_age   Always       -       28

==== Smartctl on /dev/nvme0 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        48 Celsius

==== Smartctl on /dev/nvme1 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        42 Celsius

The important hardware details on this machine:

Supermicro A+ Server 1014S-WTRT
AMD EPYC 7232P 8-Core Processor @ 3.10GHz
128 GB ECC DDR4 3200
Micron 480 GB NVMe boot device
Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) HBA controller
15 SATA drives on an external SAS-3 backplane for tank pool
Samsung SSD 990 PRO 1TB NVMe as an L2ARC for tank
couple of mirrored SSDs for apps and VMs in a separate pool

Do we know if this issue is "just" a reporting issue? Does it also impact alert notifications? If it's just a GUI bug, it's annoying. If it breaks drive temperature alerts, that's a lot more concerning.

I had posted in an older thread about this issue but @morganL recommended I start a fresh thread. For reference: https://www.truenas.com/community/t...ting-or-storage-dashboard-in-cobia-rc.112857/

This should go without saying, but I'm happy to provide additional debug information and/or perform additional troubleshooting if it helps track this down.

morganL · Oct 28, 2023

alexmarkley said:

Yesterday I upgraded my TrueNAS SCALE box from Bluefin 22.12.3.2 to Cobia 23.10.0. After the upgrade, I noticed that none of my SATA drives are reporting temperatures anymore.

Here are some relevant screenshots:

View attachment 71786

View attachment 71787

At a low level, everything still seems to be working:

Code:

root@veritas2[~]# for DISK in $(smartctl --scan | awk '{ print $1; }'); \
    do echo "==== Smartctl on $DISK ====" ; \
    smartctl -a $DISK | grep -E '^Temperature:|Airflow_Temperature|ATTRIBUTE_NAME|Health Information'; \
    echo ; done
==== Smartctl on /dev/sda ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   048   000    Old_age   Always       -       39 (Min/Max 29/41)

==== Smartctl on /dev/sdb ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   059   044   000    Old_age   Always       -       41 (Min/Max 31/44)

==== Smartctl on /dev/sdc ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   060   049   000    Old_age   Always       -       40 (Min/Max 30/44)

==== Smartctl on /dev/sdd ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   043   000    Old_age   Always       -       36 (Min/Max 30/41)

==== Smartctl on /dev/sde ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   053   000    Old_age   Always       -       39 (Min/Max 28/40)

==== Smartctl on /dev/sdf ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   067   044   040    Old_age   Always       -       33 (Min/Max 29/35)

==== Smartctl on /dev/sdg ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   049   000    Old_age   Always       -       42 (Min/Max 30/44)

==== Smartctl on /dev/sdh ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   048   000    Old_age   Always       -       42 (Min/Max 31/44)

==== Smartctl on /dev/sdi ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   038   040    Old_age   Always   In_the_past 36 (2 54 39 30 0)

==== Smartctl on /dev/sdj ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   062   040   000    Old_age   Always       -       38 (Min/Max 30/43)

==== Smartctl on /dev/sdk ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   069   045   000    Old_age   Always       -       31 (Min/Max 28/35)

==== Smartctl on /dev/sdl ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   066   048   000    Old_age   Always       -       34 (Min/Max 27/35)

==== Smartctl on /dev/sdm ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   063   045   000    Old_age   Always       -       37 (Min/Max 27/38)

==== Smartctl on /dev/sdn ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   052   000    Old_age   Always       -       36 (Min/Max 28/37)

==== Smartctl on /dev/sdo ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   068   048   000    Old_age   Always       -       32 (Min/Max 28/36)

==== Smartctl on /dev/sdp ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   061   000    Old_age   Always       -       29

==== Smartctl on /dev/sdq ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   072   062   000    Old_age   Always       -       28

==== Smartctl on /dev/nvme0 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        48 Celsius

==== Smartctl on /dev/nvme1 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        42 Celsius

The important hardware details on this machine:

Supermicro A+ Server 1014S-WTRT
AMD EPYC 7232P 8-Core Processor @ 3.10GHz
128 GB ECC DDR4 3200
Micron 480 GB NVMe boot device
Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) HBA controller
15 SATA drives on an external SAS-3 backplane for tank pool
Samsung SSD 990 PRO 1TB NVMe as an L2ARC for tank
couple of mirrored SSDs for apps and VMs in a separate pool

Do we know if this issue is "just" a reporting issue? Does it also impact alert notifications? If it's just a GUI bug, it's annoying. If it breaks drive temperature alerts, that's a lot more concerning.

I had posted in an older thread about this issue but @morganL recommended I start a fresh thread. For reference: https://www.truenas.com/community/t...ting-or-storage-dashboard-in-cobia-rc.112857/

This should go without saying, but I'm happy to provide additional debug information and/or perform additional troubleshooting if it helps track this down.

There is a bug..... probably UI

[NAS-124784] - iXsystems TrueNAS Jira

ixsystems.atlassian.net

Temps do rise over time and are accurate in reporting, but not dashboard.

alexmarkley · Oct 28, 2023

morganL said:
Temps do rise over time and are accurate in reporting, but not dashboard.

Please take a closer look at my reporting screenshot. You'll notice that most of my disk drives have no visible temperature within the reporting screen. Only the two NVMe drives are reporting a valid temperature.

Of the drives that are missing from reporting, they are all SATA. Most, but not all, are connected via a SAS HBA. Two of them are SATA SSDs, whereas the rest of them are HDDs.

alexmarkley · Oct 28, 2023

More Clarity:

morganL · Oct 28, 2023

@alexmarkley
Feel free to register a separate bug report..... if you could roll back to Bluefin and verify that current hardware works there, that would be useful.

alexmarkley · Oct 28, 2023

@morganL Can you confirm if this is the correct bug reporting procedure for this scenario? https://www.truenas.com/docs/contributing/issuereporting/jiraissuereporting/

Also yes, it was absolutely working in Bluefin with the current hardware configuration.

morganL · Oct 28, 2023

alexmarkley said:
@morganL Can you confirm if this is the correct bug reporting procedure for this scenario? https://www.truenas.com/docs/contributing/issuereporting/jiraissuereporting/

Also yes, it was absolutely working in Bluefin with the current hardware configuration.

Yes, should work.... this is the modernized approach. Hopefully easier and we get the full details of the problem.
You should get a NAS ticket ID afterward that you can share here.

alexmarkley · Oct 28, 2023

Thanks @morganL. I've opened a ticket: https://ixsystems.atlassian.net/browse/NAS-124892

cryptochrome · Oct 30, 2023

I can confirm. Have the exact same issue. I just don't know how to add a comment to the Jira thing.

yandalorian · Dec 3, 2023

Came in here to report i have the same issue as well.

alexmarkley · Jan 18, 2024

My disk temperature issue is resolved in 23.10.1.1. I suspect the relevant changelog entry was: Fix disk temperature reporting (NAS-125841).

Thanks everyone who worked on it!

Belperite · Jan 18, 2024

Not fixed for me :( It reports the NVME temp but not the spinning rust temp.

alexmarkley · Jan 18, 2024

Belperite said:
Not fixed for me :( It reports the NVME temp but not the spinning rust temp.

Were your disk temperatures being reported correctly in Bluefin? In my case, temperatures were working in Bluefin and quit working when I upgraded to Cobia.

Belperite · Jan 18, 2024

Yes they were fine in BF

CJRoss · Feb 3, 2024

My temperatures are working in 23.10.1.1 but upgrading to 23.10.1.3 causes them to disappear. Rolling back to 23.10.1.1 brought them back.

Protopia · Feb 16, 2024

This is apparently STILL VERY BROKEN!!!!

I have just tried to look at Disk Temperatures again, and yet again they graphs are broken despite having 23.10.1.1 installed and having done the extra reboot and having seen them collected a few weeks ago in the hours after the extra reboot.

I noted in the ticket on 25 Jan that a second reboot after the upgrade to 23.10.1.1 was needed to make it work. My up time is currently 21 days 17 hours which would mean that my NAS hasn't been rebooted since I reported that disk temps were working.

Protopia · Feb 16, 2024

NAS-127387 raised.

CJRoss · Feb 16, 2024

Protopia said:
This is apparently STILL VERY BROKEN!!!!

I have just tried to look at Disk Temperatures again, and yet again they graphs are broken despite having 23.10.1.1 installed and having done the extra reboot and having seen them collected a few weeks ago in the hours after the extra reboot.

I noted in the ticket on 25 Jan that a second reboot after the upgrade to 23.10.1.1 was needed to make it work. My up time is currently 21 days 17 hours which would mean that my NAS hasn't been rebooted since I reported that disk temps were working.

My temps are still working on 23.10.1.1. I will note that sometimes I have to go in and out of Reporting a couple of times as various things don't always show correctly. Have you tried that?

I need to update to 24 but I probably won't get to that until next month.

Protopia said:
NAS-127387 raised.

I really wish we didn't need an account just to view JIRA items.

Protopia · Feb 16, 2024

So, when the techs closed my ticket on the basis that only I was experiencing this problem, that is not in fact the case.

I will try going in and out of reporting a few times and see if that helps.

I will also try another reboot and see if that fixes it.

Protopia · Feb 16, 2024

I did a reboot, and 4 of 5 HDDs showed temp graphs without any historic data for the last 3-4 weeks, and I logged out and went back and then showing temp graphs for 5 of 5 HDDs. So my guess is that there are multiple problems with Disk Temps in their new reporting functionality which they have yet to address.

Important Announcement for the TrueNAS Community.

SATA Disk Temperatures Missing After Upgrade

Dabbler

Captain Morgan

Dabbler

Dabbler

Captain Morgan

Dabbler

Captain Morgan

Dabbler

Cadet

Cadet

Dabbler

Dabbler

Dabbler

Dabbler

Contributor

Dabbler

Dabbler

Contributor

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SATA Disk Temperatures Missing After Upgrade"

Similar threads