SATA Disk Temperatures Missing After Upgrade

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40
Yesterday I upgraded my TrueNAS SCALE box from Bluefin 22.12.3.2 to Cobia 23.10.0. After the upgrade, I noticed that none of my SATA drives are reporting temperatures anymore.

Here are some relevant screenshots:

1698509930562.png


1698509997306.png


At a low level, everything still seems to be working:

Code:
root@veritas2[~]# for DISK in $(smartctl --scan | awk '{ print $1; }'); \
    do echo "==== Smartctl on $DISK ====" ; \
    smartctl -a $DISK | grep -E '^Temperature:|Airflow_Temperature|ATTRIBUTE_NAME|Health Information'; \
    echo ; done
==== Smartctl on /dev/sda ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   048   000    Old_age   Always       -       39 (Min/Max 29/41)

==== Smartctl on /dev/sdb ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   059   044   000    Old_age   Always       -       41 (Min/Max 31/44)

==== Smartctl on /dev/sdc ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   060   049   000    Old_age   Always       -       40 (Min/Max 30/44)

==== Smartctl on /dev/sdd ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   043   000    Old_age   Always       -       36 (Min/Max 30/41)

==== Smartctl on /dev/sde ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   053   000    Old_age   Always       -       39 (Min/Max 28/40)

==== Smartctl on /dev/sdf ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   067   044   040    Old_age   Always       -       33 (Min/Max 29/35)

==== Smartctl on /dev/sdg ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   049   000    Old_age   Always       -       42 (Min/Max 30/44)

==== Smartctl on /dev/sdh ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   048   000    Old_age   Always       -       42 (Min/Max 31/44)

==== Smartctl on /dev/sdi ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   038   040    Old_age   Always   In_the_past 36 (2 54 39 30 0)

==== Smartctl on /dev/sdj ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   062   040   000    Old_age   Always       -       38 (Min/Max 30/43)

==== Smartctl on /dev/sdk ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   069   045   000    Old_age   Always       -       31 (Min/Max 28/35)

==== Smartctl on /dev/sdl ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   066   048   000    Old_age   Always       -       34 (Min/Max 27/35)

==== Smartctl on /dev/sdm ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   063   045   000    Old_age   Always       -       37 (Min/Max 27/38)

==== Smartctl on /dev/sdn ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   052   000    Old_age   Always       -       36 (Min/Max 28/37)

==== Smartctl on /dev/sdo ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   068   048   000    Old_age   Always       -       32 (Min/Max 28/36)

==== Smartctl on /dev/sdp ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   061   000    Old_age   Always       -       29

==== Smartctl on /dev/sdq ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   072   062   000    Old_age   Always       -       28

==== Smartctl on /dev/nvme0 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        48 Celsius

==== Smartctl on /dev/nvme1 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        42 Celsius


The important hardware details on this machine:
  • Supermicro A+ Server 1014S-WTRT
  • AMD EPYC 7232P 8-Core Processor @ 3.10GHz
  • 128 GB ECC DDR4 3200
  • Micron 480 GB NVMe boot device
  • Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) HBA controller
  • 15 SATA drives on an external SAS-3 backplane for tank pool
  • Samsung SSD 990 PRO 1TB NVMe as an L2ARC for tank
  • couple of mirrored SSDs for apps and VMs in a separate pool

Do we know if this issue is "just" a reporting issue? Does it also impact alert notifications? If it's just a GUI bug, it's annoying. If it breaks drive temperature alerts, that's a lot more concerning.

I had posted in an older thread about this issue but @morganL recommended I start a fresh thread. For reference: https://www.truenas.com/community/t...ting-or-storage-dashboard-in-cobia-rc.112857/

This should go without saying, but I'm happy to provide additional debug information and/or perform additional troubleshooting if it helps track this down.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Yesterday I upgraded my TrueNAS SCALE box from Bluefin 22.12.3.2 to Cobia 23.10.0. After the upgrade, I noticed that none of my SATA drives are reporting temperatures anymore.

Here are some relevant screenshots:

View attachment 71786

View attachment 71787

At a low level, everything still seems to be working:

Code:
root@veritas2[~]# for DISK in $(smartctl --scan | awk '{ print $1; }'); \
    do echo "==== Smartctl on $DISK ====" ; \
    smartctl -a $DISK | grep -E '^Temperature:|Airflow_Temperature|ATTRIBUTE_NAME|Health Information'; \
    echo ; done
==== Smartctl on /dev/sda ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   048   000    Old_age   Always       -       39 (Min/Max 29/41)

==== Smartctl on /dev/sdb ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   059   044   000    Old_age   Always       -       41 (Min/Max 31/44)

==== Smartctl on /dev/sdc ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   060   049   000    Old_age   Always       -       40 (Min/Max 30/44)

==== Smartctl on /dev/sdd ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   043   000    Old_age   Always       -       36 (Min/Max 30/41)

==== Smartctl on /dev/sde ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   053   000    Old_age   Always       -       39 (Min/Max 28/40)

==== Smartctl on /dev/sdf ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   067   044   040    Old_age   Always       -       33 (Min/Max 29/35)

==== Smartctl on /dev/sdg ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   049   000    Old_age   Always       -       42 (Min/Max 30/44)

==== Smartctl on /dev/sdh ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   058   048   000    Old_age   Always       -       42 (Min/Max 31/44)

==== Smartctl on /dev/sdi ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   038   040    Old_age   Always   In_the_past 36 (2 54 39 30 0)

==== Smartctl on /dev/sdj ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   062   040   000    Old_age   Always       -       38 (Min/Max 30/43)

==== Smartctl on /dev/sdk ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   069   045   000    Old_age   Always       -       31 (Min/Max 28/35)

==== Smartctl on /dev/sdl ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   066   048   000    Old_age   Always       -       34 (Min/Max 27/35)

==== Smartctl on /dev/sdm ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   063   045   000    Old_age   Always       -       37 (Min/Max 27/38)

==== Smartctl on /dev/sdn ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   064   052   000    Old_age   Always       -       36 (Min/Max 28/37)

==== Smartctl on /dev/sdo ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   068   048   000    Old_age   Always       -       32 (Min/Max 28/36)

==== Smartctl on /dev/sdp ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   061   000    Old_age   Always       -       29

==== Smartctl on /dev/sdq ====
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   072   062   000    Old_age   Always       -       28

==== Smartctl on /dev/nvme0 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        48 Celsius

==== Smartctl on /dev/nvme1 ====
SMART/Health Information (NVMe Log 0x02)
Temperature:                        42 Celsius


The important hardware details on this machine:
  • Supermicro A+ Server 1014S-WTRT
  • AMD EPYC 7232P 8-Core Processor @ 3.10GHz
  • 128 GB ECC DDR4 3200
  • Micron 480 GB NVMe boot device
  • Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) HBA controller
  • 15 SATA drives on an external SAS-3 backplane for tank pool
  • Samsung SSD 990 PRO 1TB NVMe as an L2ARC for tank
  • couple of mirrored SSDs for apps and VMs in a separate pool

Do we know if this issue is "just" a reporting issue? Does it also impact alert notifications? If it's just a GUI bug, it's annoying. If it breaks drive temperature alerts, that's a lot more concerning.

I had posted in an older thread about this issue but @morganL recommended I start a fresh thread. For reference: https://www.truenas.com/community/t...ting-or-storage-dashboard-in-cobia-rc.112857/

This should go without saying, but I'm happy to provide additional debug information and/or perform additional troubleshooting if it helps track this down.
There is a bug..... probably UI

Temps do rise over time and are accurate in reporting, but not dashboard.
 

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40
Temps do rise over time and are accurate in reporting, but not dashboard.

Please take a closer look at my reporting screenshot. You'll notice that most of my disk drives have no visible temperature within the reporting screen. Only the two NVMe drives are reporting a valid temperature.

Of the drives that are missing from reporting, they are all SATA. Most, but not all, are connected via a SAS HBA. Two of them are SATA SSDs, whereas the rest of them are HDDs.
 

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40
More Clarity:
1698514987408.png
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
@alexmarkley
Feel free to register a separate bug report..... if you could roll back to Bluefin and verify that current hardware works there, that would be useful.
 

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40

cryptochrome

Cadet
Joined
Oct 23, 2023
Messages
4
I can confirm. Have the exact same issue. I just don't know how to add a comment to the Jira thing.
 

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40
My disk temperature issue is resolved in 23.10.1.1. I suspect the relevant changelog entry was: Fix disk temperature reporting (NAS-125841).

Thanks everyone who worked on it!
 

alexmarkley

Dabbler
Joined
Jul 27, 2021
Messages
40
Not fixed for me :( It reports the NVME temp but not the spinning rust temp.
Were your disk temperatures being reported correctly in Bluefin? In my case, temperatures were working in Bluefin and quit working when I upgraded to Cobia.
 

CJRoss

Contributor
Joined
Aug 7, 2017
Messages
139
My temperatures are working in 23.10.1.1 but upgrading to 23.10.1.3 causes them to disappear. Rolling back to 23.10.1.1 brought them back.
 

Protopia

Dabbler
Joined
Jul 11, 2023
Messages
37
This is apparently STILL VERY BROKEN!!!!

I have just tried to look at Disk Temperatures again, and yet again they graphs are broken despite having 23.10.1.1 installed and having done the extra reboot and having seen them collected a few weeks ago in the hours after the extra reboot.

I noted in the ticket on 25 Jan that a second reboot after the upgrade to 23.10.1.1 was needed to make it work. My up time is currently 21 days 17 hours which would mean that my NAS hasn't been rebooted since I reported that disk temps were working.

:frown:
 

Protopia

Dabbler
Joined
Jul 11, 2023
Messages
37

CJRoss

Contributor
Joined
Aug 7, 2017
Messages
139
This is apparently STILL VERY BROKEN!!!!

I have just tried to look at Disk Temperatures again, and yet again they graphs are broken despite having 23.10.1.1 installed and having done the extra reboot and having seen them collected a few weeks ago in the hours after the extra reboot.

I noted in the ticket on 25 Jan that a second reboot after the upgrade to 23.10.1.1 was needed to make it work. My up time is currently 21 days 17 hours which would mean that my NAS hasn't been rebooted since I reported that disk temps were working.

:frown:

My temps are still working on 23.10.1.1. I will note that sometimes I have to go in and out of Reporting a couple of times as various things don't always show correctly. Have you tried that?

I need to update to 24 but I probably won't get to that until next month.


I really wish we didn't need an account just to view JIRA items.
 

Protopia

Dabbler
Joined
Jul 11, 2023
Messages
37
So, when the techs closed my ticket on the basis that only I was experiencing this problem, that is not in fact the case.

I will try going in and out of reporting a few times and see if that helps.

I will also try another reboot and see if that fixes it.
 

Protopia

Dabbler
Joined
Jul 11, 2023
Messages
37
I did a reboot, and 4 of 5 HDDs showed temp graphs without any historic data for the last 3-4 weeks, and I logged out and went back and then showing temp graphs for 5 of 5 HDDs. So my guess is that there are multiple problems with Disk Temps in their new reporting functionality which they have yet to address.
 
Top