Stress testing for temperature on a home NAS
Temperature Watch
Using stress, lm-sensors, and hddtemp to sort out temperature and reliability related issues with a home-based NAS box.
I recently found myself in a difficult situation with my home NAS brought about by some sketchy construction work done at my apartment. Long story short, the workers didn't mention that they would be using my work area as their work area and set about cutting bathroom tiles right next to my main workstation, my home NAS server, and a toy i3-based rig that I occasionally use for testing and special projects. Consequently, my NAS suffered failure after failure – with ghosts that I am still chasing more than three months later.
My NAS is a home-built rig using a Supermicro X8STi with an X5680, 24GB of ECC DDR3, and six disks total, including the operating system (OS) drive itself. The OS is on an SSD, and the data is all on HDDs. There is no redundancy (see my article on MergerFS elsewhere in this issue). So, to be clear, this is a mess primarily of my own making, but one not without others at fault.
The first problem that I experienced after the aforementioned construction was that my disks would randomly disappear after having been present at boot. I cleaned the machine out as best as I could and reseated the SATA cables. This worked temporarily, but two of the drives would still drop off, requiring a reboot, and in some cases, requiring me to reseat the SATA cable. I should note that, at the time, I was using old inexpensive SATA cables that had been collected over the years from motherboard purchases, and so I decided it was time to switch to more modern cables with the locks that the SATA III specification requires. The problems remained.
At this point, having opened the box numerous times, I realized that the intake fan in front of the hard drive cage had failed. The fan was covered in red dust from the construction. Though I cannot say definitively that this was what caused its demise, it was followed very quickly by the CPU fan. Both were 120mm fans running full-tilt 24/7, and neither had filters or screens in front of them. The next item to go was the power button, oddly enough. Although I cannot assume that dust directly caused that failure, the power and reset buttons certainly got some additional mileage after the construction was completed as I chased these demons down. The case was an inexpensive affair made of cheap plastic and jagged steel, which seemed to take its final shape only once fully populated with gear, but I hadn't had any issues with it before the construction.
At this point, I decided that a cursory cleaning wasn't enough, and I really needed to rip everything out, clean everything off well, rewire it all, replace the faulty fans, and test. Needless to say, it is good that this is only a home NAS box and not something used in any sort of real production, as this is the exactly the type of scenario that is avoided at all costs in the server world.
After taking everything out and dusting it all as carefully as possible, it would have been sacrilege for me not to replace the CPU Thermal Interface Material (TIM) between the CPU lid and the heat sink. The X5680 is a 130W CPU and lets you know it when under load as its temperature ramps up quickly.
At this point in my epic saga, I have a freshly rebuilt box with known working components but without a feel for the performance and temperature with the new fans installed. I decided to use three simple and commonly-used programs to address these needs:
stress
[1] (which is also available in an enhanced version known asstress-ng
) is a workload generator that you can use to apply a configurable load to your system.hddtemp
[2] is a tool that will monitor the temperature of your hard drive to make sure the drive is operating in the recommended range.hddtemp
works by accessing the Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.) information available with many hard drives.lm-sensors
[3] is a utility that can read and report data from sensors located in the hardware, including sensors for monitoring temperature, fan speed, and voltage for the CPU, mainboard, and other components.
These tools have all been around for a while, and you might already be familiar with them. What I decided to do was to run stress
on all 12 threads of the CPU while at the same time having Plex add all of my preexisting media to the Plex library. This test is really the most use that those drives will ever get at one time. With the CPU working on all threads, the test would simulate not only the maximum power draw but also allow me to see how well the cheap 120mm knock-off fans perform.
Simple Stress Test
To install (on Ubuntu-based systems such as mine) and to make sure that you have the most recent versions of the programs, run the following commands in the terminal (Figure 1):
$ sudo apt update && sudo apt install hddtemp lm-sensors stress -y $ sudo apt update && sudo apt upgrade -y
My system was running Ubuntu [4] 20.04.3 LTS and was up-to-date at the time of stress testing. The install only takes a short moment, and from there, all operations are done from the terminal, so you can use Ctrl+L to clear the screen and leave the terminal open to start the stressing. I would also recommend opening a second tab or another terminal instance for the monitoring. If you are connecting to a headless server via SSH or using some other terminal emulator, such as that which is found in Webmin [5] or Cockpit [6], you will need to adjust your workflow accordingly. At any rate, I found it was easier having two terminal windows open, one to run stress
and to be able to stop the test from running using Ctrl+C, and the other to refresh the monitoring programs as needed.
To start running stress
, I used the following command to load up all of the threads on the CPU:
$ stress --cpu 12
The output appears in Figure 2. Once the test had been running for a few minutes, I ran the following command in the second terminal window in order to see where the temps were:
$ sudo sensors && hddtemp /dev/sda /dev/sdb/dev/sdc /dev/sdd /dev/sde /dev/sdf
Figure 3 shows the result. You might need to run the following command first in order for it to know which sensors exist and which can be checked by the program itself:
$ sudo sensors-detect
I don't know why I had to spell out each of the drives individually for hddtemp
to work. Simply typing hddtemp
should list the temperatures of all drives using the default Celsius scale, with each drive appearing on its own line, but that didn't work for me. Adding the /dev/sdX
for each drive after the command did work and displayed each drive on its own line with temps shown in Celsius. Perhaps this had something to do with one drive being an SSD and rest being HDDs, though I doubt that was the cause. Perhaps it was because I use a cheap "RAID" card (it is not really a RAID card but rather an inexpensive Marvell chip with some SATA ports connected to it), which connects via PCIe and allows for four additional SATA III drives to be installed in the system.
For about an hour, I clicked the up arrow and ran the monitoring command every so often to make sure that the temps were OK. My CPU will run without a problem and boost normally up to around 81°C; during testing, mine peaked at about 75degreesC. Typical home-use HDDs, such as the WD Blues that I was using, shouldn't exceed 35-40degreesC, and mine hovered around 31degreesC during testing. Some folks will do stress testing for hours and even for over a day or more, but this is a home NAS with non-critical data stored on it in a temperature-controlled environment. The heat sink for the CPU was thoroughly heat-soaked after about 10 minutes of operation and the drives are located directly behind the previously mentioned 120mm intake fan, meaning they were receiving fresh (bathroom tile dust-free) air for the test.
Conclusion
After my experience with the recent construction in my home, I can pass on one very important lesson: make sure that your devices are covered and turned off if someone is doing construction nearby. Dust from construction is not great when run through your server, workstation, or laptop.
Running stress-testing software with a monitoring program is a great way to make sure that your system will stand the test of time and that your components will run optimally, especially after a new build or a rebuild. It is important to have a good baseline for your system to know if and when it is running well. One of the best things that I have found in the open source community is that there is a program, app, script, or Flatpak for just about anything you can imagine. With all of the issues that I have had to deal with pertaining to this NAS server, software was never an issue. Handy utilities like stress
, hddtemp
, and lm-sensors
were easy to use and gave me important insights on my system.
It is easy to forget about the folks who have supported the open source community over the years, especially when you find yourself in situations that you would rather not be in and with hardware that seemed to have been cursed, but I can say, unequivocally, that I have far more faith in the contributors unknown to me who helped to make my NAS system work than I do in the construction workers I let into my house. For that, I would just like to say – very loudly, so that you can hear me over these cheap fans – thank you!
Infos
- stress: https://launchpad.net/ubuntu/+source/stress/
- hddtemp: https://launchpad.net/ubuntu/+source/hddtemp/
- lm-sensors: https://launchpad.net/lmsensors/
- Ubuntu: https://ubuntu.com/
- Webmin: https://www.webmin.com/
- Cockpit: https://cockpit-project.org/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.