The QNAP of Death

Alternate title: the day my NAS died

A QNAP NAS with System Booting text

Not quite the same System Booting text I was greeted with, but close enough. Excuse the dust.

System booting? Yes but the system has been booting for literally hours now. If it hasn’t booted within five minutes, there’s something wrong.

And dear reader, there was indeed something wrong. I tried all the usual stuff; turning it off and on again, leaving it off for a couple of days, pulling all the hard drives out, turning it off and on in between all of those steps, but nothing worked, nor did it give me any kind of video output to indicate what might be wrong. It turned on, but wouldn’t boot into the OS. That probably should have been my first clue that although something was wrong, maybe it wasn’t completely dead. And if it wasn’t completely dead, then maybe there was something we could do to fix it.

But after unplugging every piece of hardware I had added to the QNAP and returning it to the stock hardware configuration, the thing would still not boot up properly, giving me that same error message. System booting. Whatever was wrong with it, it wasn’t because of something I had added or done to the system, which probably meant it was hardware-related. Ugh.

With my extensive troubleshooting prowess exhausted, it was time to turn to old mate Google.

Google immediately led me to a 100-page forum thread about the issue on QNAP own forums. This was either very good, or very bad. In my case it meant it was initially very bad because it meant I had to read through most of it, but then things turned out very good because within those 100 pages, there was the trifecta: a known recurring issue, exact steps to diagnose that specific issue, and a fix that worked for enough people for it to be considered the official unofficial fix.

The problem, as it was described, is some kind of “degraded” LPC clock. As I understand it, basically there’s some kind of timing component that keeps things in your PC running on time for lower-pin (Intel’s definition of lower-pin here actually means 1170 soldered pins) processors like the Intel Celeron J1900 in the QNAP that I have. What happens is that in some systems, including in my QNAP and even some Synology units, that the circuit for this LPC clock degrades over time due to “reasons”, and eventually reaches a state where it fails to provide a stable clock to the system, meaning that the CPU doesn’t work like it should. Or something along those lines, anyway.

According to the forum post it’s remarkably similar to an issue that affected the Intel C2000 Atom processors, which Cisco and Synology both issued advisories about all the way back in 2017, although that case was slightly more serious as it caused C2000 Atom-equipped gear to fail after as short as 18 months. In the case of my QNAP, it lasted over six years. Not bad, but buyer beware, I guess, not that you’d be able to tell this kind of thing at the time of purchase.

Thankfully, diagnosing the issue is pretty easy. Use a multimeter to measure the voltage between some pins or pads on the motherboard, depending on your specific model of QNAP, and if the voltage shows over 2V, your LPC CLK is likely broken and needs to be fixed if you want to use your NAS again.

The fix is easy enough as well. Because we need to drop the voltage of the LPC CLK signal, we can drop in a resistor. Experimentation by some helpful forum members indicated that a 100 Ohm resistor, soldered between the “negative cycle transistor” and ground, will restore the voltage to a correct value to allow the LPC CLK to supply a correct clock signal to the CPU.

Simple, right?

There was just one problem. Well, besides “the problem”. I don’t own a multimeter, nor a soldering iron. Oh, and I don’t really know how to solder. I’ve soldered before, but I wouldn’t say I’m particularly good at it. But as my old swimming coach used to say, no one is born knowing how to swim or solder, so I grabbed a cheap and cheerful soldering iron, some solder, a multimeter, and prepared myself for the hackiest soldering job in the world. Yes, it was really that bad. No, I didn’t trim the ends of the resistor. Yes, I probably should have. Yes, I managed to melt a little plastic connector next to where I was soldering, but in my defence, it was impractical to pull out the entire motherboard for easier access, so I kind of had to do it in situ while it was still attached to the case, which made it all the more awkward. No, I’m not going to show you a picture. Suffice to say, I got the job done. Just.

After all was said and done, and I put my drives back into the system, it booted up just fine. Not that I didn’t expect it to, given so many other people had had success after attempting the same fix, but it was still a relief. Getting the system back up and running again meant I didn’t have to try and go to lengths to recover the data I cared about, never mind wondering what was on there that I might have forgotten about in the first place.

I wish that was the end of the story. Alas, the forum had one more golden nugget of information to dispense, and that was that the fix was only temporary. Continued degradation of the clock timer was inevitable, and the next time it failed, there was no guarantee it would be fixable with any kind of resistor. It was hard to estimate how long the fix would work for, but six months to a couple of years seemed reasonable. Reasonable, but only if you were willing to put up with the fact that your NAS might die at any moment, and maybe even be completely unrecoverable from that point on.

Which worked for me, because now I knew that it was on the way out, it was time to build a replacement.


Tags: , ,