When chips go bad
Doghouse – Chip Replacement
Chip replacement isn’t always the answer.
Last month, I wrote about the Meltdown and Specter issues. The vendors and the community are working hard to evaluate the whole problem and see what types of solutions exist.
Many people are demanding that the chips be replaced. While this would be wonderful, in reality it is unlikely to ever happen, for many reasons.
This problem has been going on a long time, over many iterations of chips. The fabrication plants for creating the older chips have started making more modern chips. Going back in time to replace these older chips would be difficult, and even if the chip makers could do that, many of the chips are soldered to the motherboard. Removing the chip from the motherboard and resoldering it would be very expensive. Even if replacing the chips could be done, unless the chip is an exact replacement, the motherboard circuitry could be different, and the instructions to boot and run the operating system would have to change.
Perhaps it is a more modern chip, manufactured only a short time ago. The customer would have to identify the chip, find out if it is indeed affected, and submit that chip to the vendor to replace. Many of the same issues would exist, although the operating system might still be adaptable to the newer chip.
Another Time and Place
VAXstation 3100 computer systems were desktop computers designed by Digital Equipment Corporation (DEC) around 1986, one of DEC's first forays into "high volume" manufacture, which then measured in hundreds of thousands, not hundreds of millions.
DEC had a whole team of engineers on this product. Hardware circuitry, case design, manufacturing engineers, and of course both VMS and Ultrix (DEC's Unix product) engineers. We met every week in a project meeting.
This product utilized daughter cards for the RAM memory, way before the concept of industry standard SIMMs of memory. DEC had used memory chips from two different Japanese memory manufacturers. Not wanting to be too "cutting edge," DEC used memory chips that were shipping in the millions of units all over the world to many manufacturers, including DEC. The VAXstation 3100 was also the first system that DEC created that had only parity error detection, not ECC correctable memory.
The systems were headed toward field test. Because this was such an important product, there were a huge number of units made, not the typical 20-30 field test units that a larger system might have generated.
As the units were being readied to field test, it was noticed that Ultrix kept crashing while VMS was rock solid. The hardware engineers made comments about how terrible the Ultrix code was.
We eventually proved that the memory chips from one of the Japanese memory vendors was "forgetting," even though it was being properly clocked and fed the right amount of electricity – something that was not supposed to happen. If you did not read or write to that company's memory chip for over 45 seconds, it "forgot." Not all the time … only about half the time.
Wait! Why only Ultrix and not VMS? Because VMS always read into memory from disk when it was swapping in a swapped-out process. This was "refreshed" RAM that had not been written to for a long period. Unix (and Ultrix) realized that if the RAM had not been changed, what was the sense of "refreshing" it, since it still contained the information it had before? Right?
We brought in the Japanese firm and proved to them what was happening. That was when I learned the Japanese words that could not be printed here.
Wait a minute! Why had this not shown up before in those millions and millions of other systems with that vendor's chips?
Most servers of the day had ECC correctable memory. If one chip forgot, the others recreated the missing bit, called a "soft hit," something only noticeable if you looked for it in the system log.
PCs (which only had parity memory detection, not ECC correction) usually ran with small, active memories, so the chips were always being accessed. And if the system did crash, it was probably due to that crappy Microsoft operating system, right?
The memory company begged us not to announce this. DEC decided not to announce it, since it literally would bankrupt the memory company, which would have solved nothing. DEC did insist they buy back every chip they had ever sold us for the price we paid, and when the fix was done, replace the chips at current market price. The difference in price allowed DEC to replace every chip in every board owned by DEC's customers (including manufacture and field service on-site replacement) and still make twelve million dollars of profit on the replacement.
DEC never told which company had the bad chips, and I never will either.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
There's a New Open Source Terminal App in Town
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.