The IT Industry Is a Disaster
It is worse than that. The mountain of software on which we are increasingly basing our lives is a Tower of Babel that is shaking with fragments falling all around us, and is on a path to collapse, yet most people are oblivious to that, or do not understand the situation, or are in denial or just have not thought about it — taking unreliable devices as normal, without thinking where that is headed as more and more things in our world are that way — things we trust with our lives and use throughout the day. A day sucks when nothing works and people are hacking you left and right— that’s where we are headed.
Case in point: This past weekend I upgraded my Mac from Sierra to High Sierra — the latest OS for Macs. The update process destroyed my disk drive’s partition map. I lost all of my data — all of my data. (Luckily I had it backed up.) And High Sierra is software built by the richest software company. The same day, I was in my new Subaru, and when I turned off the ignition the radio came on. Huh? Even more frustrating, each time I start the car, the car’s computer has to boot; but the first thing it does it turn on the radio, and the radio controls are unresponsive until it finishes booting — about ten seconds — during which I have to listen to the radio whether I want to or not, at whatever volume setting it had the last time I used it. It was not like that with cars of 20 years ago, because they did not have computers: with old cars, you push a button, and whatever that button does takes effect immediately. The electronics were analog: wires and transistors — no code. It worked really, really well.
Things that used to work reliably are not reliable anymore.
Then, this morning, I got to work and tethered my iPhone to my Mac; but since my disk drive has been reformatted, my Mac did not remember my iPhone’s mi-fi password, so it prompted me; yet, the Mac connected anyway and downloaded a bunch of new emails — while it was waiting for me to enter the mi-fi password! Did I have the password cached in the driver? — I have no idea how it did that — it should be impossible. Note that it could not have connected to another network — my Mac had been reformatted and so its list of authorized networks was blank — yet it downloaded emails. Draw your own conclusions!
I then turned on my second laptop — a Windows PC — and a message popped up, “Unable to map drive”. Well, that is a type of transient failure that we all have come to accept — that if you try something and it fails, try again — but it is also a sign of the Von Neumann paradigm, whereby things happen in sequence, rather than in response to events. (More on that in a moment.)
Last Friday I was talking with a colleague, who is an expert in application security, about why it might be that Google has the best security people in the world, yet Google’s Chrome browser still has a steady stream of security vulnerabilities that does not seem to be going down over time. If Google can’t write secure code, then who can? We reasoned that it is because while there are lots of security experts at Google, their numbers are dwarfed by the much larger number of people who write code for Chrome — and so most of the programmers are steadily creating a mountain of insecure code that the experts cannot keep up with.
It is like that for everything. Today companies hire programmers fresh out of college, and those folks immediately write mountains of code that is riddled with vulnerabilities and often other kinds of bugs as well; and then we as a society base our lives on that code, for our phones, our cars, our Internet services, and — if the industry can convince us to trust it — every device which, according to them, should be connected to the Internet.
The industry’s solution to this has been scanning: adding more tools, to look at the code and magically find vulnerabilities. It does not work — repeat, it does not work. Von Neumann code is too hard to interpret, in terms of its intention and its timing; it takes a human to examine code and find problems, but the rate of code production is far greater than the rate at which Von Neumann code can be examined properly.
What amazes me is that people are blissfully unaware, even though the signs are all around us: computers not working, devices not doing what they are supposed to, people and companies being hacked left and right. Imagine that everything is computerized: is that a world that we want? People don’t question that, but they should. They think it has to be that way, but it doesn’t — or I should say, it didn’t.
The root causes of the problem are many:
- The industry settled on a computer architecture that is fundamentally flawed: the Von Neumann architecture: a design in which a computer is controlled by a sequence of steps; yet, real world devices like phones, TVs, cars, and so on are not a good fit for that, because they handle real time events, and so Von Neumann programs are susceptible to “race conditions”, aka “time of check, time of use” (TOCTOU) programming errors, unless the programmer knows how to write “real time programs” — which very few do.
- Today’s computing industry is driven by features, and “coders” who can rapidly produce features. The culture of engineering has left the programming industry. Thus, while for example there used to be a professional category called “real time programmer”, there is no longer — there are only “coders” — as if knowing how to code is the same as knowing how to design and implement a reliable software controlled system: it is not.
TOCTOU — the Bane of Von Neumann
Probably the most pervasive and reliability-destroying type of error in software today is the TOCTOU error, aka “race condition”. It goes like the following pseudocode:
- Check if something you need, say network or file A, is available.
a. If it is, then start using it;
b. Else, do something else.
This works fine for, say, an accounting program; but it does not work for mobile phone software, or any kind of server software, or software that controls an automobile, a tractor, an airplane, or a TV. The problem is that those kinds of devices are “real time” devices: they are interacting with humans and networks and sensors all the time, and so right after step 1 occurs, things can — and often do — change, and so the thing you checked for, which was available, might suddenly become unavailable, or the converse; but your code is chugging merrily along under the assumption that things are still as they were when you performed the check.
That is the Von Neumann paradigm: check, then act. In contrast, there are other computing paradigms, such as event paradigms, and data flow paradigms, and these paradigms can be implemented directly at the hardware level, so that it is impossible to subvert them. Such architectures were actively researched until the late 1980s, when it became clear that the personal computing industry had chosen certain commodity Intel and Motorola chips, which were based on the Von Neumann approach. We are stuck with that choice today — a choice that none of us had any say in.
Indeed, it has recently been found that the Linux operating system kernel is full of race conditions.
It is possible to write real time Von Neumann software that behaves correctly. However, it requires a certain level of sophistication — even a certain personality — things that most programmers do not have, yet today we have devolved software engineering into “coding”, and calls to democratize programming are getting louder — uninformed by the reality that writing reliable code on a Von Neumann system is extremely hard and not something that the average person can do.
Standards Through Popularity
The fact that the entire computing industry is hobbled by a bad choice of architecture is bad enough, but the industry is also rudderless and suffers from mob rule.
Take the Internet standards organization, IETF. Its central mechanism for considering and publishing new technical standards is its “Request For Comment” (RFC) process. Anyone can propose a new standard, which is good; and standards get vetted by industry groups that weigh in; however, there is not a coordinated, engineering based process for analyzing and proving the standard. Basically, if the participating voices — usually big networking companies — agree that the standard should be published, it is. Standards also get layered on, creating a heap that has no real architecture or holistic concept.
That is a fine process for, say, trade agreements: make sure all the parties feel that the agreement is in their best interests. It is a terrible process for technical standards. The reason is that no one is looking out for us! There is no process to ensure that the standard will result in reliable end user applications — i.e., that the standard is a good standard, from the perspective of its usage lifecycle, including how programmers will use it, and the reliability of the applications that are built with it.
And then there are self-appointed standards organizations such as the World Wide Web Consortium, aka W3C, which took the approach used by HTML and ran with it, producing one horrendously arcane and unreadable standard after another. The principal standard produced by that group is the infamous WSDL standard, which has largely been abandoned today. It is notorious as one of the most unnecessarily complicated specs of all time, enabling programmers to do something that is actually really simple: send a message from one place to another. Unfortunately it was immediately adopted by Web service platform vendors who were anxious to insert a new technology to force everyone to buy a new generation of products, just as new tennis racket sizes forced everyone during the 1970s to buy new tennis rackets. That’s another problem with the industry: it is driven too much by vendors who have their interests in mind — not ours.
Who made W3C king of the Web, so that it could create these standards? I didn’t; you didn’t: they did. They were self-appointed. The fact that Tim Berners-Lee created W3C played a large role in the fact that people listened to them; yet it is widely acknowledged that the core technologies of the Web, such as HTTP and HTML, both created by Berners-Lee, are deeply flawed — in fact, “deeply flawed” is a huge understatement: “unmitigated disaster” would be more accurate.
How can we fix this?
We can’t. Things are too far gone. However, things could be made a-lot better. For one, we need to make programmers (or their employers as the case may be) liable for losses that result from security bugs and other bugs. That is the only way that individuals and companies who write software will start to pay attention to how reliable their product is. It is the only way.
Second, we need to stop trying to democratize coding. Coding is not like narrative writing: it is not for everyone. Coding is an intricate design process, and only experts can do it well. Those who do it as amateurs can sometimes produce interesting and useful things, but they cannot generally write reliable products that should be used by millions of people on their real time devices such as their phones. To do that, you need engineers: people who are trained to write real time software. The alternative is to put up with the security vulnerabilities and escalating bugs that we see today.
Third, we need real standards organizations that are not driven by vendors, and that have consumers represented in some way. I am not holding my breath on that, but that is what is needed: the current system has created a big mess that puts the entire future of computing at risk: the failing IoT industry is the first sign of the collapse — the tipping point at which the current path is unsustainable.
Who Am I?
Why should you believe me? What do I know about this? Be your own judge, but here’s my background: I was on the team at Intermetrics that created VHDL, the real time programming language that is widely used today for designing chips and systems. I wrote the first simulation program using VHDL, and I wrote the first behavioral synthesis silicon compiler for VHDL, back during the late 1980s while at CAD Language Systems. After that I co-founded a company that built a great number of enterprise systems using Java and Solaris server technology, and wrote Sun Microsystem’s Enterprise Java book. Since then I have consulted to help companies figure out why their applications are unreliable, and collaborated with Peter Neumann on a book about application security and reliability. In recent years I have focused on DevOps and DevSecOps. I am a developer — not a check-the-boxes control person. I work with teams to help them code better, but I also understand what does not work, and I see a-lot of it.