Wednesday, January 17, 2007

lifetime failures (LF)

This morning at LCA Andrew Tanenbaum gave a talk about Minix 3 and his work on creating reliable software.

He cited examples of consumer electronics devices such as TVs that supposedly don't crash. However in the past I have power-cycled TVs after they didn't behave as desired (not sure if it was a software crash - but that seems like a reasonable possibility) and I have had a DVD player crash when dealing with damaged disks.

It seems to me that there are two reasons that TV and DVD failures aren't regarded as a serious problem. One is that there is hardly any state in such devices, and most of that is not often changed (long-term state such as frequencies used for station tuning is almost never written and therefore unlikely to be lost on a crash). The other is that the reboot time is reasonably short (generally less than two seconds). So when (not if) a TV or DVD player crashes the result is a service interruption of two seconds plus the time taken to get to the power point and no loss of important data. If this sort of thing happens less than once a month then it's likely that it won't register as a failure with someone who is used to rebooting their PC once a day!

Another example that was cited was cars. I have been wondering whether there are any crash situations for a car electronic system that could result in the engine stalling. Maybe sometimes when I try to start my car and it stalls it's really doing a warm-boot of the engine control system.

Later in his talk Andrew produced the results of killing some Minix system processes which show minimal interruption to service (killing an Ethernet device driver every two seconds decreased network performance by about 10%). He also described how some service state is stored so that it can be used if the service is restarted after a crash. Although he didn't explicitely mention it in his talk it seems that he has followed the minimal data loss plus fast recovery features that we are used to seeing in TVs and DVD players.

The design of Minix also has some good features for security. When a process issues a read request it will grant the filesystem driver access to the memory region that contains the read buffer - and nothing else. It seems likely that many types of kernel security bug that would compromise systems such as Linux would not be a serious problem on the HURD. Compromising a driver for a filesystem that is mounted nosuid and nodev would not allow any direct attacks on applications.

Every delegate of LCA was given a CD with Minix 3, I'll have to install it on one of my machines and play with it. I may put a public access Minux machine online at some time if there is interest.


Paul said...

The (relatively) new Siemens trains in Melbourne are highly computerised and I was on one which had to be rebooted, after pulling out of a station on the Upfield line. The driver had to lower the pantographs, removing all the power, and then raise them again. The process took a number of minutes.

Apparently the drivers are particularly frustrated with the level of computerisation in these trains.

Thomas said...

I've had a Smart's electronic transmission control "crash" in a way that both the automatic and the manual switching gears didn't work. Luckily, that was at an intersection with little traffic, power cycling the car helped.

I've experienced a situation in which the instruments of a Volvo 850 did not work (even the speedometer) after shutting off and restarting the car quickly. Here also, turning the car off and back on solved the problem. It doesn't seem that uncommon...

Anonymous said...

Oh, TVs do crash!

Basically you can see a TV as two parts, one dealing with the stream of data (signal processing) and one controlling that stream. Traditionally the former was implemented in hardware and would happily work on even if the latter part died a horrible death and rebooted. The viewer would be none the wiser since the signal processing was done in hardware.

According to what I've heard some Philips TVs would become unresponsive to the remote control at times. Yepp, that was the control software rebooting. It would only take a few seconds so most people wouldn't notice.

Justin said...

We had a projector hard lock on us last week in class. It wouldn't accept any input; even the power button didn't do anything. I finally had to pull the power cord on it, then put it back in. It works fine now, but that's a strange failure mode for something that supposedly never crashes. I've also seen horrible set top boxes offered by cable companies.

Tannenbaum never says how these devices he's seen manage to work flawlessly, what design approaches and techniques were used in creating that software. The answer appears to be to do as little as possible in software. Engine computers are generally very small micro-controllers programmed to look up in a table how much fuel to inject based on readings from an oxygen sensor, put that in the back of a queue, and inject as much fuel as the front of the queue says. That is the most complex piece of software involved under the hood that I know of. The rest of the stuff is dashboard, power windows, A/C that sort of stuff. Things that if they fail, the driver doesn't lose control, or explode the car.

Anonymous said...

We have a stretch of road in the UK, which got known for a particular model of car having a problem. Never admitted/proven but suspected to be a software issue triggered by a specific sequence of behaviour by the driver.

I think Alan Cooper, in "The Inmates are Running the Asylum" summed it up well, as when we put computers in things, they behave like thing+computer, and crucially fail like computers fail.

Of course for many of these devices the needed complexity of the computerization is quite low. But when we worked with an embedded device company, they were driven by hardware/software cost equation to use newer bigger chips that emulate older simpler chips, in order to let them reuse tested software on newer hardware when the old chips are out of production. One can soon see, in a world where such decisions get made, how software errors could creep into otherwise relatively simple computerized systems.