Hello, my name is Mike Whalen, and I'm going to talk about dependability. This particular lecture is just talking about terminology, so that when we talk about testing, we have our terms straight for what we're looking for in a program. So, dependability is what you would expect. We're interested in determining whether or not the software is dependable, that is whether it delivers a service such that we can rely on it. Service is the system behavior as it's perceived by the user of the system. So in an airplane, the service of the airplane is to fly people from one destination to another. A failure occurs when the delivered service deviates from the specification that defines what the desired service is. So we have a website where we'd like to be able to buy things and when we try and buy something, it returns an error. That would be a failure that the user can see. An error is the part of the system state that can lead to a failure. So errors can be latent or effective. So we have some bugs buried in our code, and at some execution, we actually hit that bug and then the error becomes effective. Previous to that, it was latent. And finally, we have faults which are the root cause of the error. Now, when we're dealing with mechanical systems, it could be that a piece breaks, so you actually have a physical fault in a piece. Or that you have some error of cognition, error of understanding. So the programmer doesn't completely understand what the requirements are for the system, and so when they start writing the code, they don't do it correctly. So just flipping it around and doing it from the other direction, we have programmer mistakes which are faults, which are the root cause of latent errors. Which are bugs in the program, at some point, which become effective errors when the program is executing, and we actually hit that line of code that contains the bug. And depending on whether we have some fault tolerance built into the code, that error may not become visible to the user. But if it does, and it causes the system to misbehave, then we have a failure. So that could be that the program crashes, or it returns some error code or other thing that causes the user not to be able to do what they expect to be able to do. And our goal with testing, but also with a variety of different software development life cycle processes, is to build dependable software. And in order to do that, we have to look at four different kinds of approaches to dependability. So first, we have fault avoidance, which is preventing, by construction, certain kinds of faults. So if you look at different programming languages, C versus Java, for example, they have different kinds of fault avoidance. So in C, I can write an array and then I can just read past the end of it, and I can have something called a buffer overflow, that causes lots of problems with the security. Now, Java actually has as an array bounds checking. So it's not possible to write that code, that will cause the failure later on. So certain kinds of languages actually have built-in fault avoidance. Certain languages such as Haskell have very rich type systems that allow you, at compile time, to find a lot of the problems that you might, otherwise, inject into your program and say, see. So that's one way of achieving dependability. Another one is fault tolerance. So here, what we are going to do is we're going to have redundant subsystems such that one of them can fail and the rest will continue to operate. So you see this a lot in critical systems where you have, for example, multiple actuators that control an aileron in an airplane. So if we lose one of the actuators, if it misbehaves for some reason, we can still use the other one to move the flaps in the aircraft. When you look at big websites, you tend to have lots and lots of redundant servers, and if one of them crashes, the rest of the servers can handle the remaining load. And then we have error removal. So in this process, what we're trying to do is get rid of the errors themselves. So we're going to apply verification to remove latent errors. And finally, we have error forecasting, which unlike the other three, is just a way of us measuring how likely we are to have failures based on looking at the behavior of the program. So if you think about it, under which of these categories is testing? Okay, hopefully, everyone was able to come up with the error removal. If we're to look at this graphically, what we'd look at is we have impairments, which are the things we're trying to avoid, so faults. All programmers make mistakes, and they're going to eventually introduce errors into the code, but we'd rather that these didn't lead to failures. And so what we're going to do, is we're going to look at means of achieving dependability. We're going to look at, in the case of testing, error removal. So we're going to run tests against the software, and we're going to remove some of those errors from the code. And based on the results of our testing, we might do error forecasting that says, well, we've run a big test suite for three weeks, and we haven't found any problems, so we think it's likely to be reliable. But we can also look at procurement. So we can look at certain techniques that will prevent falls from being introduced in the first place. So places like JPL, and automotive industries, and the aerospace industries. They have a lot of guidance in what you can write in your program and what you can't write, and how you design things in order to avoid lots of classes of faults. The other thing we can do is be tolerant of faults. So we know that there's a certain level of errors that we're going to find in code. And we're going to build things around those possibly erroneous components, in such a way that the system can continue to operate, even in the presence of errors. And finally, we're going to try and measure our dependability, and we use two different metrics for doing this. We're going to use both reliability and availability. So, what's the difference? So availability is the readiness of the software to respond to user requests. And reliability is continuity of correct service. That seems like those are basically the same thing, but they're not. Because it turns out that you may have an unreliable system, but if you can reboot it really fast, it actually is still pretty available. So reliability says, how long can I run this thing contiguously and have it still work correctly? And availability says, what are the chances in any given time that the system's available for me? And there are some other measures that are equally important especially when you start looking at safety critical systems. Safety is the absence of catastrophic consequences based on failures of the software. And you may think again, what's the difference between safety and reliability? Well, it turns out, you can have a system that's very safe, that's very unreliable. So, for example, if your car never starts, it's pretty safe as long as it's in your drive way, but it's very unreliable. On the other hand, you can have a fairly reliable system that occasionally is very unsafe. So you could have a car, that every once in awhile, exhibits unintended acceleration. Most of the time, it's reliable, but when that corner case happens, it's very unsafe. So some other measures that are important, are integrity, which is the absence of improper system alteration. So when you think about security and someone taking over your computer, what you have is a failure of integrity. So they've exploited some buffer overflow, they're able to change the software, and thereby, gain access into your data. And finally, maintainability, the ability for a process to undergo modifications and repairs. So, with software, it's always the case that you're upgrading things, and you're changing the way that the software works if it's successful in being used at scale. So this maintainability can actually contribute to reliability, because if you have to take the software offline to maintain it and it takes a long time, that's going to decrease your reliability. So then we can turn these into numbers. So for reliability, we talk about mean time between failures. So this idea of being able to run the system contiguously for a long time. Recoverability is how quickly, if the system fails, it can be restored to correct operation. So this is measured in mean time to repair. So then we can put those two numbers together, and determine availability. So availability is the mean time between failures divided by the mean time between failures plus the time to recover. So basically, if it's running this amount, and it takes this amount of extra to recover once a failure occurs, that gives you your availability, that ratio. And when we think about designing software, depending on what you're doing, if you're in a critical environment, and this doesn't have to be defining software for airplanes. It could be that you work at a bank, and you have hundreds of millions of dollars going through the system. You have to plan for failure. You have to expect that other software systems are going to lie to you. That physical actuators and sensors may not behave as expected, and that the hardware you're running on may also be unreliable. So one of the things that becomes important when you work in critical systems is being able to determine how robust your system is in the presence of these failures. And where this comes in to the testing process is that when we look at the expected behavior for tests, we're going to have to set up a testing environment where we can cause some of the inputs to be unreliable. And the system should still do the right thing, or we may even have a stress test where we pull the plug on certain pieces of hardware. This is in fact what Netflix does to test the reliability and the robustness of its systems, is it actually has something called Chaos Monkey, which is an automated testing tool that just nukes certain pieces of software and hardware from time to time, and then they measure how well the system responds. So I'm going to talk, just for a minute, on how one piece of guidance for critical systems, for airplanes, looks at planning for failure, and that example is DO178B which is used for airplanes. So, what we do is we categorize pieces of software by what's the worst that can happen if this thing fails? So there are five levels of criticality. One's catastrophic, which means that if the software fails, the plane may crash, and in fact, they may be likely to crash. Another one is hazardous. So this one, it's not going to cause the plane to crash, but it's going to make it really hard for the crew to fly the airplane. Major, in this case, it's going to significantly increase the work load of the crew. Minor, in this case, it's an annoyance. And finally, no effect. And what happens is that we're going to drive the rigor of the testing process based on the criticality of the software and its robustness. So if we have a level A piece software that has catastrophic failure conditions, we're going to require all kinds of robustness to a variety of different environmental scenarios. And we're going to define a bunch of different objectives that the software has to meet. So we're going to talk later on about adequacy criteria for tests. What kinds of things do we have to do before we consider the system to be adequately tested? And what we're going to do is we're going to match the software, the level of criticality of the software to the rigor of the adequacy measure. So basically, what we're trying to do is to be systematic about determining how important a piece of software is, and then how much time and effort to spend verifying it. And just to recap, what we talked about here is dependability. How much confidence can we place in a piece of software based on the way it was developed, the testing process that we used to remove errors from it, and some mechanisms that are built into our design processes, or our language that prevents certain class of failures, excuse me. And the way that we measure dependability, we talk about it is first, we talk about faults, which are the programmers error in understanding of the situation. And then the latent errors, which are those errors in the brain turning into errors in the code. And then those turn into effective errors when we're executing the program, and we actually hit that erroneous piece of code, which can turn into failures if we don't have any fault prevention techniques in place that can respond to those errors. And when we look at testing, testing is very important to build dependable software. But it's only a small part to some degree of an effective strategy for error removal and for preventing bad software, for creating dependable software. We have to look at fault avoidance, fault tolerance, error removing, and error forecasting. And in order to create dependable software, it's not even enough for our programs to do the right thing. When we're talking about critical systems, we have to plan for failure. We have to have software that's robust. So it's not enough that if all the inputs match our expectations, the system behaves as intended. We have to be robust to situation where the inputs don't match our expectations, and be able to respond to those situations. And so what we're going to do is when we define testing regimens, we're going to determine the rigor that's necessary based on the criticality of the software that we make. And just for some references, here is where that information comes from. Thank you.