So this framework of reinforcement learning which is characterized by the goal of developing a policy of taking actions when the world is presented to us in a certain state, taking actions such that over time, we get a large reward on average. So our goal in reinforcement learning, is to learn that policy, the policy of which actions to take when we're in a given state. Now, I recognize that this construct may seem rather abstract perhaps, hard to fully grasp. So what I'd like to do now, is to give a concrete example which is hopefully easily understood, which will try to make these somewhat perhaps abstract constructs more understandable. So let's consider again our doctor as an example. Let's assume that we have a diabetes doctor. So this is a doctor who is dealing with diabetic patients. So the goal of this doctor, might be to develop a regimen to keep the health of the diabetic patient under control. So let's consider how we might set this problem up as an reinforcement learning problem. Where again, the doctor over time we'd like to try to devise a policy which was able to specify the optimal action to take for any state of health of a diabetic patient. So what we're going to do in this example, is characterized the state of health of the patient which remember is a vector denoted by s. This is the state of health of the patient. Let's represent the state of the health of the patient from the standpoint of a diabetes doctor as the minimum and maximum glucose concentration from the previous day for that patient. So these are now two numbers. So from the standpoint of the doctor, the state of health of the patient on the previous day, is characterized by two numbers. The minimum glucose concentration from the previous day and the maximum glucose concentration from the previous day. Now, given that state, given those two numbers, the doctor would like to specify which action to take ideally to get the glucose level of the patient under control. So here in the context of our diabetic doctor, let's assume that there are two types of things that the doctor can control from the standpoint of medication. The rate of continuous insulin supply. So let's just say that we have a diabetic patient who was being supplied insulin. The doctor can specify the rate of continuous insulin supply as well as the bolus dose. So these are two things a doctor can control. So the action that the doctor can take in this case, would be setting the rate of continuous insulin supply and the bolus dose. So then our patient, is going to get a reward. The reward is the reward the patient gets when initially, the patient is in state s, we take action a, and this patient health or state goes to s prime. So what we would like to do is to devise that reward r, such that if s prime which is the subsequent health of the patient characterized by the minimum and maximum glucose concentration. If the s prime is better than s, then the reward should be high and if the s prime is worse than s. In other words, the health of the patient has deteriorated, the rewards should be low. We assume that we may specify what that reward is. We or the doctor specifies what that reward is. So now, to again make this further work towards making this concrete. Let's again look at our state. The state is again characterized by the minimum and maximum glucose concentration from the previous day, we're going to discretize that continuous range of values into bins. So what this means is that, while the minimum and maximum glucose concentration levels from the previous day, could be let's say any continuous number, what we're going to do is we're going to break it up into bins or ranges, and if the min and max glucose concentration levels are within a particular bin, we're going to assign the health from the previous day to that bin. So this is a discretization of the states of the health of the patient. In a similar way, the action which in this case, are the rate of continuous insulin supply and the bolus dose, those are also in general continuous numbers. We're going to again discretize those continuous numbers into bins or ranges. So instead of being concerned with what is the exact number for the rate of continuous insulin supply or the bolus dose, we're going to ask which bin is that going to be specified into. So therefore, what we're doing by this setup, is we're taking the state of health of the patient and we're discretizing it or replacing it into bins. Likewise, the actions that the doctor can take are also discretized or placed into bins. So for both, the state of health and for the actions, we're going to simplify our solution via a discretization step. So then with that done, we're going to define something called a Q function. The Q function is a function of the state and the action. This is now an n by m matrix. So if the continuous range of states is broken up into n bins and the continuous range of actions is broken up into m bins, we're now going to have a table which is an n by m matrix, which is represented by this function Q, which is a function of s and a. Now, s and a are discrete. So consequently, Q (s, a) is an n by n matrix. This matrix is meant to denote as we'll see subsequently, as the value of taking action a when in state s. So this Q function which will be really important as we move forward in reinforcement learning, is the value of taking action a, when the patient is in state s. So now, the key goal of reinforcement learning is to learn this Q function. The key thing that we want to achieve in reinforcement learning, is to learn this table or matrix Q(s, a). Because if we learn that Q function or learn that Q table, then if the patient is presented to us in any state s, which is now discretized in the way that we talked about, then what we're going to do is we're going to choose the action a, which maximizes that Q functions, so that we can maximize the reward to the patient. So the goal of reinforcement learning, is to learn this Q function or this Q matrix. The way that we're going to do that, is we're going to just experience the world. So remember that the way that reinforcement learning works we have in this case, a patient presented to us in state s, the doctor specifies an action a, the patient transitions into a new state s prime, and then a reward is manifested for that patient, then a new action is taken and the patient transits into a new state. This happens repeatedly. Through this repeated sequential process of state action new state reward, state action new state reward, that sequential process over time. Our goal is to learn this Q function or this Q table, Q(s, a), which is a good representation of, sorry, the value of taking action a when the patient is in state s. So the challenge then is, how do we learn this table Q(s, a) based upon experience? So the way that we're going to do this, is we're going to initialize the table Q(s, a) in some way. So the way in which we initialize it, can be done in multiple ways. If we have some intuition or some prior knowledge about which state, which actions a are good for particular states s, we may reflect that in the way in which we initialize that table or we may just initialize the table Q(s, a) at random. So we initialize it in some way. The initialization, is not particularly important. We then have a patient presented to us in state s. We choose an action a and the patient transitions to state s-prime, and then we observe a reward. The reward is the reward that is manifested whenever the patient is initially in state s, doctor takes action a and the patient transitions into s-prime. The initial state s and the new state s-prime are observable after we take action a, the reward is also observable. Now, based upon those observations, we now want to update our Q-function. So the key question is, if we take these measurements, we have a patient in state s, we take action a, we transition the state s-prime and we observe this reward, how should we update the Q-function to try to learn a better model? So the way that we might do this is as follows. So let Q-old, Q-superscript old, be our old table. So we have some table which is an approximation to the value of taking action a when the patient is in state s. So this is our old representation of the table. We now have taking some measurements. We have a state action, we have a state s, and an action a, we transition to state s-prime and we observe this reward. How should we update our Q-function so that we're going to update it with what we'll call Q-new, which is our new table? So what we're going to do is, we're going to take the difference between the reward and our old Q-function. So r, s, a, s-prime is the reward that we observe and Q-old is our old representation of the value of taking action a when the patient is in state s. If the reward r is larger than our previous representation of the value of taking action a in state s, then what we should do is we should increase the value of that action. So what you see here is we take the old Q and then we add the difference between the new reward and our old Q-function. Then if r, our new reward, is larger than our old Q-function, then we increase the value of the Q function and Q-new. If the r is less than the old Q-function, which means that our old Q-function overestimated the value of taking action a in state s, then we are going to diminish the Q-function. So what we're doing here is rather simple construct where we're going to adjust the value of the Q-function depending upon how the new reward r is in value relative to the old Q-function, Q-old. So this is a rather simple, I think rather intuitive construct. This parameter Alpha is called the learning rate, and it is what it says, it tells us the rate at which we're going to learn or the rate at which we're going to adjust the Q-function based upon our observations and that Alpha or the learning rate is a number between zero and one. So if we look at this equation, this equation is going to be at the heart of reinforcement learning. We're going to build upon it, but this is really an important equation for us to understand. So if we look at the new Q-function, again, what it is equal to is the old Q-function plus a weighted version of the temporal difference. If the temporal difference is positive or this TD is positive, that means that the immediate reward that we see r is larger than our previous estimation of the value of that Q-function, which means that we probably underestimated the value previously and we should increase it, and that's what this equation does. If r is less than Q-old, then that would imply that we overestimated the value of the old Q-function. Then in that case, r minus Q-old is a negative number and therefore the temporal difference takes the old Q-function and diminishes its value. So this temporal difference, what it's doing is it's looking at the immediate reward r comparing it to our old estimate of the value of taking action a in state s. Then the new Q-function is an adjustment of the old one based upon the difference between the new reward r and our old estimation of the value of taking action a in state s. Again, this Alpha is called the learning rate. It controls the relative balance between our old estimate of the Q-function and our new estimate based upon the observed reward. If the temperature difference is positive, then we take the old Q and we increase it to reflect the fact that the value of taking action a in state s is higher than we thought previously. If the temporal difference is negative, then the old Q-function is diminished we need to update the new Q-function. This equation which is I think hopefully relatively intuitive is going to be at the heart of reinforcement learning. So what we're going to do is, in this case, our doctor is going to see a patient or patients, is going to see states of health, take actions, see new states s-prime, is going to measure reward and it's going to do this repeatedly, and then over time through this equation, is going to learn a Q-function which quantifies the value of taking action a for any given state s. Then after this Q-function is learned, then this effectively will constitute a policy for our doctor, because then, for any state of health of the patient after this Q-function is learned, the doctor will know how to choose the action that is most valuable to the patient or choose the action that has the most value for a given state. So this summarizes this. Now, one problem that we need to think about, is that this setup only accounts for the immediate reward r. It doesn't account for what might happen subsequently to the patient. So let's think about this a little bit. So what we're trying to do with this simple equation is trying to learn a Q-function which is good at predicting the immediate reward to a patient. Now, in medicine in many fields, you may have a situation where the immediate reward could be good, but then the long-term effects could be bad. So one example of this is, if you're a doctor and you have a young woman who is your patient, you have perhaps a situation where you might be able to prescribe a medication which might immediately make that young woman better, and therefore, the immediate reward would be positive, very positive. But there may be some side effects to that medication which may prohibit that woman from subsequently having children. So therefore, the long-term impact of that action could be very bad. So therefore, whenever we look at this Q-function, if we only learn it based upon the immediate reward, then we have a very serious limitation, because we do not take into account the long-term consequences of actions. So this is if we recall, previously we talked about the goal of developing a non-myopic policy. What that means is that that's a policy that takes into account the immediate reward as well as long-term consequences. This setup that we have here, while it's very simple and perhaps intuitive and attractive, it is not a good solution because it only takes into account the immediate reward and does not take into account effects down the road. In other words, this is a myopic, this solution is myopic and we do not therefore want it. So what we would like to do is to take this relatively simple solution and extend it in a rather simple way which will allow us to develop a non-myopic policy.