Let's continue describing the way in which we setup or solve reinforcement learning problems. So the the MD, who's in this case is our example to try to introduce the concept of Machine Learning, the concept of reinforcement Learning. The MD interacts with the patient through a series of actions and rewards and observations. So here are the state of health of the patient at time t minus one is S sub t minus one. The doctor takes an action, prescribes a medication, looks, considers procedure etc, takes an action at time t minus one. Then a reward is manifested because the state of health has changed to S sub t. Then an action is again taken by the Dr. We get a reward RT and then the chain the health changes to S t plus one. So the thing to notice about reinforcement learning is that the interaction, in this case of the doctor with the patient, is characterized by a sequence of states, actions, rewards, then new states, and then a new action reward, and then a new state. So it's, sequence, action, reward, series, state, action, reward followed sequentially. The goal of the doctor and the goal of reinforcement learning is to develop a policy that defines for the doctor the optimal action to take when presented by a patient in state s. So the goal of reinforcement learning is to learn this policy, and the policy will effectively define the standard of care, how a doctor should pick care for a patient. The optimal policy should maximize the average reward over time and it should account for the patient outcome and for the patient costs, or the cost of delivering of care. The policy should be non-myopic, which means that it should think ahead, not only considered the most immediate impact to the patient, but also look at the long run impact to the patient. In particular, in the context of this non-myopic characteristic, we would typically like to weight the impacts in the near term more importantly or more highly than what happens in the long run but nevertheless with a non-myopic policy, we want to think about the reward immediately as well as into the future. So the challenge that we have when we solve a reinforcement learning problem is that we typically do not know that underlying distribution, P(s a,s) prime. Recall that P(s,a,s) prime corresponds to the probability that when we're given a patient in state s, and we take an action a that the patient will transition to state S prime. That underlying probability is typically unknown. So the question then is, how can we learn a policy that is optimal without having access to that underlying probability P(s,a,s) prime. So conceptually, what we can do is, we can just experience the world and then try things and record what happens. Then the idea is to try to adapt the policy such that over time we reinforce actions that were good or that lead to good outcomes and we discourage actions that lead to poor outcomes. So this is in fact the heart of the term, Reinforcement Learning. So in Reinforcement learning, we're going to just experience the world. We're going to see the state of the world, we're going to take actions, and we're going to see the new state and we're going to see the reward, and we're going to do that repetitively. The idea is is that, through experiencing States and actions and rewards and new states, we hope to learn a policy which is optimal. Optimal in the sense of achieving an outcome for the patient or reward that is on average very good, good for the patient and also good from the health system from the standpoint of keeping costs under control. So Reinforcement learning is the formulation of this fundamental challenge. So the fundamental challenge is that we're going to experience the world by a series of states, actions, rewards and new states, and then new action, new reward, new state. We're going to do that repeatedly. Then what we would like to do based upon that experience is to learn an optimal policy. This is a very fundamental problem, this is what reinforcement learning is, and the reinforcement methodology that we're going to talk about is going to address this challenge. The thing that is important to recognize is that while we have used health and medicine has our underlying thematic example largely because it's understandable to most people, intuitively understandable. This construct of trying to develop an optimal policy over time is manifested in other settings. So for example, consider the case in which you are operating the machine floor of some factory, and what you would like to do is monitor the health of the machines and also determine your maintenance schedule. So for example, when should you take a particular machine offline and repair it or tune it up. It would probably be not a good idea to wait for machines to break, and so therefore what is the optimal way to monitor the health of Machines and what is the optimal way to maintain that health. This is also a reinforcement learning problem. Another example is investing. So if you look at the price of of companies over time, you would like to make good investments. It's very difficult to understand the underlying statistics of the stock market. So what you might try to do is to develop a policy, a policy that guides which stocks you buy, which stocks you sell. You would like to do this in a way that over time you develop a policy which on average yields a positive outcome in investing. So the thing that I just wanted to highlight from this slide is that while we're going to use medicine as our underlying theme as in the past and as we move forward in these lectures, this construct of reinforcement learning is very general. We also talked about it in the context of the game Go and a machine playing a human in a game like Go. These are all sequential decision type problems and reinforcement learning which we're going to now start to dive into in greater detail is a fundamental machine learning framework for solving each of these problems.