Hi, and welcome to today's lectures on neural networks. Motivated by improvements in neural networks and deep learning, people have used this technology in several areas. Also, machine translation could significantly be improved by using neural network technology. Before we look in the next lectures how we can use neural networks in machine translation, we first want to take a look at the basic principles of neural network. We want to introduce what a neural network is, how we can use a neural network, for what tasks we can use a neural network. First, we will look at the basic units of neural networks. We will start with a basic perception, which is a basic unit of every neural network. But we will see that just a perception is not working very well. Therefore, we will then introduce the multilayer perception or feedforward neural network, which is the most commonly used neural network. After introducing the multilayer perception, we will have a look how to train a neural network. Two main important things here is on the one hand the error function, which measures how good currently the neural network is. And then second, the backpropagation algorithm, which is able to improve the neural network by looking how each of the parameters contributes to the error and then improving each of the parameters. So let's directly start with a basic unit of a neural network. The basic unit of a neural network is a perception, as shown here on the right side of the slide. First, we'll have the input of the perception. We will often refer to it as a feature vector. The input is here shown as X, it's like the vector X1 to Xn, so it's an n-dimensional vector. Commonly, we use X0 as a bias and it has always a value 1, so then the value W0 is the bias. And then we have the rest input, which is described the input to our neural network. Then the first thing the neural network does is it calculates, here in green, the weighted sum of all the inputs. So we will calculate X1 times W1 plus X2 times W2, and so on til Xn times Wn. This is the first step of the perception. Then we'll have the second step, which is the activation function. Commonly used activation functions are, for example, the sigmoid function or the tangent hyperbolicus. This function is then applied to the weighted sum, and this is then the output of the perception. So at the end, the output is the activation function applied to the weighted sum of the input and the weights. So all parameters of this perception are the weights W0 to Wn. After people were first very enthusiastic about this model, soon researchers found out that this perception cannot solve a lot of problems. One drawback of this approach is that the perception is a linear classifier. So it can only solve problems where we can linearly separate the input into the two classes. A very famous example which cannot be solved using a perception is the XOR problem. So imagine you have a perception which gets as input two values and should output the XOR of these two values. So if both of them are 0 or both of them are 1, the output should be 0. And if one of them is 1 and the other is 0, the output should be 1. This quite simple problem cannot be solved by perception. Therefore, people were interested in more complex models which can also solve problems like this. A way to extend a perception in order to also solve nonlinear problems, which are very common, is the so-called multilayer perception or feedforward neural network. The idea here is that we connect several perception to a networks and therefore get a network of perception or a multilayer perception. On the right side here we see an example of such a multilayer perception. As you can see here, the different nodes can be separated into different layers. So first we have the input layer where we have the different input nodes, then we have the hidden layer where the hidden nodes are, and we have the output layer. And you see that all these nodes of one layer are connected with all the nodes of the next layer. So the input nodes are connected to each node of the hidden layer, and then each node of the hidden layer is connected to each node of the output layer. Let's have a more detailed look at each of these layers. So first we have the input layer, where each of the nodes represents the input value. Thereby, for example, in this case, we can input an n-dimensional value. Afterwards we have in the second layer the hidden nodes. Each of these hidden nodes is a perception of its own. So each of these nodes will first calculate a weighted sum over all the input nodes and then apply the activation function to this value. One important thing with the hidden layer is that here we simplified it by only using one hidden layer, but we are not limited to having only one hidden layer. So we can also have two, three, or even more hidden layers. And then always one hidden layer is connected to all the hidden nodes in the next hidden layer. And finally, we have here the output layer, which is connected to the input and calculates the output. So the number of output nodes depends, again, on our task. So if we want to have an output which is two-dimensional, we need two output nodes. If we want to have an output where we have five different values, we need an output with five different nodes. So the size of the input and the output layer is defined by our task. The size of the hidden layer is a hyperparameter, which we can select in order to get the best performance on our task. So as I already said, each node in the hidden layer and in the output layer is a perception. So these connections are all weighted connections between node in one layer and a node in the next layer. And the weight is then used for the calculation of the weighted sum. So when we want to train this neural network, the task we have to fulfill is to select weights so that the input nodes are mapped to the correct outputs. This training is normally done in the following way. First, we will initialize the weights randomly. So then we will give some input and get an output, which of course will not be correct ones since we just used random weights. Then during training, the first step is to calculate the error of the current network. So we'll look at what output the network will currently generate given the current weights, and then compare it to some reference output we want to have. Given this error, we will then change the weights in order to lower this error, and in the end, hopefully learn mapping function with a very small or no error at all. So let's start with the first step of this training which is to calculate the error of the current network. So always remember that we have now some weights which we may be initially randomly analyzed or have already trained for some time, and we want to know how good our network currently is. This type of training can be used in a supervised learning scenario. That means we have labeled data. So we have data where we have given the input, and we have given the target which is the correct output. And now we need to train our neural network that it's generating output, which is the same as the target. So the task of the error function is now to compare the output and the target and see how good this task is fulfilled. While this may be first seems very simple since we just have to compare the dimension of the output and the target can get very high, and we have to see what a good measure is, how similar the output and the target is. So the main task of the error function is to compare the difference between the output and the target. Over the years, several metrics have been proposed in order to compare both the output of the neural network and the target. The two commonly used error function used to train neural networks in this scenario are the ones shown here on the bottom. First, we have the mean square error, and then the other one is the cross entropy. So let's have a detailed look at each of these two error metrics. So first you see here, that we sum over all examples. So you see already, we not only want to compare one example, but these error metrics are able to compare several examples. So first sum year is over all examples we are considering. Then we have the second sum, goes over I equals 1 to n which compares now each dimension of the output individually. So n in our case is the number of output neurons, and then for each of these output neurons we compare the output with the target and calculate the difference between both of them. Then we take the of these difference and this is then the error function. So we take for each of our output nodes and the target the difference take the square of them, and then we build the sum first overall output dimensions. And secondly, we built the sum over all the examples we are considering. The second error function that is very often used is the cross entropy, if we look at this, we again see first the two sums. So again, we will first sum over all the examples and then we will sum over the output dimensions. The difference in this case, is how we compare the output and the target. In this case, we take the product of the target and the logarithm of the output. This type of cross entropy is used when we want to predict the class. So in this case, one of the targets is 1 and all the others is 0, and the output should be a probability distribution where the probability should be the probability that the input belongs to this class. So in this case, if we look at the error function, only the dimension where the target value is 1 contributes to the error function because in all the other cases T is 0 and, therefore, the product is also 0. In this case, the error is minimized if the output, the output probability for this dimension is very high, in the best case where we correctly classify and the output probability for the correct class is 1, we have the logarithm of 1 which is 0, and therefore, we have no error. So now we know how we can measure how good the neural network currently is performing each task. So the next step is now, how can we change the neural network in order to improve its performance? The algorithm we're using for performing this updates is stochastic gradient descent. The main idea is then we look how big the current error is and then try to change the weights in order to minimize the error. So always when we change the weights, we want to lower the error and we're doing this by calculating the derivation. So how does gradient decent work? So we know that when we change one weight, we will also change the output and therefore we will get a different error. An example, is shown here on the right side. On the x axis, we show different values of the weight and on the y axis, we then see the error. We see that sometimes the error is very low, like for example here, and in other cases, we have a very high error, as for example here. The problem is that we do not have a very nice convex curve, but this function can be very complex and we only know the error at the current value of our weight, which is here. Since the number of parameters is very high and they all interact, we cannot try just all possible combinations. So given that we know the error for the current value of the weight. The question is how should we change this value in order to minimize the error? Remember that we don't know the whole curve and don't know that here the performance is the best, but we only know it's performance at this position. So what we do is that we calculate the derivation of the error function with respect to the parameter. And if we know the derivation like here, we know that it's showing in the direction where the hour is increasing. So the main idea is now that we change the weight in the reverse direction and therefore minimize the error. And this leads then to the update function as we see here. So we update our weight, wi, by updating it with wi-sum Delta and the Delta is the learning rate here times the derivation of the error function with respect to the current weight. In stochastic gradient descent, we are now always randomly choose one of our examples, calculate the error function, then calculate the derivation of the error function with respect to the weight, then update our weight And then go to the next example. Then we, again, calculate the error. Now on the new example with the new weights, again, calculate the derivation and do another update. And so we continue for quite a while until we, hopefully, come to a stable value. So now we nearly know how we can train a neural network. The only last question we have to address is, how do we calculate the derivation of the error function with respect to a weight? That's the last thing we have to do and this is done using the backpropagation algorithm. The idea is here that we first calculate the derivation of the error function to the output. And then since all our functions we apply in the neural nets are differentiable, we can backpropagate the error to the individual weights. And thereby calculate the derivation of the error with respect to all the weights. So first, we start with our output. So we saw two typical error function and we can directly calculate the derivation of the error function with respect to the output using normal differential rules. Then we know that at this output unit is just a simple perceptron. So the next step we then have to calculate is the derivation of the input to the activation function. Therefore, we just have to calculate the derivation of the activation function. Then we can apply the chain rule, and therefore, get this value. Then off of the weighted sum, we can directly calculate the derivation, and then we already have the first derivation for this weight. And then we can backpropagate these values to the previous weights. And thereby calculate for all the weights of the neural network, its contribution to the error and then update all of our weights. This algorithm is commonly called backpropagation since you backpropagate the error from the output backwards to the input of the neural network. So in summary, what did we learn today? We looked in detail at the multi-layer perceptron. The multi-layer perceptron is, again, shown here on the right side. We have learned that all the nodes are perceptrons which are connected to the nodes from the previous layers. So we have always connections between one layer and the next layer. And we learned the three main types of layers. We have first the input layer. Then we have the hidden layer, which can be one hidden layer, or we can have several hidden layers. And then at the end, we have the output layer. If we then want to calculate the value of the output layer, we put in the input. Then we can calculate the activation of the hidden units by just how we defined our perceptron. And then do the forward pass till we are at the output at the output layer. This is called the forward pass. If we want to train our neural network, we have to perform several steps. So again, first we have to do the forward pass which calculates the activation of the output neurons. Then we measure the error using one of the error function we want to use. Then we do the backward pass where we calculate the derivation of the error for each of the weights. And finally, we apply the rule how to update our weights. And then we can use our gradient descent algorithm to change the weights in order to minimize the error of the neural network.