Good. So we've looked at descriptive statistics before but that was very easy because we had created these computer variables which would just list objects from some simulation that we controlled. Now, they live inside of a dataframe that's much more common. We actually import spreadsheets just using the dataframes package, and it lives now as a dataframe. So now, when we want to describe the data through descriptive statistics things look a little different. Now, from state-space, see I didn't say state-space dot, so we're just going to say describe, that's all we need to do. We've already imported state-space. I'm going to say, "Describe the data." Now, what that is going to try and do, it's going to look at all these columns and it's going to do its best at describing them based on the data type that it found. So we see eltype, element type. So based on that, the dataframes object is going to decide how best to do this. So look at Age, and it's all well. That's a 64-bit integer, so those are numerical variables. I know I can do a mean. I can do a minimum, a median I should say and maximum. So that's the descriptive statistics it decided it could do. We also see number unique there, it's not going to do anything because it's not a categorical variable. Number missing, that's very important if we have missing data. If you ever work with data try and not to have any missing data, try and to impute that data with something inside of your spreadsheet file before you bring it in. That is a whole different problem, just getting your data into a shape. By that, I mean that the fields are correct but the data is input correctly before we start doing analysis. This little module that we've created here for you, this is an idealized world where the data that we bought in is absolutely perfect. In reality, that is where you actually spend most of your time. During your actual analysis in Julia, you can just do that with a smile on your face. It's as quick and easy. So let's just look at white cell count. Again, we've decided that's a 64-bit floating point values, all these elements. So it could do the mean, the minimum, the median, and the maximum. CRP, exactly the same. Now, we get to our two categorical variables. I've noticed that they were both strings, and it said minimum and maximum. So what they tried to do there, is just chose it alphabetically. So it's just going to go in descending orders as far as the alphabet's concern. So it's going to be A of the minimum and B the maximum. When we looked at Improved and Worse, that's the I and the W. So it's just going to do that alphabetic in for us. Very nicely though, we're going to see the number of unique values that is the sample space for that variable. So it said that for Treatment, the sample space had two elements in it, and they rightly serve, there was just A and B. The sample space of possible elements that could possibly be inside of that variable result was three, and we designed it as such. So we understand where the three comes from. But when you bring in a big database, a big spreadsheet file is very easy for us to see through this unique values, what the sample space was of a categorical variable. Now, let's just decide to wrangle this data but we want some answers here. So the fifth question that I'm asking that I want this answer for is, how many patients were actually in the end in group A and group B? So if we go down the Treatment, we know we saw A, A, B, A, B, how many were As and how many were Bs? Remember, we did a unweighted uniform distribution, so it should be about 50-50, but that's not to say that there's equal likelihood at every one of the 100 turns. So let's see what actually happened in my instance, yours is going to be different. This is one way that we can go about to using the by function. Now, at the moment, for me, personally, and my system, dataframes is not working absolutely the way that I intend to or think it should work. One of the functions that works perfectly for me is it's the by function. I use that a lot now in my data analysis when using Julia 1.0 because that does the job for me, so I really want to concentrate on this function by. So part of the dataframes package, the first argument is the dataframe that we're interested in. So that's just data. Now, the second one is the column that I'm interested in, and that says the Treatment column. So go down the Treatment column of my data dataframe, and do the following for me. Now, you will see this little stabby function with this operator here, the minus and greater than sign, and I'm going to create this temporary variable df, and that's going to hold a dataframe where I say N, for counting, equals the size of the dataframe comma one. Now, let me show you what that returns, and you'll see exactly that it does solve this question that we have about how many As they were and how many Bs. Very clearly you could see there I've got 39 As in my instance and 61 B's from uniform distribution. This was a bit of an outlier. When you run the code, of course you're going to see something else, but you see that that's created the dataframe for me. So this third argument here, it's going to create this temporary dataframe for me, and it is going to hold this variable N. Also, which I created there N for it's just normal stats for counting, it's a good place holder symbol, and what I want it to do is to give me back the number of elements down that column. So we see, for A, there was 39, and for B, 61. Now, there's a slightly easier way to go about this other than the stabby function that we have up there, and let's just use the third argument just to say size. So instead of the size, I'm trying to draw out just the count. Let me show you this easy one and this is the one I use most of the time. Now, you've got to interpret it slightly differently, but really I mean, it's easy to see. It is just going to return for me the 39, five. So number of rows comma number of columns, and that's what we did there. The comma one, we only wanted the first of the triple value to be returned to us, only the 39 and 61, which is nicely what we got here. I'm just going to see number of rows comma the five, which was still the number of variables, number of columns. We know it's five. I just ignore that when I see this. I'm interested in that first element in the Tuple, the 39 and 61, and it's much easier for me to write size than to write this whole function up there. So I like to do that. Now, we can also just instead of counting them, we could get some statistic, a descriptive statistic of one of the other variables in the dataset. So here I'm asking, please group by and it's the by function, group by the Treatment column. We know the Treatment column is going to be categorical data point values A and B, and do for me the mean of the ages of those that belong to A and those that belong to B. So that's why it's important for us to know this little function up here because I'm doing the same here. So I'm saying df, and then my minus greater than sign, and then do the mean for me of df.Age. That's why it's also very important. Please don't put spaces or illegal characters in the names that you give for the columns in your variable in your dataframe. Please don't do that, because now I can just refer to it as df.Age, and that df remember is a placeholder because we've created a date, and it is now this placeholder for this data dataframe, and it's just referring to one of its column, say the Age column. So, when I run this code very nicely, now I can see separated out the mean of the Age column, the mean of all the Age values only for the A patients and only for the B patients, and I can very quickly see that there is, on my instance here at least, a large difference between these two. Same for standard deviation, exactly the same here. Instead of mean, I'm using the inbuilt functions std for standard deviation, and I can see the standard deviation that so very quickly, I can start understanding through this descriptive statistics. Because it's inside of a dataframe, I can start to understand and think, "Oh, there might be something in here." If I do a t-test, is there going to be a statistically significant difference between the ages of these two groups of patients? Slowly starting to tease out the information in that data just through describing the data, through descriptive statistics. Very powerful. Now, I can also just use the state-space describe function. If we do that, we're going to get A and B separately there. Remember, unfortunately, it's not going to say in what order these two are. You see the summary stats there and the summary stats there, but it will be in the order that it's listed down here. So the first one is going to be A, the second is going to be B, and we see there the 55 that was the mean for A, the 47 for B, but now we can also see the minimum, the quartiles, and then also the length and the type. So this, for me, is a very powerful function, and something that we use all the time. By, the dataframe, the column that we want it sorted by, and then the statistical descriptive, the statistical function that we want to describe works well for us because it just brings out all the information in one go. So that's fantastic. Now that we start to understand the data, and I want you to play around and do some others, group it by, for instance, the result column, and see what you get, and use one of the other numerical variables for that description. Next up there, we're going to visualize the data and that's an even richer way to try and understand what this data is hiding, this knowledge that it is hiding.