In this lecture we're going to look at some of the basic protocols for filtering data sets. We've seen a little bit of very simple filtering in previous lectures. Today we're just going to look at that in a little bit more detail. So, we'll see some simple protocols and provide some basic Python code to extract useful subsets of our datasets. So why don't we want to do this kind of filtering and pre-processing? So far we've been looking at datasets from Amazon and the Yelp that have already been cleaned to some extent. They're fairly nice, they're fairly easy to work with. But even with these clean datasets, there could be a number of reasons why we want to further clean or pre-process them. For example, think about the following: we might have certain fields or entries that are just missing, we might have entrusted garbled or poorly formatted, we might have data that is stale. For example, if we were looking at a review dataset, maybe we'd want to ignore very old reviews because we think that it's not accurate anymore. We might also have some very strange statistical outliers in our dataset, for instance, reviews that get written on Christmas Day or who knows what. Those could be unusual, we may want to discard them, or I may want to keep them depending on what kind of application we're working with. We might want to remove data pertaining to very rail inactive users who could also be outliers. Maybe we want to restrict our dataset to a certain demographic, so we can look at a small subset of it, etc. It's all kinds of different reasons. Alternately, given is very large datasets we've been working with, we might just want to create smaller subsets for more convenient manipulation. In this lecture, we'll just look at a few ways we can do these types of filtering and pre-processing in Python. The first thing we'll do of course is read our dataset. We're going to look at the same Amazon gift card dataset. We've been looking at for several lectures now. So this code should be fairly familiar. All we're doing here is using the GZ Library and the CSV Library to read our data into some convenient data structures as we introduced in the reading structured data lecture. So we read our file line by line, we store each line as a dictionary object containing the different fields, and we convert Boolean and integer valued objects to native Python data types. Okay. So let's just recall what our data actually looks like. We have the total length of the dataset here as about a 148000 reviews. One particular review in this case, the first one looks something like this, we have fields including the customer ID, the number of helpful votes, what marketplace is it, etc. These are the fields we're going to be using and manipulating to filter and clean our data sets to produce smaller and more useful subsets. Okay. So first type of filtering we might consider is to try and filter our reviews by date. So in this case, let's try and remove very old reviews from our data set. Looking at date data, is something we're going to do a lot more in a later lecture. For the moment, we'll just try and extract the year which it was written, and we'll filter reviews based on those that were written more than several years ago. So the first thing we have to do is take the year from each review and convert that to an integer. That's going to work like the following: for each entry in this dataset, we're going to store an additional field which is the year as an integer. We will take the review date. We'll look at the first four characters of that string, and we'll convert that to an int. So recall that the dates of these reviews are actually a string containing a year, month and a date. In this case, we just like to extract the year, which is going to be the first four characters and convert that to an integer. Fairly straightforward. Okay. So we run that code, and oddly enough it raises an error. What happened, why does this fairly simple piece of code throw an error? Well, if we look at the error, it's a key error which essentially says that a review date field is missing, or at least it's missing from some reviews. So it seems that we have some reviews not containing a review date field, a good reason to perform some additional filtering. Before we can convert the date to an integer, we'll first have to remove those reviews that don't contain a date field. Some additional pre-processing. Let's try and remove each instance in this data set that does not contain a review date. That will look something like this. It's just use a simple list comprehension. We'll iterate through each entry of our dataset, and will only keep those entries where their review date is present. If we look at the length of our data set again, not sure if you remember the old value, but it seems like it's actually only gone down by one review. So for some reason in this dataset, there was a single review that was missing a date field, and now produced a cleaner version where every single review contains a date. Now we can again try that pre-process we did before to include an integer date field for every single review, and that should work fine. Now let's try filtering out old reviews. In this case we'll discard those written before 2010. That'll be a fairly simple list comprehension. We'll go through each review in our dataset, we'll look at a year int, and we'll say, "Is that greater than 2009?" Now, our new dataset contains a 148,095 entries. So only a very small number we've actually, that old and they've now been discarded from this new dataset. Okay. Note that in this case, and in most cases throughout this lecture, we're going to do pre-processing using a fairly simple list comprehension. The next thing we might want to do is filter reviews by quality. One notion of quality could be this idea of review helpfulness. We have these two fields in our views indicating the total number of votes each review received, and all of those total number how many were helpful. So we can filter by helpfulness using another fairly simple list comprehension. It will look something like this. We take our dataset, we'll look at each review in that dataset, and we say, "Does it have fewer than three total helpfulness votes? Or is its helpfulness ratio greater than a half?" So what's going on here? We're saying, in the first part of this if statement, that will keep those reviews that have fewer than three total votes. Basically, if a very small number of users has actually rated this review before, we'll keep it because we don't really have a good sense of its true helpfulness ratio yet. If on the other hand it has received three or more votes, will only keep it if it's helpfulness ratio is more than a half. So at least half of the people who read this review and voted on it considered it to be helpful. Okay. We'll look at it again, and we've now reduce the size of our dataset to a little under a 148,000. Next we might want to filter our data set by user activity. As I suggested before, we might want to discard the most inactive users. In his case, users who have only written the single review before from our dataset. To do this, we'll try and use the defaultdict class which we introduced in the Civil Statistics lecture. So to do that, we'll build an instance of a defaultdict, which is going to store a counter for each user saying how many reviews has that user written. Now we just iterate through our entire dataset, and increment that counter each time we see a particular customer ID. That's now going to contain a dictionary mapping each customer ID to the number of reviews that user has written. We can now filter our dataset to keep only users with two or more reviews, using a simple list comprehension. So we'll iterate through all of the entries in our dataset and we'll check whether the number of reviews written by that user is greater than or equal to two. So we're keeping only uses who've written two or more reviews. Now if we look at the length of our dataset, it's down to 11,000. So it seems that this operation at least, cut down our data set drastically. In order to discard inactive users, we lost something like 90% of our data. Okay. The final filtering we might do is to filter reviews by review length. We might think a review that contains only a few words might be particularly uninformative, so we should discard it. In this case, let's take our dataset and discard any review that has, say fewer than 10 words. So to do this, we'll use the following words comprehension. We'd say, "Is the length of the review body greater than or equal to 10." Once we apply the split operation to it. So what the split operation is going to do, as we introduced way back when we were reading CSV files, is it's going to split that review and any white space character. So essentially what this will do, is produce a list of words in the review and we can then count as length. Once we perform that filtering, we've cut down another 30 or 40 percent of the reviews. Were down to a dataset of length 7,000. Okay. So those are just a few of the different ways we can filter our dataset. There are many many more you could apply in practice. For instance, you could extend this by looking for products that have only been reviewed a few times with inactive users. We could also look at inactive products. We could filter out those users who don't give very informative ratings. For example, people who just rate everything 5-stars. We could filter reviews that were not part of the vine program or were not verified purchases. We could filter users who seemed to have stopped writing reviews for a long period of time. So on and so forth. But the moral of the story is that, as you apply these types of filtering, they're quickly and drastically going to reduce the amount of data we have to work with especially when we did things like removing inactive users. It turns out in a dataset like this one, the vast majority of users are inactive. Just to summarize, we gave a few examples of the basic protocols we can use to filter data, and we presented Python code for doing so, and we saw the effect that pre-processing can have on real datasets. So on your own, we showed this for the Amazon data. You can try the same thing for the Yelp data set. So try loading the Yelp dataset and performing similar pre-processing operations, and see what effect it has on the amount of workable data that is left.