Hello, everyone. Welcome to the data mining project course. This is a part of the data mining specialization offered through the Masters Sciences in Data Science degree by the University of Colorado Boulder. Today we're going to just start by identifying the key components of real-world data mining projects and really start thinking about designing and proposing your own project. In the data mining specialization, so we have two previous courses. The first one covers data mining pipeline which will basically talk about the whole pipeline. How you do data mining by taking raw data like go through various stages of processing and analysis so that you can have good results at the end. Then in the second course, the data mining methods. We dive into some of the details, trying to understand specific techniques and then try to apply them in different settings. The third course, data mining project. This is actually where we're trying to do a concrete data mining project. As a bit of a review of what we have covered so far, we started by saying that when you look at data mining, think about four different reviews. Because in any particular data mining setting, you're always dealing with some kind of data, of course. Then using that data, you're trying to learn certain knowledge and that knowledge can be used for specific application scenarios, and of course you need the technique to do that. Then we look at the data mining pipeline. This is where you take the raw data, you know what I want to get to by taking the raw data and identifying specific knowledge that can be useful for particular application scenarios. The technique really covers the whole process of how you understand your data. Because when you have your raw data, don't rush into any modeling process, but just try to understand your data. Understand the general properties of your data set. Then we need to really think about how you pre-process your data because there are various issues with your data set or there's a certain set you would like to do before your data is ready for the modeling step. Also data warehousing is particularly useful. This is in terms of how you manage your data. You may have very different types of data. There are multi-dimensional data and may be used as very different granularities. If you actually spend some time managing your data properly, well, it will make your whole process much more efficient and actually more flexible. Of course, the data modeling process. This is where you have the data prepared, you have the specific data you need for a particular modeling problem. There are of course, many techniques which handle that. Identifying the right technique to use for your particular modeling scenario is important. Related data is pattern evaluation because you don't just build a model and use it. You want to make sure you evaluate your model properly so that you're confident that what you have identified or what you have built would it be robust, accurate, and reliable in the real-world setting. Then we talk about some of the detailed techniques because we said the right data mining actually leverages a wider range of techniques for different problems. We'll talk about frequent pattern analysis. This is just generally, you have a data set. We use the transaction scenario quite a bit. How you identify patterns that occur frequently. The pattern could be frequent item set. Those I'll just say a set of items that they occur together all the time. But also you can look at sequential data and the graph data, so then you can look to frequent subsequences or frequent substructures. All those are actually very useful to give you a good understanding of the general patterns or frequent patterns. We then started talking about class vacation prediction scenario. This is where you have some prior knowledge. This is where we refer to as the supervised learning because you have some grounded shoes label, either of the categories or the classes of specific objects, or you already have historic information about how things change, why we have a numerical data. Classification prediction are really just trying to build up a model so that you can classify objects. So when you have seen new objects with specific attributes then you can classify the object into one specific classes. Prediction similarly, but here you are dealing with numerical values. You're trying to predict the actual value. We then talk about clustering. This is more like the unsupervised learning scenario where you don't have predefined classes. Instead, you're just trying to look at all the different objects, you have some kind of similarity or dissimilarity measure between those objects, and then you're trying to then assign the objects into different clusters such that within the same cluster, objects are similar. If they are in different clusters, then those objects should be different. We also talk about anomaly detection. This is a different [inaudible] the setting of finding general patterns. Of course, it's very useful to find a general pattern, but also there are many scenarios where identifying anomalies, these are just things that are different from the general pattern, it's also very useful. We also briefly talked about some of the trend, evolution analysis, this particularly relates to some temporal information. We also have a little bit about some of the more advanced techniques in the field, depending on what kind of datatype, some of those are much more complex, or you actually fused together multiple types of data for your particular problem setting. Of course, data mining is a very active research field. We also talked about what is happening in the research field, or the front of the line research. Now, with all that, we have looked at the different views of the data, we have looked at the data mining pipeline, we'll look at the techniques. What I'm trying to accomplish [inaudible] this data mining project setting. You may say, we have actually worked on different tasks where you have a specific input that's being provided to you, and then you will be able to build a model or do some analysis to generate the expected output. You think about some of the assignments we have done in the previous courses. For example, we have worked on scatter plot, you're being provided with the data set and then you generated the scatter plot, or that you can do correlation analysis if you have specific attributes. Then you could even just take one of your classifiers and build image classifier. Those are all very useful and concrete tasks. You can view those as the building blocks, so we definitely want to learn all that. In this course, we're really emphasizing this architect review. That means you are now the architect of a data mining project. Instead of somebody giving you a specific task or list of tasks, take this, do that, give us this data or output, or do this and generate a particular model. Those are very specific tasks. But as an architect, you're really designing the whole thing. Really just think about this big picture view. You want to able to identify the problem, and you want to be able to figure out what are the tasks that should be accomplished, and when they work together, they would allow you then to tackle some real-world problems at a larger scale. Throughout this course, just really position yourself as architect of a data mining project. Keep this big picture view, and always use this analytical thinking. By analytical thinking, the idea is that you're always reasoning about why I should do this, why I should do it in this way, what does the results mean, because that's the very important part because these are, in a way, differs from this more mechanical view where you say, "Oh yeah, I'm being told to do task 1, 2 ,3, and now just I just finish 1, 2, 3." As I said, that's still very useful, but we really want to think more in terms of the design part. That's really the purpose of this course. We will walk through the process of identifying and designing of problem, and actually finishing a data mining project. Of course, the question there, that sounds great, but where do we start? Because there are many possibilities and many ways to do it, but I don't know where to start. Number 1, that's really about your interests. I really think it's much more interesting when you're working on a project that you like. That's really what I want you to start with. Think about whatever things, of course, related to data mining that you're potentially interested in, or you already know you are interested in. That then really gets us to the four views. As we said, any data mining project should have those components. You should know what kind of applications you're dealing with, what kind of knowledge, what kind of technique, what kind of data. But you can start in different places, depending on what you're interested in and what you already have as your background. You could start with applications. You could say, "In this application domain, I have good knowledge and that's really what I want to work on for my data mining project," or you just say," Oh, I always wondered about this particular application, and I think I want to do something with it." Great, or you could say, "I'm particularly interested in certain types of knowledge. I always wondered maybe some kind of spatial, temporal relationship, or correlations between signs," or "I'm just interested in [inaudible]." Those are just maybe particular types of knowledge you are interested in learning, or identifying, or answering. That could also would be a very good starting point. There are actually many cases where I say, "Well, I don't know what the application is or the knowledge would it be yet, but I know this is a very good data-set." This data-set has some very interesting information, I just want to see what I can do with it. That's another way to start. Start with a data-set that you are interested in and then you can explore what it can do. There are also scenarios where say, actually I'm really interested in the technical design of a particular method, so I want just to dive deeper and see whether I can either apply this technique in a particular setting. I can identify the usage of the scenario maybe ready later, or I have some idea to improve this method further. I can start with this particular technique. To say, I'm really interested in classification, or even more specifically, this particular method, and I think I have some idea to make it a better or use it in some settings. All those can be very good starting point. The number one thing and also the thing I will say just keep in mind throughout the project is that I want to work on something that I'm interest in, I feel excited about, I'm curious about what I would have find out when I go through all the effort and the whole process. Let's read this starting point. More specifically, when you think about [inaudible] think about the application scenario. You know that there are a wide range of application domains. Nowadays, A lot of domains have various types of data or various data mining related problems. Start with something that you're interested in. I'm just using here example, we talk about this bird migration. In this case then it gets a little bit this notion about a domain knowledge. Because when you're picking a particular domain, you want to have some knowledge in that domain. You could say, "This is in my area. I have been working on this for a long time. I have deep knowledge in this." Great. But sometimes it's okay you're not the expert in that domain. But you're just interested and you maybe have some knowledge. Here let's say I maybe just a armature bird watcher. I'm just interested or I wanted to just see what I can find. Those are all fine, but it really, when you look at a particular application, think about what domain knowledge you may be able to leverage. Because that would then allow you to identify a problem that is challenge in this domain. Because you don't want to just say wondering around just to pick anything, because that may or may not be the most useful one. When you have a bit of domain knowledge, just think about what the challenge is. Whenever you do, you want to be able to contribute. By identifying what the challenge is and the more importantly, what is a value of solving that problem. When asking this, because you want to say, "I want to spend my time working on a data mining project, so I better use it on something useful." Pick a domain that you're interested in. Think about what domain knowledge you can leverage and then identify one or two challenging problems in this domain, but then really ask your question what's the value? What's the benefit? If I solve it, would this be of use to the domain? That's about application. Then, think about the knowledge. Knowledge is really what do you learn. The question here is that, what are you trying to learn? Do you have something concrete in mind or are you just really trying to explore it. Think about example of this is a weather prediction. You have a lot of a potential data you can leverage, historical data or they're also actually models. They're like a physical model, so you can leverage. But then we saw that of course for you to make good predictions. You want to really identify good relationships among the different features which will be the update, the input of your problem. Many times when you talk about the knowledge, think about the general patterns because, like I said, the general pattern can be very useful and if you can identify good relationships across different attributes, that you can actually use that general pattern in many different settings. But of course they're also scenarios, you'll be focusing more in terms of the anomalies. Lines that just stand out. They're just different from others. Why of course, there are many other scenarios that you can use this. Here then your knowledge, focus could be the rare cases. The many times you actually run into this scenario where either you know this is the specific pattern you're looking for. For example, I want to know how the, let's say, wind speed and the temperature may impact the chance of a precipitation or something. Then that's actually very specific problem and you're looking for specific pattern to answer that question. But there are also many scenarios as where you say, "Well, I don't know. Anything could be interesting. I don't know exactly what patterns I'll find," but it's more about just saying, "I want to find some pattern that may be interesting and useful." That's okay too. But really, the thing is that something of interests and something that may be useful. Now, data. Of course, we cannot have a data mining project without good data. Here then, of course you need to ask yourself, what kind of data do you want to use? My example is that if I'm working with this remote sensing data, I know this is a time series of data and actually also I can either associate it with the different spatial areas and there are different granularities and also depending on what type of sensors I'm getting, then I know how many channels are there, and also the specific meaning of those channels. Similarly, when you have your data or this is a dataset that you're considering using, always think about our four Vs, which are used to characterize your dataset. Just a bit of a refresh with your memory, the four Vs, think about the first one is readjust volume like how big is your dataset? This is important for almost all data mining project. You want to have reasonably large dataset. Of course, this term is vague. I always get asked about how big is big enough? I want to say if you're talking about hundreds of thousands of data points, that's really too few. Most of the times, we'll talk about hundreds or thousands of millions of data points because you need a reasonably large dataset so that you can have a meaningful process. That's just the volume, and then also think about the variety because there are maybe scenarios where say I'm just dealing with one type of data, which is okay. But there are many scenarios where you may be dealing with different types of data, and also actually there's also good benefits of integrating different types of data. Think about the variety. Are you dealing with different types of data and how you may be able to integrate the different types of data, so that you can make good use of them. Velocity. Are you talking about a static data? Are you talking about dynamic data? Dynamic meaning that these could be things that change over time. You may even be dealing with real time or historical, which is fine. But think about whether your data changes over time and how quickly that changes. Then veracity, this is a piece that was actually the fourth V that was added later. Because when we start looking at the actual data in the real world and you will see various issues why the quality is important, most of real-world datasets are not of high quality. Do pay close attention to the veracity, to what extent you think your data is good. If not, then what can you do to make it better? Here, I will highlight the point of data availability, because as we said if you don't have the data, you cannot have your data mining project. Depending what kind of data you want to get, it can take time or you may never get it. In my courses, I have seen lot of students when they say they have a great idea, and this dataset, if they have it could really allow them to do a very interesting project but the data is not easy to get, and they may not get the data in time. That would almost make your project infeasible if you don't have the data. Pay very close attention to your data availability. Really make sure that you have the data you need for your project when you propose and actually really get the data quickly rather than say, "Oh yeah, I know it's there. I'll get it." Don't rely on that. Have the data readily available when you start your project. The fourth piece, technique. As I said, there are different application scenarios or different types of knowledge you can learn, different datasets and of course, the core part is about what techniques do you use. We covered many different techniques. They can be useful in many different settings, and in particular, we'll talk about which techniques are suitable for what kind of problems? That is very important. If you remember our discussion, obviously this is a clustering scenario. Why? We talked about quite a few clustering algorithms. They all have the different characteristics. They're designed differently, but also they work in different scenarios, and also they may find the different shapes of the clusters. Understanding on one side is your problem study, but also on the other side is that okay, yeah, there are many different techniques. But which one or which ones are more suitable? That's very important to start with and that in this process, really think about your evaluation metrics, because many times when you have your problem setting, you need a way to show that you have accomplished your goals or you have a way of demonstrating the success, or how good the model is, or how reliable this pattern is. You want to really think about your evaluation metrics. Then many times you will be doing some comparison. Remember we'll talk about, like in our previous course, why there are different techniques and for the particular problem, you may be able to compare across the methods and pick the ones that are most suitable for your problem setting. You will probably dealing with some comparison of more than one techniques, that's totally fine. But the key point here is that your comparison should not stop at reporting the results, because when you say, "I have this problem, I have, say three different methods. I will run the experiments and report the methods, the results of the three methods and say, 'Oh yeah, among that the method three works best, I'll just pick it." Well, you really need to take one more step and say, "What do the results tell you?" Because say method three has the highest accuracy, but when it's not good, or when you actually see another method works better, that then you actually have a bit of reasoning behind your results and just say, okay, this is how I did my comparison, and using my evaluation matrix, this is why I'm picking this one, but also noting that this method doesn't work as well in certain scenarios. Then actually give you this reasoning and more importantly actually this may lead to your innovation. Because you may be able to say that, well, I tried the different methods, but by looking at the results and the performance, I see that they all have their pros and cons, and by looking at when they are making mistakes, actually I can come up with a way to improve it. That's actually very useful and that's actually very exciting. That's how we actually push the field forward. We don't stop by just say reporting the results and pick the best one. You reason about the results, you try to identify limitations and potential ways to improve your method. You don't have to, not all the data mining project will actually get to that point, but I think that reasoning is very important. You want to understand when you're picking a method, why you're picking that method. Now, when you have a nicely, roughly defined, identified, this is the data I'm going to use this application scenario, this is the knowledge, this is roughly the technique, always think about the whole pipeline. As it was said, right, you usually don't work just one piece of it, you really want to say, take the Data, spend your time, understand your data, pre processing the data as needed. You may want to really think about how you want to do a bit of a data warehousing or Management, so that your data is nicely reorganized, so it's easier for you to do your analysis. Of course the modeling validation. Again, think about all those components. You maybe, of course, spending different amount of time on the specific ones, but if my data quality is great then I'm probably done. But it is always important and useful for you to think about the whole pipeline and see where you need to spend effort and how those different pieces come together. Again, think about yourself as the Architect to your project. There just also another important aspect I'm going to really call out is about the scope of your project. Because automatically we only have a few weeks. We'll be talking about finishing of reasonably scoped data mining project. It's something we're interested in and we're learning through this whole process and we're producing something useful. Really think about your timeline, depending on how much workload you have, and how much time you can spend and how soon you can make progress. Really just think about timeline, how do you actually get to the finish line as you're planning out your whole project? Also, it's always good to have a prioritized list of tasks. As I said, those task are more specific components. When you're putting a project together, you will have the specific tasks. There may be other many questions you want to answer, many things you want to address, but well, I cannot do everything. Instead it's really about which ones are mostly interesting and important? Also of course there are some dependencies across this tasks. You want just to lay out a plan to say, okay, these are ones I have to do, these are the ones I need to do first before I can do the other ones. All, are depending on how I'm progressing, I may need to cut out a particular branch because it's not applicable or it's not as interesting or I just don't have time to finish that, that's okay, or you can have things that oh, actually if I have time, I would really like to try this, but I may not get to it if time is limited. Always think about is, these are the things you would like to do and the things you have to do in order to get your particular point. We plan out those signs that you have better sense about, well these other things I like to explore, but of course, depending how much time is available and depending how closer we get to the finish line, then you may be able to adjust the sense along the way. But having that bigger picture is very helpful. Also, always think about your expected outcome. As I said, we may plan a grander view of by project. But if I'm just starting in many different pieces and then at some point, it's end of the term and I don't have it or I don't have anything to show, that's not good. Really think about as you're planning out your project you could say, this other task if I can do, then I expect that I can show this or I can learn this part. That's the expected outcome. Of course, you can add to it as you make more progress, but you want to make sure that something can present within the timeframe of your project. This also really important I guess to the evaluation part because, well, you need to show you have accomplished something. How do you demonstrate? This gets to the evaluation metrics and how you're going to evaluate. Because I have seen students proposing projects where the evaluation would request maybe getting thousands of users, if not more, to provide feedback, which is very difficult. It would be very nice but it's very difficult. Just think about how you want to demonstrate the success or the accomplishment of your project at the end. Analytical thinking. I'll keep reminding you, you'll probably hear me saying this all the time because I really feel this is a core value and the most important training in this whole process is about being the architect, you always ask yourself this question, why I'm doing this? Why should design this way? Why choose this method over the other? Why is this result showing better performance? That's the whole reason in piece. Because I said, we really will not go beyond the mechanical piece of say I'm doing Task 1, Task 2, I'm just reporting what I'm doing. You really want to add in those question marks along the way on almost everything. Why I'm doing this? Why do I want to do this way? Why doesn't that work and I have to choose a different approach? Because I really see that as a core value of a data scientist. [inaudible] talk about, have you proposed your own data mining project, we'll talk about a few pieces where may get you started. Now I just started the brainstorming process. When you say brainstorming, pick a few initial ideas. As I said, this could be starting with application, knowledge, data, technique, whatever. Just write down a few sentences that you're thinking are potentially interesting. Once you have that, think a little bit about it, and then check visibility. As I said, you want to check the availability of the data you want to use, and also think about how you would like to evaluate and whether that is feasible. They're just like some general check just to make sure this is potentially doable. Then the very important part is actually discussing with other people. You can say well, I'm working in this product by myself, why should I talk to other people? This is a one thing I said in terms of the data science or just data analytics in general, is about how different people can have very different perspectives. Even when somebody who's not an expert in your project, or in your domain, they can actually provide different perspectives. I feel that is very valuable. Also just in the process when you're trying to describe what you are trying to do and discussing with others, it also helps yourself to better formulate your project. This process is really valuable and actually very good practice. Try to do this with other people, family, friends, or colleagues, as I said, they don't need to be experts, they don't need to be your project partners, but they can really help you in this process. Just to iterate, in the slide you may start with some initial ideas, you may already determine and say this is the product I'm going to work on, I'm done. That's fine. But many times, you may have a few choices or even for the same general area you say, there may be specific tasks or directions I can go. Just take the time and just iterate through the different ideas or different pieces that may interest you and know that you wouldn't have the answer at this stage. Meaning that, you wouldn't know for sure this will work out. This is the best way to do it. We don't. That's how we explore things. But the key is really to think about it, do the brainstorming, use your analytical thinking, you try to pick something that is to your best knowledge, that is of interest and is variable. Let us begin as I said, we will be walking through the whole process of data mining project, and this is really just the beginning of it. But start to think about it, ideals, iterate through it, and then we'll come back and we'll discuss how we can then make it more concrete. We will get to the proposal stage very soon. That's all for today. Thank you.