Hello, everyone. Today, let's continue our discussion of our course project. Specifically, we're going to talk about the project checkpoint. Right now, you should be somewhere in the middle point. We've finished the project proposal, and now we're really in the process of designing and then developing our solution for the real-world data mining project. The key part is to think about the full data mining pipeline. Just as a quick review, we covered the whole pipeline. As we said, data mining is not just one particular component where you just take some data, do some modeling, and report the results. We really need to think about the whole process, starting with collecting the raw data. See how you are getting the specific datasets needed for your project. We have the data, and then progress through the whole pipeline, trying to understand your data, think about whether there's any pre-processing that is needed. Think about data warehousing, how you can manage your data or analyze data in particular way, and then, of course, the modeling process and evaluation. Throughout this whole process, just keep in mind what is the big picture of your project, because remember, as we would to propose our project, we would say what is the application, what is this scenario you are trying to utilize the data mining methods and data processes for, what kind of knowledge are you trying to learn, what kind of data you're using and of course, what kind of techniques. All of those need to come together. This is really how we integrate the various pieces that we have learned so far in the data mining specialization, to really put this into actual solution for a real-world project. If you think about your progress, because right now, it's like looking back and of course, looking forward, what kind of progress we are expecting in your course project. Starting point is that you should have obtained the data you need. As I said, you need the data to accomplish your project. Do you have the data you need? Are you still in the process of collecting data or you have all the data you need locally available? Also, think about the tools. Data mining is a very active field, a lot of tools have already been developed, so we really should leverage. Are the tools method that's already available instead of reinventing the wheel. Identify the tools that are potentially useful for your project. If you're not too familiar with the tools already, of course, spend a little time, learn about those tools. Most of those tools have some good tutorials or examples for you to get familiar with it and then start using them. Then of course, I would say, take your time. Try to understand your data, because this is a stage where you basically just take your data are and then just say, what kind of attributes do I have? What kind of distribution am I seeing? Do I see any extreme values? You may try to visualize your data in different ways so you actually have better understanding of your data. Then in the pre-processing stage, think about what kind of processing you have done already, are there things that you think you may still need to do? Because there may be missing values and maybe there's some filtering you have to do, or aggregation, or normalization. Basically, here, you're picking the different types of data after you have looked at them, and now you're trying to prepare them. Again, depending on the specific project and specific data you're dealing with, this step may take more time or longer time, but typically, it is important to at least think about what you need to do in this stage. The next one, warehousing. Our project may not be that big that it requires extensive design of the data warehouse, but the key idea here is that from your project point of view, whether there are better ways for you to organize your data and manage your data so that it's easier for you to do your [inaudible] or an iterative process. Here, either you can say, "I have the data in the right format I need, so I'm fine." That's okay. But if you are dealing with maybe different types of data and there are some benefits of rearrange things or pricking them in different subsets, spend your time, just take the effort and then just make sure it's organized properly. Then as you progress further, now you've got your data, you have a good understanding of it, you are ready for the actual modeling. This is a core step where you think about what specific modeling. Problems you're trying to address. We covered quite a few in our data mining methods course. Of course, depending on your project, you may pick just maybe focus on a particular one, or then maybe a couple you're trying to address. That's fine, but just think about what is the most useful method for your particular problem setting. Let's think about specifically. We talked about different methods, and there are, as I said, tools available readily, so you can actually just utilize what is available, but then the key part is about identifying your problem, identifying the suitable methods for your problem. In this process, do you see ways of improving your method further? All this is, of course, challenging, but that's really the interesting part, and that's, potentially, the most innovative aspect. You would do this in a more iterative process, so you know your modeling problem. You may try out some method, and you will have the basic evaluation to see how it performs, and then see whether you can improve it further, so there would be a bit of back and forths in particular between the modeling and evaluation steps. We talked about how you need to think about the metrics. At the start, given your problem setting, you're going to say, "Okay, how do I evaluate to success? What kind of metrics do I use?" Keep in mind that you may be focused more on the effectiveness side, that's more like the quality of the method, but also many scenarios. Think about how the efficiency angle is also applicable. How long does it take for you to process that much data? Those are, of course, the offline training piece and also the online decision-making piece, and maybe there are different subtasks, and also if you're comparing between methods, also think about how they compare in terms of the effectiveness and in terms of efficiency. Many times, you will be asking this question about trade-offs because in many scenarios, you may have a method that they perform differently, so maybe [inaudible] there's maybe one method that's superior to all the other approaches. Fine, that's good, but there are many scenarios you will see that different methods may perform differently, and it's not clear one is always better than the others, so now you get to this trade-off about, "Okay, this method is more accurate, but it takes more time. The other method is slightly lower in terms of accuracy, but it's much more efficient," or, "This method actually performs better in this kind of scenario," because when you look at the error, like the false positive, false negative trade-offs, you can also see different things. This method has a lower false positive, but then it has a lower false negative or something. This is actually the important part of this reasoning. It's about knowing the method, knowing how the method works, applying them to your particular problem and your dataset, seeing the results, and then be able to reason about the pros and cons of different methods. As you're progressing, so this is now we get into this checkpoint stage, and we're really just trying to get a status update like are things on track? Remember, we proposed our project. That is one way. We're just planning things out. We'd like to work on that, we'd like to tackle those tasks, but as we can now progress into our project, we have better understanding of the problem, and we have some preliminary results already. All that allows you to have a, in a way, revisit of our proposal, and now checking where things are. For the purpose of a checkpoint, really think about your progress and also changes. Progress means what you have accomplished so far. While the changes are important, as we said, the proposal stage was really tentative. We're still planning out, and we're just figuring out a lot of things. But by now, you may have some more concrete ideas, so you may have adjusted things, or even made bigger changes. All those are fine, but this is what we would like to identify for the checkpoint. Again, there are just two pieces to submit. This is very similar to the proposal stage, so you will be submitting a checkpoint slide and also a checkpoint report. Remember, for the project proposal, you also submit the slides and also report. For the checkpoint, you don't need to do everything from scratch. Rather, you take your proposal slides, you take your proposal report and just update them because, supposedly, this is still the similar problem setting. You may have adjusted things a little bit, or you may have added something new, or if there's more substantial change of your project, that is still fine. You can reuse some of the pieces you have already, and then just say what has changed and what you have done so far. These are the two main pieces: the checkpoint slides and also checkpoint report. What should be included in the checkpoint slides? As we said, this is an updated version from your proposal slides, so you'll probably have very similar content compared to the proposal slides. If you remember, originally, we're suggesting that roughly 5-10 slides for your proposal stage because, of course, we're just planning things out, more like a high- level planning and overview of the project. Hopefully, by now, you have a few more slides to add. Again, this is just a guideline, it's not a hard limit. You don't have to fit your slides exactly between 10-15, but that's just saying by now, you should have a few more slides with a bit of more concrete information on them. But throughout the slides, it is about highlighting your progress, highlighting any potential changes. For the purpose of the checkpoint slides, you should give a good overview of the project because it has been a while since we saw your proposals and also you have maybe a different review or maybe looking at your checkpoint report. Make sure you provide a good overview. It's very similar in terms of your projects. I think, just say this is what my project's about, why you have a good title, and also think about your problem statement, related work, propose work, evaluation, timeline. You can take your proposal slides on top of that, now, you can just add in information that is new, that is different, but make sure you highlight that. This, originally, I planned to focus on those tasks but there has been a few changes. Or I have tried this, but it's not working out well so I go shifting my focus a little bit to a different one. That is all okay. It is really, again, a good summary of your project but with focus on the progress and changes. Here, I want to just remind everybody again of the slide style. Because, of course, if you paid good attention during your proposal stage, your slides should already be good. It's reasonably clean in terms of the overall style, and also they have key points, and are easy to read. All those are good. But just keep in mind, many times, as we get into some of the details, then, we tend to put in more information on a single slide. You'll say there's a lot of some important detail I want to put in, but always sit back and just reexamine your slides. Always make sure they're clean, they're simple, they're concise, and they're to the point, then really highlight the key points when we make. Don't worry about including other details, you will have the report. Of course, you would be writing the report with some of the details, but many times, the goal is not just to convey other information. Your slide is really a good overview and highlighting the most important part. That's about your checkpoint slides. Let's look at the report. The report, again, it would have to be an updated and the expanded version from your proposal report. You have the similar ACM proceedings template. Here, we're putting a bit of guideline in terms of the pagements because many times on your written report, we tend to ask how many pages do I have to write. We didn't specify this for the proposal report. The idea is you're just getting started, just write whatever you feel comfortable with the information that you think you can provide then. But by the time you get to the checkpoint, hopefully, you have a little bit more content to include. Again, 3-9 pages, this is just a guideline. You don't have to fit them right into that; a little bit shorter, that's fine, or a little bit longer, that's fine too. But the key point is that you want to convey the information effectively in this report. Just, again, think about the specific sections we discussed for the proposal stage and then see how you have updated content for each of those sections. Starting point, we just review all the different pieces. We went through this for the proposal stage. Right now, for the checkpoint, we look out at the same guideline, but now see whether you have more concrete things and have more better idea in terms of what you can put into each of those sections. Title, as I said, we wanted to start with a good title. But maybe at the beginning, you may not have a clear idea of what do you want to do and now as you progress further with your project, you say, now I have some specified. I want to actually change my title. This could be subtle changes or more substantial; either way, it's okay. But what we want out of your project title is concise and informative title that really conveys the key idea of your project. You may even start with a reasonably good part, like project title, but as you're writing your checkpoint, also think about whether you can improve your title a little bit. Because, ultimately, you just wanted people to hear your title and remember it, of course, but also understand what are you trying to do in that particular project. Abstract. This is the executive summary of the set. It's one or two paragraphs long, and they'll really just provide a good summary of your project. If you look back what you wrote for your proposal stage, you may have some saying been more of just planning things out and a bit of vague or even like there are some specifics are not spelled out so right now just take that version and then revise it so that you have hopefully more concrete description. But still it should be concise. The play is not to have the abstract spit out of the technical details, but rather is about highlighting what is project about, what you're trying to accomplish and if you have some information or details ready, what would be the key findings. You can start putting some of that. Introduction. This is the main intro section. There are four key components: what is a problem? Why is this probably important? What are the limitations of existing work? Also what is your potential contribution? Take some time, read what you have written before for your proposal, and then of course, update it or you could say, that still stays true. Whatever I wrote for the proposal is still good, so I'm fine. That's great. But if you see ways to refine your description, so you may have some more concrete like understand the problem, or you may have a slightly better justification for the importance of the project, and also you may have identify some other related work like that also may require some update in terms of the limitation, the existing solutions, and of course, your contribution, as you have a better understanding of what work you're doing for your project, what you could accomplish while you may also be able to update your contribution or component. That's the introduction section. Related work section. We're of course, talking about what has been done already. We'll talk about how you want to group them into different categories so it's easier to really use one paragraph or maybe one particular topic or one focus. Also, the important part is about how your work differs and it builds upon from those prior work. Because what you don't want is to mention, this is what has been done and what I'm doing is really just repeating or just following exactly what they have done, and just re-doing that for myself. That's not particularly helpful. It's useful in terms of learning, but we want you to see that you're building upon prior work. That is how you show the value of your project and show the contribution of your work. It doesn't have to be substantial contribution, but you want to be able to really think about how your work add something new to what has been done already. Now in terms of updating, for the checkpoint stage, whatever you have written in the work section may still apply but you may have things you want to add or things you want to adjust, or say since I'm changing my project, so this is maybe not as relevant, but instead I'm just adding something else, or I say I have found a new piece of related work, which is quite relevant so I want to add that. All those are good, just basically the update. You don't have to rewrite the whole section. You probably have some good content already, so read more about refining it with any new information you have. Then proposed work. This is a core section and by now you may decide, maybe that's a too bigger section that I can separate into multiple section, which is all fine. But this is really just talking about what you're trying to do. Again, you have your proposal report, so you have your initial plan of what you're planning to do, though you're just putting in more details. For example, I now have more information about the datasets, so I want to just explain specifically what kind of data I have and what kind of attributes they have and how big are they and all that. Or if you are actually collecting data on the fly, then you also can write a little bit more about your data collection process or the tools you're using. You may have identify new tools or you have change or say, I original thinking about using this tool, but turned out wasn't particularly useful, so I'm now trying something new. All those are good, this is just updated description of datasets, tools. But then the main tasks. This is where you can actually hopefully add in some of the details as you're progressing through your project. As we'll talk about, when you have your data, so apparently you can do some initial statistical analysis. You can report the size, how many dimensions and maybe the distribution, histogram, scatterplot, whatever, visualization if it is a temporal, some kind of spatial information, or some kind of heat map. Think about different ways really to convey some of the key characteristics of your dataset. Also preprocessing warehousing. This is the stage we're just preparing. Actually, a lot of real-world projects really call for a lot of effort in terms of just getting the data ready. You may be dealing with missing values, what do you do with the missing values? You just filter things out or if you're merging different datasets, do you see how you actually maybe describe how you're preparing the dataset. If you have done any normalization or you have done some data selection, you are selecting dataset for different scenarios. All those are useful. Also, you are to say how I do the data warehousing, I'm going to say, I have initially those multiple CSV files and now I organize into a database or have some way of managing all the data. So it's easier for me to correct the different subsets. All those are things you can write in to, just as you're making progress because this is important. You're describing what you are doing. But also important is that you're not just describing or you do 1, 2, 3, 4 because it's not reading as more like a reported that somebody would just fall and repeat yourself. You want to add maybe the reasoning. I'm seeing this in my dataset or I'm trying to tackle this particular problem and this is why I choose this semester to do it. You're basically in a way documenting, not only the steps, but also why you're doing it. Of course, the specific questions, the patterns, and the modeling you're trying to do. There are many approaches. Again, the focus is about why you're choosing this method or this set of methods and why you think they would work, then just provide your reasoning and provide some of the details about what you're doing in order to explore particular patterns or buildings, particular models. Then, of course, evaluation. By now, hopefully, you have something more concrete. Remember, for the proposal stage, you'll say totally fine. You're just really talking about generally in terms of how you plan to evaluate. Think about the metrics. Now, do you think that those metrics they'll apply or do you think that you have some may be slightly different or you want to add the metrics you want to consider? Those are all fine. I decided, don't worry about you have to say, I changed it or I had to come up with something different. That's okay. That's how we learn and that's how a project progresses, but just make sure you're adding better reasoning about why you're adding this particular metric or why you're removing a particular metric because they don't apply. Then the experimental setup. By now, hopefully, you have your data ready and you may have some preliminary exercise in terms of trying something out. Just see whether your experimental setup makes sense, there is training set, testing set split, some subset scenarios. Just checking and updating your description of the experimental setup, and also of course, comparison. If you are talking about comparing across different semesters. Do you have any update? Maybe you're not there yet. You say, I'm still just playing with the first semester. That's okay. But if you have any further understanding and also further plan in terms of how you want to compare with different semesters, update your report. Also, like here, if you have some preliminary results, this is already where you can start thinking about the reasoning, one key part in terms of evaluation section to make sure that you don't just report the results, you could say "Oh, I took my data. I prepared my dataset and I'm building classifier. I tried decision tree classifier, this is the tree I got, this is the performance that I get. Also, I tried to support vector machine and this is the result I get and just put a table saying this is the performance I'm getting. You really want to start to think about, now you're seeing some preliminary results twice. You have your evaluation metrics and you've seen the results and then say, what does it mean? Is this a good performance and why is this one better than the other and why is this one better than the other? I say, I expect them to perform well overall, that looks reasonable, but I want to look at some of the errors, I see some issues, and this also where you can potentially say, "Oh, I can actually do better," or just say, "Well, because this semester is designing this way, they don't really address this particular challenge." Fine. But this is really the step of where as you're getting some preliminary results, where I start doing the reasoning, starting talking about the trade-offs, about when something works better when another may work better. Because it's totally fine that your results are not perfect and you say, I tried this but the results really don't look good. Of course, they'll say, that's not how you define success or failure with a project, but it's really more about your understanding of why it did work and why it doesn't work and how you maybe able to make it better. Then the discussion section. This section serves as [inaudible] talking about how your project is progressing, any challenges, any changes, so basically adding more of the reasoning. For your proposal report, we'll ask you to put in some of the timeline, it's like when you're planning out your project, this is what I'm planning to do, and this is how each component would take how much time, and also our current status, all of that. Now, you just take that, update it. You may say, "Okay, everything's on track. I have finished all this. I'm actually following the right timeline." Great, or just say, "That took a bit longer, so I'm a little bit slightly delayed, so I'm actually going to adjust this a little bit just to make sure I still would have a good finish of my project," or just say, "Yeah, I've spent some time. This is all good, but now, I will actually shift my project a little bit, so I have a bit of updated task list in the timeline." All good. Also, update your current status part because you'll say, "Okay, this is what we'll have accomplished already, and these are the things that remain to be done." Also, challenges. This could be potential challenges down the road or challenges we have already faced, or you're facing right now. This is all fine, but talk about challenges that occur or may occur in your project. If you have adjusted, great, talk about how you tackled that particular challenge. We always still keep this notion about alternative approaches, a backup plan, so you may have written something for your proposal stage, and now you say, "Well, great, I figured it out, so I don't think that this is a problem anymore." Great, or just say, "Yeah, I made some progress, and this is still a part I haven't figured out, so I may still run into some problems." That's okay, but still, just more like you update it, like planning and reasoning of how the remaining part of project will be carried out. Also here, of course, is again, you're highlighting some of the changes, and also if you have a lesson to write. This could be good lessons or bad lessons. Lessons that worked, great. I tried this, and it really helped me in the further analysis, great, or just say, "Oh, initially, I didn't do this, but it turned out it was really limiting my analysis down the road, so I have to go back and change something. Those are all good. At the center, this is how we learn, and just keeping this in mind as you're writing about your report, and especially discussing the challenges, the changes, the reasoning, and also what may come down the road, all this is a good. The reasoning is the key part. Finally, conclusion. Now, looking at the report conclusion you wrote for the proposal, update it. By now, you probably have a lot more information and some of them are better understanding of the project, so just update your summary parts to say, "Okay, in this project, this is what I'm doing, and this is what I have accomplished so far," so it will say the key findings. Of course, you may only have preliminary results, but you have some, maybe one or two interesting findings already, put it in because that's how you want to provide the takeaway message, so your conclusion should really highlight some of the interesting part that you have identified. Of course, the future work will come later when you actually finish your project, but that's how you summarize your report. By now, basically, as you're writing your checkpoint report, you're just taking your proposal report, going through the individual sections, and then just see how you can update it. Update it with new information, update it with the changes, adding the results or progress, but the reason about how things are working out, or things that have to be changed because of some challenges. All those are good. That is showing you're learning in this process and showing your capability as independent data scientists because you are planning out, and you're figuring out your project. That's really what we're trying to accomplish before the project checkpoint stage. As you're updating your checkpoint slides, your checkpoint reports, just keep all this in mind and see where things are in terms of your progress to see how things may have changed, either things worked out just as planned versus things that had to be adjusted, all those are good, just put that in. This is really the real-world experience of working on a project. We start with something that we think would be very interesting, that we'd like to pursue, and we have some plan, and by the checkpoint, it's just like, "Okay, we have better understanding and things are working out to some extent, and there are also things that we are changing, and, of course, things we're learning along the way." That's all for the project checkpoint. Thank you. We'll see you later.