Hello everyone. Welcome to join this course. It is how to analyze the log data on Alibaba Cloud Big Data platform. Actually, it is not only Alibaba's product. We will also talk about how the pull process go on the traditional system like Hadoop. If you also want to know that, I think you are in the right place. This course is made up of three topics. One is what we can get from these logs. As you know, we're generating tons of log every day and we want to make full use of it, use data or throw it away like what we did before. All the big companies put a lot of effort to deal with the data. There should be a reason. Two, I believe you will realize how important the data is after step 1, you are going to find a way to process it. The first thing came out to your mind is some system can scale out like Hadoop. Because one of the original use of distributed system like Hadoop was to store and process the massive wall of big data. We are going to talk a little more about how to do it on Hadoop. Three, we're focusing on Big Data Platform powered by Alibaba Cloud. Let's get started. The first thing I would like to talk about, is what we can get from the log data. In order to make this process more interesting, I suggest you may think of this as a role-playing game. From now on, we are the engineers who work for a fictional [inaudible] education, cooperation. That means we've got a website with all our courses on it. People visit our website and leave trail on the server. It is typically captured in semi-structured, that is weblogs. Our mission is to help our organization get better insight, by analyzing the data. All the examples will be in the context of this. If you go through these in this section, the business problems we've got, what we did engineers can do and we're going to review the values and show them to the senior management. Here are the business issues that matters. Only four are listed here. But in fact, in real world business scenario, there will be much more. Let's go through these questions. One, how can we get more people to know us? Number two, which courses is the most popular one? Three, why should it take so long time to open our website? Number four, why is the page looks so messy after I changed my browser? These are real business issues that really affect the user experience. Unless [inaudible] was a easier one, I will pick number 2 because you can see the result by launching a not so complicated circle statement over a small site, not small sometimes. But it is structured data is stored using the RDBMS, which is a relational database management systems, like Oracle and MsSQL. But what about the other three? We cannot get the information from the transactional data. That's why we need the web clickstream lock data. You can just find the information for all the other three questions and many many more questions from it. Let's see what we can get from the clickstream log data. I should confess first that, I don't know much about this gentleman. What I only know is he is a Nobel Prize winner in economics and one of his word is famous, I would like to quote, that is, "If you torture the data long enough, it will confess." Let's start with the data. When someone clicks a link, types a URL or some made out to form, the browser sends a request to the server for information. It might be asking for page or sending data. But either way, that is called an HTTP request. The web server like mgx, HTDP, Tomcat, we're generating one or several lines of data like this. There I list five important analysis metrics here. Let's go through the whole lines in detail information. The first align, if you split in the white space, the first one will be the IP address information. From the IP address information, it will show you the user location and operator's information, customer's distribution in this country or in the whole world. You may put your servers in that operator and in that location. A second one, will be time. You can see the time is divided into two part using the whitespace. The first part, here is the date and time information. The second one will be the timezone information. It will show, how long does the user stay in your website, or which time does most of the user to visit your website? Now, you will adjust your optimization to know when is the rush hour of your business. About the time zone, there's an interesting question you may be thinking about, how many time zones are there? Generally, what we know is if each time zone were one-hour apart, there would be 24 in the world. But server time zones have only 30 and 45 minutes offset. Someplace or country like in Korea, India, and Afghanistan, they are all less than one hour apart thus making the total number of the worldwide time zones much higher. As far as I know, there would be 39 different local time zones in use. The next one will be the request. It has the request method and a post or get, and the context you're asking for, an HTTP protocol and a version, then the 200 tend to forward HTTP code. Usually, it is invisible. Though I'm sure you've seen some of the very common response code will be 404, it means that the page you're indicating was not found. The code about 500, that means there's something wrong about the web server. Then, we can get the browser information. It can just answer the questions we list just now. That is, why is the page looks so messy after I changed my browser? You can collect the information of different browsers type, that is, all your customer use, and optimize your web page style content for each kind of browser like Chrome, Safari, Firefox, etc. The last one listed here is the user source. It can show you where the user come from, they were directly website visitor or redirected from search engines after you type some keywords in the search page. If most of your users apply to the first one, you may suggest the marketing team fellows to put more budget on advertisement on search engines like Google, Bing, Yahoo. These things are not so hard to tell. What you're going to do is to dig more information from them. After this course, you will able to make some of those dashboard. Let's take a look at the first. Page view also called PV, it stands for a single click on the website. User viewer, which is also called, unique view, it stand for one single user. One user can make from at least one to many page views. Let me ask you a question. If these two statistics results are similar in your case, what does it mean? It means that your website is not so attractive, that people just left after seeing well of your page. In most cases, the page view will be many times more than the unique user. What you can tell from the result of the unique IP address and the unique user? That depends. For most of the cases in China, if the IP address is more than the unique user, it means that most of your customers visit your website from home because we do not only fix the IP address at home. If not, it probably means that you have more visitor for that from a company office because each company, you already got a fixed IP address. You can also generate a chart to show the traffic of different time of each day. It will show you the rush hour that you need to pay more attention. These are the user source information that we just talked about, the directly visit your website, other link, from the search engine. In this case, you have more users from directly visit mode, so you need to improve your budget on the search engines. This one is about the total number of requests with HTTP code 404. That means the source IP is most likely a bad one. It is like, if you owned a convenience store, all the bad guys will pretend that they are ordinary customers to exhaust your limited resource, so we need to put those IP addresses listed here into the blacklist. The last one is the use of distribution chart. You will figure out where the customer is from and then it will indicate you where you should put your servers. If a large number came from the north of China, that means you are supposed to move your servers into the data center that are located in the North.