This is the video four of the 3rd course. This video is about files. In this video we will first do a short introduction about generality when they're using file formal program. After that we will talk about the three ways we can manipulate file and C and see. So the open function and how the the related function. The fopen function and all the related functions and the mmap function. We'll have started that talk about what is important to consider when we work with multiple files at the same time. We will open the discussion on the the use of transaction to make manipulating file safer. And we will complete with a quick conclusion. Introduction. They have many kind of file a program is working with. The first is an input file. So when input file generally come from the the user and in this case the code must be prepared to handle an empty file. A file shorter than expected or file longer than expected. If the file is longer than expected, it's very important to never read extradited up and put it in memory. Most probably if the file is longer than expected, it's because something someone malicious had code. It will try to execute using a buffer overflow exploit to do something like that. The second kind of file the program will work work with is an output file. So a file where we start result over processing. In this case the program need to be careful and consider that it's possible that the right fail just because size limitation or their policies on the system. The program must also be careful to don't leak data in the output file. That the third kind of file the program is manipulating is a data file. So a file that contained information and used every time the program runs. The first important thing is that the user that used the program is maybe not able to manipulate this file. So the program is responsible to manipulate this file. Using the special privileges programs have in a way that will not leaked data this file contained. And in a way that will keep this file that is so will not corrupt the data in this file in any way. Open. When we open the file with the open function or the create file function on windows. We have an many functions we can use to manipulate files. So I listed here in the table with the name of the function only in the name of the function on Windows. When an application called the right or the right file function. The data is not directly right into the file except if we request that when opening the file. So normally data is simply put into the system cache and we transfer to the physical disk. When the device is available operating system do that to improve the performance. This improved the performance in two ways. First if the data is written again just after, that data will be changing the cache. And then we will say physical access. Also giving the purity to the read operation on the physical disk, reduce application wait times. So we'll just do the read first and wait for the availability of the just to do the right operation. This way we will reduce the application with time and the user with time. If the application crashed, the data in the file cache will be saved to the disk by the operating system anyway. So we don't have to worry about that. If the operating system crashed some data can be lost. Because the operating system will not be able to save the data that stay in the in the cache and have to be safe. We are now going to look at an example. The clientlist_0 example. This is a very simple program. When a new customer subscribes to a data service, it provides the destination URL and transfer frequency. And the the other part of the system will call the clientlist_0 executable to add the new client to the clientlist. In this example the clientlist_0 escaped as in being a refinery. Where all records are the same size so which makes it easier to navigate from one record to other record. After this example run we just imagine that some other part of the system. So another program that run period particularly we'll send the data to the client server as needed. So we're respecting the requested frequency. The code works on Windows and handling X. But I prefer to present you on Windows. We see hear that we have some different depending on the operating system. The program will accept four arguments. The first one is the comment. So add or remove the I D. So the customer ID, the address and the frequency. Beginning of the. We got the validation of the argument count here. We just valid for the the common argument if we have the I into parts. Here we create, open would be a better world the data file. We do request for and read and write access. Just because we'll be able to add some new customer or also remove customer. Depending on the first argument that come in, will execute the add function on the removed function and after we will just close The data file. So we'll first look to the remove function. If we want to remove it client from the client list, we'll call it the Remove. This is the function remove. So at the start they validate the number of agreement because we need an extra argument. That is the idea of the client to remove. We pass this argument, And here we read the idea of the file of the first Id in the first record. And if the Id match, we'll just rewind the file and write back a zero to the Id to indicate that the record is no longer valid. And if the mad Id was not matching, we'll just go to the next record and begin again. So ReadId validating. So just try, at this time the file is empty. So if I do that I will just, Get an error, that is Id not found. We'll try later with some data in the file. Even if this card is not really perfect and not represent any security issues, it will be safe to remove, Id, if we find it. And the way we'll remove it will work or not work, but will not corrupt the file. We will now look at the add function. So this is the add function. Same thing, we will reevalidate here the number of arguments because we need three arguments. When we do a add the Id of the customer, of the address, or the URL where to send the data and the frequency of the data transfer. So we'll pass the value of the two numerical argument. We go at the end of the file just because we want to add a new customer and new clients to the file. So we move to the end and here we start to write. So the first thing we do is to write the Id. After we compute the address from the argument into a variable. And the program is very careful to don't cause any buffer overflow. So we first initialize the memory in order to avoid memory leaks as we've seen in course before. And after that, we compute the data to the variable and we take care to don't overflow this local variable. If the address is longer than the variable, it would be simply truncated and the rest of the program will work in the same way. It's sort of that, when, We will try to transfer data to this address. It will most probably not work. And at this time we will fix the situation. And once we, if needed, truncate the address, we'll just write it to the file. And after that we compute the period, just because the argument is provided as a frequency but in the file we keep period and minutes. So we take the number of minutes and the day, and we divide it by the frequency. And we write the frequency to the file. And I'll work if an error occur. If for an example, when we try to write the address, something is preventing that, most probably because of the file, the disc is full or something like that. We will call that Add Abort function. And I will look at what this function do. This function is just a little bit here Yeah, so again, two version of the function. One for the links and one for the windows and we get back to the Offset. That was the beginning of the new record and we truncate the file there. Here on Linux. So we'll not let a file that have an Id valid in the record without having the address valid or without having the frequency valid. Because the same thing happen, if we are not able to write the period in the file. So it's the revert part. So if an error occurs, we'll just go back and put the file in the same state that it was just before we start the operation. What is very good. This part is very good. This part is less, we will see in a few minutes why. Just let's try the add for now. So will ADD plan 1. I want to send the data to a ftp site, And I want to send it six times every day. And it work. If we look at the file, We see that we have 64 bytes in the file. Again, I can much probably look at the file. We'll just, Drop it in visual studio. And we have the file. So we have the customer, the client Id, the address. And at the end, the period. So, Now if I want, I can use the REMOVE 1 plan. So the the file still exists, but the information in the file is not valid just because the client Id Is zero, we can validate it. And yes that land idea is not zero. So just get back to the code. Some of you maybe not says that we do a division with data that comes directly from the user. So it's it may be a security issue. So clearly security Issue just because the user may provide a zero as the frequency. So the period will not be able to to be calculated divide by zero with clothes on their own. We already talk about the divide by zero error. It's why it's not the goal of this example to show that what is more more problem in this case is that we already right the customer writing. So the five container a client idea valued client I D. So the record will be consider valued by the other part of the system. And we also already added the data to the the address to the file. So yes the other part of the stem will be able to send that to this address. But the the last large last part of the film will not will not be changed so the parole will not be set first. The first problem is that the fight will not be of the garland. So if we try to add a new customer after that the program will most probably run but the the head will not work correctly. Just because the the fire will not the fire size will not be a multiple of the recorded size. So we will just get in a situation where the program is no more no longer working has expected because of because of the input of the. The the user. So just let's try. So I just ad at But rather than going with the frequency of six and we go with the frequency of zero and yes I have a trashed. If I look at the file the file is one and went 26. So it's really not good because the record size is 28 128. So 64 Let's play by two. If I try to add a new client client two with valiant frequency the program is is working but again the file is too short so the record is misplaced so the program is no longer working correctly. In this case we can see it here. So here we have the first non valued record. It's okay. The second record is the record where the divide by zero mm then the the other record we try to add. Just start at the wrong index. So the other part of the system will not be able to work with this file. So because of the of the error and because of an error that a user can cows the program is no longer working correctly. Or in this case this record will work correctly. The period the period will depend on the and the next idea we will try to add to it. But so the customer that goes there will receive the data and marco baby will not be charged because the the addition failed. So it's a it's a clear security issue just before leaving I will will just pass the code this way if you want to look at the code, no longer will be able to post the video and look to the code. And we will also be able to look to the code headed to make this this code working on lyrics. I remember that this code is present and on the virtual computer we provide with the course in order to for the customer for the student to do the final project but you can also use it to look to the code or even try it. The example client list 08 is a fix for the security issue we described in the previous example. So we'll just look to the change in the code. The problem was in the add function. So the fix is also in the add function. There's that function. So this time we use a structure that represents a record of the in the file. We can take a look at the records so you go to the defection. We said that the the structure exactly what we got in the file. So the I D. The address and the period in minutes we realize how infrastructure. So this way we will avoid leaking data. Remember about the course and look how vulnerable we will parse the client I D. From the common line argument and put it in the structure. We will copy the address in the structure to taking care to not overflow the buffer and will also parts the frequency from their argument and compute the program. I just Just let the divide by zero possibility here because it was not the problem. Otherwise, otherwise trying to to fix, I will move the five pointer at the end of the file in order to me a bank possible to add the record and I will write all the record at the same time. That's it. And they have some other note that I want to to do on the security side of using file 1st evolves if the size of the record, if you have a size that is a multiple of the not a multiple but the floor of two. Like this. We see that the size of the structure is 64 bytes. So it will avoid that with the same record cross block boundary on the disk and then or in the file system cache. So it will make a less, not less, but clearly impossible. That one part of the record is put to digest and another part of the record is not put to the just because of the system crash or something like that. So it's really a good practice to have a record side that is a power of two and is smaller than it just blocked. That is generally 40 in the modern operating system. So it's a good security practice in a good protests programming practice too. So to do a quick recap, let's write operation is generally safer, simpler code is generally safer if we want to to make it more generally sentence or general sentence too many plate file. We can also use the F open function and whole harder associated function. The difference between the open and ff open is an internal buffer and internal application buffer. When we use the F open, the problem with the the open and the read and write function is that every time we call program called the read or the write function. The execution have to communicate with the operating system in order to transfer the information to the operating system. If we do will not have small data operation. This had an important over red to the functionality so the the people that create the C language also create the F open function and f read f right function that uses an internal buffer in the application. So when we do the freedom of death, right. The data is a first move to the application buffer and real read operation will just take place when this buffer will no longer be large enough to contain the information. So we minimize the name. The number of times the application have to talk with the operating system to transfer data. This application will for at the risk for the correct functionality of the program and also for the security is that if the application trash the data that is in this buffer and is not is waiting to be put on the desk will will be lost. This table shows the function we use when we use the F open family. The right column just showed that some of this function have a safe version when using windows when running on the windows will be better. I want to do a quick note the last line in skate f flush this function will simply flush the buffer so with the buffer is waiting to be right and waiting to have more data in order to improve thing. They will flush the data anyway and will be able right and heads directly to the disk at this time. If a program used this woman played that a fighter should flush that actually just after a complex sting operation. So rather than wait for normal flush, you can decide when it's appropriate to flush that out that today just. Doing that will clearly just they grabbed the performance because at the start we use therefore infinity that to avoid small write to the disk and doing a flush very often. Will just don't let the library do what she she tried to do what it tried to do. The other possibility is trying to detect application trash in order to flush application before to digest. So let that f functions or where the freedom right do the their normal work. And if the application crash we detected and we then call the flash just before the crash we will see in the coursera and link in the more processing in the video about handling singles, how to do that. Another way to manipulate the file is to use the m map function when we use the map function. What we do is to use the memory manager of the operating system. So we will just create a and the address space a range of address in the process memory. And each time the process will write or read data from the space, it will access the file. So it works the same way that the operating system used to create the virtual memory and it's very efficient writing. Reading a small amount of data is clearly not performance issue this way this time. Yes, the the data will be transferred from the disk in large page but the operation will be real quick event if we just read or write a small amount of data. And the important thing too is that even if the application crash, how the right that happens will find their way to the just because the system, the operating system, remember the pages and we'll right back today just when they just get available. If the operating system crash something that I can be lost it. Sure, that because the if the operating system crashed some page that had to be written will probably not be brighter here. We see the function hand on Linux and then on Windows to create the to map the file. So we first open the file with open our current fight exactly as we want to open it the normal way and after what that we will map the fight. If want to ensure that how that would have been right to the map and memory is correctly written to the disk. We can use the m sync function or the flush view of five function and windows to do that multiple files and the problem we see where a program because of an error or because of the malicious action. corrupt the data file is more likely to happen if the program many played multiple file. Because some some file can be related between them, we think about a file with the invoice and the file with the inventory. So if we invite something we want to also change the inventory for reduce the number in stack. So if we are able to change one but not the other one, we have a problem. We have a problem with the data integrity so when the many times and many many case. But file may need to be modified at the same time and computer don't do many times many thing at the same time. So it will naturally occur one thing after one other operations. So something can happen between the two and, it's chaos a problem for the data integrity. One way to solve this kind of problem is using transaction. What we talk when we talk about transaction, we talk about a way to modify multiple files, in an ethnic manner. So exactly as if our modification happened at the same time. So the goal of the transaction is to always, completely execute the transaction or not executive it at all. A transaction can contain many write operation, to one or many files which can be many operation in the same file or many operation on different file. If a program want to use transaction it first created a transaction. And after that execute the write operation using the function, supporting the transaction model. And after that will commit the transaction, so it's that the commit that all the the write will be confirmed, and happen on the system. If one of the write operation file, write operation or other operation will see in the next slide. That we can have in the transaction, more than just write operation. The program will not commit the transaction it will rather, revert the transaction. And if the program creates transaction and failed to commit the transaction. Because the application crash or because the code simply have an error and don't call the comit function. The operating system will not commit the change and, or will use or a time out or the hand of that with the execution of the application. To say the transaction has not been commit and we will just revert all part of it that have been, executed until now. And when those the transaction is support, and the function we used to create transaction is the create transaction function. And we have many functions to execute operation, that will be part of a transaction. So that create file transaction will let us open the file in a transacted way, so our operation will be part of the transaction. And we also have other functions to copy file, move file, delete file and, other things. And at the end we have the CommitTransaction, or the RollbackTransaction. If at some point in the processing of transaction that the program decide oops we cannot complete the transaction. It can tell the RollbackTransaction rather than the CommitTransaction to rollback all change that have already been in this transaction. It's now time for the conclusion, first reading a file is always safe, will not corrupt the file by reading it. Writing simple file is easily safe so we're going to do it safely, using few simple tricks. Writing complex or multiple file is challenging, this is the main part the main thing to remember of this video. Using transaction can help sometimes we just don't have the choice if we think about people writing database library or database software. They clearly don't have the choice to use transaction to make, the thing safe. Often when we program if complex or multiple file are needed, using database is a very good way to do. So it's a void to get the problem in your end, we just let this problem in the hand of the people that create the database services and the database software. So it's often a better way to go if not, if the program need to end their multiple file in a consistent way, and it's often the way to go. This is the end of this video, the next video is about standard input, standard out and standard error.