Today we're gonna talk about string processing and we're gonna start by talking about string sorting. And we'll take a look at some classic methods. But first we need to talk a little bit about just what strings are. And actually, that's really dependent on the programming language you are using. Different programming languages nowadays really have completely different implementations of strings. So, to get started we need to take a look at efficient implementations of basic operations on strings. We're gonna tailor our algorithms to particular Java implementation. And they can be made to work in other situations. But, certainly at starting point you have to be specific. So, what is the string? String is just a sequence of characters. And it's actually a very important fundamental abstraction that's been with us from the beginning of information processing. So, pretty much everything we communicate is a string... email or our program for our strings. I believe, in another important area of significance that has arisen recently is in computational biology where our understanding of the way that life works depends on genomic sequences, and it's essentially based on string processing. We'll see examples of that later on. This is a quote that talks about that issue. "The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology." So, we're not talking just about data structure for information processing but for life it models of life itself. Now back to computers. The strings are made up of characters. What's the character? Well the kind of classical representation of what a character is so-called 7-bit ASCII code where you have actually the underlying data type is 8-bit so you can have up to 256 characters, but for many many years programmers used only 128 of those characters. That would include all the upper case and lower case letters and numbers and some punctuation. So that's 7-bit ASCII that's the standard for the C programming language. That's a very widely used language. Nowadays, people use what's called Unicode. Where a character is a 16-bit integer and that's to allow for many many more characters due to do the 16. 65,536 instead of only 256. And that allows for encoding characters from many different of the world's languages and mathematical characters. So, it's a much more general and generous representation of what is a character. But it's important to be specific. So, in Java the standard is that it carries a 16-bit unsigned integer. Now, not all programming systems and applications have moved up to the Unicode standard. So, sometimes you'll find programs looking for you know kind of coding and not finding it and this t-shirt is a joke that has something to do that... do with that. It's supposed to be a heart which is a valid, you know, character in some world but not on that t-shirt. So, we'll come back to what is a character and we're gonna use the simpler version in between those two are character's an 8-bit integer. Now let's talk about what a string is in Java? There's a built-in string data type. It's not quite built-in but many of the features for processing string are built into the Java language. So it's okay to think of it as being built-in. So, in Java a string is a sequence of characters and it's immutable. Once you create a string you can't change it. In the primary operations that you can perform efficiently with a string are two: number one - 'find its length'. So, you can get the length of a string and that's just the number of characters in it. You can index into a string and get the 'iᵗʰ' character. That's the 'charAt' method. And you can extract a substring of the string to create a new string that's a continuous sub-sequence of the characters in the string. And one of the big, you know, features of Java's implementation is that you can get that operation done in constant time. And then you can also do what's called concatenation, and that's add a character to the end of another string. That one can't be done in constant time in the standard Java string data type. So, this is what the implementation looks like for the string data type in Java. The private instance variables are an array of characters. And 'offset' that's an index into the first character of the string in the array. The 'length'. And also to make it a more efficient to search using string keys. Java keeps a private variable which is the hash code for that string. So, once the string is built, the hash code computed then when it's time to use the hash code, and a hashing algorithm, it's immediately available. So, the 'length' method simply returns that length. To get the 'iᵗʰ' character of the string we add 'i' to offset and get that character. And to create a string given 'offset', 'length' and a 'char' array, we just reset those value. And then the key thing is the 'substring' method. Since, all it involves is a pointer into the immutable string and length on the index of the first character, we can build a string in constant time just by copying the reference to the character array. So, that implementation would give a good feeling of why 'substring' method is constant in string. So this is the performance. It's a sequence of characters immutable and the underlying implementation is immutable. Instance variables that give the array, offset and length. And so it means that we can get length out in constant time, charAt in constant time just by having the offset and the indexing. Substring in constant time just by essentially copying those instance variables. But to concatenate, to make a new string that results from adding one character to a string, we have to create a whole new string and make a copy of it because the string itself is immutable. So it takes time proportional to the number of characters in the string. And it involves making a new string. You can imagine string implementations and they exist in various programming languages where these performance guarantees are different. And actually Java has different implementations for applications where you might want different performance guarantee. And if you work out the memory usage for a string of length 'N' then it's '40+2N' bytes. You might consider using char array but then you'll lose a lot of the convenience of being able to pretty substring instantly. And also the language features that support strings. So, here's a implementation of, a different implementation, of a sequence of characters. In Java that is mutable. So, the idea is that you can use this data type when you're building up a string a piece at a time, like maybe reading characters off standard input or something. The underlying implementation in this case is a resizing array of characters. So when it fills up and doubles, as we've done many times before, and it keeps the length as an instance variable. So with, StringBuilder you can get the length in constant time. You can get characters in constant time just by doubling. And you can concatenate out a new character in amortized constant time. Most of the time it's constant. Every once in a while you might have to double what you pay for that double by the number of operations that you did. The thing you lose though is that it takes linear time to extract a substring. Because to extract a substring you have to make new char array that can be resizing and so forth and can be amenable to concat. So, that's two different implementations of sequence of characters in Java, with these two different, importantly different, performance characteristics. So we have to keep, might be mindful of that in applications. And again in other programming languages something like the StringBuilder is more like the standard, and just have to know what the implementation is. There's another one called StringBuffer as well in Java that we will skip for now. So, here's a typical example that might have a simple computation like how do we efficiently reverse a string? So, I could use a string or you could use a StringBuilder. With string you get to declare it almost like a built-in type, and simply initialized with a null string. And then to compute the reverse string we go backwards through the original string and concatenate the characters starting at the back to create a reverse string. Or with StringBuilder you use the StringBuilder data type, and so create an object that uses the doubling array and use the append operation. So what do you think which one of these is gonna be most efficient for a long string? And the answer is that it's StringBuilder because using the built-in string every time you do a concatenation you have to make a copy of the whole string. So, if the string is of length 'N' it's gonna take one plus two plus three all the way up 'N', which sums to N², about N²/2. So, it takes quadratic time to do, for this algorithm to run, for a long string and that's gonna preclude using it for huge strings. As we have seen so many times, can't be using a quadratic time algorithm for a lot of data. On the other hand, with StringBuilder it's linear time because the append operations are amortized linear. So, that's a simple example. Here's another example, a computation that we're gonna look at later on at the end of the lecture is how do we form an array of suffixes? So that is we have an input string. In the suffixes of the string are the strings that you get by starting at each position. So the first suffix is the whole string. The next one starts at position one. Next one starts a position two. And so forth. Each one less. And so we have algorithms that gain efficiency by forming an array of suffixes of a given string. And so how do we create that thing in the first place. Again, you can do it with string or you can do it with StringBuilder. So, let's look at it with string. We get the length that's gonna be the length of the array. And what we do is for all values of 'i' we set suffixes of 'i' to the substring of x you... 's' you get by starting at 'i' and going all the way to 'N'. And that's our suffix array. And this is the corresponding code for StringBuilder. But now in this case the standard method is gonna be linear. Whereas the StringBuilder, because there's only one string in the substrings or a few pointers into that string, whereas for StringBuilder we have to make a new string for each suffix and there's a quadratic number of characters in the... in all of those strings. So, it takes quadratic time. So you can't use StringBuilder to build a suffix array for a huge string. So again, those are typical examples of string processing where it really matters which string implementation that you're using. And if you're not using... For using Java these tradeoffs are clear. If you're using some other programming language you better make sure that you know how strings are implemented before you even get started with string processing. So, here's a simple computation that we'll be using. Suppose that we have two strings and what we're interested in knowing is the length of the longest common prefix. So, here is some, a static method that we will implement. This function takes two strings as arguments. We only need to go as far as the length of the shortest of the two strings. So that's 'N'. And then we just go ahead and start at the beginning and compare. As long as the strings are equal we increment 'i'. And if we get to a point where they're non-equal that's when we return 'i'. In that case that's the length of the longest common prefix is. In this case they're not equal at four. That means they match for four characters. And if we get to the end of one of them then that's the prefix. So we just return 'N'. So that's just a little bit of warm-up code and the amount of time that takes is proportional to the length of the longest common prefix. Although if the prefix is short like if the two strings have a different first character then it's sub-linear, it doesn't have to look at all the data. Just has to look at amount that matches. So, the idea of a sub-linear time algorithm for string processing is a really important one that we're going to be taking advantage of as we move into more complicated algorithms. So, for example you can compare two strings without looking at them all. It depends how. Just have to find the first place that they differ. So you don't look at all the data that's sub-linear time. We are gonna see sorting algorithms that take advantage of that. Now, we're not going to really do it in the code that we show in lecture or even in the book, but it's actually fairly easy to take many of the algorithms that we are gonna look at, and make some so that they work for general alphabets. And for different applications it might be entirely appropriate to customize the code to a particular alphabet. So, like if the thing are... The things that are being processed are numbers, are in positive integers, or things like account numbers. Maybe only 10 decimal characters can occur. So, we might as well work with strings made from those well-defined 10 characters. In DNA there's only four characters. So, we might as well know that we're working with four characters and so. I will often talk of the radix which is the number of possible different character values in the string. Now, we're always gonna use what's called as extended ASCII where, just to fix ideas, where the radix is 256 and the number of bits therefore to represent a character is the log based 2 of that. So 8 bits 256. And when we talk about performance of algorithms we'll use 'R' and 'log R' just to make sure that it's clear that if we're working with a smaller alphabet or a larger alphabet we can still use the algorithms but the performance is gonna depend on the radix. So, that's the introduction to string processing.