• In English we distinguish the three different entities: letters, words, and sentences. There is a certain parallelism between the fact that groups of letters make up words and the fact that groups of words make up sentences.


  • This situation also exists with computer languages. Certain character strings are recognizable words (GOTO, END . . .). Certain strings of words are recognizable commands. Certain sets of commands become a program (with or without data).


  • We begin with only one finite set of fundamental units out of which we build structures. We shall call this the alphabet.


  • A certain specified set of strings of characters from the alphabet will be called the language. Those strings that are permissible in the language we call words.


  • We shall wish to allow a string to have no letters. This we call the empty string or null string, and we shall denote it by the symbol Λ


  • Two words are considered the same if all their letters are the same and in the same order so there is only one possible word of no letters. For clarity, we do not allow the symbol A to be part of the alphabet for any language.


  • Meaning is something we do not refer to in formal languages. We are primarily interested in syntax alone, not semantics or diction


  • To be an acceptable specification of a language, a set of rules must enable us to decide, in a finite amount of time, whether a given string of alphabet letters is or is not a word in the language.


  • The set of rules can be of two kinds. They can either tell us how to test a string of alphabet letters that we might be presented with, to see if it is a valid word; or they can tell us how to construct all the words in the language by some clear procedures.


  • Let us consider some simple examples of languages. If we start with an alphabet having only one letter, the letter x \( \sum = {x} \) we can define a language by saying that any nonempty string of alphabet characters is a word.


  • \(L_1 = {x, xx, xxxx ... }\) or to write this in an alternate form: \(L_1 = {x^n ; n = 1,2,3... }\)


  • Because of the way we have defined it, this language does not include the null string. We could have defined it so as to include Λ, but we didn't.


  • In this language, as in any other, we can define the operation of concatenation, in which two strings are written down side by side to form a new longer string. In this example, when we concatenate the word xxx with the word xx, we obtain the word xxxxx. The words in this language are clearly analogous to the positive integers, and the operation of concatenation is analogous to addition: \( x^n concatenated with x^m is the word x^{n+m} \)


  • It will often be convenient for us to designate the words in a given language by new symbols, that is, other than the ones in the alphabet.


  • For example, we could say that the word xxx is called a and that the word xx is b.


  • Then to denote the word formed by concatenating a and b we write the letters side by side: ab = xxxxx


  • It is not always true that when two words are concatenated they produce another word in the language.


  • For example if the language is \( L_2= { x, xxx, xxxxx,...} = {x^{odd}} = {x^{2n+1} ; n = 0,1,2,3 } \) then a = xxx and b = xxxxx are both words in L , but their concatenation ab = xxxxxxxx is not in L2.



EXAMPLE


  • Consider another language. Let us begin with the alphabet: Σ = {O, 1, 2, 3, 4, 5, 6, 7, 8, 9} and define the set of words: L3= { any finite string of alphabet letters that does not start with the letter zero }


  • This language L3 then looks like the set of all positive integers written in base 10.


  • L3 ={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,.. . }


  • We define the function "length of a string" to be the number of letters in the string. We write this function using the word "length." For example, if a = xxxx in the language L1 above, then length(a) = 4 ; length(Λ) = 0


  • If \( L_4 ={\Lambda, x, xx, xxx, xxxx...} ; L_4 = {x^n ; n = 0,1,2,3...} \)


  • Here we have said that x0 = Λ, not x0 = 1 as in algebra. In this way xn is always the string of n x's.


  • It is very important not to confuse 0, which is a string of length 1, with Λ. Remember, even when Λ is a word in the language, it is not a letter in the alphabet.


  • Let us introduce the function reverse. If a is a word in some language L, then reverse(a) is the same string of letters spelled backward, called the reverse of a, even if this backward string is not a word in L. reverse(145) = 541


  • Let us define a new language called PALINDROME over the alphabet Σ = {a, b}. PALINDROME = { Σ, and all strings x such that reverse(x) = x }


  • If we begin listing the elements in PALINDROME we find PALINDROME = { Λ, a, b, aa, bb, aaa, aba, bab, bbb, aaaa, abba ... }


  • Given an alphabet Σ, we wish to define a language in which any string of letters from I is a word, even the null string. This language we shall call the closure of the alphabet. It is denoted by writing a star (an asterisk) after the name of the alphabet as a superscript Σ* This notation is sometimes known as the Kleene star after the logician who was one of the founders of this subject.



Q. If Σ = {x}, then


  • Σ* = L4 = {Λ , x, xx, xxx ... }


  • Σ = {0, 1} = {Λ , 0, 1, 00, 01, 10, 11, 000, 001 ... }


  • Σ = {a, b, c} = {Λ , a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, aaa ... }


  • We can think of the Kleene star as an operation that makes an infinite language of strings of letters out of an alphabet.


  • When we say "infinite language" we mean infinitely many words each of finite length.


  • If S is a set of words, then by S* we mean the set of all finite strings formed by concatenating words from S, where any word may be used as often as we like, and where the null string is also included.



Q. If S = {aa,b}, then


  • S*= { A plus any word composed of factors of aa and b }


  • = { Λ plus all strings of a's and b's in which the a's occur in even clumps }


  • = {Λ, b, aa, bb, aab, baa, bbb, aaaa, aabb, baab, bbaa, bbbb, aaaab, aabaa, aabbb, baaaa, baabb, bbaab, bbbaa, bbbbb, ... }



Q. Let S = { a, ab }. Then


  • S* = {Λ plus any word composed of factors of a and ab}


  • = {Λ plus all strings of a's and b's except those that start with b and those that contain a double b}


  • = {Λ, a, aa, ab, aaa, aab, aaaa, aaab, aaba, abaa, abab, aaaaa, aaaab, aaaba, aabaa, aabab, abaaa, abaab, ababa,... }


  • By the phrase "double b" we mean the substring bb. For each word in S* every b must have an 'a' immediately to its left. The substring bb is impossible, as is starting with a b. Any string without the substring bb that begins with an a can be factored into terms of (ab) and (a).



Q. consider S = {xx, xxx}


  • S* = { Λ and all strings of more than one x }


  • ={xn for n=0, 2, 3, 4, 5...}


  • ={Λ xx, xxx, xxxx, xxxxx, xxxxxx,.. .}


  • Notice that the word x is not in the language S*. The string xxxxxxx is in this closure for any of these three reasons. It is: (xx) (xx) (xxx) or (xx) (xxx) (xx) or (xxx) (xx) (xx)


  • These three factors are all in the set S; therefore their concatenation is in S*. If this is the only way to factor this string into factors of (xx) and (xxx) , we say that the factoring is unique.


  • It is important to note here that the parentheses, (), are not letters in the alphabet but are used for the sole purpose of demarcating the ends of factors.






  • Let us suppose that we wanted to prove mathematically that this set S* contains all xn for n ≠ 1


  • First, we consider the possibility that there were some powers of x that we could not produce by concatenating factors of (xx) and (xxx).


  • Obviously, since we can produce x4 , x5 , x6 , the examples of strings that we cannot produce must be large.


  • Let us ask the question, "What is the smallest power of x (larger than 1) that we cannot form out of factors of xx and xxx?" Let us suppose that we start making a list of how to construct the various powers of x.


  • On this list we write down how to form x2, x3 and so on.


  • Let us say that we work our way successfully up to x373 we cannot figure out how to form x374 . We become stuck, so a friend comes over to us and says, "Let me see your list. How did you form the word X372


  • Why don't you just concatenate another factor of xx in front of this and then you will have the word x374 that you wanted."


  • Our friend is right, and this story shows that while writing this list out we can never really become stuck.


  • This discussion can easily be generalized into a mathematical proof of the fact that S* contains all powers of x greater than 1.


  • We have just established a mathematical fact by a method of proof that we have rarely seen in other courses.


  • It is a proof based on showing that something exists (the factoring) because we can describe how to create it (by adding xx to a previous case).


  • What we have described can be formalized into an algorithm for producing all the powers of x from the factors xx and xxx.


  • The method is to begin with xx and xxx and, when we want to produce xn, we take the sequence of concatenations that we have already found will produce xn-2, and we concatenate xx on to that.


  • The method of proving that something exists by showing how to create it is called proof by constructive algorithm.



Note


  • If Σ = Ф (the empty set), then Σ* = {Λ}


  • If S = {Λ}, then S* = {Λ}


  • If for some reason we wish to modify the concept of closure to refer to only the concatenation of some (not zero) strings from a set S, we use the notation + instead of *


  • If Σ = {x}, then Σ+ = { x, xx, xxx. . . }


  • If S = {xx, xxx} then S+ is the same as S* except for the word Λ, which is not in S+. This is not to say that S+ cannot in general contain the word Λ. It can, but only on condition that S contains the word Λ. In this case, Λ is in S+ , since it is the concatenation of some (actually one) word from S (Λ itself).


  • If S is the set of three words = {a, b, c} then S+ = {a,b,c,aa,ab,ac,ba,bb,bc,ca,cb,cc,aaa,...}


  • Now suppose we start with the set S* and try to form its closure, which we denote as (S*)* or S**


  • If S is not the trivial empty set, then S* is infinite, so we are taking the closure of an infinite set.


  • This should present no problem since every string in the closure of a set is a combination of only finitely many words from the set.


  • Even if the set S has infinitely many words, we use only finitely many at a time. This is the same as with ordinary arithmetic expressions, which can be made up of only finitely many numbers at a time even though there are infinitely many numbers to choose from.


  • If S is not the trivial empty set, then S* is infinite, so we are taking the closure of an infinite set.


  • This should present no problem since every string in the closure of a set is a combination of only finitely many words from the set.


  • Even if the set S has infinitely many words, we use only finitely many at a time.


  • This is the same as with ordinary arithmetic expressions, which can be made up of only finitely many numbers at a time even though there are infinitely many numbers to choose from.