RECURSIVE DEFINITIONS

Say for example that S = {a,b}.

Then S* is clearly all strings of the two letters a and b of any finite length whatsoever.

Now what would it mean to take strings from S* and concatenate them

= aababaaaaaba = (aaba)(baaa)(aaba) = [(a)(a)(b)(a)] [(b)(a)(a)(a)] [(a)(a)(b)(a)] = (a)(a)(b)(a)(b)(a)(a)(a)(a)(a)(b)(a)

Let us consider one more illustration.

If S = {aa, bbb}, then S* is the set of all strings where the a's occur in even clumps and the b's in groups of 3, 6, 9. . . Some words in S* are aabbbaaaa, bbb, bbbaa

If we concatenate these three elements of S*, we get one big word in S**, which is again in S*. = aabbbaaaabbbbbbaa = [(aa)(bbb)(aa)(aa)] [(bbb)] [(bbb)(aa)]

This theorem expresses a trivial but subtle point.

It is analogous to saying that if people are made up of molecules and molecules are made up of atoms, then people are made up of atoms.

Every factor from S* is made up of factors from S.

Therefore, every word in S** is made up of factors from S.

Therefore, every word in S** is also a word in S*.

We can write this as S** C S* using the symbol "C" from Set Theory, which means "is contained in or equal to."

Now in general it is true that for any set A we know that A C A*, since in A* we can chose as a word any one factor from A.

So if we consider A to be our set S*, we have S* C S**

Together, these two inclusions prove that PROBLEMS S* = S**

A recursive definition is characteristically a three-step process. First, we specify some basic objects in the set. Second, we give rules for constructing more objects in the set from the ones we already know. Third, we declare that no objects except those constructed in this way are allowed in the set.

Example : Suppose that we are trying to define the set of positive even integers One standard way of defining this set is: EVEN is the set of all positive whole numbers divisible by 2.

Another way we might try is this: EVEN is the set of all 2n where n = 1 2 3 4 ...

The set EVEN is defined by these three rules:

Rule 1 : 2 is in EVEN

Rule 2 : If x is in EVEN then so is x + 2.

Rule 3 : The only elements in the set EVEN are those that can be produced from the two rules above.

Example 2 : Let us consider the way polynomials are usually defined: A polynomial is a finite sum of terms each of which is of the form a real number times a power of x (that may be x⁰ = 1). Now let us consider a recursive definition

Rule 1 : Any number is in POLYNOMIAL.

Rule 2 : The variable x is in POLYNOMIAL.

Rule 3 : If p and q are in POLYNOMIAL, then so are p + q and (p) and pq.

Rule 4 : POLYNOMIAL contains only those things which can be created by the three rules above.

The symbol pq, which looks like a concatenation of alphabet letters, in algebraic notation refers to multiplication.

These rules are very crude in that they make us write subtraction as p + (- 1)q and they do not show us how to simplify this to p - q.

We could include rules for making the notation prettier, but the rules above do allow us to produce all polynomials in some form or another, and the rules themselves are simple.

The reason that these definitions are called "recursive" is that one of the rules used to define the set mentions the set itself. We define EVEN in terms of previously known elements of EVEN, POLYNOMIAL in terms of previously known elements of POLYNOMIAL.

EXAMPLE : Definition of the set of people who are descended from Henry VIII and Definition of factorial

Rule 1 : 0! = 1

Rule 2 : (n + 1)! = (n + 1)(n!)

Rule 1 :The children of Henry VIII are all elements of DESCENDANTS.

Rule 2 : If x is an element of DESCENDANTS, then so are x's children

We defined L₁ by the symbols: L = {xⁿ for n = 1, 2, 3, ... }

We also defined L₂ by the symbols: L = {xⁿ for n = 1, 3, 5, ... }

L₅ by the symbols: L = {xⁿ for n = 1, 4, 9, ... }

However for certain languages such as L₆ = { xⁿ for n = 3, 4, 8, 22, . . . } ? What words are in the language.

Perhaps these are the ages of the sisters of Louis XIV when he assumed the throne.

More precision and less guesswork is required, especially where computers are concerned.

In this chapter we shall develop some new language-defining symbolism that will be much more precise than the ellipsis (which is what the three dots . . . are called).

regular expressions.

We will define the term regular expression itself recursively.

The languages that are associated with these regular expressions are called regular languages and are also said to be defined by finite representation.

L₄={ Λ, x, xx, xxx, xxxx, . . .}

Let S = {x}.

Then L₄ = {x}* As shorthand for this we could have written: L4 = {x}*

We now introduce the use of the Kleene star applied not to a set but directly to the letter x and written as a superscript as if it were an exponent.

x* The simple expression x* will be used to indicate some sequence of x's (maybe none at all).

x* = Λ or x or x² or x³ or x⁴ ... = xⁿ for some n = 0, 1, 2, 3, 4,...

That is x* stands for a string of x's, but we do not specify how many.

It stands for any string of x's in the language L₄.

L₄ = language (x*)

Since x* is any string of x's, L₄ is then the set of all possible strings of x's of any length (including Λ).

We should not confuse x*, which is a language-defining symbol, with L which is the name we have given to a certain language.

This is why we use the word "language" in the equation.

Suppose that we wished to describe the language L over the alphabet Σ = {a,b} where L = {a, ab, abb, abbb, abbbb, ... }

We could summarize this language by the English phrase "all words of the form one a followed by some number of b's (maybe no b's at all.)".

Using our star notation, we may write: or without the space, L = language (a b*) L = language (ab*)

We can apply the Kleene star to the string ab if we want, as follows:

(ab)* = Λ or ab or abab or ababab ...

Parentheses are not letters in the alphabet of this language, so they can be used to indicate factoring without accidently changing the words. Since the star represents some kind of exponentiation, we use it as powers are used in algebra, where by universal understanding the expression xy² means (xy)².

The language L, can be defined by any of the expressions below:

XX* or X⁺ or XX*X* or X*XX* or x x* or x*x⁺ or x**x*xx* or

Q. The language defined by the expression ab*a

The set of all strings of a's and b's that have at least two letters, that begin and end with a's, and that have nothing but b's inside (if anything at all).

language (ab*a) = {aa, aba, abba, abbba, abbbba, ... .}

It would be a subtle mistake to say only that this language is the set of all words that begin and end with an 'a' and have only 'b's' in between, because this description may also apply to the word "a" depending on how it is interpreted. Our symbolism eliminates this ambiguity.

Q. The language of the expression a*b* contains all the strings of a's and b's in which all the a's (if any) come before all the b's (if any).

language (a*b*) = {Λ, a, b, aa, ab, bb, aaa, aab, abb, bbb, aaaa, . . . }

Here we should again be very careful to observe that a*b* ≠ (ab)*

since the language defined by the expression on the right contains the word abab, which the language defined by the expression on the left does not. This cautions us against thinking of the * as a normal algebraic exponent.

The language defined by the expression a*b*a* contains the word baa since it starts with zero a's followed by one b followed by two a's.

The following expressions both define the language L₂ = {x^odd}

x(xx)* or (xx)*x

x*xx* does not since it includes the word (xx), x, (x).

Q. Consider the language T defined over the alphabet Σ = {a, b, c}. T = {a, c, ab, cb, abb, cbb, abbb, cbbb, abbbb, cbbbb, ... }

All the words in T begin with an 'a' or 'ac' and then are followed by some number of b's.

Symbolically, we may write this as

T = language ((a + c)b*)

= language (either a or c then some b's)

We now introduce another use for the plus sign.

By the expression x + y where x and y are strings of characters from an alphabet, we mean "either x or y".

This means that x + y offers a choice, much the same way that x* does.

Care should be taken so as not to confuse this with + as an exponent. We should, of course, have said "some or no b's". We often drop the zero option because it is tiresome.

We let the word "some" always mean "some or no," and when we mean "some positive number of' we say that.

We say that the expression (a + c)b* defines a language in the following sense. Every + and * ask us to make a choice. For each * or + used as a superscript we must select some number of factors for which it stands.

For each other + we must decide whether to choose the right-side expression or the left-side expression.

For every set of choices we have generated a particular string. The set of all strings produceable by this method is the language of the expression. In the example (a + c)b* we must choose either the a or the c for the first letter and then we choose how many b's the b* stands for.

Each set of choices is a word. If from (a + c) we choose c and we choose b* to mean bbb, we have the word cbbb.

Q. Now let us consider a finite language L that contains all the strings of a's and b's of length exactly three.

L = {aaa, aab, aba, abb, baa, bab, bba, bbb}

The first letter of each word in L is either an 'a' or a 'b'. The second letter of each word in L is either an 'a' or a 'b'. The third letter of each word in L is either an 'a' or a 'b'. So we may write L = language ((a + b)(a + b)(a + b))

L = language ((a + b)³)

If we want to define the set of all seven letter strings of a's and b's, we could write (a + b)⁷ . In general, if we want to refer to the set of all possible strings of a's and b's of any length whatsoever we could write, (a + b)*

Q. We can describe all words that begin with the letter 'a' simply as:

L = a(a + b)* that is, first an a, then anything (as many choices as we want of either letter a or b).

Q. All words that begin with an a and end with a b can be defined by the expression

a(a + b)*b = a (arbitrary string)b

Q. Let us consider the language defined by the expression (a + b)*a(a+b)*

At the beginning we have (a + b)*, which stands for anything, that is any string of a's and b's, then comes an 'a', then another anything.

All told, the language is the set of all words over the alphabet Y = {a,b} that have an 'a' in them somewhere.

The only words left out are those that have only b's and the word Λ.

For example, the word abbaab can be considered to be of this form in three ways: (Λ)a(bbaab) or (abb)a(ab) or (abba)a(b)

Q. The language of all words that have at least two a's can be described by the expression

(a + b)*a(a + b)*a(a + b)*

= (some beginning)(the first important a)(some middle)(the second important a)(some end)

where the arbitrary parts can have as many a's (or b's) as they want.

In the last three examples we have used the notation (a + b)* as a factor to mean "any possible substring," just as we have seen it stand for the language of all words.

Q. Another expression that denotes all the words with at least two a's is:

b*ab*a(a + b)*

We scan through some jungle of b's (or no b's) until we find the first a, then more b's (or no b's), then the second a, then we finish up with anything. In this set are abbbabb and aaaaa.

We can write: (a + b)*a(a + b)*a(a + b)* = b*ab*a(a + b)*

where by the equal sign we do not mean that these expressions are equal algebraically in the same way as x+x=2x

We could write language ((a + b)*a(a + b)*a(a + b)*) = language (b*ab*a (a + b)*) = all words with at least two a's.

To be careful about this point, we say that two regular expressions are equivalent

If they describe the same language.

The expressions below also describe the language of words with at least two a's.

and EXAMPLE (a + b) * ab * ab* ; b*a(a + b)*ab*

Q. If we wanted all the words with exactly two a's, we could use the expression

b*ab*ab*

which describes such words as aab, baba, and bbbabbbab. To make the word aab, we let the first and second b* become Σ and the last becomes b.

Q. Complex equivalences

The language of all words that have at least one a and at least one b = (a + b)*a(a + b)* b(a + b)*

= (arbitrary)a(arbitrary) b(arbitrary)

we are then requiring that an a precede a b in the word. Such words as ba and bbaaaa are not included in this set. Since, however, we know that either the a comes before the b or the b comes before the a, we could define this set by the expression:

(a+b)*a(a+b)*b(a+b)* + (a+b)*b(a+b)*a(a+b)*

There is a simpler expression that defines the same language.

If we are confident that the only words that are omitted by the first term (a + b)*a(a + b)*b(a + b)* are the words of the form some b's followed by some a's, then it would be sufficient to add these specific exceptions into the set.

These exceptions are all defined by the regular expression: bb*aa* The language of all words over the alphabet Σ = {a, b} that contain both an a and a b is therefore also defined by the expression: (a + b)*a(a + b)*b(a + b)* + bb*aa* Notice that it is necessary to write bb*aa* because b*a* will admit words we do not want, such as aaa.

These language-defining expressions cannot be treated like algebraic symbols. We have shown that (a + b)*a(a + b)*b(a + b)* + (a + b)*b(a + b)*a(a + b)* = (a + b)*a(a + b)*b(a + b)* + bb*aa*

The first terms on both sides of this equation are the same, but if we cancel them we get (a+b)*b(a+b)*a(a+b)* = bb*aa*

which is false, since the left side includes the word aba, which the expression on the right side does not.

The only words that do not contain both an a and a b in them somewhere are the words of all a's, all b's, or Λ.

When these are added into the language, we get everything.

Therefore, the regular expression: (a+b)*a(a+b)*b(a+b)* + bb*aa* + a* + b* defines all possible strings of a's and b's.

The word Λ is included in both a* and b*.

We can then write: (a + b)* = (a+b)*a(a+b)*b(a+b)* + bb*aa* + a* + b* which is not a very obvious equivalence at all.

THEOREM 1 : For any set S of strings we have S* = S**.

REGULAR EXPRESSIONS

Regular languages