Examples of equivalences

(a+b)* = (a+b)* + (a+b)*

(a+b)* = (a+b)* (a+b)*

(a+b)* = a(a+b)* + b(a+b)* + Λ

(a+b)* = (a+b)* ab(a+b)* + b*a*

We can represent a finite language by using the plus (union sign) alone.

If the language L over the alphabet X = {a, b} contains only the finite list of words given below

L = {abba, baaa, bbbb} then we can represent L by the symbolic expression

L = language (abba + baaa + bbbb) Every word in L is some choice of options of this expression.

If L is a finite language that includes the null word Λ, then the expression that defines L must also employ the symbol Λ.

For example, if L = {Λ, a, aa, bbb} then the symbolic expression for L must be L = language (Λ + a + aa + bbb) The symbol Λ is a very useful addition to our system of language-defining symbolic expressions.

Q. Let V be the language of all strings of a's and b's in which the strings are either all b's or else there is an a followed by some b's. Let V also contain the word Λ.

V = {Λ, a, bab, bb, abb, bbb, abbb, bbbb, ....} We can define V by the expression b* + ab* where the word A is included in the term b*.

Alternatively, we could define V by the expression: (Λ + a)b*

This would mean that in front of the string of some b's we have the option of either adding an a or nothing.

Since we could always write b* = Ab*, we have what appears to be some sort of distributive law at work. Λb* + ab* = (Λ + a)b*

We have factored out the b* just as in algebra.

It is because of this analogy to algebra that we have denoted our disjunction by the plus sign instead of the union sign U or the symbolic logic sign V.

We have a hybrid system: the * is somewhat like an exponent and the + is somewhat like addition.

But the analogies to algebra should be approached very suspiciously, since addition in algebra never means choice and algebraic multiplication has properties different from concatenation (even though we sometimes conventionally refer to it as product):

ab = ba in algebra ab + ba in formal languages Let us reconsider the language T = {a, cab, cb, abb, cbb, ... }.

T can be defined as above by (a + c)b* but it can also be defined by ab* + cb* This is another example of the distributive law.

We have all the parts we need in order to define regular expressions recursively. The symbols that appear in regular expressions are: the letters of the alphabet Σ, the symbol for the null string Λ, parentheses, the star operator, and the plus sign.

Rule 1 : Every letter of Σ can be made into a regular expression by writing it in boldface; Λ is a regular expression.

Rule 2 : If r₁ and r₂ are regular expressions, then so are (r₁), r₁r₂, r₁ + r₂, r_l* ; r_l⁺ = r_lr₁*,

Rule 3 : Nothing else is a regular expression.

If S and T are sets of strings of letters (whether they are finite or infinite sets), we define the product set of strings of letters to be :

ST = {all combinations of a string from S concatenated with a string from T }

S = {a, aa, aaa} T = {bb, bbb}

ST = {abb, abbb, aabb, aabbb, aaabb, aaabbb}

Note that these words are not in proper order.

If S and T are sets of strings of letters (whether they are finite or infinite sets), we define the product set of strings of letters to be :

If S = {a, bb, bab} T = {a, ab}

ST = {aa, aab, bba, bbab, baba, babab}

If P = {a, bb, bab} Q = {Λ, bbbb} then PQ = {a, bb, bab, abbbb, bbbbbb, babbbbb}

M = {Λ, x, xx} N-={Λ, y, yy, yyy, yyyy,...}

MN = {Λ, y, yy, yyy, yyyy, ... x, xy, xyy, xyyy, xyyyy,... xx, xxy, xxyy, xxyyy, xxyyyy,.. ..}

(a + aa + aaa)(bb + bbb) = abb + abbb + aabb + aabbb + aaabb + aaabbb

(a + bb + bab)(a + ab) = aa + aab + bba + bbab + baba + babab

(a + bb + bab)(Λ + bbbb) = a+bb+bab+ab⁴ +b⁶ + bab⁵

(Λ + x + xx)(y*) = y* + xy* + xxy*

Rule 1 : The language associated with the regular expression that is just a single letter is that one-letter word alone and the language associated with Λ is just {Λ}, a one-word language.

Rule 2 : If r, is a regular expression associated with the language L₁, and r₂ is a regular expression associated with the language L₂ then

(i) The regular expression (r₁) (r₂) is a regular expression associated with the language L₁ times L₂ language(r₁r₂) = L₁L₂

(ii) The regular expression r₁ + r₂ is associated with the language formed by the union of the sets L₁ and L₂ language(r₁ + r₂) = L₁ + L₂.

(iii) The language associated with the regular expression (r₁)* is L₁*, the Kleene closure of the set L₁ as a set of words. language (r₁*) = L₁*

To make one regular expression that defines the language L, turn all the words in L into boldface type and stick pluses between them.

For example, the regular expression that defines the language L = {baa, abbba, bababa} is baa + abbba + bababa

If L = {aa, ab, ba, bb} the algorithm described above gives the regular expression aa + ab + ba + bb

The reason this trick only works for finite languages is that an infinite language would become a regular expression that is infinitely long, which is forbidden.

Q. L {Λ, x, xx, xxx, xxxx, xxxxx}

The regular expression we get from the theorem is Λ + x + xx + xxx + xxxx + xxxxx

A more elegant regular expression for this language is (Λ + x)⁵

Of course the 5 is, strictly speaking, not a legal symbol for a regular expression although we all understand it means (Λ + x)(Λ + x)(Λ + x)(Λ + x)(Λ + x)

Q. Consider the expression: (a + b)*(aa + bb)(a + b)*

This is the set of strings of a's and b's that at some point contain a double letter.

We can think of it as (arbitrary)(double letter)(arbitrary)

Let us now ask, "What strings do not contain a double letter?"

Some examples are: Λ a b ab ba aba bab abab baba .... The expression (ab)* covers all of these except those that begin with 'b' or end in 'a'.

Adding these choices gives us the regular expression (Λ + b) (ab)* (Λ + a)

Q. Consider the regular expression below: E = (a + b)*a(a + b)*(a + Λ) (a + b)*a(a + b)*

= (arbitrary) a (arbitrary) [a or nothing] (arbitrary) a (arbitrary).

One obvious fact is that all the words in the language of E must have at least two a's in them. Let us break up the middle plus sign into its two cases: either the middle factor contributes an a or else it contributes a Λ.

E = (a+b)*a(a+b)*a(a+b)*a(a+b)* + (a + b)*a(a + b)* Λ(a + b)*a(a + b)*

This is a more detailed use of the distributive law.

The first term above clearly represents all words that have at least three a's in them.

Before we analyze the second term let us make the observation that (a + b)* Λ(a + b)* which occurs in the middle of the second term is only another way of saying "any string whatsoever" and could be replaced with the more direct expression (a + b)*

This would reduce the second term of the expression to (a + b)*a(a + b)*a(a + b)* which we have already seen is a regular expression representing all words that have at least two a's in them.

Therefore, the language associated with E is the union of all strings that have three or more a's with all strings that have two or more a's.

But since all strings with three or more a's are themselves already strings with two or more a's, this whole language is just the second set alone.

The language associated with E is no different from the language associated with (a + b)*a(a + b)*a(a + b)* which we have examined before with three of its avatars.

It is possible by repeated application of the rules for forming regular expressions to produce an expression in which the star operator is applied to a subexpression that already has a star in it.

Some examples are: (a + b*)* (aa + ab*)* ((a + bbba*) + ba*b)*

In the first of these expressions, the internal * adds nothing to the language

(a + b*)* = (a + b)*

since all possible strings of a's and b's are described by both expressions.

Also, in accordance with Theorem 1, (a*)* = a*

However, (aa + ab*)* ≠ (aa + ab)*

Since the language for the expression on the left includes the word abbabb, which the language on the right does not.

(The language defined by the regular expression on the right cannot contain any word with a double b.)

Q. Consider the regular expression: (a*b*)*

The language defined by this expression is all strings that can be made up of factors of the form a*b*

But since both the single letter a and the single letter b are words of the form a*b*, this language contains all strings of a's and b's.

It cannot contain more than everything, so (a*b*)* = (a + b)*

Q. E = [aa + bb + (ab+ba)(aa+bb)*(ab+ba)]*

This regular expression represents the collection of all words that are made up of "syllables" of three types:

type1 = aa

type2 = bb

type3 = (ab + ba)(aa + bb)*(ab + ba)

E = [type1 + type2 + type3]*

Suppose that we are scanning along a word in the language of E from left to right reading the letters two at a time.

First we come to a double a (type1), then to a double b type2 , then to another double a (type1 again).

Then perhaps we come upon a pair of letters that are not the same.

Say, for instance, that the next two letters are ba.

This must begin a substring of type1.

It starts with an undoubled pair (either ab or ba), then it has a section of doubled letters (many repetitions of either aa or bb), and then it finally ends with another undoubled pair (either ab or ba again).

One property of this section of the word is that it has an even number of a's and an even number of b's.

If the section started with a ba, it could end with an ab still giving two a's and two b's on the ends with only doubled letters in between.

If it started with a ba and ended with an ab, again, it would give an even number of a's and an even number of b's.

After this section of type3 we could proceed with more sections of type, or type1 or type 2 until we encountered another undoubled pair, starting another type section.

We know that another undoubled pair will be coming up to balance off the initial one.

The total effect is that every word of the language of E contains an even number of a's and an even number of b's.

If this were all we wanted to conclude, we could have done so more quickly. All words in the language of E are made up of these three types of substrings and, since each of these three has an even number of a's and an even number of b's, the whole word must, too.

However, a stronger statement is also true.

All words with an even number of a's and an even number of b's belong to the language of E.

The proof of this parallels our argument above.

Consider a word w with even a's and even b's. If the first two letters are the same, we have a type1 or type syllable.

Scan over the doubled letter pairs until we come to an unmatched pair such as ab or ba.

Continue scanning by skipping over the double a's and double b's that get in the way until we find the balancing unmatched pair (ab or ba) to even off the count of a's and b's.

If the word ends before we find such a pair, the a's and b's are not even.

Once we have found the balancing unmatched pair, we have completed a syllable of type3.

By "balancing" we do not mean it has to be the same unmatched pair: ab can be balanced by either ab or ba.

Consider them book-ends or open and close parentheses; whenever we see one we must later find another.

Therefore, E represents the language of all strings with even a's and even b's.

Q. Consider the language defined by the regular expression: b*(abb*)*(Λ + a)

This is the language of all words without a double a.

The typical word here starts with some b's.

Then come repeated factors of the form abb* (an a followed by at least one b).

Then we finish up with a final a or we leave the last b's as they are. This is another starred expression with a star inside.

Journey of a Thousand Miles begins with One Click!!!

Rules for Regular Expressions

DEFINITION - The following rules define the language associated with any regular expression.

THEOREM : If L is a finite language (a language with only finitely many words), then L can be defined by a regular expression.