Regular expressions are text strings used for matching a specific pattern, or to search for a specific location, such as the start or end of a line or a word. Regular expressions can contain both normal characters, as well as so-called meta-characters, such as * and $.
Many text editors and utilities such as vi, sed, awk, find and grep work extensively with regular expressions. Some of the popular computer languages that use regular expressions include Perl, Python and Ruby.
It can get rather complicated, and there are whole books written about regular expressions; thus, we will do no more than skim the surface here.
These regular expressions are different from the wildcards (or meta-characters) used in filename matching in command shells such as bash. The table below lists search patterns and their usage.
Search Patterns | Usage |
---|---|
.(dot) | Match any single character |
a|z | Match a or z |
$ | Match end of string |
^ | Match beginning of string |
* | Match preceding item 0 or more times |
For example, consider the following sentence: The quick brown fox jumped over the lazy dog.
Some of the patterns that can be applied to this sentence are as follows:
Command | Usage |
---|---|
a.. | matches azy |
b.|j. | matches both br and ju |
..$ | matches og |
l.* | matches lazy dog |
l.*y | matches lazy |
the.* | matches the whole sentence |
grep is extensively used as a primary text searching tool.
It scans files for specified patterns and can be used with regular expressions, as well as simple strings, as shown in the table:
Command | Usage |
---|---|
grep [pattern] <filename> | Search for a pattern in a file and print all matching lines |
grep -v [pattern] <filename> | Print all lines that do not match the pattern |
grep [0-9] <filename> | Print the lines that contain the numbers 0 through 9 |
grep -C 3 [pattern] <filename> | Print context of lines (specified number of lines above and below the pattern) for matching the pattern; here, the number of lines is specified as 3 |
strings is used to extract all printable character strings found in the file or files given as arguments.
It is useful in locating human-readable content embedded in binary files; for text files, you can just use grep.
For example, to search for the string my_string in a spreadsheet:
strings book1.xls | grep my_string
The tr utility is used to translate specified characters into other characters or to delete them. The general syntax is as follows:
tr [options] set1 [set2]
The items in the square brackets are optional.
tr requires at least one argument and accepts a maximum of two. The first designated set1 in the example lists the characters in the text to be replaced or removed.
The second, set2, lists the characters that are to be substituted for the characters listed in the first argument.
Sometimes, these sets need to be surrounded by apostrophes (or single-quotes (')) in order to have the shell ignore that they mean something special to the shell.
It is usually safe (and may be required) to use the single-quotes around each of the sets, as you will see in the examples below.
For example, suppose you have a file named city containing several lines of text in mixed case.
To translate all lower case characters to upper case, at the command prompt type cat city | tr a-z A-Z and press the Enter key.
Command | Usage |
---|---|
$ tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ | Convert lower case to upper case |
$ tr '{}' '()' < inputfile > outputfile | Translate braces into parenthesis |
$ echo "This is for testing" | tr [:space:] '\t' | Translate white-space to tabs |
$ echo "This is for testing" | tr -s [:space:] | Squeeze repetition of characters using -s |
$ echo "the geek stuff" | tr -d 't' | Delete specified characters using -d option |
$ echo "my username is 432234" | tr -cd [:digit:] | Complement the sets using -c option |
$ tr -cd [:print:] < file.txt | Remove all non-printable characters from a file |
$ tr -s '\n' ' ' < file.txt | Join all the lines in a file into a single line |
tee takes the output from any command, and, while sending it to standard output, it also saves it to a file.
In other words, it "tees" the output stream from the command: one stream is displayed on the standard output and the other is saved to a file.
For example, to list the contents of a directory on the screen and save the output to a file, at the command prompt type ls -l | tee newfile and press the Enter key.
Typing cat newfile will then display the output of ls –l.
wc (word count) counts the number of lines, words, and characters in a file or list of files. Options are given in the table below:
Option | Description |
---|---|
–l | Displays the number of lines |
-c | Displays the number of bytes |
-w | Displays the number of words |
By default, all three of these options are active.
For example, to print only the number of lines contained in a file, type wc -l filename and press the Enter key.
cut is used for manipulating column-based files and is designed to extract specific columns. The default column separator is the tab character.
A different delimiter can be given as a command option.
For example, to display the third column delimited by a blank space, at the command prompt type ls -l | cut -d" " -f3 and press the Enter key.
Search for all instances of the user command interpreter (shell) equal to /sbin/nologin in /etc/passwd and replace them with /bin/bash. (Do not overwrite /etc/passwd.)
Solution You can see a solution for this exercise here:
To get the output on standard out (terminal screen):
sed s/’\/sbin\/nologin’/’\/bin\/bash’/g /etc/passwd
or to direct to a file:
sed s/’\/sbin\/nologin’/’\/bin\/bash’/g /etc/passwd > passwd_new
Note this is kind of painful and obscure, because we are trying to use the forward slash (/) as both a string and a delimiter between fields. Instead, you can do:
sed s:’/sbin/nologin’:’/bin/bash’:g /etc/passwd
where we have used the colon (:) as the delimiter instead (you are free to choose your delimiting character!). In fact, when doing this, we don’t even need the single quotes:
sed s:/sbin/nologin:/bin/bash:g /etc/passwd
Generate a column containing a unique list of all the shells used for users in /etc/passwd.
You may need to consult the manual page for /etc/passwd as in:
man 5 passwd
Which field in /etc/passwd holds the account’s default shell (user command interpreter)?
How do you make a list of unique entries (with no repeats)?
Solution You can see a solution for this exercise here:
The field in /etc/passwd that holds the shell is #7. To display the field holding the shell in /etc/passwd using awk and produce a unique list:
awk -F: ’{print $7}’ /etc/passwd | sort -u
or
awk -F: ’{print $7}’ /etc/passwd | sort | uniq
For example:
awk -F: ’{print $7}’ /etc/passwd | sort -u
/bin/bash
/bin/sync
/sbin/halt
/sbin/nologin
/sbin/shutdown
In the following, we give some examples of things you can do with the grep command; your task is to experiment with these examples and extend them.
Search for your username in file /etc/passwd.
Find all entries in /etc/services that include the string ftp:
Restrict to those that use the tcp protocol.
Now, restrict to those that do not use the tcp protocol, while printing out the line number.
Get all strings that start with ts or end with st.
Solution
You can see a solution for this exercise here:
Search for your username in file /etc/passwd:
grep your-username /etc/passwd
Find all entries in /etc/services that include the string ftp:
grep ftp /etc/services
Restrict to those that use the tcp protocol:
grep ftp /etc/services | grep tcp
Now, restrict to those that do not use the tcp protocol, while printing out the line number:
grep -n ftp /etc/services | grep -v tcp
Get all strings that start with ts or end with st:
grep ^ts /etc/services $ grep st$ /etc/services
The tee utility is very useful for saving a copy of your output while you are watching it being generated.
Execute a command such as doing a directory listing of the /etc directory:
ls -l /etc
while both saving the output in a file and displaying it at your terminal.
Solution
You can see a solution for this exercise here:
ls -l /etc | tee /tmp/ls-output
less /tmp/ls-output
Using wc (word count), find out how many lines, words, and characters there are in all the files in /var/log that have the .log extension.
Solution
You can see a solution for this exercise here:
wc /var/log/*.log
Note that you would have do this with sudo to get every file counted.