Home‎ > ‎CIS 98‎ > ‎

Regular Expressions

Introduction

Regular Expressions is a metalanguage used for pattern matching. It allows a user to tell a program to search for a particular pattern of characters in addition to specified strings. BASH has minimal support for pattern matching allowing you to create patterns for filenames. This is called globbing. In this lecture I discuss globbing and more advanced pattern matching syntax used by some UNIX commands. 

Globbing

Regular expressions work by replacing literal values with wildcards. The most commonly used wildcard is the star (*) symbol. A star matches absolutely anything. Run the following command in your home directory:

$ echo * 

Here's the output from when I run the command:

$ echo * 
arduino-1.0.5 baseline.pcapng bin Blackboard ConfigEngnie Demo Desktop Documents Downloads eagle Eclipse fritzing-0.9.2b.linux.AMD64 Fun grading Horticulture intro ISOs kernel Music packet.cap Pictures Public RDPShare rover_workspace sketchbook Snappy Templates Videos Wireshark Work

BASH's globbing feature replaces the * with the name of every file and directory in my home directory. Compare for yourself by running the command:

$ ls 

The ls command produces different looking output but the contents of the output are the same. By itself * isn't very useful, but you can constrain what files it matches:

$ echo a*
arduino-1.0.5

The command above matches only files that begin with "a". There's more:

$ echo *ng
baseline.pcapng grading

Now we match only files that end with "ng". You can use the * to match inside as well: 

$ echo D*o*
Demo Desktop Documents Downloads 

Notice that "Demo" matches. The * matches zero or more characters. It's the ultimate wildcard! There are more. 

Globbing Exercises

Given the following files in your current directory:


feb86   jan12.89  jan19.89 jan26.89 
jan5.89 jan85     jan86    jan87 
jan88   mar88     memo1    memo10 
memo2   memo2.sv
 
What would be the output from the following commands?
  1. echo *
  2. echo m[a-df-z]*
  3. echo jan*
  4. echo ?????
  5. echo jan?? feb?? mar??
  6. echo *[!0-9]
  7. echo [A-Z]*
  8. echo *.*
  9. echo *89
  10. echo [fjm][ae][bnr]*

Using Sed

The sed is the UNIX "Stream Editor." It's a program that can be used to easily alter and manipulate text from the command line. Suppose you had the following file in your home directory called message.txt:

$ cat message.txt 
The Unix operating system was pioneered by Ken
Thompson and Dennis Ritchie at Bell Laboratories
in the late 1960s.  One of the primary goals in
the design of the Unix system was to create an
environment that promoted efficient progarm
development. 

Here are some ways you can use sed to manipulate the file:

$ sed 's/Dennis/Menace/' message.txt 
The Unix operating system was pioneered by Ken
Thompson and Menace Ritchie at Bell Laboratories
in the late 1960s.  One of the primary goals in
the design of the Unix system was to create an
environment that promoted efficient progarm
development. 

That's easy, what if we want to change every instance of the into THE?

$ sed 's/the/THE/' message.txt 
The Unix operating system was pioneered by Ken
Thompson and Dennis Ritchie at Bell Laboratories
in THE late 1960s.  One of the primary goals in
THE design of the Unix system was to create an
environment that promoted efficient progarm
development. 

Notice we didn't get all of them. That's because sed only matches the first instance on the line unless you modify the match with 'g':

$ sed 's/the/THE/g' message.txt 
The Unix operating system was pioneered by Ken
Thompson and Dennis Ritchie at Bell Laboratories
in THE late 1960s.  One of THE primary goals in
THE design of THE Unix system was to create an
environment that promoted efficient progarm
development. 

Almost, but to sed "The" is different from "the". If you want the match to be case-insensitive you modify the match with 'i': 

$ sed 's/the/THE/gi' message.txt 
THE Unix operating system was pioneered by Ken
Thompson and Dennis Ritchie at Bell Laboratories
in THE late 1960s.  One of THE primary goals in
THE design of THE Unix system was to create an
environment that promoted efficient progarm
development. 

The full set of regular expressions work in sed, not just the ones used for globbing.

Sed Exercises 

These exercises use the text above. Copy and paste it into a file so that you can test your sed commands. Use sed to:
  1. Replace all instances of "in" with "--in--"
  2. Same as #1 but only when "in" appears at the beginning of a line. 
  3. Same as #1 but only when "in" appears at the end of a line. 
  4. Replace all vowels with "-" (upper and lower case)

Using Grep 

The grep command also uses regular expressions. Grep is used to find what you're looking for in a text file. Here's an example of how to use grep on the same message from before:

$ grep Dennis message.txt 
Thompson and Dennis Ritchie at Bell Laboratories

Notice only the line with "Dennis" is placed on the output. Grep may also use regular expressions when it's given the "-e" option: 

$ grep -e '^the' message.txt
the design of the Unix system was to create an

Notice that "The" and "the" are still not the same. You can make grep ignore case with the "-i" flag. Important: All grep flags must come before the "-e" flag:

$ grep -ie '^the' message.txt
The Unix operating system was pioneered by Ken
the design of the Unix system was to create an

Grep is one of the most useful UNIX commands you will ever learn. I use it daily. 

Grep Exercises

Use grep on the message.txt file from the sed exercises. Use grep to:
  1. Print lines with the word "at" (but not lines with words that have at in them).
  2. Print lines that contain numbers.
  3. Using the who command and a pipe determine what CIS-98 students are logged in.

Regular Expressions Quick Reference

NotationMeaningExampleMatches Globbing 
* Zero or more occurrences of previous regular expression x*
 xx*
 .*
 w.*s
0 or more consecutive x's
1 or more consecutive x's
0 or more characters
w followed by 0 or more characters followed by an s
 Yes
. Any single character a.. a followed by any two characters Yes
^ Beginning of line  ^The The only if it appears at the beginning of a line No
$ End of line  x$
 ^INSERT$
 ^$
 x only if it is the last character on the line
 a line containing just the characters INSERT
 a line that contains no characters
 No
[chars] Any character in chars [Tt]
 [a-z]
 [a-zA-Z]
 Lower or uppercase t
 Lowercase letter
 Any alphabetic character
 Yes
[^chars] Any character not in chars [^0-9]
 [^a-zA-Z]
 Any nonnumeric character
 Any nonalphabetic character
 Yes
\<chars\> Mark word boundaries around chars <the> Matches the word the, but not them, there, or other No
\{min,max\} At least min and at most max occurrences of the previous regular expression x\{1,5\}
 [0-9]\{3,9\}
 [0-9]\{3\}
 [0-9]\{3,\}
 At least 1 and at most 5 x's
 Anywhere from 3 to 9 successive digits.
 Exactly 3 digits
 At least 3 digits
 No
\(...\) Store characters matched between parentheses in next register (1-9) ^\(.\)
 ^\(.\)\1
 First character on line and stores it in register 1
 First and second characters on the line if they're the same
 No

One thing to remember about RE, they are greedy; they look for the largest data sample that matches the pattern, not the smallest.

Comments