|
|
| Home About This Site Academic Policies Academic Staff Acm-istanbul Courses Course Descriptions Year 0 Year 1 Cmn 147 Comp 111 Comp 112 Comp 149 Comp 150 Comp 151 Archive Comp151 Homeworks LectureNotes Week 08 LectureNotes Week 09 Sample02 (pdf) Example (pdf) Sample (pdf) Worksheets Comp 152 Comp 197 Comp 198 Sci 161 Voc 109 Year 2 Year 3 Year 4 Curiosity Corner High School Computer Clubs Project Lab Rules Links Member Help News Other Stuff Standards Project Tanitim Turing Days Usage Statistics Yarışma |
=====================================
COMP 151 - Lecture Notes
=====================================
------------------------------------------
REGULAR LANGUAGES AND REGULAR EXPRESSIONS:
------------------------------------------
Regular languages are the easiest family of languages, whose grammar rules are of the following form:
A --> aB
A --> a
Where capital letters are non-teminals and, lower case letters are terminals.
Example: Unary numbers
--------
S --> A
A --> 1
A --> 1A
There are infinite number of senteces in the language.
Example: Binary Numbers
--------
S --> A
A --> 0
A --> 1
A --> 0A
A --> 1A
There are infinite number of senteces in this language also.
Example: Decimal Numbers?
--------
The rule list will be unnecessarily long for such a simple language!
Regular expressions is a concise way of describing regular langugages in computer applications. Following conventions are used for a "minimalist" regular expression sytax:
* Alphabet is ASCII character set(or another character set the computer system supports)
* Any character in the alphabet represents itself, except those that are used for regular expressions, such as *, [, ], etc. These characters must be excaped (prepended with slash, \) to represent themselves.
* Square brackets, [], indicate alternatives.
* The Kleene star, *, is used to indicate zero or more repetitions of the preceeding character.
Examples:
---------
Regular Expression Sentences in the regular language
------------------ ---------------------------------
abc abc
a[bc]d abd acd
ab*c ac abc abbc abbbc ...
a[bc]*d ad abd acd abbd abbbd accd abcbcbbccd ...
a\[c a[c
REGULAR EXPRESSIONS IN PRACTICE:
--------------------------------
Although the minimum syntax above is sufficient to describe any regular language, a few more constructs are used to make the expressions even more concise:
* within a bracket, the dash, -, is used to express a range of characters. For example "[a-z]" means all characters between a and z (inclusive). This is easier than writing all 26 characters with hand. A few shortcuts are used for common ranges, such as "[:alnum:]" which stands for "[0-9A-Za-z]".
* When a caret, ^, appears at the beginning of a bracket, it means a negation. For example [^0-9] means "any character except the range 0-9".
* The dot, ., matches any character.
Examples:
---------
Regular Expression Sentences in the regular language
------------------ ---------------------------------
a[b-k]l abl acl .... akl
a[^a-z]z a0z a1z a.z a@z "a z" ...
a.z abz "a z" acz a0z ...
a.*z az aaz aaaaaz abz abcdz a01bcdefz ...
a\.\]\\ a.]\
a[[:alnum:]]z aaz abz ... a0z ... aAz aBz ...
EXTENDED REGULAR EXPRESSIONS:
-----------------------------
In time as the power of computers increased, the regular expression syntax was extended to include languages that are slightly more complex than regular languages. The syntax is referred to as "extended regular expression syntax", and adds the following conventions:
* ^ outside a bracket expression represents the beginning of a line
* $ represents the end of a line.
* ? means the preceeding item is optional and repeated at most once
* + means the preceeding item is repeated one or more times
* {n} means the preceeding item is repeated exactly n times.
* {n,} means the preceeding item is repeated n or more times.
* {n,m} means the preceeding item is repeated at least n times, but not more
than m times.
* Parts of regular expressions can be grouped within parantheses, ().
* Two regular expressions can be joined with the infix operator, |, meaning OR. Thus either match is allowed.
* \n , where n is a single digit, is a reference to the substricng that previously mathched the n'th paranthesized subexpression.
NOTE: The above list extends the set of special characters to : .+*?[]{}()^$|\
Examples:
---------
Regular Expression Sentences in the regular language
------------------ ---------------------------------
^abc "abc" at the beginning of the line
^abc$ a line that contains only "abc"
^[0-9]*$ a line that contains only digits
ab?c ac abc
ab+c abc abbc abbbc ...
ab{2}c abbc
ab{2,}c abbc abbbc abbbbc ...
ab{2,3}c abbc abbbc
(abc)|(def) abc def
([0-9])\.\1 0.0 1.1 2.2 3.3 ...
-------------------------------------------------------
USING REGULAR EXPRESSIONS TO SEARCH OR MANIPULATE TEXT:
-------------------------------------------------------
There are two utilities we will use here: "grep" and "sed" programs.
grep stands for "GNU regular expression processor". It can search for lines that match a regular expression, either from files or from standard input. The synopsis for its usage is:
grep [options] PATTERN [FILE...]
The PATTERN is a regular expression. If "-E" option is given, the regular expression is considered to be an extended expression.
Examples:
---------
grep "abc" : match lines containing "abc" from standard input (press CTRL-D to end input)
grep "abc" some.file : do a similar match for lines in file "some.file"
ls -l |grep "txt": match the lines from output of "ls -l" for those containing "txt". What happens when a file is named "txt.nottxt" ?
ls -l |grep "\.txt$" | filter for lines that end in ".txt"
ls -l |grep "^-......r.." : filter for files that are world readable
grep " 200[0-9] " some.file : filter lines from file that mention years 2000-2010
sed stands for stream editor. Its typical usage is to find text matching aregular expression and replace it with something else. We use sed's "substitude" facility for this.
sed s/tomatch/toreplace/
Example: Match the lines starting with "d", replace the "d" with "DIRECTORY" to emphasize the output
ls -l | sed s/"^d"/DIRECTORY/
Example: Emphasize the world readable directories by replacing "r" with something else (requires extended regular expressions, hence the "-r" option in the command"
ls -l | sed -r s/"(^.......)r"/"\0WORLD READABLE"/
|
Last Updated : 2008-11-20 09:40:00
|