Istanbul Bilgi University Department of Computer Science
Previous Home > Courses > Year 1 > Comp 151 > LectureNotes Week 08 Next
Home
   About This Site
   Academic Policies
   Academic Staff
   Acm-istanbul
   Courses
      Course Descriptions
      Year 0
      Year 1
         Cmn 147
         Comp 111
         Comp 112
         Comp 149
         Comp 150
         Comp 151
            Archive
            Comp151 Homeworks
            LectureNotes Week 08
            LectureNotes Week 09
            Sample02 (pdf)
            Example (pdf)
            Sample (pdf)
            Worksheets
         Comp 152
         Comp 197
         Comp 198
         Sci 161
         Voc 109
      Year 2
      Year 3
      Year 4
   Curiosity Corner
   High School Computer Clubs Project
   Lab Rules
   Links
   Member Help
   News
   Other Stuff
   Standards Project
   Tanitim
   Turing Days
   Usage Statistics
   Yarışma

=====================================
COMP 151 - Lecture Notes
=====================================

------------------------------------------
REGULAR LANGUAGES AND REGULAR EXPRESSIONS:
------------------------------------------
Regular languages are the easiest family of languages, whose grammar rules are of the following form:
            A --> aB
            A --> a
Where capital letters are non-teminals and, lower case letters are terminals. 

Example: Unary numbers
--------
    S --> A
    A --> 1
    A --> 1A
    
    There are infinite number of senteces in the language.

Example: Binary Numbers
--------
    S --> A
    A --> 0
    A --> 1
    A --> 0A
    A --> 1A

    There are infinite number of senteces in this language also.

Example: Decimal Numbers?
--------
    The rule list will be unnecessarily long for such a simple language!

Regular expressions is a concise way of describing regular langugages in computer applications. Following conventions are used for a "minimalist" regular expression sytax:
    * Alphabet is ASCII character set(or another character set the computer system supports)
    * Any character in the alphabet represents itself, except those that are used for regular expressions, such as *, [, ], etc. These characters must be excaped (prepended with slash, \) to represent themselves.
    * Square brackets, [], indicate alternatives. 
    * The Kleene star, *, is used to indicate zero or more repetitions of the preceeding character.

Examples:
---------
    Regular Expression      Sentences in the regular language
    ------------------      ---------------------------------
    abc                     abc
    a[bc]d                  abd acd
    ab*c                    ac abc abbc abbbc ...
    a[bc]*d                 ad abd acd abbd abbbd accd abcbcbbccd ...
    a\[c                    a[c

REGULAR EXPRESSIONS IN PRACTICE:
--------------------------------
Although the minimum syntax above is sufficient to describe any regular language, a few more constructs are used to make the expressions even more concise:
    * within a bracket, the dash, -, is used to express a range of characters. For example "[a-z]" means all characters between a and z (inclusive). This is easier than writing all 26 characters with hand. A few shortcuts are used for common ranges, such as "[:alnum:]" which stands for "[0-9A-Za-z]".
    * When a caret, ^, appears at the beginning of a bracket, it means a negation. For example [^0-9] means "any character except the range 0-9".
    * The dot, ., matches any character.

Examples:
---------
    Regular Expression      Sentences in the regular language
    ------------------      ---------------------------------
    a[b-k]l                 abl acl .... akl
    a[^a-z]z                a0z a1z a.z a@z "a z" ...
    a.z                     abz "a z" acz a0z ...
    a.*z                    az aaz aaaaaz abz abcdz a01bcdefz ...
    a\.\]\\                 a.]\
    a[[:alnum:]]z           aaz abz ... a0z ... aAz aBz ...

EXTENDED REGULAR EXPRESSIONS:
-----------------------------
In time as the power of computers increased, the regular expression syntax was extended to include languages that are slightly more complex than regular languages. The syntax is referred to as "extended regular expression syntax", and adds the following conventions:
    * ^ outside a bracket expression represents the beginning of a line
    * $ represents the end of a line.
    * ? means the preceeding item is optional and repeated at most once
    * + means the preceeding item is repeated one or more times
    * {n} means the preceeding item is repeated exactly n times.
    * {n,} means the preceeding item is repeated n or more times.
    * {n,m} means the preceeding item is repeated at least n times, but not more
      than m times.
    * Parts of regular expressions can be grouped within parantheses, ().
    * Two regular expressions can be joined with the infix operator, |, meaning OR. Thus either match is allowed.
    * \n , where n is a single digit, is a reference to the substricng that previously mathched the n'th paranthesized subexpression.

NOTE: The above list extends the set of special characters to : .+*?[]{}()^$|\

Examples:
---------

    Regular Expression      Sentences in the regular language
    ------------------      ---------------------------------
    ^abc                    "abc" at the beginning of the line
    ^abc$                   a line that contains only "abc"
    ^[0-9]*$                a line that contains only digits
    ab?c                    ac abc
    ab+c                    abc abbc abbbc ...
    ab{2}c                  abbc
    ab{2,}c                 abbc abbbc abbbbc ...
    ab{2,3}c                abbc abbbc
    (abc)|(def)             abc def
    ([0-9])\.\1             0.0 1.1 2.2 3.3 ...

-------------------------------------------------------
USING REGULAR EXPRESSIONS TO SEARCH OR MANIPULATE TEXT:
-------------------------------------------------------
There are two utilities we will use here: "grep" and "sed" programs.

grep stands for "GNU regular expression processor". It can search for lines that match a regular expression, either from files or from standard input. The synopsis for its usage is:
     grep [options] PATTERN [FILE...]

The PATTERN is a regular expression. If "-E" option is given, the regular expression is considered to be an extended expression.

Examples:
---------
    grep "abc"  : match lines containing "abc" from standard input (press CTRL-D to end input)
    grep "abc" some.file : do a similar match for lines in file "some.file"
    ls -l |grep "txt": match the lines from output of "ls -l" for those containing "txt". What happens when a file is named "txt.nottxt" ?
    ls -l |grep "\.txt$" | filter for lines that end in ".txt"
    ls -l |grep "^-......r.." : filter for files that are world readable
    grep " 200[0-9] " some.file : filter lines from file that mention years 2000-2010
    
sed stands for stream editor. Its typical usage is to find text matching aregular expression and replace it with something else. We use sed's "substitude" facility for this.
    sed s/tomatch/toreplace/

Example: Match the lines starting with "d", replace the "d" with "DIRECTORY" to emphasize the output
    ls -l | sed s/"^d"/DIRECTORY/

Example: Emphasize the world readable directories by replacing "r" with something else (requires extended regular expressions, hence the "-r" option in the command"
     ls -l | sed -r s/"(^.......)r"/"\0WORLD READABLE"/