Friday, April 30, 2021

REGEX (Regular Expressions)

Start and end of line

        #  matches all line starting with "cat"

egrep '^cat' regex.txt

# matches all line ending with "cat"

egrep 'cat$' regex.txt

# matches all line that has "cat" anywhere on the line

egrep 'cat' regex.txt

# matches a line that contains only "cat"

egrep '^cat$' regex.txt

# matches empty lines

egrep '^$' regex.txt

# matches non-empty lines (-v is for negating the output)

egrep -v '^$' regex.txt

Single character match

       # there must be any single character betwen "a" and "c"

egrep 'a.c' regex.txt

Character class

        # matches "gray" or "grey"

egrep 'gr[ae]y' regex.txt

#  combining several character classes

egrep 'sep[ea]r[ea]te' regex.txt

# matches "<H1>", "<H2>", and "<H3>"

egrep '<H[123]>' regex.txt

# same as above ^ (provides a range)

egrep '<H[1-3]>' regex.txt

# matches "<H[-]>" (doesn't provide a range)

egrep '<H[-]>' regex.txt

# multiple ranges are fine

egrep '<H[0123456789abcdefABCDEF]>' regex.txt

# simplified version of the above ^ expression

egrep '<H[0-9a-fA-F]>' regex.txt

# matches a "!", ".", "_", and "?"

egrep '<H[!._?]>' regex.txt

# match if and only if there is something that is not

# "<Hx>" (remember this concept)

egrep '<H[^x]>' regex.txt

# matches all that are not "<H1>", "<H2>",

# or "<H3>"

egrep '<H[^1-3]>' regex.txt

Alternatives

        # matches "gray" or "grey"

egrep 'gray|grey' regex.txt

# same as above ^

egrep 'gr(a|e)y' regex.txt

# matches any line that begins with 'From: ',

# 'To: ', or 'Subject: '

egrep '^(From|To|Subject) ' regex.txt 

Word boundaries

        # matches all lines that have a string which starts with "cat"

egrep '\<cat' regex.txt

# matches all lines that have a string which ends with "cat"

egrep 'cat\>' regex.txt

# matches all lines only that have a word "cat" which is not

# embedded within another word (or string). e.g this will

# match line `the cat is furry` but not `concatenate this file`

egrep '\<cat\>' regex.txt

Optional items

        # matches lines with string "color" or "colour" ("u" is optional)

egrep 'colou?r' regex.txt

# same as `egrep '(July|Jul) (4th|four|4)' regex.txt`

egrep 'July? (four|4(th)?)'

# allows one optional space

egrep '<H1 ?>' regex.txt

Quantifiers: repetition

        # matches "<H1>", "<H1 >", "<H1  >", "<H1   >", and

# so on (no space, w/ one space, or w/ more than

# one space after H1)

egrep '<H1 *>' regex.txt

# matches "<H1 >", "<H1  >", "<H1   >", and so on

# (atleast w/ one space after H1 is required)

egrep '<H1 +>' regex.txt

# matches "<H>", "<H1>", "<H2>", "<H3>", ... "<H9>"

# (number after "H" is not required)

egrep '<H[0-9]*>' regex.txt

# matches "<H0>", "<H1>", "<H2>", "<H3>", ... "<H9>"

# (number after "H" is required)

egrep '<H[0-9]+>' regex.txt

# matches "o" for atleast once or up to 3

# times ({min,max})

egrep 'co{1,3}l' regex.txt

# matches "o" for exactly 3 times ({min,max})

egrep 'co{3,3}l' regex.txt

# see p 75/780 of "OReilly - Mastering Regular Expressions" book

egrep <HR +SIZE *= *[0-9]+ *> regex.txt

Parentheses and backreferences

        # matches all words that are repeated atleast

# twice (with space between repetitions) like

  - not all `egrep` supports backreference # "the the", "apple apple apple", etc

    and `\< .. \>` egrep '\<([a-zA-Z]+) +\1\>' regex.txt

# same as above but this time this version also

# matches double words with different capitalization

# like "The the"

#

# this seems wrong??

egrep '\<([a-zA-Z]+) +\1\>' regex.txt

Escape sequence

        # removes the special function of "." w/c

# is to match any single character

egrep 'www\.facebook\.com' regex.txt

Miscellaneous

        # moves all non-hidden files on the current directory

# to the target directory

mv *.* Archive/

Some examples

        # matches a variable name that are allowed to contain only

# alphanumeric characters and underscores, but which may

# not begin with a number

egrep '[a-zA-Z_][a-zA-Z_0-9]*' regex.txt

# a string within doublequotes (see book for explanation)

egrep '"[^"]*"' regex.txt

# dollar amount (with optional cents)

egrep '\$[0-9]+(\.[0-9][0-9])?' regex.txt

# time of day, such as "9:17 am" or "12:30 pm"

egrep '(1[012]|[1-9]):[0-5][0-9] (am|pm)' regex.tx

Metacharacters

- special chracters that are used to match and manipulate patterns

^  :  matches start of line

$  :  matches end of line

|  :  provides alternatives

.  :  matches any single character

() :  you can put alternatives inside (separated by |)

?  :  quantifier - optional item (must be placed after the optional item)

*  :  quantifier - similar to ?, matches none, one or more of the immediately-preceding item (exit status is always 0)

+  :  quantifier - similar to ?, MUST match one or more of the immediately-preceding item (exit status is 0 or 1 for fail)

{min,max}  :  interval quantifier - matches the immediately-preceding item for atleast "min" times or until "max" times

Character Class

[]  :  represents a single character to match

Character Class Metacharacter

- these are special characters put inside character classes

- they have different meanings inside a character class compared to when placed outside a character class

-  :  provides range of characters (not considered metacharacter if it is the first character in the class)

^  :  negates the list

Metasequences

- these are used for word boundaries

- use this if you want to search for a particular string that is not embedded in a larger word

- let say you want to look only for the word "cat" and disregard lines with "catleya", "concatenate", etc

- this is not supported on all versions of egrep

\< :  the position at the start of a word

\> :  the position at the end of a word

\1 :  remembers strings/texts inside immediately-preceding parenthesis (used as backreferencing tool)

EGREP

egrep "^(From|Subject): " --> same as egrep "^From: |^Subject: "

- Not all egrep programs are the same. The supported set of metacharacters, as well as their meanings, are often different—see your local documentation

- The useful -i option discounts capitalization during a match

grep -i 'regular_expression' text_file  ##search a filename based on the regular expression

grep -i '^$' text_file  ## searches fro blank lines

grep -i '^$' text_file | wc -l  ## returns the number of blank lines

grep . text_file  ## deletes all blank lines

No comments:

Post a Comment