My Lazy Admin: REGEX (Regular Expressions)

Start and end of line

# matches all line starting with "cat"
egrep '^cat' regex.txt

# matches all line ending with "cat"
egrep 'cat$' regex.txt

# matches all line that has "cat" anywhere on the line
egrep 'cat' regex.txt

# matches a line that contains only "cat"
egrep '^cat$' regex.txt

# matches empty lines
egrep '^$' regex.txt

# matches non-empty lines (-v is for negating the output)
egrep -v '^$' regex.txt

Single character match

# there must be any single character betwen "a" and "c"
egrep 'a.c' regex.txt

Character class

# matches "gray" or "grey"
egrep 'gr[ae]y' regex.txt

# combining several character classes
egrep 'sep[ea]r[ea]te' regex.txt

# matches "<H1>", "<H2>", and "<H3>"
egrep '<H[123]>' regex.txt

# same as above ^ (provides a range)
egrep '<H[1-3]>' regex.txt

# matches "<H[-]>" (doesn't provide a range)
egrep '<H[-]>' regex.txt

# multiple ranges are fine
egrep '<H[0123456789abcdefABCDEF]>' regex.txt

# simplified version of the above ^ expression
egrep '<H[0-9a-fA-F]>' regex.txt

# matches a "!", ".", "_", and "?"
egrep '<H[!._?]>' regex.txt

# match if and only if there is something that is not
# "<Hx>" (remember this concept)
egrep '<H[^x]>' regex.txt

# matches all that are not "<H1>", "<H2>",
# or "<H3>"
egrep '<H[^1-3]>' regex.txt

Alternatives

# matches "gray" or "grey"
egrep 'gray|grey' regex.txt

# same as above ^
egrep 'gr(a|e)y' regex.txt

# matches any line that begins with 'From: ',
# 'To: ', or 'Subject: '
egrep '^(From|To|Subject) ' regex.txt

Word boundaries

# matches all lines that have a string which starts with "cat"
egrep '\<cat' regex.txt

# matches all lines that have a string which ends with "cat"
egrep 'cat\>' regex.txt

# matches all lines only that have a word "cat" which is not
# embedded within another word (or string). e.g this will
# match line `the cat is furry` but not `concatenate this file`
egrep '\<cat\>' regex.txt

Optional items

# matches lines with string "color" or "colour" ("u" is optional)
egrep 'colou?r' regex.txt

# same as `egrep '(July|Jul) (4th|four|4)' regex.txt`
egrep 'July? (four|4(th)?)'

# allows one optional space
egrep '<H1 ?>' regex.txt

Quantifiers: repetition

# matches "<H1>", "<H1 >", "<H1 >", "<H1 >", and
# so on (no space, w/ one space, or w/ more than
# one space after H1)
egrep '<H1 *>' regex.txt

# matches "<H1 >", "<H1 >", "<H1 >", and so on
# (atleast w/ one space after H1 is required)
egrep '<H1 +>' regex.txt

# matches "<H>", "<H1>", "<H2>", "<H3>", ... "<H9>"
# (number after "H" is not required)
egrep '<H[0-9]*>' regex.txt

# matches "<H0>", "<H1>", "<H2>", "<H3>", ... "<H9>"
# (number after "H" is required)
egrep '<H[0-9]+>' regex.txt

# matches "o" for atleast once or up to 3
# times ({min,max})
egrep 'co{1,3}l' regex.txt

# matches "o" for exactly 3 times ({min,max})
egrep 'co{3,3}l' regex.txt

# see p 75/780 of "OReilly - Mastering Regular Expressions" book
egrep <HR +SIZE *= *[0-9]+ *> regex.txt

Parentheses and backreferences

# matches all words that are repeated atleast
# twice (with space between repetitions) like

- not all `egrep` supports backreference # "the the", "apple apple apple", etc

and `\< .. \>` egrep '\<([a-zA-Z]+) +\1\>' regex.txt

# same as above but this time this version also
# matches double words with different capitalization
# like "The the"
#
# this seems wrong??
egrep '\<([a-zA-Z]+) +\1\>' regex.txt

Escape sequence

# removes the special function of "." w/c
# is to match any single character
egrep 'www\.facebook\.com' regex.txt

Miscellaneous

# moves all non-hidden files on the current directory
# to the target directory
mv *.* Archive/

Some examples

# matches a variable name that are allowed to contain only
# alphanumeric characters and underscores, but which may
# not begin with a number
egrep '[a-zA-Z_][a-zA-Z_0-9]*' regex.txt

# a string within doublequotes (see book for explanation)
egrep '"[^"]*"' regex.txt

# dollar amount (with optional cents)
egrep '\$[0-9]+(\.[0-9][0-9])?' regex.txt

# time of day, such as "9:17 am" or "12:30 pm"
egrep '(1[012]|[1-9]):[0-5][0-9] (am|pm)' regex.tx

Metacharacters

- special chracters that are used to match and manipulate patterns

^ : matches start of line

$ : matches end of line

| : provides alternatives

. : matches any single character

() : you can put alternatives inside (separated by |)

? : quantifier - optional item (must be placed after the optional item)

* : quantifier - similar to ?, matches none, one or more of the immediately-preceding item (exit status is always 0)

+ : quantifier - similar to ?, MUST match one or more of the immediately-preceding item (exit status is 0 or 1 for fail)

{min,max} : interval quantifier - matches the immediately-preceding item for atleast "min" times or until "max" times

Character Class

[] : represents a single character to match

Character Class Metacharacter

- these are special characters put inside character classes

- they have different meanings inside a character class compared to when placed outside a character class

- : provides range of characters (not considered metacharacter if it is the first character in the class)

^ : negates the list

Metasequences

- these are used for word boundaries

- use this if you want to search for a particular string that is not embedded in a larger word

- let say you want to look only for the word "cat" and disregard lines with "catleya", "concatenate", etc

- this is not supported on all versions of egrep

\< : the position at the start of a word

\> : the position at the end of a word

\1 : remembers strings/texts inside immediately-preceding parenthesis (used as backreferencing tool)

EGREP

egrep "^(From|Subject): " --> same as egrep "^From: |^Subject: "

- Not all egrep programs are the same. The supported set of metacharacters, as well as their meanings, are often different—see your local documentation

- The useful -i option discounts capitalization during a match

grep -i 'regular_expression' text_file ##search a filename based on the regular expression
grep -i '^$' text_file ## searches fro blank lines
grep -i '^$' text_file | wc -l ## returns the number of blank lines
grep . text_file ## deletes all blank lines

My Lazy Admin

Friday, April 30, 2021

REGEX (Regular Expressions)