https://xkcd.com/208/
https://xkcd.com/208/

Regular Expressions

What are Regular Expressions?

Regex is a language for describing patterns in strings.

What are Regular Expressions?

Use regex for:

  • Finding needles in haystacks.
  • Changing one string to another.
  • Pulling data out of strings.

Writing Regular Expressions

The Basic Idea

You're writing a search pattern.

Regular expressions are complicated-looking strings that describe a search pattern.

Some Examples

Try these out on https://regex101.com/

  • a+b+ : Matches any string containing one or more as, followed by one or more bs
  • \d{5} : Matches any string containing five digits (a regular ZIP code)
  • \d{5}-\d{4} : Matches any string containing 5 digits followed by a dash and 4 more digits (a ZIP+4 code).
  • \d{5}(-\d{4})? : Matches any ZIP code with an optional +4 extension.

Basic Patterns

  • . Matches one of any character.
  • \w Matches a word character (letters, numbers, and _).
  • \W Matches everything \w doesn’t (punctuation, etc.).
  • \d Matches a digit.
  • \D Matches anything that isn’t a digit.
  • \s Matches whitespace (space, tab, newline, carriage return, etc.).
  • \S Matches non-whitespace (everything \s doesn’t match).
  • \ is also the escape character.

Variable-length Patterns

  • {n} matches the previous pattern exactly n times.
  • {n,m} matches the previous pattern between n and m times (inclusive).
  • {n,} matches the previous pattern at least n times.

Variable-length Patterns

  • {n} matches the previous pattern exactly n times.
  • {n,m} matches the previous pattern between n and m times (inclusive).
  • {n,} matches the previous pattern at least n times.
  • * matches the previous pattern zero or more times ({0,}).
  • + matches the previous pattern one or more times ({1,}).
  • ? matches the previous pattern one or more times ({0,1}).

DIY character classes

  • [abc\d] matches a character that is either a, b, c, or a digit.
  • [a-z] matches characters between a and z.
  • ^ negates a character class: [^abc] matches everything except a, b, and c.

Anchors

  • ^ forces the pattern to start matching at the beginning of the line.
  • $ forces the pattern to finish matching at the end of the line.
  • \b forces the next character to be a word boundary.
  • \B forces the next character to not be a word boundary.

Groups

  • (ab|c) matches either ‘ab’ or ‘c’.
  • You can use length modifiers on groups, too: (abc)+ matches one or more 'abc'
  • The real power of grouping is backreferences. You can refer to the thing matched by the 1st group, 2nd group, etc. with \1, \2, etc.
  • For example, (ab|cd)\1 matches ‘abab’ or ‘cdcd’ but not ‘abcd’ or ‘cdab’.

Greedy vs. Polite matching

  • Regular expressions are greedy by default: they match as much as they possibly can.
  • Usually this is what you want, but sometimes it isn’t.
  • You can make a variable-length match non-greedy by putting a ? after it.
  • For example: .+\. vs. .+?\..

grep

Search with grep

  • grep: Global Regular Expression Print
  • grep 'REGEX' FILES: Search FILES using REGEX and print matches.
  • If you don’t specify FILES, grep will read from STDIN (so you can pipe stuff into it).

Search with grep

  • -C NUMLINES gives NUMLINES lines of context around the match.
  • -v prints every line that doesn’t match (invert).
  • -i Ignore case when matching.
  • -o Only print the part of the line the regex matches.

Some Examples

For this example, we'll use STDIN as our search text. That is, grep will use the pattern (passed as an argument) to search the input received over STDIN.

$ echo "bananas" | grep 'b\(an\)\+as'
bananas
$ echo "banananananananas" | grep 'b\(an\)\+as'
banananananananas
$ echo "banas" | grep 'b\(an\)\+as'
banas
$ echo "banana" | grep 'b\(an\)\+as'

sed

Editing with sed

  • sed is a stream editor. Use it for editing files or STDIN.
  • It uses regular expressions to perform edits to text.
  • -r enables extended regular expressions.
  • -n makes sed only print the lines it matches. (Very useful for debuggin'.)

The Print Command

  • sed -n '/REGEX/ p' works pretty much exactly like grep.
  • Use this to make sure your regexes are matching what you want them to.
  • (You can also use p together with s, which we’ll talk about immediately.)

The Substitute Command

  • s/REGEX/REPLACEMENT/ replaces the thing matched by REGEX with REPLACEMENT.
  • Patterns can be any regular expression that we’ve talked about so far.

The Substitute Command

  • Replacements can be plain text and/or backreferences!
  • Add a g to the end (e.g. s/apple/banana/g) to make the substitution global (every match on each line).
  • Add an i (e.g., s/apple/banana/i) to make the match case-insensitive.