Handy Little Guide to Regular Expression Search

Regular expressions, or regexps, are logical statements that describe sets of strings. They are easy to use and very powerful. Here is a quick overview of how to use them. Whenever a regular expression appears in this text, it will be highlighted in red with yellow background.

Regular expressions consist of atoms -- individual logical pieces that can be glued together to form a full expression. The simplest atom is simply a character that you would like to match in a string. For example, a is a regular expression with a single atom that will match any string contains the letter "a" (depending on whether you are searching or matching, this will match either just "a" or any string with "a" in it -- our tools search rather than match, so the latter will happen). get is also a regular expression, with three atoms, that will match strings like "get", "gettysburg", "beget", and "budgetary".

There are special single-character atoms that serve useful purposes. For us, the important atoms are:

. - matches any single character (letter or digit or punctuation). So, for example, ...get will match a string like "budget", because there are three characters preceding "get", but not "beget", since there are only 2.

^ and $ - match the beginning and end of a single line, respectively. If you are searching a dictionary list with one word per line (which is what our tools mostly use), these are useful to mark beginnings and ends of words. For example, ^...get$ will match strings like "budget", "gadget", "forget", but not "budgetary".

\(, \), \", etc. - backslash means that the character that follows is 'escaped', meaning if it's a character that's used in the regular expression logic, it will be considered a regular character instead. More on special regular expression characters further down.

Regexp atoms can be extended with special modifier characters that follow them. The important ones are:

* - when following an atom, * means "0 or more". For example, ^.*get$ will match all lines in a file that end in "get", including "get" by itself (since .* will match 0 characters as well).

+ - when following an atom, + means "1 or more". So ^.+get$ will match all lines that end in "get" and have at least one character preceding it. The result in this case will be the same as above, but without the word "get" by itself.

? - when following an atom, ? means "0 or 1". So ^.?get$ will match "aget" and "get".

{n} - when following an atom, the curly brackets with a number in them mean "exactly n". So ^.{3}get$ will match all lines that have six characters, the last three of which are "get". .{3} is of course equivalent to ...

{m,n} - when following an atom, the curly brackets with a pair of numbers m and n mean "at least m and at most n". So ^.{0,3}get$ will match all words up to 6 letters that end in "get".

An atom can also represent a set of characters. We saw that above, with . representing any character. If you wanted to match a narrower set of characters, you can use square brackets. The content of the square brackets matches a single character, so everything placed in the square brackets is chained with logical OR. For example, [abc] will match "a OR b OR c". You can also use a few convenient range representations to avoid typing out full ranges. For example, [A-Z] will match "any capital letter from A to Z", and [0-9] will match "any digit from 0 to 9". Ranges and individual atoms can be combined within square brackets. For example, [H-Z2-7ad] will match "any capital letter from H to Z or any digit from 2 to 7 or 'a' or 'd'". One note on square brackets: you can create negative sets (that's "any character except these") by putting a ^ symbol after the opening square bracket. Note that this is different than using ^ by itself, which is an atom for "beginning of a line".

Ordinary parentheses in regexp allow you to create so called "groups". A parenthesized regexp is no different from an unparen'ed one, other than that you can refer to them again, later in the expression. It also allows you to apply modifiers to sets of atoms. For example, cat* will match "cat", "catt", "cattt", and so on; whereas (cat)* will match "cat", "catcat", and so on.

Referencing groups you already matched can be quite useful. For example, a regular expression: ^(.{3})men(.{3})$ will match any 9 letter word with "men" in the middle (like "alimental" or "commended"). However, if we replace the third part of the regexp with a \1 -- meaning "reference to the first group in this regexp", like so: ^(.{3})men\1$, we are now looking for a word that starts with "any three letters", continues with "men", and ends with "the same as first group" -- which is the same three letters that it starts with. In an English dictionary of 500K+ words, there happens to be just one such word, and that's "tormentor". If there was a word "tormentormen", you'd be able to find it if you did this: ^(.{3})(men)\1\2$ -- where you refer both to the first and second groups.

As we established, if you just put regular expressions next to each other, that "glues" them together -- so ^cat$ will match "beginning of line, followed by "c", followed by "a", followed by "t", followed by end of line". What if we wanted to match a line that has a substring "cat" OR a substring "dog"? For that, we use an OR chaining symbol, which a vertical bar. So, an expression ^.*(cat|dog).*$, will match "beginning of line, followed by any number of any character, followed by "cat" OR by "dog", followed by any number of any characters, followed by end of line". So both "anti-Catholic" and "anti-dogmatic" will be included :)

You can also chain regular expressions with a logical AND (this isn't conventional for older regexp parsers). To do that, you can use the following notation:

(?=regexp1)(?=regexp2)(?=regexp3)....

The first group means "match regexp1, but don't stop reading this expression" -- which results in the parser also matching the second expression, third expression, and so on. So, for example: (?=^.*cat.*$)(?=^.*dog.*$)(?=^[^-]{10}$) will match: "any word with substring 'cat' AND any word with a substring 'dog' AND any word with 10 characters that aren't a dash". In a large English dictionary, there is one such word: dogcatcher (if we allowed a dash, there's one other such word: 'cattle-dog').

NEW: In our tool, you can also use the notation:
regexp1
regexp2
regexp3
...

(one expression per line), which will get converted to the question mark notation by the back-end.

This notation actually allows you to use regexp to find single-word anagrams and even use wildcards while doing so. For example, if you wanted a 7-letter anagram from letters a,c,c,c,y and two unknown letters, you can do:

(?=^.*c.*c.*c.*$)(?=^.*y.*)(?=^.*a.*)(?=^.{7}$)

resulting in 'acyclic'. If we wanted 7- and 8-letter anagrams, we can do:

(?=^.*c.*c.*c.*$)(?=^.*y.*)(?=^.*a.*)(?=^.{7,8}$)

resulting in "cycladic, accuracy, acyclic, cyclecar, cyclical, peccancy".