Patterns

The patterns in the input are written using regular expressions in the style of lex, with a more Caml-like syntax. These are:

  • 'c': match the character ‘c’. The character constant is the same syntax as Objective Caml character.
  • _: (underscore) match any character.
  • eof: match an end-of-file .
  • "foo": the literal string “foo”. The syntax is the same syntax as Objective Caml string constants.
  • ['x' 'y' 'z']: character set; in this case, the pattern matches either an ‘x’, a ‘y’, or a ‘z’ .
  • ['a' 'b' 'j'-'o' 'Z']: character set with a range in it; ranges of characters ‘c1’ - ‘c2’ (all characters between c1 and c2, inclusive); in this case, the pattern matches an ‘a’, a ‘b’, any letter from ‘j’ through ‘o’, or a ‘Z’.
  • [^ 'A'-'Z']: a negated character set, i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
  • [^ 'A'-'Z' '\n']: any character EXCEPT an uppercase letter or a newline
  • r*: zero or more r’s, where r is any regular expression
  • r+: one or more r’s, where r is any regular expression
  • r?: zero or one r’s, where r is any regular expression (that is, “an optional r”)
  • ident: the expansion of the “ident” defined by an earlier let ident = regexp definition.
  • (r): match an r; parentheses are used to override precedence (see below)
  • rs: the regular expression r followed by the regular expression s; called “concatenation”
  • r|s: either an r or an s
  • r#s: match the difference of the two specified character sets.
  • r as ident: bind the string matched by r to identifier ident

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom; ‘*’ and ‘+’ have highest precedence, followed by ‘?’, ‘concatenation’, ‘|’, and then ‘as’. For example,

"foo" | "bar"*

is the same as

("foo")|("bar"*)

since the ‘*’ operator has higher precedence than than alternation (‘|’). This pattern therefore matches either the string “foo” or zero-or-more of the string “bar”.

To match zero-or-more “foo”’s-or-“bar”’s:

("foo"|"bar")*

A negated character set such as the example “[^ ‘A’-‘Z’]” above will match a newline unless “\n” (or an equivalent escape sequence) is one of the characters explicitly present in the negated character set (e.g., “[^ ‘A’-‘Z’ ‘\n’]“). This is unlike how many other regular expression tools treat negated character set, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^“]* can match the entire input unless there’s another quote in the input.

comments powered by Disqus