When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (the “longest match” principle). If it finds two or more matches of the same length, the rule listed first in the ocamllex input file is chosen (the “first match” principle).
Once the match is determined, the text corresponding to the match (called the token) is made available in the form of a string. The action corresponding to the matched pattern is then executed (a more detailed description of actions follows), and then the remaining input is scanned for another match.
If no match is found, the scanner raises the Failure “lexing: empty token” exception.
Now, let’s see the examples which shows how the patterns are applied.
rule token = parse
| "ding" { print_endline "Ding" } (* "ding" pattern *)
| ['a'-'z']+ as word (* "word" pattern *)
{ print_endline ("Word: " ^ word) }
...
When “ding” is given as an input, the ding and word pattern can be matched. ding pattern is selected because it comes before word pattern. So if you code like this:
rule token = parse
| ['a'-'z']+ as word (* "word" pattern *)
{ print_endline ("Word: " ^ word) }
| "ding" { print_endline "Ding" } (* "ding" pattern *)
| ...
ding pattern will be useless.
In the following example, there are three patterns: ding, dong and dingdong.
rule token = parse
| "ding" { print_endline "Ding" } (* "ding" pattern *)
| "dong" { print_endline "Dong" } (* "dong" pattern *)
| "dingdong" { print_endline "Ding-Dong" } (* "dingdong" pattern *)
...
When “dingdong” is given as an input, there are two choices: ding + dong pattern or dingdong pattern. But by the “longest match” principle, dingdong pattern will be selected.
Though the “shortest match” principle is not used so frequently, ocamllex supports it. If you want to select the shortest prefix of the input, use shortest keyword instead of the parse keyword. The “first match” principle holds still with the “shortest match” principle.