In my previous blog post, I discussed how recursive-regex has been
maturing but that it still wasn’t and was not intended to compete with
actual parser toolkits. I wanted to quantify this assumption and
present a head-to-head analysis of using recursive-regex vs using
cl-yacc.
I chose cl-yacc because
I already had a working css3-selector parser implemented. My
existing cl-yacc implementation is based on
the published
CSS3-Selectors Grammar (but the modified some to get it to work).
CL-YACC Parser
I found implementing the parser in cl-yacc to be fairly tedious,
time consuming and error prone, even for such a small language as
css-selectors. It doesn’t help that I tried to do it using the
published css3 grammar and lex files which are somewhat awkward (eg:
open parens are in the lexer and close parens are in the grammar).
I wrote a not very well documented (f)lex file reader to read in the
existing CSS3 flex
file and build a lexer compatible with cl-yacc. After getting a
valid lexer, I started working with the published grammar to convert
it to a format cl-yacc would approve of. Along the way I was finding
the syntax for cl-yacc to be pretty cumbersome, so I used some
read time execution to turn forms like the first below, into forms like
the second (macro expanded) one below which are valid cl-yacc
productions. (This is not a great idea or great code, but it did simplify
the parser def for me.)
(or-sel #.(rule (comb-sel :|,| spaces or-sel) (list :or comb-sel or-sel)) #.(rule (comb-sel) comb-sel)) (or-sel (comb-sel :|,| spaces or-sel #'(lambda (comb-sel |,| spaces or-sel) (declare (ignorable comb-sel |,| spaces or-sel)) (list :or comb-sel or-sel))) (comb-sel #'(lambda (comb-sel) (declare (ignorable comb-sel)) comb-sel)))
The difficulties I had in implementation were mostly concerning where
white space could appear, in which productions, at which levels in the
grammar. It seemed like it was quite a task to get an unambiguous
grammar and have all my extra unimportant white space everywhere still
be parsable. After I had finally managed to rid my grammar of
reduce/reduce conflicts, I was able to parse my language and it seemed
like a fairly peppy parser.
The only problem I really have with this implementation is that it
seems like it is totally illegible after the fact. Even knowing about
parsers and having written this one, I don’t feel comfortable
modifying the language. It seems like it would be difficult to get
working again. Thankfully, I don’t anticipate having to do much
language rewriting in this case.
CL-YACC parser definition (lexer not shown)
(yacc:define-parser *css3-selector-parser* (:start-symbol selector) (:terminals (:|,| :|*| :|)| :|(| :|>| :|+| :|~| :|:| :|[| :|]| :|=| :|-| :S :IDENT :HASH :CLASS :STRING :FUNCTION :NTH-FUNCTION :INCLUDES :DASHMATCH :BEGINS-WITH :ENDS-WITH :SUBSTRING :integer)) (:precedence ((:left :|)| :s :|,| :|+| :|~| )) ) (selector #.(rule (or-sel) or-sel)) (or-sel #.(rule (comb-sel :|,| spaces or-sel) (list :or comb-sel or-sel)) #.(rule (comb-sel) comb-sel)) (comb-sel #.(rule (and-sel combinator comb-sel) (list combinator and-sel comb-sel)) #.(rule ;; need to handle trailing spaces here ;; to avoid s/r (and-sel spaces) and-sel)) (combinator (:s (constantly :child)) (spaces :|>| spaces (constantly :immediate-child)) (spaces :|~| spaces (constantly :preceded-by)) (spaces :|+| spaces (constantly :immediatly-preceded-by))) (and-sel #.(rule (and-sel simple-selector) (list :and and-sel simple-selector)) #.(rule (simple-selector) simple-selector)) (simple-selector #.(rule (:HASH) `(:hash ,(but-first hash))) #.(rule (:CLASS) `(:class ,(but-first class))) #.(rule (:IDENT) `(:element ,ident)) (:|*| (constantly :everything)) (attrib #'identity) (pseudo #'identity)) (attrib #.(rule (:|[| spaces :IDENT spaces :|]|) `(:attribute ,ident)) #.(rule (:|[| spaces :IDENT spaces attrib-value-def spaces :|]|) `(:attribute ,ident ,attrib-value-def)) ) (attrib-value-def #.(rule (attrib-match-type attrib-value) (list attrib-match-type attrib-value))) (attrib-match-type #.(rule (:|=|) :equals) #.(rule (:includes) :includes) #.(rule (:dashmatch) :dashmatch) #.(rule (:begins-with) :begins-with) #.(rule (:ends-with) :ends-with) #.(rule (:substring) :substring)) (attrib-value #.(rule (:ident) ident) #.(rule (:string) (but-quotes string))) (pseudo #.(rule (:|:| :IDENT) (list :pseudo ident)) #.(rule (:|:| :FUNCTION spaces selector :|)|) (list :pseudo (but-last function) selector)) #.(rule (:|:| :NTH-FUNCTION spaces nth-expr spaces :|)| ) `(:nth-pseudo ,(but-last nth-function) ,@nth-expr))) (nth-expr #.(rule (:ident) (cond ((string-equal ident "even") (list 2 0)) ((string-equal ident "odd") (list 2 1)) (T (error "invalid nth subexpression")))) #.(rule (nth-sign :integer) (list 0 (if (string-equal nth-sign "-") (* -1 (parse-integer integer)) (parse-integer integer)))) #.(rule (nth-sign :integer :ident) (let (extra-num) (cond ((string-equal "n" ident) T) ;; this is because our lexer will recogince n-1 as a valid ident ;; but n+1 will hit the rule below ((alexandria:starts-with-subseq "n" ident) (setf extra-num (parse-integer (subseq ident 1)))) (T (error "invalid nth subexpression in (what is ~A)" ident))) (list (or (if (string-equal nth-sign "-") (* -1 (parse-integer integer)) (parse-integer integer)) 0) (or extra-num 0)))) #.(rule (nth-sign :integer :ident nth-sign :integer) (when (and integer-1 (null nth-sign-1)) (error "invalid nth subexpression 2n+1 style requires a sign before the second number")) (list (or (if (string-equal nth-sign-0 "-") (* -1 (parse-integer integer-0)) (parse-integer integer-0)) 0) (or (if (string-equal nth-sign-1 "-") (* -1 (parse-integer integer-1)) (parse-integer integer-1)) 0)) )) (nth-sign #.(rule (:|+|) :|+|) #.(rule (:|-|) :|-|) #.(rule () ())) (spaces (:S ) ( )))
Recursive-Regex Parser
After having written recursive-regex, I wanted a way to beat some of
the bugs out as well as have a good example of what I could use this
tool for, and with what performance characteristics. To accomplish
this task, I converted the CL-YACC grammar and lex file into a single
parser definition and
wrote some code to turn that file into the recursive dispatch functions.
REX: a recursive expression file format based (loosely) on lex
I had a really shoddy lexish reader for the existing lex based lexer.
I converted this to a similar tool (without the c blocks) for defining
named recursive expressions. These files have the .rex extension.
They consists of options, inline-definitions, and named productions.
The definitions for css-selectors are in css3.rex.
Once I had my recursive expression definition, getting the parser was
a fairly easy task. Along the way I added some code to minimize the
parse tree results by promoting children, when a parent only had a
single child that matched the same length as the parent. I also
improved the parse tracing, so that I could observer and debug what
the expression was doing while it was matching. With the tree minimized
I also had to revise many of my tests.
css3.rex: css3-selectors parser definition
%option case-insensitive
h => [0-9a-f]
nmstart => [a-z]
nmchar => [a-z0-9-]
nl => \n|\r\n|\r|\f
string1 => \"([\t !#$%&(-~]|\\{nl}|\')*\"
string2 => \'([\t !#$%&(-~]|\\{nl}|\")*\'
id => [-]?{nmstart}{nmchar}*
name => {nmchar}+
int => [0-9]+
num => [0-9]+|[0-9]*\.[0-9]+
string => {string1}|{string2}
url => ([!#$%&*-~])*
w => [ \t\r\n\f]*
s => [\s\f]+
%%
comment => \/\*[^*]*\*+([^/][^*]*\*+)*\/
cdo => <!--
cdc => -->
class => \.(?<ident>)
hash => #(?<ident>{name})
ident => {id}
element => {id}
important_sym => !{w}important
selector => (?<or>)
or => (?<comb-sel>)(\s*,\s*(?<or>))?
comb-sel => (?<and>)((?<combinator>)(?<comb-sel>))?
combinator => (?<child>\s+)|(?<immediate-child>>)|(?<preceded-by>~)|(?<immediatly-preceded-by>\+)
and => (?<sim-sel>)(?<and>)?
sim-sel => (?<hash>)|(?<class>)|(?<element>)|\*|(?<attrib>)|(?<pseudo>)
attrib => \[\s*(?<ident>)\s*(?<attrib-val>)?\]
attrib-val => [\^\$\*\|\~]?=((?<ident>)|(?<string>))
pseudo => :(?<ident>)(?<parens>\s*(?<selector>)|(?<nth-expr>)\s*)?
nth-expr => even|odd|[\+-]?{int}(n([\+-]{int})?)?
Performance Numbers
These are the performance numbers for the two parsers each parsing 6
short inputs 1000 times. Also included is (a version) of that
output. (Recursive-expressions return CLOS object trees that have in
the results below been converted to a list representation for easy of
viewing.) As you can see the recursive-expressions version is ten
times slower and uses twenty times the memory.
(defun run-some-tests ()
(timed-side-by-side-parses
(list ".foo" "#bar" ":bast" "div[onclick]"
"div[onclick=foo]"
":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )" )
1000))
CSS> (run-some-tests)
;;; Recursive-Regex Results
Evaluation took:
7.548 seconds of real time
7.480000 seconds of total run time (6.850000 user, 0.630000 system)
[ Run times consist of 1.410 seconds GC time, and 6.070 seconds non-GC time. ]
99.10% CPU
18,821,417,333 processor cycles
1,167,530,480 bytes consed
((:CLASS ".foo" (:IDENT "foo")) (:HASH "#bar" (:IDENT "bar"))
(:PSEUDO ":bast" (:IDENT "bast"))
(:AND "div[onclick]" (:ELEMENT "div")
(:ATTRIB "[onclick]" (:IDENT "onclick")))
(:AND "div[onclick=foo]" (:ELEMENT "div")
(:ATTRIB "[onclick=foo]" (:IDENT "onclick")
(:ATTRIB-VAL "=foo" (:IDENT "foo"))))
(:OR ":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )"
(:PSEUDO ":nth-last-child( 2n+1 )" (:IDENT "nth-last-child")
(:MATCHED-PARENS "( 2n+1 )" (:BODY "2n+1 " (:NTH-EXPR "2n+1"))))
(:COMB-SEL "foo.bar>bast:blech( .foo )"
(:AND "foo.bar" (:ELEMENT "foo") (:CLASS ".bar" (:IDENT "bar")))
(:IMMEDIATE-CHILD ">")
(:AND "bast:blech( .foo )" (:ELEMENT "bast")
(:PSEUDO ":blech( .foo )" (:IDENT "blech")
(:MATCHED-PARENS "( .foo )"
(:BODY " .foo" (:CLASS ".foo" (:IDENT "foo")))))))))
;;; CL-YACC Results
Evaluation took:
0.790 seconds of real time
0.790000 seconds of total run time (0.770000 user, 0.020000 system)
[ Run times consist of 0.080 seconds GC time, and 0.710 seconds non-GC time. ]
100.00% CPU
1,968,753,765 processor cycles
64,695,792 bytes consed
((:CLASS "foo") (:HASH "bar") (:PSEUDO "bast")
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick"))
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick" (:EQUALS "foo")))
(:OR (:NTH-PSEUDO "nth-last-child" 2 1)
(:IMMEDIATE-CHILD (:AND (:ELEMENT "foo") (:CLASS "bar"))
(:AND (:ELEMENT "bast") (:PSEUDO "blech" (:CLASS "foo"))))))
TL/DR
In conclusion, I am certainly not going to replace my working, fast,
memory efficient cl-yacc parser with my recursive-expressions parser.
However, if I wanted to have a working, legible (maybe) parser
definition, that will match as I intuitively expect, I might use
recursive-expressions. Because I am so used to using regex’s for
matching, if performance was not an issue, I would probably always
prefer the recursive expressions version. I could also see the
recursive expressions solution being a nice prototyping tool to help
develop the cl-yacc parser.
Obviously some of these opinions are going to be biased because
I wrote one of these libraries and not the other
CL-YACC
Pros
- Pretty quick and very memory efficient parser
- Easy to customize parser output (each production has a lambda body to build whatever output is necessary)
- Theoretically well grounded
Cons
- Its hard to make unambiguous grammars
- Not exceedingly helpful with its suggestions for how to fix your ambiguities
Recursive Regex
Pros
- Relatively easy to get working
- lexer and parser are the same tool, built on top of cl-ppcre, which
you presumably already know, and has a good test environment (regex-coach, repl) - Parses ambiguous grammars
- Has reasonable parser tracing built in, so debugging can be somewhat easier
Cons
- Not very concerned with parse time or memory consumption
- Parses ambiguous grammars
- Bad pathological cases (exponential search)
- Currently no architecture for modifying the parse tree other than an after
the fact rewrite