In my previous blog post, I discussed how recursive-regex has been
maturing but that it still wasn’t and was not intended to compete with
actual parser toolkits. I wanted to quantify this assumption and
present a head-to-head analysis of using recursive-regex vs using
cl-yacc.
I chose cl-yacc because
I already had a working css3-selector parser implemented. My
existing cl-yacc implementation is based on
the published
CSS3-Selectors Grammar (but the modified some to get it to work).
CL-YACC Parser
I found implementing the parser in cl-yacc to be fairly tedious,
time consuming and error prone, even for such a small language as
css-selectors. It doesn’t help that I tried to do it using the
published css3 grammar and lex files which are somewhat awkward (eg:
open parens are in the lexer and close parens are in the grammar).
I wrote a not very well documented (f)lex file reader to read in the
existing CSS3 flex
file and build a lexer compatible with cl-yacc. After getting a
valid lexer, I started working with the published grammar to convert
it to a format cl-yacc would approve of. Along the way I was finding
the syntax for cl-yacc to be pretty cumbersome, so I used some
read time execution to turn forms like the first below, into forms like
the second (macro expanded) one below which are valid cl-yacc
productions. (This is not a great idea or great code, but it did simplify
the parser def for me.)
(or-sel
#.(rule (comb-sel :|,| spaces or-sel)
(list :or comb-sel or-sel))
#.(rule (comb-sel) comb-sel))
(or-sel
(comb-sel :|,| spaces or-sel
#'(lambda (comb-sel |,| spaces or-sel)
(declare (ignorable comb-sel |,| spaces or-sel))
(list :or comb-sel or-sel)))
(comb-sel
#'(lambda (comb-sel) (declare (ignorable comb-sel)) comb-sel)))
The difficulties I had in implementation were mostly concerning where
white space could appear, in which productions, at which levels in the
grammar. It seemed like it was quite a task to get an unambiguous
grammar and have all my extra unimportant white space everywhere still
be parsable. After I had finally managed to rid my grammar of
reduce/reduce conflicts, I was able to parse my language and it seemed
like a fairly peppy parser.
The only problem I really have with this implementation is that it
seems like it is totally illegible after the fact. Even knowing about
parsers and having written this one, I don’t feel comfortable
modifying the language. It seems like it would be difficult to get
working again. Thankfully, I don’t anticipate having to do much
language rewriting in this case.
CL-YACC parser definition (lexer not shown)
(yacc:define-parser *css3-selector-parser*
(:start-symbol selector)
(:terminals (:|,| :|*| :|)| :|(| :|>| :|+| :|~| :|:| :|[| :|]| :|=| :|-|
:S :IDENT :HASH :CLASS :STRING :FUNCTION :NTH-FUNCTION
:INCLUDES :DASHMATCH :BEGINS-WITH :ENDS-WITH :SUBSTRING
:integer))
(:precedence ((:left :|)| :s :|,| :|+| :|~| )) )
(selector #.(rule (or-sel) or-sel))
(or-sel
#.(rule (comb-sel :|,| spaces or-sel)
(list :or comb-sel or-sel))
#.(rule (comb-sel) comb-sel))
(comb-sel
#.(rule (and-sel combinator comb-sel)
(list combinator and-sel comb-sel))
#.(rule
;; need to handle trailing spaces here
;; to avoid s/r
(and-sel spaces) and-sel))
(combinator
(:s (constantly :child))
(spaces :|>| spaces (constantly :immediate-child))
(spaces :|~| spaces (constantly :preceded-by))
(spaces :|+| spaces (constantly :immediatly-preceded-by)))
(and-sel
#.(rule (and-sel simple-selector)
(list :and and-sel simple-selector))
#.(rule (simple-selector) simple-selector))
(simple-selector
#.(rule (:HASH) `(:hash ,(but-first hash)))
#.(rule (:CLASS) `(:class ,(but-first class)))
#.(rule (:IDENT) `(:element ,ident))
(:|*| (constantly :everything))
(attrib #'identity)
(pseudo #'identity))
(attrib
#.(rule (:|[| spaces :IDENT spaces :|]|)
`(:attribute ,ident))
#.(rule (:|[| spaces :IDENT spaces attrib-value-def spaces :|]|)
`(:attribute ,ident ,attrib-value-def))
)
(attrib-value-def
#.(rule (attrib-match-type attrib-value)
(list attrib-match-type attrib-value)))
(attrib-match-type
#.(rule (:|=|) :equals)
#.(rule (:includes) :includes)
#.(rule (:dashmatch) :dashmatch)
#.(rule (:begins-with) :begins-with)
#.(rule (:ends-with) :ends-with)
#.(rule (:substring) :substring))
(attrib-value
#.(rule (:ident) ident)
#.(rule (:string) (but-quotes string)))
(pseudo
#.(rule (:|:| :IDENT) (list :pseudo ident))
#.(rule (:|:| :FUNCTION spaces selector :|)|)
(list :pseudo (but-last function) selector))
#.(rule (:|:| :NTH-FUNCTION spaces nth-expr spaces :|)| )
`(:nth-pseudo ,(but-last nth-function)
,@nth-expr)))
(nth-expr
#.(rule (:ident)
(cond ((string-equal ident "even") (list 2 0))
((string-equal ident "odd") (list 2 1))
(T (error "invalid nth subexpression"))))
#.(rule (nth-sign :integer)
(list 0 (if (string-equal nth-sign "-")
(* -1 (parse-integer integer))
(parse-integer integer))))
#.(rule (nth-sign :integer :ident)
(let (extra-num)
(cond
((string-equal "n" ident) T)
;; this is because our lexer will recogince n-1 as a valid ident
;; but n+1 will hit the rule below
((alexandria:starts-with-subseq "n" ident)
(setf extra-num (parse-integer (subseq ident 1))))
(T (error "invalid nth subexpression in (what is ~A)" ident)))
(list (or (if (string-equal nth-sign "-")
(* -1 (parse-integer integer))
(parse-integer integer))
0)
(or extra-num 0))))
#.(rule (nth-sign :integer :ident nth-sign :integer)
(when (and integer-1 (null nth-sign-1))
(error "invalid nth subexpression 2n+1 style requires a sign before the second number"))
(list (or (if (string-equal nth-sign-0 "-")
(* -1 (parse-integer integer-0))
(parse-integer integer-0))
0)
(or (if (string-equal nth-sign-1 "-")
(* -1 (parse-integer integer-1))
(parse-integer integer-1))
0))
))
(nth-sign
#.(rule (:|+|) :|+|)
#.(rule (:|-|) :|-|)
#.(rule () ()))
(spaces
(:S )
( )))
Recursive-Regex Parser
After having written recursive-regex, I wanted a way to beat some of
the bugs out as well as have a good example of what I could use this
tool for, and with what performance characteristics. To accomplish
this task, I converted the CL-YACC grammar and lex file into a single
parser definition and
wrote some code to turn that file into the recursive dispatch functions.
REX: a recursive expression file format based (loosely) on lex
I had a really shoddy lexish reader for the existing lex based lexer.
I converted this to a similar tool (without the c blocks) for defining
named recursive expressions. These files have the .rex extension.
They consists of options, inline-definitions, and named productions.
The definitions for css-selectors are in css3.rex.
Once I had my recursive expression definition, getting the parser was
a fairly easy task. Along the way I added some code to minimize the
parse tree results by promoting children, when a parent only had a
single child that matched the same length as the parent. I also
improved the parse tracing, so that I could observer and debug what
the expression was doing while it was matching. With the tree minimized
I also had to revise many of my tests.
css3.rex: css3-selectors parser definition
%option case-insensitive
h => [0-9a-f]
nmstart => [a-z]
nmchar => [a-z0-9-]
nl => \n|\r\n|\r|\f
string1 => \"([\t !#$%&(-~]|\\{nl}|\')*\"
string2 => \'([\t !#$%&(-~]|\\{nl}|\")*\'
id => [-]?{nmstart}{nmchar}*
name => {nmchar}+
int => [0-9]+
num => [0-9]+|[0-9]*\.[0-9]+
string => {string1}|{string2}
url => ([!#$%&*-~])*
w => [ \t\r\n\f]*
s => [\s\f]+
%%
comment => \/\*[^*]*\*+([^/][^*]*\*+)*\/
cdo => <!--
cdc => -->
class => \.(?<ident>)
hash => #(?<ident>{name})
ident => {id}
element => {id}
important_sym => !{w}important
selector => (?<or>)
or => (?<comb-sel>)(\s*,\s*(?<or>))?
comb-sel => (?<and>)((?<combinator>)(?<comb-sel>))?
combinator => (?<child>\s+)|(?<immediate-child>>)|(?<preceded-by>~)|(?<immediatly-preceded-by>\+)
and => (?<sim-sel>)(?<and>)?
sim-sel => (?<hash>)|(?<class>)|(?<element>)|\*|(?<attrib>)|(?<pseudo>)
attrib => \[\s*(?<ident>)\s*(?<attrib-val>)?\]
attrib-val => [\^\$\*\|\~]?=((?<ident>)|(?<string>))
pseudo => :(?<ident>)(?<parens>\s*(?<selector>)|(?<nth-expr>)\s*)?
nth-expr => even|odd|[\+-]?{int}(n([\+-]{int})?)?
Performance Numbers
These are the performance numbers for the two parsers each parsing 6
short inputs 1000 times. Also included is (a version) of that
output. (Recursive-expressions return CLOS object trees that have in
the results below been converted to a list representation for easy of
viewing.) As you can see the recursive-expressions version is ten
times slower and uses twenty times the memory.
(defun run-some-tests ()
(timed-side-by-side-parses
(list ".foo" "#bar" ":bast" "div[onclick]"
"div[onclick=foo]"
":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )" )
1000))
CSS> (run-some-tests)
;;; Recursive-Regex Results
Evaluation took:
7.548 seconds of real time
7.480000 seconds of total run time (6.850000 user, 0.630000 system)
[ Run times consist of 1.410 seconds GC time, and 6.070 seconds non-GC time. ]
99.10% CPU
18,821,417,333 processor cycles
1,167,530,480 bytes consed
((:CLASS ".foo" (:IDENT "foo")) (:HASH "#bar" (:IDENT "bar"))
(:PSEUDO ":bast" (:IDENT "bast"))
(:AND "div[onclick]" (:ELEMENT "div")
(:ATTRIB "[onclick]" (:IDENT "onclick")))
(:AND "div[onclick=foo]" (:ELEMENT "div")
(:ATTRIB "[onclick=foo]" (:IDENT "onclick")
(:ATTRIB-VAL "=foo" (:IDENT "foo"))))
(:OR ":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )"
(:PSEUDO ":nth-last-child( 2n+1 )" (:IDENT "nth-last-child")
(:MATCHED-PARENS "( 2n+1 )" (:BODY "2n+1 " (:NTH-EXPR "2n+1"))))
(:COMB-SEL "foo.bar>bast:blech( .foo )"
(:AND "foo.bar" (:ELEMENT "foo") (:CLASS ".bar" (:IDENT "bar")))
(:IMMEDIATE-CHILD ">")
(:AND "bast:blech( .foo )" (:ELEMENT "bast")
(:PSEUDO ":blech( .foo )" (:IDENT "blech")
(:MATCHED-PARENS "( .foo )"
(:BODY " .foo" (:CLASS ".foo" (:IDENT "foo")))))))))
;;; CL-YACC Results
Evaluation took:
0.790 seconds of real time
0.790000 seconds of total run time (0.770000 user, 0.020000 system)
[ Run times consist of 0.080 seconds GC time, and 0.710 seconds non-GC time. ]
100.00% CPU
1,968,753,765 processor cycles
64,695,792 bytes consed
((:CLASS "foo") (:HASH "bar") (:PSEUDO "bast")
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick"))
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick" (:EQUALS "foo")))
(:OR (:NTH-PSEUDO "nth-last-child" 2 1)
(:IMMEDIATE-CHILD (:AND (:ELEMENT "foo") (:CLASS "bar"))
(:AND (:ELEMENT "bast") (:PSEUDO "blech" (:CLASS "foo"))))))
TL/DR
In conclusion, I am certainly not going to replace my working, fast,
memory efficient cl-yacc parser with my recursive-expressions parser.
However, if I wanted to have a working, legible (maybe) parser
definition, that will match as I intuitively expect, I might use
recursive-expressions. Because I am so used to using regex’s for
matching, if performance was not an issue, I would probably always
prefer the recursive expressions version. I could also see the
recursive expressions solution being a nice prototyping tool to help
develop the cl-yacc parser.
Obviously some of these opinions are going to be biased because
I wrote one of these libraries and not the other
CL-YACC
Pros
- Pretty quick and very memory efficient parser
- Easy to customize parser output (each production has a lambda body to build whatever output is necessary)
- Theoretically well grounded
Cons
- Its hard to make unambiguous grammars
- Not exceedingly helpful with its suggestions for how to fix your ambiguities
Recursive Regex
Pros
- Relatively easy to get working
- lexer and parser are the same tool, built on top of cl-ppcre, which
you presumably already know, and has a good test environment (regex-coach, repl)
- Parses ambiguous grammars
- Has reasonable parser tracing built in, so debugging can be somewhat easier
Cons
- Not very concerned with parse time or memory consumption
- Parses ambiguous grammars
- Bad pathological cases (exponential search)
- Currently no architecture for modifying the parse tree other than an after
the fact rewrite