Comparing Recursive-Regex and CL-YACC

In my previous blog post, I discussed how recursive-regex has been
maturing but that it still wasn’t and was not intended to compete with
actual parser toolkits. I wanted to quantify this assumption and
present a head-to-head analysis of using recursive-regex vs using
cl-yacc.
I chose cl-yacc because

I already had a working css3-selector parser implemented
. My
existing cl-yacc implementation is based on
the published
CSS3-Selectors Grammar
(but the modified some to get it to work).


CL-YACC Parser

I found implementing the parser in cl-yacc to be fairly tedious,
time consuming and error prone, even for such a small language as
css-selectors. It doesn’t help that I tried to do it using the
published css3 grammar and lex files which are somewhat awkward (eg:
open parens are in the lexer and close parens are in the grammar).

I wrote a not very well documented (f)lex file reader to read in the
existing CSS3 flex
file
and build a lexer compatible with cl-yacc. After getting a
valid lexer, I started working with the published grammar to convert
it to a format cl-yacc would approve of. Along the way I was finding
the syntax for cl-yacc to be pretty cumbersome, so I used some
read time execution to turn forms like the first below, into forms like
the second (macro expanded) one below which are valid cl-yacc
productions. (This is not a great idea or great code, but it did simplify
the parser def for me.)

(or-sel
   #.(rule (comb-sel :|,| spaces or-sel)
       (list :or comb-sel or-sel))
   #.(rule (comb-sel) comb-sel))



(or-sel
 (comb-sel :|,| spaces or-sel
  #'(lambda (comb-sel |,| spaces or-sel)
      (declare (ignorable comb-sel |,| spaces or-sel))
      (list :or comb-sel or-sel)))
 (comb-sel
  #'(lambda (comb-sel) (declare (ignorable comb-sel)) comb-sel)))

The difficulties I had in implementation were mostly concerning where
white space could appear, in which productions, at which levels in the
grammar. It seemed like it was quite a task to get an unambiguous
grammar and have all my extra unimportant white space everywhere still
be parsable. After I had finally managed to rid my grammar of
reduce/reduce conflicts, I was able to parse my language and it seemed
like a fairly peppy parser.

The only problem I really have with this implementation is that it
seems like it is totally illegible after the fact. Even knowing about
parsers and having written this one, I don’t feel comfortable
modifying the language. It seems like it would be difficult to get
working again. Thankfully, I don’t anticipate having to do much
language rewriting in this case.

CL-YACC parser definition (lexer not shown)

(yacc:define-parser *css3-selector-parser*
  (:start-symbol selector)
  (:terminals (:|,| :|*| :|)| :|(| :|>| :|+| :|~| :|:| :|[| :|]| :|=| :|-|
		:S :IDENT :HASH :CLASS :STRING :FUNCTION :NTH-FUNCTION
		:INCLUDES :DASHMATCH :BEGINS-WITH :ENDS-WITH :SUBSTRING
		:integer))
  (:precedence ((:left :|)| :s :|,| :|+| :|~| )) )
  
  (selector #.(rule (or-sel) or-sel))

  (or-sel
   #.(rule (comb-sel :|,| spaces or-sel)
       (list :or comb-sel or-sel))
   #.(rule (comb-sel) comb-sel))
  
  (comb-sel
   #.(rule (and-sel combinator comb-sel)
       (list combinator and-sel comb-sel))
   #.(rule
	 ;; need to handle trailing spaces here
	 ;; to avoid s/r
	 (and-sel spaces) and-sel))

  (combinator 
   (:s (constantly :child))
   (spaces :|>| spaces (constantly :immediate-child))
   (spaces :|~| spaces (constantly :preceded-by))
   (spaces :|+| spaces (constantly :immediatly-preceded-by)))
  
  (and-sel
   #.(rule (and-sel simple-selector)
       (list :and and-sel simple-selector))
   #.(rule (simple-selector) simple-selector))
  
  (simple-selector
   #.(rule (:HASH) `(:hash ,(but-first hash)))
   #.(rule (:CLASS) `(:class ,(but-first class)))
   #.(rule (:IDENT) `(:element ,ident))
   (:|*| (constantly :everything))
   (attrib #'identity)
   (pseudo #'identity))

  (attrib
   #.(rule (:|[| spaces :IDENT spaces :|]|)
       `(:attribute ,ident))
   
   #.(rule (:|[| spaces :IDENT spaces attrib-value-def spaces :|]|)
       `(:attribute ,ident ,attrib-value-def))
   )

  (attrib-value-def
   #.(rule (attrib-match-type attrib-value)
       (list attrib-match-type attrib-value)))

  (attrib-match-type
   #.(rule (:|=|) :equals)
   #.(rule (:includes) :includes)
   #.(rule (:dashmatch) :dashmatch)
   #.(rule (:begins-with) :begins-with)
   #.(rule (:ends-with) :ends-with)
   #.(rule (:substring) :substring))

  (attrib-value
   #.(rule (:ident) ident)
   #.(rule (:string) (but-quotes string)))
  
  (pseudo
   #.(rule (:|:| :IDENT) (list :pseudo ident))
   
   #.(rule (:|:| :FUNCTION spaces selector :|)|)
       (list :pseudo (but-last function) selector))
   #.(rule (:|:| :NTH-FUNCTION spaces nth-expr spaces :|)| )
       `(:nth-pseudo ,(but-last nth-function)
		     ,@nth-expr)))

  (nth-expr
   #.(rule (:ident)
       (cond ((string-equal ident "even") (list 2 0))
	     ((string-equal ident "odd") (list 2 1))
	     (T (error "invalid nth subexpression"))))
   
   #.(rule (nth-sign :integer)
       (list 0 (if (string-equal nth-sign "-")
		   (* -1 (parse-integer integer))
		   (parse-integer integer))))
   
   #.(rule (nth-sign :integer :ident)
       (let (extra-num)
	 (cond
	   ((string-equal "n" ident) T)
	   ;; this is because our lexer will recogince n-1 as a valid ident
	   ;; but n+1 will hit the rule below
	   ((alexandria:starts-with-subseq "n" ident)
	    (setf extra-num (parse-integer (subseq ident 1))))
	   (T (error "invalid nth subexpression in (what is ~A)" ident)))
	 (list (or (if (string-equal nth-sign "-")
		       (* -1 (parse-integer integer))
		       (parse-integer integer))
		   0)
	       (or extra-num 0))))
   #.(rule (nth-sign :integer :ident nth-sign :integer)
       (when (and integer-1 (null nth-sign-1))
	 (error "invalid nth subexpression 2n+1 style requires a sign before the second number"))
       (list (or (if (string-equal nth-sign-0 "-")
		     (* -1 (parse-integer integer-0))
		     (parse-integer integer-0))
		 0)
	     (or (if (string-equal nth-sign-1 "-")
		     (* -1 (parse-integer integer-1))
		     (parse-integer integer-1))
		 0))
       ))
  
   (nth-sign
    #.(rule (:|+|) :|+|)
    #.(rule (:|-|) :|-|)
    #.(rule () ()))
  
   (spaces
    (:S )
    ( )))

Recursive-Regex Parser

After having written recursive-regex, I wanted a way to beat some of
the bugs out as well as have a good example of what I could use this
tool for, and with what performance characteristics. To accomplish
this task, I converted the CL-YACC grammar and lex file into a single
parser definition and

wrote some code to turn that file into the recursive dispatch functions
.

REX: a recursive expression file format based (loosely) on lex

I had a really shoddy lexish reader for the existing lex based lexer.
I converted this to a similar tool (without the c blocks) for defining
named recursive expressions. These files have the .rex extension.
They consists of options, inline-definitions, and named productions.

The definitions for css-selectors are in css3.rex.

Once I had my recursive expression definition, getting the parser was
a fairly easy task. Along the way I added some code to minimize the
parse tree results by promoting children, when a parent only had a
single child that matched the same length as the parent. I also
improved the parse tracing, so that I could observer and debug what
the expression was doing while it was matching. With the tree minimized
I also had to revise many of my tests.

css3.rex: css3-selectors parser definition

%option case-insensitive

h => [0-9a-f]
nmstart => [a-z]
nmchar => [a-z0-9-]
nl => \n|\r\n|\r|\f
string1 => \"([\t !#$%&(-~]|\\{nl}|\')*\"
string2 => \'([\t !#$%&(-~]|\\{nl}|\")*\'
id => [-]?{nmstart}{nmchar}*
name => {nmchar}+
int => [0-9]+
num => [0-9]+|[0-9]*\.[0-9]+
string => {string1}|{string2}
url => ([!#$%&*-~])*
w => [ \t\r\n\f]*
s => [\s\f]+

%%

comment => \/\*[^*]*\*+([^/][^*]*\*+)*\/
cdo => <!--
cdc => -->
class => \.(?<ident>)
hash => #(?<ident>{name})
ident => {id}
element => {id}
important_sym => !{w}important

selector => (?<or>)
or => (?<comb-sel>)(\s*,\s*(?<or>))?

comb-sel => (?<and>)((?<combinator>)(?<comb-sel>))?
combinator => (?<child>\s+)|(?<immediate-child>>)|(?<preceded-by>~)|(?<immediatly-preceded-by>\+)
and => (?<sim-sel>)(?<and>)?
sim-sel => (?<hash>)|(?<class>)|(?<element>)|\*|(?<attrib>)|(?<pseudo>)
attrib => \[\s*(?<ident>)\s*(?<attrib-val>)?\]
attrib-val => [\^\$\*\|\~]?=((?<ident>)|(?<string>))
pseudo => :(?<ident>)(?<parens>\s*(?<selector>)|(?<nth-expr>)\s*)?
nth-expr => even|odd|[\+-]?{int}(n([\+-]{int})?)?


Performance Numbers

These are the performance numbers for the two parsers each parsing 6
short inputs 1000 times. Also included is (a version) of that
output. (Recursive-expressions return CLOS object trees that have in
the results below been converted to a list representation for easy of
viewing.) As you can see the recursive-expressions version is ten
times slower and uses twenty times the memory.

(defun run-some-tests ()
(timed-side-by-side-parses
(list ".foo" "#bar" ":bast" "div[onclick]"
"div[onclick=foo]"
":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )" )
1000))

CSS> (run-some-tests)

;;; Recursive-Regex Results
Evaluation took:
7.548 seconds of real time
7.480000 seconds of total run time (6.850000 user, 0.630000 system)
[ Run times consist of 1.410 seconds GC time, and 6.070 seconds non-GC time. ]
99.10% CPU
18,821,417,333 processor cycles
1,167,530,480 bytes consed

((:CLASS ".foo" (:IDENT "foo")) (:HASH "#bar" (:IDENT "bar"))
(:PSEUDO ":bast" (:IDENT "bast"))
(:AND "div[onclick]" (:ELEMENT "div")
(:ATTRIB "[onclick]" (:IDENT "onclick")))
(:AND "div[onclick=foo]" (:ELEMENT "div")
(:ATTRIB "[onclick=foo]" (:IDENT "onclick")
(:ATTRIB-VAL "=foo" (:IDENT "foo"))))
(:OR ":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )"
(:PSEUDO ":nth-last-child( 2n+1 )" (:IDENT "nth-last-child")
(:MATCHED-PARENS "( 2n+1 )" (:BODY "2n+1 " (:NTH-EXPR "2n+1"))))
(:COMB-SEL "foo.bar>bast:blech( .foo )"
(:AND "foo.bar" (:ELEMENT "foo") (:CLASS ".bar" (:IDENT "bar")))
(:IMMEDIATE-CHILD ">")
(:AND "bast:blech( .foo )" (:ELEMENT "bast")
(:PSEUDO ":blech( .foo )" (:IDENT "blech")
(:MATCHED-PARENS "( .foo )"
(:BODY " .foo" (:CLASS ".foo" (:IDENT "foo")))))))))

;;; CL-YACC Results

Evaluation took:
0.790 seconds of real time
0.790000 seconds of total run time (0.770000 user, 0.020000 system)
[ Run times consist of 0.080 seconds GC time, and 0.710 seconds non-GC time. ]
100.00% CPU
1,968,753,765 processor cycles
64,695,792 bytes consed

((:CLASS "foo") (:HASH "bar") (:PSEUDO "bast")
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick"))
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick" (:EQUALS "foo")))
(:OR (:NTH-PSEUDO "nth-last-child" 2 1)
(:IMMEDIATE-CHILD (:AND (:ELEMENT "foo") (:CLASS "bar"))
(:AND (:ELEMENT "bast") (:PSEUDO "blech" (:CLASS "foo"))))))


TL/DR

In conclusion, I am certainly not going to replace my working, fast,
memory efficient cl-yacc parser with my recursive-expressions parser.
However, if I wanted to have a working, legible (maybe) parser
definition, that will match as I intuitively expect, I might use
recursive-expressions. Because I am so used to using regex’s for
matching, if performance was not an issue, I would probably always
prefer the recursive expressions version. I could also see the
recursive expressions solution being a nice prototyping tool to help
develop the cl-yacc parser.

Obviously some of these opinions are going to be biased because
I wrote one of these libraries and not the other

CL-YACC

Pros
  • Pretty quick and very memory efficient parser
  • Easy to customize parser output (each production has a lambda body to build whatever output is necessary)
  • Theoretically well grounded
Cons
  • Its hard to make unambiguous grammars
  • Not exceedingly helpful with its suggestions for how to fix your ambiguities

Recursive Regex

Pros
  • Relatively easy to get working
  • lexer and parser are the same tool, built on top of cl-ppcre, which
    you presumably already know, and has a good test environment (regex-coach, repl)
  • Parses ambiguous grammars
  • Has reasonable parser tracing built in, so debugging can be somewhat easier
Cons
  • Not very concerned with parse time or memory consumption
  • Parses ambiguous grammars
  • Bad pathological cases (exponential search)
  • Currently no architecture for modifying the parse tree other than an after
    the fact rewrite

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>