Comparing Recursive-Regex and CL-YACC

In my previous blog post, I discussed how recursive-regex has been
maturing but that it still wasn’t and was not intended to compete with
actual parser toolkits. I wanted to quantify this assumption and
present a head-to-head analysis of using recursive-regex vs using
cl-yacc.
I chose cl-yacc because

I already had a working css3-selector parser implemented
. My
existing cl-yacc implementation is based on
the published
CSS3-Selectors Grammar
(but the modified some to get it to work).


CL-YACC Parser

I found implementing the parser in cl-yacc to be fairly tedious,
time consuming and error prone, even for such a small language as
css-selectors. It doesn’t help that I tried to do it using the
published css3 grammar and lex files which are somewhat awkward (eg:
open parens are in the lexer and close parens are in the grammar).

I wrote a not very well documented (f)lex file reader to read in the
existing CSS3 flex
file
and build a lexer compatible with cl-yacc. After getting a
valid lexer, I started working with the published grammar to convert
it to a format cl-yacc would approve of. Along the way I was finding
the syntax for cl-yacc to be pretty cumbersome, so I used some
read time execution to turn forms like the first below, into forms like
the second (macro expanded) one below which are valid cl-yacc
productions. (This is not a great idea or great code, but it did simplify
the parser def for me.)

(or-sel
   #.(rule (comb-sel :|,| spaces or-sel)
       (list :or comb-sel or-sel))
   #.(rule (comb-sel) comb-sel))



(or-sel
 (comb-sel :|,| spaces or-sel
  #'(lambda (comb-sel |,| spaces or-sel)
      (declare (ignorable comb-sel |,| spaces or-sel))
      (list :or comb-sel or-sel)))
 (comb-sel
  #'(lambda (comb-sel) (declare (ignorable comb-sel)) comb-sel)))

The difficulties I had in implementation were mostly concerning where
white space could appear, in which productions, at which levels in the
grammar. It seemed like it was quite a task to get an unambiguous
grammar and have all my extra unimportant white space everywhere still
be parsable. After I had finally managed to rid my grammar of
reduce/reduce conflicts, I was able to parse my language and it seemed
like a fairly peppy parser.

The only problem I really have with this implementation is that it
seems like it is totally illegible after the fact. Even knowing about
parsers and having written this one, I don’t feel comfortable
modifying the language. It seems like it would be difficult to get
working again. Thankfully, I don’t anticipate having to do much
language rewriting in this case.

CL-YACC parser definition (lexer not shown)

(yacc:define-parser *css3-selector-parser*
  (:start-symbol selector)
  (:terminals (:|,| :|*| :|)| :|(| :|>| :|+| :|~| :|:| :|[| :|]| :|=| :|-|
		:S :IDENT :HASH :CLASS :STRING :FUNCTION :NTH-FUNCTION
		:INCLUDES :DASHMATCH :BEGINS-WITH :ENDS-WITH :SUBSTRING
		:integer))
  (:precedence ((:left :|)| :s :|,| :|+| :|~| )) )
  
  (selector #.(rule (or-sel) or-sel))

  (or-sel
   #.(rule (comb-sel :|,| spaces or-sel)
       (list :or comb-sel or-sel))
   #.(rule (comb-sel) comb-sel))
  
  (comb-sel
   #.(rule (and-sel combinator comb-sel)
       (list combinator and-sel comb-sel))
   #.(rule
	 ;; need to handle trailing spaces here
	 ;; to avoid s/r
	 (and-sel spaces) and-sel))

  (combinator 
   (:s (constantly :child))
   (spaces :|>| spaces (constantly :immediate-child))
   (spaces :|~| spaces (constantly :preceded-by))
   (spaces :|+| spaces (constantly :immediatly-preceded-by)))
  
  (and-sel
   #.(rule (and-sel simple-selector)
       (list :and and-sel simple-selector))
   #.(rule (simple-selector) simple-selector))
  
  (simple-selector
   #.(rule (:HASH) `(:hash ,(but-first hash)))
   #.(rule (:CLASS) `(:class ,(but-first class)))
   #.(rule (:IDENT) `(:element ,ident))
   (:|*| (constantly :everything))
   (attrib #'identity)
   (pseudo #'identity))

  (attrib
   #.(rule (:|[| spaces :IDENT spaces :|]|)
       `(:attribute ,ident))
   
   #.(rule (:|[| spaces :IDENT spaces attrib-value-def spaces :|]|)
       `(:attribute ,ident ,attrib-value-def))
   )

  (attrib-value-def
   #.(rule (attrib-match-type attrib-value)
       (list attrib-match-type attrib-value)))

  (attrib-match-type
   #.(rule (:|=|) :equals)
   #.(rule (:includes) :includes)
   #.(rule (:dashmatch) :dashmatch)
   #.(rule (:begins-with) :begins-with)
   #.(rule (:ends-with) :ends-with)
   #.(rule (:substring) :substring))

  (attrib-value
   #.(rule (:ident) ident)
   #.(rule (:string) (but-quotes string)))
  
  (pseudo
   #.(rule (:|:| :IDENT) (list :pseudo ident))
   
   #.(rule (:|:| :FUNCTION spaces selector :|)|)
       (list :pseudo (but-last function) selector))
   #.(rule (:|:| :NTH-FUNCTION spaces nth-expr spaces :|)| )
       `(:nth-pseudo ,(but-last nth-function)
		     ,@nth-expr)))

  (nth-expr
   #.(rule (:ident)
       (cond ((string-equal ident "even") (list 2 0))
	     ((string-equal ident "odd") (list 2 1))
	     (T (error "invalid nth subexpression"))))
   
   #.(rule (nth-sign :integer)
       (list 0 (if (string-equal nth-sign "-")
		   (* -1 (parse-integer integer))
		   (parse-integer integer))))
   
   #.(rule (nth-sign :integer :ident)
       (let (extra-num)
	 (cond
	   ((string-equal "n" ident) T)
	   ;; this is because our lexer will recogince n-1 as a valid ident
	   ;; but n+1 will hit the rule below
	   ((alexandria:starts-with-subseq "n" ident)
	    (setf extra-num (parse-integer (subseq ident 1))))
	   (T (error "invalid nth subexpression in (what is ~A)" ident)))
	 (list (or (if (string-equal nth-sign "-")
		       (* -1 (parse-integer integer))
		       (parse-integer integer))
		   0)
	       (or extra-num 0))))
   #.(rule (nth-sign :integer :ident nth-sign :integer)
       (when (and integer-1 (null nth-sign-1))
	 (error "invalid nth subexpression 2n+1 style requires a sign before the second number"))
       (list (or (if (string-equal nth-sign-0 "-")
		     (* -1 (parse-integer integer-0))
		     (parse-integer integer-0))
		 0)
	     (or (if (string-equal nth-sign-1 "-")
		     (* -1 (parse-integer integer-1))
		     (parse-integer integer-1))
		 0))
       ))
  
   (nth-sign
    #.(rule (:|+|) :|+|)
    #.(rule (:|-|) :|-|)
    #.(rule () ()))
  
   (spaces
    (:S )
    ( )))

Recursive-Regex Parser

After having written recursive-regex, I wanted a way to beat some of
the bugs out as well as have a good example of what I could use this
tool for, and with what performance characteristics. To accomplish
this task, I converted the CL-YACC grammar and lex file into a single
parser definition and

wrote some code to turn that file into the recursive dispatch functions
.

REX: a recursive expression file format based (loosely) on lex

I had a really shoddy lexish reader for the existing lex based lexer.
I converted this to a similar tool (without the c blocks) for defining
named recursive expressions. These files have the .rex extension.
They consists of options, inline-definitions, and named productions.

The definitions for css-selectors are in css3.rex.

Once I had my recursive expression definition, getting the parser was
a fairly easy task. Along the way I added some code to minimize the
parse tree results by promoting children, when a parent only had a
single child that matched the same length as the parent. I also
improved the parse tracing, so that I could observer and debug what
the expression was doing while it was matching. With the tree minimized
I also had to revise many of my tests.

css3.rex: css3-selectors parser definition

%option case-insensitive

h => [0-9a-f]
nmstart => [a-z]
nmchar => [a-z0-9-]
nl => \n|\r\n|\r|\f
string1 => \"([\t !#$%&(-~]|\\{nl}|\')*\"
string2 => \'([\t !#$%&(-~]|\\{nl}|\")*\'
id => [-]?{nmstart}{nmchar}*
name => {nmchar}+
int => [0-9]+
num => [0-9]+|[0-9]*\.[0-9]+
string => {string1}|{string2}
url => ([!#$%&*-~])*
w => [ \t\r\n\f]*
s => [\s\f]+

%%

comment => \/\*[^*]*\*+([^/][^*]*\*+)*\/
cdo => <!--
cdc => -->
class => \.(?<ident>)
hash => #(?<ident>{name})
ident => {id}
element => {id}
important_sym => !{w}important

selector => (?<or>)
or => (?<comb-sel>)(\s*,\s*(?<or>))?

comb-sel => (?<and>)((?<combinator>)(?<comb-sel>))?
combinator => (?<child>\s+)|(?<immediate-child>>)|(?<preceded-by>~)|(?<immediatly-preceded-by>\+)
and => (?<sim-sel>)(?<and>)?
sim-sel => (?<hash>)|(?<class>)|(?<element>)|\*|(?<attrib>)|(?<pseudo>)
attrib => \[\s*(?<ident>)\s*(?<attrib-val>)?\]
attrib-val => [\^\$\*\|\~]?=((?<ident>)|(?<string>))
pseudo => :(?<ident>)(?<parens>\s*(?<selector>)|(?<nth-expr>)\s*)?
nth-expr => even|odd|[\+-]?{int}(n([\+-]{int})?)?


Performance Numbers

These are the performance numbers for the two parsers each parsing 6
short inputs 1000 times. Also included is (a version) of that
output. (Recursive-expressions return CLOS object trees that have in
the results below been converted to a list representation for easy of
viewing.) As you can see the recursive-expressions version is ten
times slower and uses twenty times the memory.

(defun run-some-tests ()
(timed-side-by-side-parses
(list ".foo" "#bar" ":bast" "div[onclick]"
"div[onclick=foo]"
":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )" )
1000))

CSS> (run-some-tests)

;;; Recursive-Regex Results
Evaluation took:
7.548 seconds of real time
7.480000 seconds of total run time (6.850000 user, 0.630000 system)
[ Run times consist of 1.410 seconds GC time, and 6.070 seconds non-GC time. ]
99.10% CPU
18,821,417,333 processor cycles
1,167,530,480 bytes consed

((:CLASS ".foo" (:IDENT "foo")) (:HASH "#bar" (:IDENT "bar"))
(:PSEUDO ":bast" (:IDENT "bast"))
(:AND "div[onclick]" (:ELEMENT "div")
(:ATTRIB "[onclick]" (:IDENT "onclick")))
(:AND "div[onclick=foo]" (:ELEMENT "div")
(:ATTRIB "[onclick=foo]" (:IDENT "onclick")
(:ATTRIB-VAL "=foo" (:IDENT "foo"))))
(:OR ":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )"
(:PSEUDO ":nth-last-child( 2n+1 )" (:IDENT "nth-last-child")
(:MATCHED-PARENS "( 2n+1 )" (:BODY "2n+1 " (:NTH-EXPR "2n+1"))))
(:COMB-SEL "foo.bar>bast:blech( .foo )"
(:AND "foo.bar" (:ELEMENT "foo") (:CLASS ".bar" (:IDENT "bar")))
(:IMMEDIATE-CHILD ">")
(:AND "bast:blech( .foo )" (:ELEMENT "bast")
(:PSEUDO ":blech( .foo )" (:IDENT "blech")
(:MATCHED-PARENS "( .foo )"
(:BODY " .foo" (:CLASS ".foo" (:IDENT "foo")))))))))

;;; CL-YACC Results

Evaluation took:
0.790 seconds of real time
0.790000 seconds of total run time (0.770000 user, 0.020000 system)
[ Run times consist of 0.080 seconds GC time, and 0.710 seconds non-GC time. ]
100.00% CPU
1,968,753,765 processor cycles
64,695,792 bytes consed

((:CLASS "foo") (:HASH "bar") (:PSEUDO "bast")
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick"))
(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick" (:EQUALS "foo")))
(:OR (:NTH-PSEUDO "nth-last-child" 2 1)
(:IMMEDIATE-CHILD (:AND (:ELEMENT "foo") (:CLASS "bar"))
(:AND (:ELEMENT "bast") (:PSEUDO "blech" (:CLASS "foo"))))))


TL/DR

In conclusion, I am certainly not going to replace my working, fast,
memory efficient cl-yacc parser with my recursive-expressions parser.
However, if I wanted to have a working, legible (maybe) parser
definition, that will match as I intuitively expect, I might use
recursive-expressions. Because I am so used to using regex’s for
matching, if performance was not an issue, I would probably always
prefer the recursive expressions version. I could also see the
recursive expressions solution being a nice prototyping tool to help
develop the cl-yacc parser.

Obviously some of these opinions are going to be biased because
I wrote one of these libraries and not the other

CL-YACC

Pros
  • Pretty quick and very memory efficient parser
  • Easy to customize parser output (each production has a lambda body to build whatever output is necessary)
  • Theoretically well grounded
Cons
  • Its hard to make unambiguous grammars
  • Not exceedingly helpful with its suggestions for how to fix your ambiguities

Recursive Regex

Pros
  • Relatively easy to get working
  • lexer and parser are the same tool, built on top of cl-ppcre, which
    you presumably already know, and has a good test environment (regex-coach, repl)
  • Parses ambiguous grammars
  • Has reasonable parser tracing built in, so debugging can be somewhat easier
Cons
  • Not very concerned with parse time or memory consumption
  • Parses ambiguous grammars
  • Bad pathological cases (exponential search)
  • Currently no architecture for modifying the parse tree other than an after
    the fact rewrite

Recursive-Regex Update

I think I have enough of the bugs worked out and enough tests now to actually recommend that others use the recursive-regex library up at my github.

There is a brief overview of the project in the introductory blog post. Also, in response to one of the comments on that blog post, there is an example s-exp parser as part of the test suite.

While this started as a toy, to scratch an intellectual itch, I think that this project is potentially a nice mid point between full blown parser frame work and regular expressions. Grammars are hard to get right though, so if you are writing your own language you might want to investigate something from the cliki parser generators page (eg: cl-yacc).

Introducing Recursive-Regex

Recursive-Regex is the end result of a weekend of playing with the code I published on Thursday about adding named dispatch functions to CL-PPCRE regular expressions. I kept at it and I think that this approach might have some promise for building up a library of reusable regexp/matcher chunks. I also found that this made it somewhat easier to obtain results from the regular expression search because I get back a full parse tree rather than the bindings typically supplied by CL-PPCRE.

I have it somewhat documented, loadable and testable, with all my current tests passing. There is even a recursive regex csv parser defined in the default dispatch table (mostly as a simple, but practical proof of concept).

Comma-List: [\t ]*(?:(?<body>[^,]*)[\t ]*,)*[\t ]*(?<body>[^,]*)[\t ]*
CSV-Row: (?<comma-list>((?<double-quotes>)|[^\n,]*))(?:\n|$)
CSV-File: (?<csv-row>)*

Double quotes and body both go to custom dispatcher functions. Body defines where the body regex should be matched and what to use if no body is supplied.

I don’t really have long term plans for this project, but it scratched an intellectual itch I was experiencing. Perhaps it will be useful for someone down the road.

Even More Fun With CL-PPCRE Filter Functions

A while ago I posted about my adventures playing with CL-PPCRE filter functions. In the previous blog post I destructively modify a cl-ppcre parse tree to add a filter function that can handle matching matched pairs of parentheses (a typical example of what regular expressions are NOT capable of). In this post I formalize that example into something that could be more broadly applied with less understanding of the underlying mechanics.

To begin with I define a function create-scanner-with-filters that will handle creating these special scanners for me. My idea is to provide a table of functions that should be called when we see certain strings inside of the regular expression. Because there are already named groups (see *allow-named-registers*) that can have parameters and that CL-PPCRE is already parsing for me, I decided to tie into the named registers to handle my function dispatching. This has the added niceness that whatever your filter matches is going to be stored in a register.

An over view of this process is: parse the regex, replace any named-register nodes’ (that have a function in the table) third element (usually a regex whose match will be stored in a register) with our specialized filter function, compile the new scanner and return that to the end user. I also decided that the regex that is the body of the named group should be available to the filter and in most cases should probably be used as part of the filter function.

If I continue to play with this, I might eventually release it as a library, but for now its stands well on its own.

Without further ado:

(cl-interpol:enable-interpol-syntax)
(declaim (optimize (debug 3)))

;; TODO: group binds in body expressions
;; TODO: propogate current scanner options to body scanners

(defun make-matched-pair-matcher (open-char close-char)
  "Will create a regex filter that can match arbitrary pairs of matched characters
   such as (start (other () some) end)"
  (lambda (body-regex)
    (setf body-regex (if (eql body-regex :void)
                         nil
                         (cl-ppcre:create-scanner
                          `(:SEQUENCE :START-ANCHOR ,body-regex :END-ANCHOR))))
    (lambda (pos)
      ;;(format T "TEST3 ~A ~A ~%" cl-ppcre::*reg-starts* cl-ppcre::*reg-ends*)
      (iter
        (with fail = nil)
        (with start = pos)
        (with cnt = 0)
        (for c = (char cl-ppcre::*string* pos))
        (if (first-iteration-p)
            (unless (eql c open-char) (return fail))
            ;; went past the string without matching
            (when (>= pos (length cl-ppcre::*string*))
              (return fail)))
        (cond
          ((eql c open-char) (incf cnt))
          ((eql c close-char)
           (decf cnt)
           (when (zerop cnt) ;; found our last matching char
             (if (or (null body-regex)
                     (cl-ppcre:scan body-regex cl-ppcre::*string*
                                    :start (+ 1 start)
                                    :end pos))
                 (return (+ 1 pos))
                 (return fail)))))
        (incf pos)))))

(defun default-dispatch-table ()
  "Creates a default dispatch table with a parens dispatcher that can match
   pairs of parentheses"
  `(("parens" . ,(make-matched-pair-matcher #\( #\) ))))

(defun create-scanner-with-filters
    (regex &optional (function-table (default-dispatch-table)) )
  "Allows named registers to refer to functions that should be in
   the place of the named register"
  (let* ((cl-ppcre:*allow-named-registers* T)
         (p-tree (cl-ppcre:parse-string regex)))
    (labels ((dispatcher? (name)
               "Return the name of the dispatcher from the table if
                applicable"
               (cdr (assoc name function-table :test #'string-equal)))
             (mutate-tree (tree)
               "Changes the scanner parse tree to include any filter
                functions specified in the table"
               (typecase tree
                 (null nil)
                 (atom tree)
                 (list
                  (aif (and (eql :named-register (first tree))
                            (dispatcher? (second tree)))
                       `(:named-register (second tree)
                         (:filter ,(funcall it (third tree))))
                       (iter (for item in tree)
                         (collect (mutate-tree item))))))))
      ;; mutate the regex to contain our matcher functions
      ;; then compile it
      (cl-ppcre:create-scanner (mutate-tree p-tree)))))

(defparameter *example-function-phrase*
  "some times I like to \"function (calling all coppers (), another param (), test)\" just to see what happens")

(defun run-examples ()
  "Just runs some examples expected results:

   ((\"function (calling all coppers (), another param (), test)\"
     #(\"(calling all coppers (), another param (), test)\"))
    (\"function (calling all coppers (), another param (), test)\"
     #(\"(calling all coppers (), another param (), test)\"))
    (NIL))

  "
  (flet ((doit (regex)
           (multiple-value-list
            (cl-ppcre:scan-to-strings
             (create-scanner-with-filters regex)
             *example-function-phrase*))))
  (list
   (doit #?r"function\s*(?<parens>)")
   (doit #?r"function\s*(?<parens>([^,]+,)*[^,]+)")
   (doit #?r"function\s*(?<parens>not-matching-at-all)"))))

PS. I don’t claim this is actually worth anything, only that I had fun doing it.

Fun with CL-PPCRE: Counting Parentheses

So traditionally, regular expressions have been… err… regular.  While regular expressions are incredibly useful, being regular is a limitation.  The canonical example of what a regular language cannot do, is counting parentheses.  For example, if I am trying to match an expression with arbitrary nesting, I need a more robust matching solution than a regular language can provide.

I have seen that CommonLisp – Portable Perl-compatible regular expressions (cl-ppcre) supported a parse tree representation of regular expressions.  My hope was that this would allow arbitrary embedding of CL code into the matching structure of the regular expressions.  It does.

To teach myself how to use this tool, I decided to count parentheses.  The basic structure of the parse tree is just a list of symbols that get compiled into specific matching functions.

  • ‘.’ -> :EVERYTHING
  • ‘*’ -> :GREEDY-REPITITION

To embed a lisp function you create a node in that parse-tree named :FILTER whose child is a function of one argument (the position in the string this function should try to match).  This function returns how many charcters it consumes (by position), or nil if it did not match.

Keep in mind that the following sample is not reentrant, but could be made to be so.  Now without further ado, the code:
link to the paste