Introducing Recursive-Regex

Recursive-Regex is the end result of a weekend of playing with the code I published on Thursday about adding named dispatch functions to CL-PPCRE regular expressions. I kept at it and I think that this approach might have some promise for building up a library of reusable regexp/matcher chunks. I also found that this made it somewhat easier to obtain results from the regular expression search because I get back a full parse tree rather than the bindings typically supplied by CL-PPCRE.

I have it somewhat documented, loadable and testable, with all my current tests passing. There is even a recursive regex csv parser defined in the default dispatch table (mostly as a simple, but practical proof of concept).

Comma-List: [\t ]*(?:(?<body>[^,]*)[\t ]*,)*[\t ]*(?<body>[^,]*)[\t ]*
CSV-Row: (?<comma-list>((?<double-quotes>)|[^\n,]*))(?:\n|$)
CSV-File: (?<csv-row>)*

Double quotes and body both go to custom dispatcher functions. Body defines where the body regex should be matched and what to use if no body is supplied.

I don’t really have long term plans for this project, but it scratched an intellectual itch I was experiencing. Perhaps it will be useful for someone down the road.

Even More Fun With CL-PPCRE Filter Functions

A while ago I posted about my adventures playing with CL-PPCRE filter functions. In the previous blog post I destructively modify a cl-ppcre parse tree to add a filter function that can handle matching matched pairs of parentheses (a typical example of what regular expressions are NOT capable of). In this post I formalize that example into something that could be more broadly applied with less understanding of the underlying mechanics.

To begin with I define a function create-scanner-with-filters that will handle creating these special scanners for me. My idea is to provide a table of functions that should be called when we see certain strings inside of the regular expression. Because there are already named groups (see *allow-named-registers*) that can have parameters and that CL-PPCRE is already parsing for me, I decided to tie into the named registers to handle my function dispatching. This has the added niceness that whatever your filter matches is going to be stored in a register.

An over view of this process is: parse the regex, replace any named-register nodes’ (that have a function in the table) third element (usually a regex whose match will be stored in a register) with our specialized filter function, compile the new scanner and return that to the end user. I also decided that the regex that is the body of the named group should be available to the filter and in most cases should probably be used as part of the filter function.

If I continue to play with this, I might eventually release it as a library, but for now its stands well on its own.

Without further ado:

(cl-interpol:enable-interpol-syntax)
(declaim (optimize (debug 3)))

;; TODO: group binds in body expressions
;; TODO: propogate current scanner options to body scanners

(defun make-matched-pair-matcher (open-char close-char)
  "Will create a regex filter that can match arbitrary pairs of matched characters
   such as (start (other () some) end)"
  (lambda (body-regex)
    (setf body-regex (if (eql body-regex :void)
                         nil
                         (cl-ppcre:create-scanner
                          `(:SEQUENCE :START-ANCHOR ,body-regex :END-ANCHOR))))
    (lambda (pos)
      ;;(format T "TEST3 ~A ~A ~%" cl-ppcre::*reg-starts* cl-ppcre::*reg-ends*)
      (iter
        (with fail = nil)
        (with start = pos)
        (with cnt = 0)
        (for c = (char cl-ppcre::*string* pos))
        (if (first-iteration-p)
            (unless (eql c open-char) (return fail))
            ;; went past the string without matching
            (when (>= pos (length cl-ppcre::*string*))
              (return fail)))
        (cond
          ((eql c open-char) (incf cnt))
          ((eql c close-char)
           (decf cnt)
           (when (zerop cnt) ;; found our last matching char
             (if (or (null body-regex)
                     (cl-ppcre:scan body-regex cl-ppcre::*string*
                                    :start (+ 1 start)
                                    :end pos))
                 (return (+ 1 pos))
                 (return fail)))))
        (incf pos)))))

(defun default-dispatch-table ()
  "Creates a default dispatch table with a parens dispatcher that can match
   pairs of parentheses"
  `(("parens" . ,(make-matched-pair-matcher #\( #\) ))))

(defun create-scanner-with-filters
    (regex &optional (function-table (default-dispatch-table)) )
  "Allows named registers to refer to functions that should be in
   the place of the named register"
  (let* ((cl-ppcre:*allow-named-registers* T)
         (p-tree (cl-ppcre:parse-string regex)))
    (labels ((dispatcher? (name)
               "Return the name of the dispatcher from the table if
                applicable"
               (cdr (assoc name function-table :test #'string-equal)))
             (mutate-tree (tree)
               "Changes the scanner parse tree to include any filter
                functions specified in the table"
               (typecase tree
                 (null nil)
                 (atom tree)
                 (list
                  (aif (and (eql :named-register (first tree))
                            (dispatcher? (second tree)))
                       `(:named-register (second tree)
                         (:filter ,(funcall it (third tree))))
                       (iter (for item in tree)
                         (collect (mutate-tree item))))))))
      ;; mutate the regex to contain our matcher functions
      ;; then compile it
      (cl-ppcre:create-scanner (mutate-tree p-tree)))))

(defparameter *example-function-phrase*
  "some times I like to \"function (calling all coppers (), another param (), test)\" just to see what happens")

(defun run-examples ()
  "Just runs some examples expected results:

   ((\"function (calling all coppers (), another param (), test)\"
     #(\"(calling all coppers (), another param (), test)\"))
    (\"function (calling all coppers (), another param (), test)\"
     #(\"(calling all coppers (), another param (), test)\"))
    (NIL))

  "
  (flet ((doit (regex)
           (multiple-value-list
            (cl-ppcre:scan-to-strings
             (create-scanner-with-filters regex)
             *example-function-phrase*))))
  (list
   (doit #?r"function\s*(?<parens>)")
   (doit #?r"function\s*(?<parens>([^,]+,)*[^,]+)")
   (doit #?r"function\s*(?<parens>not-matching-at-all)"))))

PS. I don’t claim this is actually worth anything, only that I had fun doing it.

Fun with CL-PPCRE: Counting Parentheses

So traditionally, regular expressions have been… err… regular.  While regular expressions are incredibly useful, being regular is a limitation.  The canonical example of what a regular language cannot do, is counting parentheses.  For example, if I am trying to match an expression with arbitrary nesting, I need a more robust matching solution than a regular language can provide.

I have seen that CommonLisp – Portable Perl-compatible regular expressions (cl-ppcre) supported a parse tree representation of regular expressions.  My hope was that this would allow arbitrary embedding of CL code into the matching structure of the regular expressions.  It does.

To teach myself how to use this tool, I decided to count parentheses.  The basic structure of the parse tree is just a list of symbols that get compiled into specific matching functions.

  • ‘.’ -> :EVERYTHING
  • ‘*’ -> :GREEDY-REPITITION

To embed a lisp function you create a node in that parse-tree named :FILTER whose child is a function of one argument (the position in the string this function should try to match).  This function returns how many charcters it consumes (by position), or nil if it did not match.

Keep in mind that the following sample is not reentrant, but could be made to be so.  Now without further ado, the code:
link to the paste

Yay Memes

I have two Debian boxes that I do most of my work on (remotely from my desktop)

russ@sekhmet:~$ history|awk ‘{print $2}’|awk ‘BEGIN {FS=”|”} {print $1}’|sort|uniq -c|sort -rn|head -10
82 sudo
58 cd
56 darcs
43 svk
35 psql
35 ll
26 fg
22 ls
16 sbcl
13 su
russ@mahes:~$ history|awk ‘{print $2}’|awk ‘BEGIN {FS=”|”} {print $1}’|sort|uniq -c|sort -rn|head -10
213 sudo
102 darcs
41 ls
37 fg
34 cd
8 ll
7 python
6 svn
5 ruby
5 rm
I am guessing that quite a bit of the sudo was in fact sudo python or sudo emacs which I use extensivly in system administration. I dont even ever remember running ruby so not sure how it made its way onto the list (I don’t even know the language)
via b. clementson