A while ago I posted about my adventures playing with CL-PPCRE filter functions. In the previous blog post I destructively modify a cl-ppcre parse tree to add a filter function that can handle matching matched pairs of parentheses (a typical example of what regular expressions are NOT capable of). In this post I formalize that example into something that could be more broadly applied with less understanding of the underlying mechanics.
To begin with I define a function create-scanner-with-filters that will handle creating these special scanners for me. My idea is to provide a table of functions that should be called when we see certain strings inside of the regular expression. Because there are already named groups (see *allow-named-registers*) that can have parameters and that CL-PPCRE is already parsing for me, I decided to tie into the named registers to handle my function dispatching. This has the added niceness that whatever your filter matches is going to be stored in a register.
An over view of this process is: parse the regex, replace any named-register nodes’ (that have a function in the table) third element (usually a regex whose match will be stored in a register) with our specialized filter function, compile the new scanner and return that to the end user. I also decided that the regex that is the body of the named group should be available to the filter and in most cases should probably be used as part of the filter function.
If I continue to play with this, I might eventually release it as a library, but for now its stands well on its own.
Without further ado:
(cl-interpol:enable-interpol-syntax) (declaim (optimize (debug 3))) ;; TODO: group binds in body expressions ;; TODO: propogate current scanner options to body scanners (defun make-matched-pair-matcher (open-char close-char) "Will create a regex filter that can match arbitrary pairs of matched characters such as (start (other () some) end)" (lambda (body-regex) (setf body-regex (if (eql body-regex :void) nil (cl-ppcre:create-scanner `(:SEQUENCE :START-ANCHOR ,body-regex :END-ANCHOR)))) (lambda (pos) ;;(format T "TEST3 ~A ~A ~%" cl-ppcre::*reg-starts* cl-ppcre::*reg-ends*) (iter (with fail = nil) (with start = pos) (with cnt = 0) (for c = (char cl-ppcre::*string* pos)) (if (first-iteration-p) (unless (eql c open-char) (return fail)) ;; went past the string without matching (when (>= pos (length cl-ppcre::*string*)) (return fail))) (cond ((eql c open-char) (incf cnt)) ((eql c close-char) (decf cnt) (when (zerop cnt) ;; found our last matching char (if (or (null body-regex) (cl-ppcre:scan body-regex cl-ppcre::*string* :start (+ 1 start) :end pos)) (return (+ 1 pos)) (return fail))))) (incf pos))))) (defun default-dispatch-table () "Creates a default dispatch table with a parens dispatcher that can match pairs of parentheses" `(("parens" . ,(make-matched-pair-matcher #\( #\) )))) (defun create-scanner-with-filters (regex &optional (function-table (default-dispatch-table)) ) "Allows named registers to refer to functions that should be in the place of the named register" (let* ((cl-ppcre:*allow-named-registers* T) (p-tree (cl-ppcre:parse-string regex))) (labels ((dispatcher? (name) "Return the name of the dispatcher from the table if applicable" (cdr (assoc name function-table :test #'string-equal))) (mutate-tree (tree) "Changes the scanner parse tree to include any filter functions specified in the table" (typecase tree (null nil) (atom tree) (list (aif (and (eql :named-register (first tree)) (dispatcher? (second tree))) `(:named-register (second tree) (:filter ,(funcall it (third tree)))) (iter (for item in tree) (collect (mutate-tree item)))))))) ;; mutate the regex to contain our matcher functions ;; then compile it (cl-ppcre:create-scanner (mutate-tree p-tree))))) (defparameter *example-function-phrase* "some times I like to \"function (calling all coppers (), another param (), test)\" just to see what happens") (defun run-examples () "Just runs some examples expected results: ((\"function (calling all coppers (), another param (), test)\" #(\"(calling all coppers (), another param (), test)\")) (\"function (calling all coppers (), another param (), test)\" #(\"(calling all coppers (), another param (), test)\")) (NIL)) " (flet ((doit (regex) (multiple-value-list (cl-ppcre:scan-to-strings (create-scanner-with-filters regex) *example-function-phrase*)))) (list (doit #?r"function\s*(?<parens>)") (doit #?r"function\s*(?<parens>([^,]+,)*[^,]+)") (doit #?r"function\s*(?<parens>not-matching-at-all)"))))
PS. I don’t claim this is actually worth anything, only that I had fun doing it.