[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7. Regular expressions

Unix is cursed with a number of incompatible syntaxes for regular expression patterns, used by different programs, and with various features. The shell globbing patterns are used most frequently. These are simple and terse, but they are not fully general regular expressions. Q's solution extends the conventional globbing syntax, as in the Korn Shell.

These are the most important special characters:

`?'
Matches any single character except `\0' or `\n'. When matching against file names, also does not match a `/', nor an initial `.'.

`*'
Matches any number of characters that might match `?'. Also, remembers the matched characters for use by replacement commands.

`*(pattern)'
Match any number of instances of the pattern.

`"chars"'
Match the quoted characters. C-style escapes are recognized.

`\X'
Match the character X exactly.

`(pattern)'
Grouping. Same as pattern, but also remembers the matched characters for use by replacement commands.

`pattern1|pattern2'
Match either pattern.

`[charset]'
Standard character sets.

`[:keyword:]'
Extension loophole. Do some special matching operation depending on the keyword.

The builtin function match tries to match a string against a quoted pattern:
 
"abcd" match a*d  # Succeeds
There may be a second replacement pattern:
 
"abcd" match a*d A*D
==> "AbcD"


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1 glob

The glob function takes a single string, interprets it as a globbing pattern, and returns a sorted vector of matching file names. The result is the empty vector if there are no matches.

One unique feature of glob is that it knows how search through an unbounded number of sub-directories. To find every Makefile in any sub-directory of dld-* do:
 
glob "dld-*/*(*/)Makefile"

The algorithm works by scanning the filenames in a directory. Each filename (prepended by the name of the current directory) is matched against the pattern. If pattern matches the entire filename, we have found a match. Otherwise, the regular expression matcher has been modified to signal two kinds of failure: A prefix-partial-match happens when the matcher runs out of characters in the candidate. This means that the candidate is not a valid match, but it might be a prefix of a valid match. In that case, if the candidate names a directory, we continue recursively scanning that directory. Other kinds of match failure tell us to give up (with this particular file).

Note that the above example runs 2-3 times faster than GNU find on:
 
find . -regex "dld-.*Makefile" -print
The reason is that find has to look into every sub-directory of ., while Q's glob only looks at sub-directories matching dld-*.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2 globlist

The function globlist does "shell-style" globbing, using the routine glob. It takes a vector of strings, does tilde-expansion, and calls glob on each pattern. Any empty result from glob is replaced by a one-element vector containing the original pattern, but with quotes and parentheses removed. To find every Makefile in any sub-directory of dld-* do:
 
glob "dld-*/*(*/)Makefile"
Note that the above example is 2-3 times faster than GNU find, because find has to look into every sub-directory of ., while Q's glob only looks at sub-directories matching dld-*. (See full paper for Q's algorithm.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3 globlist

globlist takes a vector of strings, does tilde-expansion, and calls glob on each pattern. Where there is no match, the answer is replaced by the original pattern, but with quotes and parentheses removed. The sub-answers are concatenated to one vector.
 
globlist (quote x(3+4)y p"ar"se* f(oo).*)
might yield:
 
["x3+4y" "parserule.o" "parsemacros.o" "parse.o" "foo.*"]

Here is a no-frills implementation of echo:
 
Q1> :(macro echo :X@)= parse "__echo (quote " X@ ")"
Q2> :(__echo :L)= sprintf "%{%s%^ %}\n" (globlist L)
Q3> echo parse*
parserule.o parsemacros.o parse.o
The function __echo does the actual work: It calls globlist to do globbing, and then concatenates the results together using the sprintf routine. (The "%{...%}" format directives loop over a sequence, just like Common Lisp's ~{...~} directives.)

In Unix, the shell traditionally does globbing. This is usually convenient, but sometimes the standard expansion is inappropriate, such as the patterns used by grep and find. Non-Unix systems may provide globbing under application control. This provides more flexibility. The Q approach provides the same flexibility in a Unix framework.

As an example, consider ren, an intelligent (and simplified) mv:
 
:(__ren :src :dst)=(
  :X=(glob src)
  {run mv $(X?) $(X? match $src $dst)} do)
:(macro ren :args@) = parse "__ren (quote " args@ ")@"
The __ren routine takes two patterns. It first finds the filenames matching the first pattern. Then, for each match, it calls rename (interface to the system call), using the matching filename X? and the new name X? match $src $dst.

The ren macro allows you to write:
 
ren *.c.BAK BAK/*.c


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Per Bothner on December, 4 2001 using texi2html