pacc logo

Character escapes in classes

Thanks again to Richard Smith for pointing out that character escapes in classes don't work. Here's a new test case inspired by his bug report; it's slightly more realistic than many of our test cases:

parse foo foo
parse 'foo bar' 'foo,bar'
parse 'foo  bar' 'foo,bar'
parse 'foo  bar' 'foo,bar'
parse 'foo   456    qux' 'foo,bar,qux'

#include <string.h>

char *allocat(char *a, char *b) {
    char *x = malloc(strlen(a) + strlen(b) + 2);
    if (!x) nomem();
    strcpy(x, a); strcat(x, ","); strcat(x, b);
    return x;

Words :: char * ← w:Word Space* ws:Words { allocat(w, ws) } / w:Word → w
Word ← (!Space .)+ → { ref_str() }
Space :: void ← [ \t\v\r\n]

It may not be obvious, but the inputs include some tab characters, which should be picked up as a Space by the \t in the character class. Currently, however, pacc treats the tabs as Word characters. I found this surprising, as I was sure that I'd included code to interpret the traditional character escapes (I already knew we don't support the fancy new escapes in C99).

However, on looking at that code again, I realised that it was wrong (and unfortunately I hadn't written any test cases for it): although we correctly recognise that \t means something special, it ends up meaning lower-case t!

Thinking about it some more, I've taken rather a wrong turn in handling character escapes. I've started building my own compiler for them: in s_range() we put a number into the parse tree. While this could work, it's a lot of effort which we can instead hand off to the C compiler! Even better, this should mean that to get all the fancy escapes, we only have to recognise them.

Well, that's easy, and works beautifully. Almost. Unfortunately, although turning [\t] into '\t' is a great idea, turning [€] into '€' isn't so good: it elicits a warning from gcc about multi-byte character constants.

So, back to making s_range() work. In fact, the task of interpreting C character escapes is not at all hard, even including the exotic new ones. I added about 20 lines to s_range(), and there are some extra rules in pacc.pacc which were largely copied directly from the syntax in the standard. As usual, one advantage of PEGs shines through: the C standard has a disambiguating “semantic” rule:

Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.

PEGs avoid ambiguity, though, and the maximal munch rule is expressed in the grammar itself:

OctalEscape ← "\\" [0-7][0-7]?[0-7]?

Last updated: 2015-05-24 19:45:25 UTC


Porting and packaging

One thing pacc needs is more users. And, perhaps, one way to get more users is to reduce the friction in getting started with pacc. An obvious lubricant is packaging. Read More...

Release relief

Looking at _pacc_coords(), I noticed that it seemed to have the same realloc() bug that I'd just fixed in _pacc_result(). However, the "list of arrays" trick really wasn't going to work here. Read More...

See more news articles