Unicode character classes

The development version of pacc now supports Unicode characters in character classes. (As always, UTF-8 is currently the only supported encoding.)

Not much to say about it, really. The fact that characters in classes were (ultimately) matched by . meant that we were already reading Unicode characters correctly. Emitting the right code meant adding a decoder whilst building the AST, and adding support for numeric nodes, which was actually trivial.

Previously we had a single node type, crange which held either a character class equality, range start, or range end, distinguished on their first character. This was actually a rather ugly scheme. Replacing crange with cceq, ccge, and ccle seemed extravagant, but in fact considerably tidied up the code in emit.c.

One task remains to be done: we output the exact same code to read the next Unicode character from the input for every any matcher and every cclass matcher. This is obviously silly, and putting this into a function will tidy things up considerably, and also make one day supporting encodings other than UTF-8 more feasible.

Last updated: 2015-05-24 19:45:27 UTC


