pacc logo

Unicode character classes

The development version of pacc now supports Unicode characters in character classes. (As always, UTF-8 is currently the only supported encoding.)

Not much to say about it, really. The fact that characters in classes were (ultimately) matched by . meant that we were already reading Unicode characters correctly. Emitting the right code meant adding a decoder whilst building the AST, and adding support for numeric nodes, which was actually trivial.

Previously we had a single node type, crange which held either a character class equality, range start, or range end, distinguished on their first character. This was actually a rather ugly scheme. Replacing crange with cceq, ccge, and ccle seemed extravagant, but in fact considerably tidied up the code in emit.c.

One task remains to be done: we output the exact same code to read the next Unicode character from the input for every any matcher and every cclass matcher. This is obviously silly, and putting this into a function will tidy things up considerably, and also make one day supporting encodings other than UTF-8 more feasible.

Last updated: 2015-05-24 19:45:27 UTC

Donate

Support the development of pacc with a donation! We accept donations in BitCoin or via PayPal who handle almost any other form of payment.

News

Porting and packaging

One thing pacc needs is more users. And, perhaps, one way to get more users is to reduce the friction in getting started with pacc. An obvious lubricant is packaging. Read More...

Release relief

Looking at _pacc_coords(), I noticed that it seemed to have the same realloc() bug that I'd just fixed in _pacc_result(). However, the "list of arrays" trick really wasn't going to work here. Read More...

See more news articles

feed