pacc logo

UTF-8 Support

Just about the last thing left on the list. Of course, for the most part pacc already supports UTF-8: its own grammar uses characters like ← and ε with no problems. So long as they occur in literal strings, they work just fine (and on the assumption that all the world's UTF-8).

It's not my intention to do anything about all the world's UTF-8 yet. I'll need to take advice from that on the userbase, when there is one! as I am no expert.

So the infelicity that needs to be fixed for the first release is the Any matcher (which is supposed to match a single character, but in fact matches a single byte), and Character Classes. I don't actually know how to approach the latter yet, but I think I can do Any, so let's knock that off first, starting with a test case, of course...

Also, we get the wrong answers for column counts in coords due to encoding issues.

Right, so Any matching works, thanks to the genius decoder by Björn Höhrmann.

And I'm pretty sure I can see how to apply that to Character Classes too, but I'm too tired to start on that now.

On another note, whilst writing the tutorial, I fairly early hit the problem of explaining to the punters why the first pacc parser they write gives answers for yessir and noway as well as the expected inputs. I thought I had some notes from ages ago about the possibility of anchoring the parse to the end of the string by default, but I can't find them now. I think I threw it out on grounds of compatability with other PEG parsers.

But I'm going to bring it back in again on the grounds of: it's the right thing to do. Sure, it's easy enough to anchor explicitly, with End ← !.. But if we anchor by default, it's just as easy to unanchor with End ← .*. And every real world grammar wants to be anchored; we anchor at the left end, don't we? So I'm going to do that, before the release. Sure, that may not be how any other PEG parser in the world works, but they're wrong. :-)

Last updated: 2015-05-24 19:45:29 UTC


Porting and packaging

One thing pacc needs is more users. And, perhaps, one way to get more users is to reduce the friction in getting started with pacc. An obvious lubricant is packaging. Read More...

Release relief

Looking at _pacc_coords(), I noticed that it seemed to have the same realloc() bug that I'd just fixed in _pacc_result(). However, the "list of arrays" trick really wasn't going to work here. Read More...

See more news articles