pacc logo

Character Classes

2010-11-22

At the moment I'm thinking about character classes. It seems to me that we must add at least two node types to the grammar: cclass introduces a class; its children of type crange are a list that give the various ranges within the class. The value of the cclass node itself is null for a normal character class; anything (^, say) for an inverted character class.

The crange node's value consists of a type byte, which is >, =, or <, followed by the character in question. So for example the character class [A-Z_] would appear in the tree as:

cclass: (null)
  crange: ">A"
  crange: "<Z"
  crange: "=_"

Code emission is fairly obvious I think.

Also, I'm now more or less determined that we will support different character encodings. The important ones I know of are byte (no character encoding, each byte in the input represents itself), UTF-8, and UTF-16. Mainly I just need 2 functions. One function returns the length in bytes of the next character on the input (obviously this is always 1 for the byte noncoding; more complex for UTF-8 / UTF-16), which is used by the any matcher. The other function returns the next character on the input decoded into an int (or UTF-32), for use in the character class matcher.

Supporting various combinations like "grammar is written in UTF-8 but input is in UTF-16" will be a little more complicated, but I think I can do it using function pointers in the struct _pacc_parser object. But more of that later.

2010-11-24

So basic character classes work, all the way from pacc.pacc to the compiler! Still missing are inverted character classes, and proper handling of character escapes.

(Inverted character classes are technically unnecessary, since [^x] is equivalent to ![x].. Hmm... we could actually implement them that way in pacc.pacc. But on further reflection I think it's easier -- actually less code -- to carry the inversion down to emit, as I'd originally envisaged.)

But first, some tidy up. I made a bit of a hash of syntax.c in the last throes of making character classes work, and thinking about it I suspect snoc is actually append (essentially, we don't distinguish atoms from lists). Let's see... that seems fine. And negated character classes work. Cool.

2010-11-30

What next? Fix character classes to support C escapes, and also the special escape \]. (Note that we don't need to escape -, since it should work at the beginning, or end of a class, and also after any range, for example [A-Za-z-_]. Need tests for all these cases, of course.) Rewrite pacc.pacc to use character classes. Copy new rules into pacc0.c.

Hmm... making - work as expected was more effort than I'd expected. I was bitten by the usual problem that PEGs never backtrack, even when you expect that they might, and also that some of the low-level rules that match escaped characters were of type void -- they were normally glued together by a higher-level rule that used refs to get at the actual string value. But for character classes we need to know the characters. (Along the way, I think I spotted some problems with # comments occurring inside rules that I should pursue.)

Last updated: 2015-05-24 19:47:21 UTC

Donate

Support the development of pacc with a donation! We accept donations in BitCoin or via PayPal who handle almost any other form of payment.

News

Porting and packaging

One thing pacc needs is more users. And, perhaps, one way to get more users is to reduce the friction in getting started with pacc. An obvious lubricant is packaging. Read More...

Release relief

Looking at _pacc_coords(), I noticed that it seemed to have the same realloc() bug that I'd just fixed in _pacc_result(). However, the "list of arrays" trick really wasn't going to work here. Read More...

See more news articles

feed