pacc logo
pacc v0.3: Character classes

Next: , Previous: , Up: Top   [Contents][Index]


7 Character classes

For our next example, let’s consider how we could parse a decimal number. We’ll start by parsing a single digit, and we already know one way we could do that:

Digit <- "0" → 0 / "1" → 1 / "2" → 2 / "3" → 3 / "4" → 4 /
  "5" → 5 / "6" → 6 / "7" → 7 / "8" → 8 / "9" → 9

This is tedious and error-prone, but it’s the best we can do with the matchers we’ve seen so far. Another matcher available in pacc is the character class: square brackets enclose a set of characters to be matched, so the character class ‘[0123456789]’ matches a single decimal digit. To compress it even further, we can write ‘[0-9]’. The hyphen stands for all the characters between the two characters it separates.

But while solving one problem—concisely matching any of a set of characters—we have introduced another: how do we write a useful semantic expression to go with a character class matcher?

Digit <- [0-9] -> { /* what goes here? */ }

Now that we have just one alternative that will match any digit, what expression can we use that will give us the right value? We need some way to refer back to input that we are matching. We’ll see how we do that in the next section.

As well as normal character classes, pacc supports negated character classes. These are written with a caret ‘^’ as the first character. For example, ‘[^)]’ matches any single character except a closing parenthesis.

Should you need to write a character class that includes the literal character ‘^’, simply ensure that it is not the first character in the class. To match a literal hyphen, write it as the first or last character in the class. To match a literal ‘]’, write it as the first character. For example: the character class ‘[~^]’ matches either a tilde or a caret; ‘[-_]’ matches a hyphen or an underscore; and ‘[][]’ matches either an opening or a closing bracket.

If you are familiar with character classes in regular expressions, pacc’s character classes are broadly similar. Note, however, that pacc is not aware of locales. In pacc, a hyphenated range only ever stands for all the characters with Unicode code points in the (inclusive) range between those of the two named characters. (Believe it or not, in at least some versions of GNU grep, in at least some locales, the character class ‘[a-z]’ matches all lower case letters, and also all upper case letters… except ‘Z’.) Nor does pacc support named classes (such as ‘[:alpha:]’). See Semantic guards, if you need to do something like this.


Next: , Previous: , Up: Top   [Contents][Index]

Last updated: 2016-08-03 21:39:50 UTC

News

Porting and packaging

One thing pacc needs is more users. And, perhaps, one way to get more users is to reduce the friction in getting started with pacc. An obvious lubricant is packaging. Read More...

Release relief

Looking at _pacc_coords(), I noticed that it seemed to have the same realloc() bug that I'd just fixed in _pacc_result(). However, the "list of arrays" trick really wasn't going to work here. Read More...

See more news articles

feed