lux by example

lux is a text-processing tool, like sed. Its job is to take a line of input, transform it by a rule we specify, and write the result into output. Rule is two-folded: there is an input template and an output template.

For example, this expression parses time.

Snippet's layout is as follows: on top is the control panel, with play and reset buttons and a menu. Directly below is an input template area.

Below is an area for input and a pre-rendered output.

These snippets are interactive. Change the time in the input and hit the โ–ถ๏ธ button.

Structure of JSON output will be discussed later.

Literals

The simplest parser is a literal parser. To make a literal, surround a desired string with a pair of single or double quotes.

Change the input to something else and see how the program rejects it.

Built-ins

To parse data unknown beforehand, we use built-in primitives. There are 3 built-ins: any, alpha and digit. To use a built-in parser, just use its name.

any parser matches a single utf-8 character.

alpha and digit match a single letter and digit respectively. In this snippet, replace any with alpha or digit and inspect the output.

Sequencing

By using , (comma) we can sequence parsers. A compound parser succeeds only if both parsers succeed.
More than two parsers can be sequenced, and built-in parsers can be freely sequenced with literals. With this, we are ready to construct our first parser for the hh:mm time format.

Modifiers

In the previous example, we fixed the number of digits in hours and minutes to two. But what if we want to allow a non-fixed number of something? For this occurrence, we have modifiers. There are four of them in total.

The + (plus) modifier runs the parser one or more times.
The * (star) modifier runs the parser zero or more times.
The ? (question mark) modifier runs the parser once or never.

The final modifier is repeat and it looks like this: {n-m}. It repeats the modifier parser n to m times.

For the common case of {n-n}, shorthand exists: {n}.

Now, knowing modifiers, let's iterate on our hh:mm parser. To make our task a bit less dull, let's allow omitting the first zero in hours and adding a - (minus) before time (imagine we're parsing UTC time zone offset).

Alternation

Sometimes we want the parser to accept either one option or another. For example, if we need to parse a binary digit, we need a parser that only accepts 0 or 1.

This behavior is achieved with |(pipe). Parsers connected with | will be tried in sequence until one succeeds.

Parentheses

Any parser can be parenthesized. Parentheses do not change what the parser does, but allow combining parsers in a more complex way.

Match data

For now, every parser we introduced could only validate input: it either accepted or rejected it. Parsing implies some kind of structured output, that can be used down the line.

This structured output that lux yields on successful parsing is called match. Match contains a piece of source from which it was parsed and, optionally, match data. Match data comes in three kinds.

Array

Perhaps the simplest kind of match data is the array. Arrays contain homogeneous matches.

To construct a parser that yields an array, we surround it with brackets. At runtime, when the parser succeeds, its output is wrapped into a singleton array.

Choose infer type in the snippet menu and push โ–ถ๏ธ. Note that this parser has type [u]โ€”unlike parsers we constructed before, all of which had type u. u in types stands for unitโ€”match without data. For brevity, we will say type of parser instead of type of value yielded by the parser.

In the output pane, we can see how match with data is encoded in JSON: field s contains the aforementioned source, and x contains match data when it is present.

By combining modifiers and brackets we can parse non-singleton arrays.
When sequencing parsers, yielded matches get merged. Note that only values from bracketed parsers appear in match data, since non-bracketed ones do not yield any.

Records

Very often, the data we parse is heterogeneous. For that case, lux has records.

Like with arrays, records have syntax to wrap a parser: {i:p}. At runtime, the composite parser adds a label i to the value yielded by the inner parser. Note that labels are tracked in types.

Also, lux allows writing {i} instead of {i:i}.

Records with multiple fields are built by sequencing.
When alternating records, only those fields stay that are present in every record. If we wish to preserve some fields, we need to explicitly mention them in each section of alternation.
Ability to modify alternated records allows us to implement something akin to partition function.

And, of course, records can be nested. See how, again, types mirror the structure of our output.

Variants

Often, when alternating between parsers, we want to imprint which parser succeeded on the result. This is achieved with variants. Like a record, a variant adds a label to the inner value: <i:p>. We call this label-type pair a case. A match with variant data contains exactly one case, unlike a record, which contains its every field. The type of a variant signifies which cases it might contain.

To construct a variant with multiple cases, we use alternation. A compound parser's type is a union of all cases.

Variants are encoded as objects with a single field. The field's name is the label, and its value is the value yielded by the case's parser.

Examples

Together, records and variants form a quite powerful system that resembles the algebraic data types found in many modern languages. Armed with these tools, let's parse some real-world formats.

date --iso-8601=seconds

ls -l | head -2

head -2 /etc/hosts