I recently starting making a serious attempt at learning Haskell (anyone who has tried before will probably sympathise that it usually takes a couple of tries to crack it). Amongst the many cool things Haskell has to offer is an amazing parsing library that comes with the standard set of packages called Parsec, which lets you describe how to parse complex grammars in what essentially looks like natural language.
Here is how a CSV parser is implemented using Parsec. Don’t worry if you don’t understand all the syntax, the point is that the whole parser is specified in just four lines.
1234567891011
-- Taken from http://book.realworldhaskell.org/read/using-parsec.htmlimportText.ParserCombinators.ParseccsvFile=endBylineeolline=sepBycell(char',')cell=many(noneOf",\n")eol=char'\n'parseCSV::String->EitherParseError[[String]]parseCSVinput=parsecsvFile"(unknown)"input
arcsecond is a parser combinator library, in which complex parsers can be built by composing simple parsers. The simplest parsers just match specific strings or characters:
1234567891011
const{digit,letter,char,str,regex}=require('arcsecond');// digit will parse any number 0-9// letter will parse any letter a-z or A-Z// char and str will parse specific sequencesconstgetA=char('a');constgetHello=str('hello');// regex will parse any given regular expressionconstgetLowNumbers=regex(/^[0-4]+/);
These can then be combined together with the combinators in the library.
const{parse,char,str,sequenceOf,choice}=require('arcsecond');// parse creates a function we can use for parsing textconstparseText=parse(combinedParser);console.log(parseText('hello world'))// -> [ 'hello', ' ', 'world' ]
Combinators
The combinators are where it gets cool. In arcsecond a combinator is a higher order parser, which takes one or more parsers as its input and gives back a new parser that combines those in some way. If you’ve used higher order components in react like connect, withRouter, or withStyles, then you’re already familiar with the idea.
As shown above, sequenceOf is a combinator that will parse text using each of the parsers in order, collecting their results into an array.
choice, on the other hand, will try each of its parsers in order and use the first one that matches. The library contains more, like many, which takes a parser as an argument and matches as much as it can using that parser, collecting up the results in an array:
123456789101112131415161718
const{parse,many,choice,str}=require('arcsecond');constanimal=choice([str('dog'),str('whale'),str('bear'),str('ant')]);constmanyAnimals=many(animal);// parse creates a function we can use for parsing textconstparseText=parse(manyAnimals);console.log(parseText('dogantwhaledogbear'))// -> [ 'dog', 'ant', 'whale', 'dog', 'bear' ]
You can use sepBy to create a parser that matches items separated by something matched by another parser.
1234567891011121314151617181920
const{parse,sepBy,choice,str,char}=require('arcsecond');constanimal=choice([str('dog'),str('whale'),str('bear'),str('ant')]);constcommaSeparated=sepBy(char(','));constcommaSeparatedAnimals=commaSeparated(animal);// parse creates a function we can use for parsing textconstparseText=parse(commaSeparatedAnimals);console.log(parseText('dog,ant,whale,dog,bear'))// -> [ 'dog', 'ant', 'whale', 'dog', 'bear' ]
between will let you match items that occur between two other parsers.
12345678910111213141516171819202122
const{parse,between,sepBy,choice,str,char}=require('arcsecond');constanimal=choice([str('dog'),str('whale'),str('bear'),str('ant')]);constbetweenBrackets=between(char('['))(char(']'));constcommaSeparated=sepBy(char(','));constcommaSeparatedAnimals=commaSeparated(animal);constarrayOfAnimals=betweenBrackets(commaSeparatedAnimals);// parse creates a function we can use for parsing textconstparseText=parse(arrayOfAnimals);console.log(parseText('[dog,ant,whale,dog,bear]'))// -> [ 'dog', 'ant', 'whale', 'dog', 'bear' ]
Curried Functions
arcsecond makes use of curried functions. If you don’t know what a curried function is, give my article “Making Functional Programming Click” a read. If you really can’t be bothered right now, open that in another tab and read this executive summary.
A curried function is one that, if it takes more than one argument, it instead returns a new function that takes the next argument. It’s easier to see with some code:
12345678910111213141516171819202122232425
// Normal function, just a single function taking both arguments and returning a resultfunctionnormalAdd(a,b){returna+b;}normalAdd(1,2);// -> 3// Curried function, a function that takes the first argument and then returns// *another function*, which takes the next argument.functioncurriedAdd(a){returnfunction(b){returna+b;}}curriedAdd(1)(2);// -> 3// We can use curried functions to easily make new, more specific functionsconstaddOne=curriedAdd(1);addOne(2);// -> 3
As you can see above, curriedAdd is first called with 1. Since it then returns a function, we can go ahead and call that with 2, which finally returns the actual result.
We can use curriedAdd to create a new function by calling it with just one argument and then assigning the result to a variable. As a language, JavaScript treats functions as a first class citizen, meaning they can be passed around and assigned to variables.
This principle lies at the heart of arcsecond, and every function in the library is defined this way. sepBy take two parsers — first a separator parser, and second a value parser. Because it is curried, it is easy to create a more specific combinator like commaSeparated by only supplying the first argument.
If you’re not used to it, it will probably seem strange. But a good lesson to learn as a software developer is not to have knee-jerk bad reactions to things that you don’t immediately understand. There is usually a reason, and you’ll find a lot more value by discovering that reason rather than dismissing it.
Error Handling
If you try to parse a string which is not correctly formatted you would expect to get some kind of error message. arcsecond uses a special data type called an Either, which is either a value or an error. It’s like a Promise, which can be either Rejected or Resolved, but without the implication of being asynchronous. In an Either however, the “resolved” type is called a Right, and the “rejected” type is called a Left.
The return type of parse is an Either. You can get at the value or error like this:
12345678910111213141516171819
const{parse,sequenceOf,str,char}=require('arcsecond');constparser=sequenceOf([str('hello'),char(' '),str('world')]);constparseResult=parse(parser)('hello world');// We can use the cata method, which is basically a switch/case but for typesparseResult.cata({Left:err=>{// do something here with the error},Right:result=>{// result -> [ 'hello', ' ', 'world' ] }});
However this might not fit well into your codebase if you don’t have more functional code. For that reason there are two other options. The first is to convert the Either into a Promise:
1234567891011121314151617
const{parse,sequenceOf,str,char,toPromise}=require('arcsecond');constparser=sequenceOf([str('hello'),char(' '),str('world')]);constparseResult=parse(parser)('hello world');toPromise(parseResult).catch(err=>{// do something here with the error}).then(result=>{// result -> [ 'hello', ' ', 'world' ] });
Or you can use toValue, which must be wrapped in a try/catch block:
12345678910111213141516
const{parse,sequenceOf,str,char,toValue}=require('arcsecond');constparser=sequenceOf([str('hello'),char(' '),str('world')]);constparseResult=parse(parser)('hello world');try{constresult=toValue(parseResult);// -> [ 'hello', ' ', 'world' ] }catch(err){// do something with err}
Something more complex: JSON
Let’s put arcsecond through its paces by using it to create a parser for JSON.
So to write a JSON parser, we just need to write parsers for all these values.
Types
In order for our parser to be useful we need to be able to identify what we’ve parsed, and the best way to do that is to put the results into a data type, which will provide a common interface for us to interact with the JSON tree. Every type has a type name, a value, and a toString function to pretty print the structure.
// Function to create a single-valued "type constructor"constmakeBasicType=typeName=>value=>({type:typeName,value,toString:()=>`${typeName}(${value})`});// Function to create a multi-valued "type constructor"constmakeMultiType=typeName=>values=>({type:typeName,value:values,toString:()=>`${typeName}(${values.map(v=>v.toString()).join(', ')})`});conststringType=makeBasicType('String');constnumberType=makeBasicType('Number');constbooleanType=makeBasicType('Boolean');constarrayType=makeMultiType('Array');constobjectType=makeMultiType('Object');// For the object, we can store the key/value pairs in their own typeconstkeyValuePair=(key,value)=>({type:'KeyValuePair',value:[key,value],toString:()=>`KeyValuePair(${key.toString()},${value.toString()})`});constnullType=()=>({type:'Null',value:null,toString:()=>'Null'});arrayType([stringType("hello"),numberType(42),stringType("world"),numberType(84),]).toString();// -> Array(String(hello), Number(42), String(world), Number(84))
With our types in hand let’s start with the absolute simplest of parsers: true, false and null. These are just literal strings:
The parsers in arcsecond have a map method — just like arrays —which allows you to transform the value the parser matched. With map we can put the values matched into the data types defined above.
Numbers are a bit more complex. The JSON specification has this railroad diagram showing how a number can be parsed:
The forks in the railroads show optionality, so the simplest number that can be matched is just 0.
Basically a numbers like:
1
-0.2
3.42e2
-0.4352E-235
Are all valid in JSON, while something like 03.46 is not, because no path in the railroad would allow for that.
const{choice,sequenceOf,char,digits,regex,possibly,anyOfString}=require('arcsecond');// The possibly parser either returns what it matches or null// We can use the possibly parser to make one that returns empty string// instead of null, which is more useful in this caseconstorEmptyString=parser=>possibly(parser).map(x=>(x)?x:'');// We also want to get sequences of matches, but then join them back// together afterwards, instead of being in an arrayconstjoinedSequence=parsers=>sequenceOf(parsers).map(x=>x.join(''));constnumberParser=joinedSequence([orEmptyString(char('-')),choice([char('0'),regex(/^[1-9][0-9]*/)]),orEmptyString(joinedSequence([char('.'),digits])),orEmptyString(joinedSequence([anyOfString('eE'),orEmptyString(anyOfString('-+')),digits]))]).map(x=>numberType(Number(x)));// Now we can add the numberParser to the jsonParserconstjsonParser=choice([nullParser,trueParser,falseParser,numberParser]);
If you take the time to read through the numberParser, you’ll see it lines up pretty much 1:1 with the diagram above.
Let’s try strings next. A string is anything between double quotes, but it can also contain escaped quotes.
const{between,many,choice,sequenceOf,char,anythingExcept}=require('arcsecond');// Like joinedSequence, we want to get many of the same matches and then join themconstjoinedMany=parser=>many(parser).map(x=>x.join(''));constbetweenQuotes=between(char('"'))(char('"'));conststringParser=betweenQuotes(joinedMany(choice([joinedSequence([char('\\'),char('"')]),anythingExcept(char('"'))]))).map(stringType);// Adding the stringParser to the jsonParserconstjsonParser=choice([nullParser,trueParser,falseParser,numberParser,stringParser]);
The anythingExcept parser comes in very handy here, and is especially expressive when compared with the image on the JSON spec website.
That only leaves Array and Object, which can both be pitfalls because they are basically just containers of jsonValue. To illustrate how this might go wrong, we can write the Array parser the “wrong” way first and then see how to address it.
We can use the whitespace parser — which matches zero or more whitespace characters — to ensure the the array brackets and comma operator allow for any (optional) whitespace that might be there.
123456789101112131415161718192021222324
const{between,char,choice,sepBy,whitespace}=require('arcsecond');constwhitespaceSurrounded=between(whitespace)(whitespace)constbetweenSquareBrackets=between(whitespaceSurrounded(char('[')))(whitespaceSurrounded(char(']')));constcommaSeparated=sepBy(whitespaceSurrounded(char(',')));constarrayParser=betweenSquareBrackets(commaSeparated(jsonParser)).map(arrayType);constjsonParser=choice([nullParser,trueParser,falseParser,numberParser,stringParser,arrayParser]);// ReferenceError: jsonParser is not defined
Because arrayParser is defined in terms of jsonParser, and jsonParser is defined in terms of arrayParser, we run into a ReferenceError. If we moved the definition of arrayParser below jsonParser, we’d still have the same problem. We can fix this by wrapping the jsonParser in a special parser, aptly named recursiveParser. The argument to recursiveParser is a thunk, which will allow us to reference variables that are not yet in scope.
12345678910111213141516171819202122232425
const{between,char,choice,sepBy,whitespace,recursiveParser}=require('arcsecond');constwhitespaceSurrounded=between(whitespace)(whitespace)constbetweenSquareBrackets=between(whitespaceSurrounded(char('[')))(whitespaceSurrounded(char(']')));constcommaSeparated=sepBy(whitespaceSurrounded(char(',')));// Move this parser to the top, and wrap it in a recursiveParser,// which takes a function that returns a parserconstjsonParser=recursiveParser(()=>choice([nullParser,trueParser,falseParser,numberParser,stringParser,arrayParser]));constarrayParser=betweenSquareBrackets(commaSeparated(jsonParser)).map(arrayType);
Implementing the arrayParser is actually quite trivial — as simple as JSON values, separated by commas, in square brackets.
Object is only marginally more complex. The values in an object are pairs of strings some other JSON value, with a colon as a separator.
Since I opened up with a CSV parser in Haskell, let’s see how that would look in arcsecond. I’ll keep it minimal and forgo creating data types to hold the values, and some extra strengthening that the Parsec version also doesn’t have.
The result should be an array of arrays — the outer array holds “lines” and the inner arrays contain the elements of the line.
const{parse,char,many,regex,anythingExcept,sepBy}=require('arcsecond');constjoinedMany=parser=>many(parser).map(x=>x.join(''));constcell=joinedMany(anythingExcept(regex(/^[,\n]/)));constcells=sepBy(char(','))(cell);constparser=sepBy(char('\n'))(cells);constdata=`1,ReactJS,"A declarative efficient and flexible JavaScript library for building user interfaces"2,Vue.js,"Vue.js is a progressive incrementally-adoptable JavaScript framework for building UI on the web."3,Angular,"One framework. Mobile & desktop."4,ember.js,"Ember.js - A JavaScript framework for creating ambitious web applications"`;console.log(parse(parser)(data));// -> [// [ '' ],// [ '1', 'React JS', '"A declarative efficient and flexible JavaScript library for building user interfaces"' ],// [ '2', 'Vue.js', '"Vue.js is a progressive incrementally-adoptable JavaScript framework for building UI on the web."' ],// [ '3', 'Angular', '"One framework. Mobile & desktop."' ],// [ '4', 'ember.js', '"Ember.js - A JavaScript framework for creating ambitious web applications"' ],// ]
Conclusion
There are quite a few key features of arcsecond that didn’t get a mention in this article, including the fact it can parse context sensitive languages, and the parsing model is based on a Fantasy Land -compliant Monad. My main goal with this project was to bring the same level of expressivity that Parsec has to JavaScript, and I hope I’ve been able to do that.