Haskell for Elm developers: giving names to stuff (Part 4 - Parser combinators)
⚠️ DISCLAIMER ⚠️ This is by no means a full in-depth explanation of parser combinators, as there are many papers on the subject. This post assumes you are somewhat familiar with
elm/parser
, and thus you are equipped with the tools you need to get familiar with parser combinators in Haskell!
Hey! Long time no see, I’ve finally gathered my inner strength to be brave enough to write this post, yay! 🎉
Even though this might not be a fully complete explanation of parser combinators, as said in the above disclaimer, you might want to have a refresher on what Applicative
functors were:
This will become relevant to this post, as you will soon notice, but first of all, another brief reminder before we get onto the good stuff.
What is a Parser?
Basically speaking, a parser is a function that takes some text input and returns some structure (normally an Abstract Syntax Tree, aka: AST) as an output. A parser combinator is, thus, a higher-order function that takes parsers as input and returns a new parser as output.
As this definition is better explained with examples, let’s have a look at the simplest possible definition of a parser in Haskell:
type Parser a = String -> Maybe (a, String)
Although no production ready parser combinators library would be as naive, it serves well our purpose to understand the simple essence of it:
- Await a
String
value. - Produce a result that may or not succeed (that’s why it returns a
Maybe
, although with this minimalistic design there is no possible way to know why the parser failed). - Return a tuple of the value you want to parse and whatever’s left of the string that you did not consume to produce a value of type
a
.
Some people like to think about parsers as somewhat simple machines that need a cursor, and that parse left to right the input and produce results as they traverse the text input, but this illustration might not be so accurate in certain implementations.
For the Haskell curious (which I guess you are since you are reading this post), notice how the above definition of a Parser
looks suspiciously familiar to the State
monad:
newtype State s a =
State { runState :: s -> (a, s) }
We could maybe dedicate a following blogpost on its own just to the State
monad, but this is out of the scope of this post. Suffice to say that during the research for writing these lines I found that there’s actually a quite nice Elm package that implements the State
monad! 🤯
How are parsers defined in Elm?
Now that we are getting a bit more familiar with the way a parser is constructed, let’s have a look at the de facto only parser library that exists for Elm, elm/parser
and see how the Parser
type is defined there:
-- Parser.Advanced.elm
type Parser context problem value =
Parser (State context -> PStep context problem value)
type PStep context problem value
= Good Bool value (State context)
| Bad Bool (Bag context problem)
If you squint your eyes hard enough, you can see the resemblance. For example, if we strip the context
and problem
type variables, which are used to give more information to the user about why a certain parser failed, and we simplify the State
type that the author used here (not the State
monad, do not be confused!), it will look like this:
type Parser value =
Parser (String -> PStep value)
type PStep value
= Good value String
| Bad String
This looks a bit closer to the initial Haskell definition we looked at, and you can therefore now notice that PStep
is just a sophisticated version of Maybe
that will give you more information in case of failure.
The hidden typeclass
The hidden secret of parser combinators, in my humble opinion, lies in the simple oneOf
function:
oneOf : List (Parser a) -> Parser a
When you are parsing stuff, sometimes you are parsing something AND something (that’s when the Applicative
typeclass kicks in), but usually you also need to parse something OR something else. This is where the oneOf
function comes in handy, for example in the elm/json
library you have the following decoding function:
maybe : Decoder a -> Decoder (Maybe a)
maybe decoder =
oneOf
map Just decoder
[ , succeed Nothing
]
This simple decoder allows us to declare that you might parse something present in your input (or not), and appropriately represent that into a Maybe
type. But, how would such combinator look in Haskell? 🤔
optional :: Alternative f => f a -> f (Maybe a)
= Just <$> d <|> pure Nothing optional d
As you can see, the maybe
decoder function is called optional
in Haskell, but the interesting stuff is the typeclass constraint: Alternative f =>
. This is the magical final piece of the puzzle we need to understand to get parser combinators!
Let’s have a look at simplified version of how the Alternative
typeclass is defined in Haskell:
class Applicative f => Alternative f where
-- The identity of '<|>'
empty :: f a
-- An associative binary operation
(<|>) :: f a -> f a -> f a
-- One or more.
some :: f a -> f [a]
-- Zero or more.
many :: f a -> f [a]
Since some
and many
are defined in terms of <|>
, we can notice quickly that the relevant part is the new <|>
operator, which does not have a name but by using the pipe I think it might convey the idea of a lifted OR (|
) operator.
So again, where in Elm we can express that we can parse one thing or another as items in a list given to oneOf
, in Haskell there is the infix binary operator <|>
to define all the possible things we want to parse.
If you also noticed, Alternative
requires an Applicative
instance to also be present! So, in every possible Haskell implementation of Parser
, mandatorily you are going to have this code blow:
instance Applicative Parser where
-- ...
instance Alternative Parser where
-- ...
This is what makes parser combinators work: an Applicative
instance, and an Alternative
one, that is it! ✨
Interesting aha! moment
While scanning though the elm/parser
library, you will find this code:
-- INFIX OPERATORS
infix left 5 (|=) = keeper
infix left 6 (|.) = ignorer
{-| Just like the [`(|=)`](Parser#|=) from the `Parser` module.
-}
keeper : Parser c x (a -> b) -> Parser c x a -> Parser c x b
keeper parseFunc parseArg =
map2 (<|) parseFunc parseArg
{-| Just like the [`(|.)`](Parser#|.) from the `Parser` module.
-}
ignorer : Parser c x keep -> Parser c x ignore -> Parser c x keep
ignorer keepParser ignoreParser =
map2 always keepParser ignoreParser
Which is funny, because we CANNOT define infix operators in Elm, but Evan can 😜. Besides that, what is actually interesting is that, for conveniency, it is better to have infix operators for the keeper
and ignorer
(or the eater
operator, as Tereza Sokol called it in her nice talk 🤣) functions that allows us to consume or discard input, because it leads to somewhat more readable code™️.
If we want to find those functions in Haskell, let me confess something right now: I did not show you in the previous Applicative
post all there is to it regarding applicative functors, you have been lied to! 😈
Here is the complete definition of the Applicative
typeclass:
class (Functor f) => Applicative f where
{-# MINIMAL pure, ((<*>) | liftA2) #-}
-- | Lift a value.
pure :: a -> f a
-- | Sequential application.
(<*>) :: f (a -> b) -> f a -> f b
<*>) = liftA2 id
(
-- | Lift a binary function to actions.
-- ==== __Example__
-- >>> liftA2 (,) (Just 3) (Just 5)
-- Just (3,5)
liftA2 :: (a -> b -> c) -> f a -> f b -> f c
= (<*>) (fmap f x)
liftA2 f x
-- | Sequence actions, discarding the value of the first argument.
--
-- ==== __Examples__
-- If used in conjunction with the Applicative instance for 'Maybe',
-- you can chain Maybe computations, with a possible "early return"
-- in case of 'Nothing'.
--
-- >>> Just 2 *> Just 3
-- Just 3
--
-- >>> Nothing *> Just 3
-- Nothing
(*>) :: f a -> f b -> f b
*> a2 = (id <$ a1) <*> a2
a1
-- | Sequence actions, discarding the value of the second argument.
(<*) :: f a -> f b -> f a
<*) = liftA2 const (
In case you did not notice, lets have the type signatures side by side now:
(<*>) :: f (a -> b) -> f a -> f b
|=) : Parser c x (a -> b) -> Parser c x a -> Parser c x b (
and…
(<*) :: f keep -> f ignore -> f keep
|.) : Parser c x keep -> Parser c x ignore -> Parser c x keep (
So this means that the <*
and the <*>
operators are, respectively, the ignorer
and keeper
functions from elm/parser
!! 🤯🤯🤯
What about the *>
operator? Well, as you know, in Haskell we have this massive operator overflow, so it is exactly the same as <*
but it let’s us ignore the argument to the left of the operator, as we will now see in a REAL WORLD™️ parser code sample. 😉
A real world™️ Haskell parser example
To make sure we actually learned something about parser combinators in this post, let’s have a look at a portion of my language-avro
Haskell library:
-- | Parses a single import into the 'ImportType' structure.
parseImport :: MonadParsec Char T.Text m => m ImportType
=
parseImport "import"
reserved *> ( (impHelper IdlImport "idl" <?> "Import of type IDL")
<|> (impHelper ProtocolImport "protocol" <?> "Import of type protocol")
<|> (impHelper SchemaImport "schema" <?> "Import of type schema")
)where
impHelper :: MonadParsec Char T.Text m => (T.Text -> a) -> T.Text -> m a
= ct <$> (reserved t *> strlit <* symbol ";") impHelper ct t
If you are curious regarding how such a parser would look with other libraries (like
trifecta
), you can have a look at this code, which is surprisingly similar to Elm!
Before you freak out, let me explain to you what this piece of code is trying to parse: in the Avro IDL language (which is used for example, in Kafka), you can define imports of 3 different types:
import idl "foo.avdl";
import protocol "foo.avpr";
import schema "foo.avsc";
To represent this in Haskell, first we need a data type that we want our parser to return:
-- | Type for the possible import types in 'Protocol'.
data ImportType
= IdlImport T.Text
| ProtocolImport T.Text
| SchemaImport T.Text
deriving (Eq, Show, Ord)
And now we can proceed with the parsing stuff, you will notice a few interesting things
- There is a strange
reserved
combinator, which is a user defined combinator that is pretty clever and has the notion of comments/whitespace. - Similarly,
strlit
is also user defined and it helps us to parse string literals. - What the hell is
<?>
?? Another crazy operator?? Well, do not worry, it is just to give proper error messages when the parser get’s stuck, the Elm equivalent would be theproblem : String -> Parser a
function. 😉 - What is that scary
MonadParsec Char T.Text m => m
typeclass constrain? Well, I would gladly read that as justParser
in the example we were giving before, but since in my package I used themegaparsec
library, I did not want to lie to you again and show you the real type of the parser (more onmegaparsec
later).
Here is a similar parser, written with elm/parser
as a reference!
import Parser
exposing
|.)
( (, (|=)
, Parser
, chompIf
, chompWhile
, getChompedString
, oneOf
, spaces
, succeed
, symbol
)
type Import
= Idl String
| Protocol String
| Schema String
parser : Parser Import
parser =
let
importHelper :
String -> Import)
(-> String
-> Parser Import
importHelper ct t =
succeed ct
|. symbol t
|. spaces
|= strlit
|. symbol ";"
in
succeed identity
|. symbol "import"
|. spaces
|= oneOf
importHelper Idl "idl"
[ , importHelper Protocol "protocol"
, importHelper Schema "schema"
]
strlit : Parser String
strlit =
getChompedString <|
succeed ()
|. chompIf (\c -> c == '"')
|. chompWhile Char.isLower
|. chompIf (\c -> c == '.')
|. chompWhile Char.isLower
|. chompIf (\c -> c == '"')
output =
Parser.run parser "import protocol \"foo.avpr\";"
-- > Ok (Protocol "\"foo.avpr\"")
As you can see, the code is fairly similar, we only used the oneOf
instead of the <|>
operator, and the only complicated thing in Elm was figuring out how our strlit
combinator had to look like. (Obviously this implementation is not perfect, but it is good enough for educational purposes).
If you were able to understand the above Haskell code, congratulations, you know parser combinators already! 🎉🎉🎉
The state of parsers in the Haskell ecosystem
As opposed to Elm, where there is only one choice (elm/parser
), the Haskell ecosystem is much more rich and diverse, each one with their different tradeoffs. Here is an incomplete list of parser combinator libraries I’m aware of:
- Parsec
- Trifecta
- Attoparsec
- Megaparsec
- Earley
- … (many, many more!!)
The only one I’ve used in production and am a bit more familiar with is megaparsec
, and I learned a lot regarding how to use it from Mark Karpov excellent’s blogpost. I really recommend it since it is quite performant and it has some really nice and smart combinators that will spare you a ton of work.
Acknowledgements
Special thanks to @serras for technical proofreading this post again. 🙏🏻
Hope Parser
combinators finally clicked for you ✨ (if they had not already) and you learned something new. Ah! And in case you did not notice…
JSON Decoders are actually parser combinators! 🤯🤯🤯
So, as always, if you were doing JSON Decoders you were using parser combinators all along without noticing it! 😁
If you enjoyed this post and would like me to continue the series, please consider sponsoring my work, share it in your social networks and follow me on Twitter! 🙌🏻