Extensible Enums in Haskell

Posted on January 13, 2024

Your task is to write a library to work with an HTTP API of a service provider you use. A common pattern in these payloads is to use a string that is documented to be like an enumeration:

{ ...
  "detailed_result": "INVALID_TOKEN"
  ...
}

Where the values of the key "detailed_result" belong to a set of values which have some meaning attached to them.

You might be tempted, after reading their documentation, to enumerate each of the values in a sum-type:

data DetailedResult
  = InvalidToken
  | AddressMatchFailed
  ...
  deriving (Bounded, Eq, Enum, Show)

We can then use the aeson library to derive the JSON parser and serializer for us. This will work for a time until the service provider decides to add new values to the, pseudo-enumeration. For it is not a true enumeration but an open set of string constants. One that the service provider might add new members to without informing you.

In this post I will show you three ways we can handle this situation in Haskell. We want to make sure that our code only operates on valid values of this string. But we also want to make sure that we can accept new values without causing a parse failure. It’s also worth noting that we do trust this system to a degree and so storing unrecognized values is okay as long as we don’t use them.

The Cop Out: Avoid Types

This is the easiest solution. Don’t make it an enumeration or try to parse it at all. Treat the value as Text and match on the known values at run-time:

case detailedReason of
  "INVALID_TOKEN" -> _
  "ADDRESS_MATCH_FAILED" -> _
  _ -> error "Unrecognized detailed_reason"

The trade-off here is that we will not be able to lean on the type checker to help us. We will have to document the valid, recognized values somewhere. Programmers using our code will have to know to look those values up… and we better be careful about checking for spelling errors in the string literals we’re matching against.

The Open-Ended Sum Type

This approach has many of the benefits of encoding an enumeration in a sum type with the addition of a constructor that holds any unrecognized value.

data DetailedResult
  = InvalidToken
  | AddressMatchFailed
  ...
  | UnrecognizedDetailedResult Text
  deriving (Eq, Show)

We will have to write the JSON instances by hand here.

All that is required from code that uses this type now is to avoid using the value of UnrecognizedDetailedResult when pattern matching.

processResult :: DetailedResult -> IO ()
processResult detailedResult = case detailedResult of
  InvalidToken -> putStrLn "Handling INVALID_TOKEN"
  AddressMatchFailed -> putStrLn "Handling ADDRESS_MATCH_FAILED"
  ...
  UnrecognizedDetailedResult _ ->
    -- Do not use the value of `UnrecognizedDetailedResult`...
    putStrLn "Skipping unrecognized result"

This approach allows us to lean more on the type system. It gives us the benefit of enumerating the valid, recognized values as constructors which will play nicely with our tooling. However the trade-off is that we will have to always handle the unrecognized case in our pattern matches. This means we have to avoid using the unrecognized value by convention since the type system will not prevent any callers from using it.

The Phantom Parameter

This solution uses type-level machinery. We want to be able to add a tag to our structure which will tell the type checker whether the value is recognized. This way we can write functions that only accept recognized values and get a type error if we make a mistake.

First we will need to use some language extensions:

{-# LANGUAGE GADTs #-}
{-# LANGUAGE EmptyDataDeriving #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE StandaloneDeriving #-}

Let’s define some empty types to represent whether a value is recognized or not:

data Recognized deriving (Eq, Show)
data Unrecognized deriving (Eq, Show)

We need the EmptyDataDeriving extension for these definitions. Normally you can’t derive stock instances for types with no constructors. We may want to be able to use these instances, at least when our code is under development, so we use this handy extension.

We will use these types as “tags” to tell the type system which values we consider “recognized”.

Next we will define a generalized algebraic data type or GADT which will take a type parameter:

data DetailedResult a where
  InvalidToken       :: DetailedResult Recognized
  AddressMatchFailed :: DetailedResult Recognized
  UnrecognizedResult :: String -> DetailedResult Unrecognized

deriving instance Show a => Show (DetailedResult a)

Using a GADT allows us to “set” the type of a with our constructors. When I say that we will be tagging the structure with a type, this is what I mean. The type parameter here will be used as a tag by us to tell the type checker whether our function cares about Recognized values or Unrecognized values. This means that we can write explicit types such as:

processDetailedResult :: DetailedResult Recognized -> IO ()
processDetailedResult _ = putStrLn "processing valid detailed result"

And we will get a type error if we try to pass in an unrecognized value:

ghci> processDetailedResult (UnrecognizedResult "WAT")
<interactive>:184:24-48: error: [GHC-83865]
     Couldn't match type ‘Unrecognized’ with ‘Recognized’
      Expected: DetailedResult Recognized
        Actual: DetailedResult Unrecognized
    ...

It also means that the body of processDetailedResult doesn’t have to handle unrecognized values because it will never accept them by definition.

But why not use this type parameter on a regular sum-type?

-- Note the addition of the `a` type parameter...

data DetailedResult' a
  = InvalidToken'
  | AddressMatchFailed'
  ...
  | UnrecognizedResult' Text
  deriving (Eq, Show)

Well that’s because we can’t “set” the type of a… it’s implicitly defined as, for all types represented by “a”. This means if we try to block unrecognized values from being processed with:

processResult' :: DetailedResult' Recognized -> IO ()
processResult' _ = putStrLn "Processing recognized result..."

The fact that the type signature says DetailedResult' Recognized doesn’t mean only values of DetailedResult' Recognized:

ghci> processDetailedResult' (UnrecognizedResult' "WAT")
Processing recognized result...

It turns out the type of UnrecognizedResult' "WAT" is DetailedResult' a which fits when we want to evaluate processDetailedResult'. Every constructor returns this value and so each one will match the signature.

When we use a GADT however we plug this “for all types” hole by setting the type of the type parameter depending on the constructor used in the GADT. This allows the compiler to infer what type a is. It can use this information when evaluating processResult as a consequence.

Now let us write a basic text parser. It should give you an idea of how to write the JSON instances for this type:

fromText :: Text -> DetailedResult ???

This isn’t going to work. What type do we need for ???? If we try a polymorphic variable when we try to return RecognizedResult we will get a type error because Recognized is only one type and a is implicitly defined as “for all types,” as in:

fromText :: forall a. Text -> DetailedResult a

Recognized is only one type of all types.. There’s one more special type we need to make this work:

data Some t where
  Some :: Show a => t a -> Some t

It might not be clear to you what this is useful for if you don’t have a strong grasp of GADTs and pattern matching yet. That’s okay! What this type is doing is filling in for our ??? type. This type is telling callers of our function that they will get some DetailedResult and they will have to figure out which one they have.

So we can change our type signature and fill in the definition like so:

fromText :: String -> Some DetailedResult
fromText s = case s of
  "INVALID_TOKEN" -> Some InvalidToken
  "ADDRESS_MATCH_FAILED" -> Some AddressMatchFailed
  ...
  _ -> Some $ UnrecognizedResult s

Callers can figure out which DetailedResult they have by using pattern matching.

Before we demonstrate it’s use let’s add a helpful Show instance for Some DetailedResult:

-- This is why we need the FlexibleInstances extension
deriving instance Show (Some DetailedResult)

This means we can use fromText like so:

ghci> fromText "INVALID_TOKEN"
Some InvalidToken

And we can determine which kind of DetailedResult we have received like this:

case fromText "INVALID_TOKEN" of
  Some (UnrecognizedResult _) ->
    putStrLn "Cannot process unrecognized result"
  Some r@InvalidResult -> processDetailedResult r
  Some r@AddressMatchFailed -> processDetailedResult r

Did you notice the symmetry between the arrow in the Some constructor and the pattern match above? Pay attention to the “shapes” of the expressions: t a -> Some t and Some InvalidToken -> _. Even though the a is on the left side of the arrow in the GADT definition we can match on it’s value in the left side of the pattern match. It turns out this notion is generally useful and there’s the some library for working with them.

The benefit here is that we no longer have to write functions that always handle the case of unrecognized values. Instead we can write functions that only accept recognized values.

The trade off here is that we need to express more of what we want to the type system which requires a little more code/effort.

The Symbolic Approach

We can go even further with the type system in Haskell. With some more extensions and libraries it’s possible for us to write code that will promote our "detailed_result" values to type-level Symbols… so long as the values are valid Haskell symbols as well.

However the benefits of doing so don’t remove any of the trade-offs listed in the previous section and make the code significantly more difficult to read as it will rely on inference to determine which value is being matched instead of terms.

I may discuss this approach in a future post.

Conclusion

Use the approach that is sufficient for the task at hand.

I would go for the Cop Out for a task-oriented script where the code isn’t going to be shared or re-used. A simple comment enumerating the values or a URL where the reader can find them is good enough. When it’s more important to get the job done and out of the way this will be my preferred approach.

However if I intend to write a library to interact with this service that will be shared and re-used in many places I would prefer the Phantom Parameter. The added benefit of tagging the values that are recognized means that we can eliminate the need to handle unrecognized values in every function while still getting the benefit of exhaustive pattern matching and support from the type checker.

If I am working with a team that is predominantly junior-to-intermediate Haskell developers that will be maintaining this library I might consider the Open-Ended Sum Type since it requires a little less Haskell knowledge to get going with and is still type safe. We may have to be careful about handling unrecognized values but hopefully this can be caught by code review and testing.

All in all, how much you leverage the type system, is up to you. Happy hacking!