Wanted: a short and clever morse code decoder in Haskell. This article presents a series of successively smaller solutions that gradually include computer science concepts like dichotomic search, stack-based languages and finally deforestation, an optimization technique specific to purely functional languages like Haskell.
Table of contents
This article previously appeared in issue 14 of The Monad.Reader.
In a post to reddit, user CharlieDancey presented a challenge to write a short and clever morse code decoder. Of course, what could be more clever than writing it in Haskell? ;-)
In the following, I’ll present a series of solutions that gradually include dichotomic search (the straightforward generalization of binary search), stack-based languages or reverse polish notation, and finally deforestation (also called fusion), an optimization technique specific to purely functional languages like Haskell.
For your convenience, source code for the individual sections is available.
Morse code was designed to be produced and read, or rather heard by humans, who have to memorize the following table:
To write a program that decodes sequences of dots and dashes like
-- --- .-. ... . -.-. --- -.. .
into letters, we will have to store this table in some form. For instance, we could store each letter and corresponding morse code in a big association list
type MorseCode = String
dict :: [(MorseCode, Char)]
dict =
[(".-" , 'A'),
("-..." , 'B'),
("-.-." , 'C'),
...
which we name dict
because it’s a dictionary translating between the latin alphabet and morse code. To decode a letter, we simply browse through this list to determine whether it contains a given code of dots and dashes
decodeLetter :: MorseCode -> Char
decodeLetter code = maybe ' ' id $ lookup code dict
Decoding a whole sequence of letters is done with
decode = map decodeLetter . words
so that we have for example
> decode "-- --- .-. ... . -.-. --- -.. ."
"MORSECODE"
Of course, association lists are a rather slow data structure. A binary search tree, trie, or hash table would be a better choice. But fortunately, it is very easy to change data structures in Haskell. All you have to do is to a qualified import of a different module, such as Data.Map, and change the definition of dict
to
dict :: Data.Map.Map MorseCode Char
dict = Data.Map.fromList $
[(".-" , 'A'),
("-..." , 'B'),
("-.-." , 'C'),
...
and use Data.Map.lookup
instead of Data.List.lookup
.
Faithfully replicating the morse code table is clear and preferable, but makes for quite large source code. So, let’s make an exception today and think of something as clever as we can.
The idea is the following: whenever an encoded letter starts with a dash '-'
, we know that it can’t be for example an A or E, because those start with a dot '.'
. And if the next symbol is a dot, then it can’t be a G or M because their second symbol is a dash. With each symbol we read, more and more alternatives disappear until we are left with a single alternative. We can depict this with a binary tree:
At each node, the left subtree contains all the letters that can be obtained by adding a dot to the current letter, while the right subtree contains those letters that can be obtained with a dash. This is illustrated by means of a dotted line connected to the left subtree and a dashed line connected to the right subtree.
Now, to decode a letter, we simply start at the root of the tree and follow the dotted or dashed lines depending on the symbols read. For instance, to decode the sequence "-..."
, we have to go right once and then left thrice, ending up at the letter B. This procedure is also called dichotomic search because at each point in our search for the right letter, we ask a yes/no question “Dot or dash?” (a dichotomy) that partitions the remaining search space into to disjoint parts.
Let’s implement this in Haskell. First, we assume that our dictionary is given as a tree
data Tree a = Leaf
| Branch { tag :: a
, left :: Tree a
, right :: Tree a }
dict :: Tree Char
dict = Branch ' ' (Branch 'E' ... ) (Branch 'T' ...)
Of course, writing out the dict
tree in full is going to be repetitive and boring, but let’s just imagine we have already done that. Then, decoding a letter by walking down the tree is simply a left fold over the code word
decodeLetter = tag . foldl (flip step) dict
where
step '.' = left
step '-' = right
In this case, the accumulated value of the fold is the current subtree, although the term “accumulate” is a bit misleading in that we don’t make the dictionary bigger; we make it smaller by passing to a subtree.
The curious reader may notice that we have actually implemented a trie.
Now, let’s reduce the source code needed for the dictionary tree. In Haskell, we’d have to write something like
dict = Branch ' ' (Branch 'E' ... ) (Branch 'T' ...)
Compared to the association list, we got rid of the dots '.'
and dashes '-'
, they have been made implicit in the structure of our tree. But we still need a lot of parenthesis and applications of the Branch
function.
A clever way to get rid of parenthesis is reverse polish notation, commonly used in stack-based languages. The idea is that a program in such a language is a sequence of instructions that pop and push values from a stack. For example,
1 2 +
is a program to calculate 1 plus 2. Reading from left to right, the first instruction is 1
which pushes the integer 1 onto the stack. The second instruction is 2
, pushing 2 onto the stack. And finally, the instruction +
pops the two topmost integers from the stack and pushes their sum onto the stack. This procedure is visualized as follows:
To see how that eliminates the need for parenthesis, take a look the program
1 2 + 3 4 + +
which calculates the sum (1+2)+(3+4).
To build our dictionary tree, we are going to devise a very similar language. Instead of integers, the stack will store whole trees. In analogy to the numerals, there will be an instruction for pushing a Leaf
onto the stack; and in analogy to +
, there is going to be an instruction for applying the Branch
constructor to the two topmost subtrees.
Here’s the program for building the tree, stored as a string:
program :: String
program = "__5__4H___3VS__F___2 UI__L__+_ R__P___1JWAE"
++ "__6__=B__/_XD__C__YKN__7_Z__QG__8_ __9__0 OMT "
Each character represents one instruction. The underscore '_'
pushes a Leaf
onto the stack while all the other character push a Branch
tagged with the character in question onto the stack. The interpreter for this tiny programming language can be written using a simple left fold:
dict = interpret program
where
interpret = head . foldl exec []
exec xs '_' = Leaf : xs -- push Leaf
exec (x:y:xs) c = Branch c y x : xs -- combine subtrees
The stack xs
is represented as a list of trees.
We have found a concise way to represent the morse code dictionary in source code, but cleverness does not end here. For instance, how about eliminating the tree entirely?
More precisely, it seems wasteful to construct the dictionary tree by creating leaves and connecting branches only to deconstruct them again when decoding letters. Wouldn’t it be more efficient to interpret the morse code tree as a call graph rather than as a data structure Tree Char
?
In other words, imagine say a function e
corresponding to the node labeled ‘E’ in the tree, and working as follows:
e :: MorseCode -> Char
e ('.':ds) = i ds
e ('-':ds) = a ds
e [] = 'E'
If the next symbol is a dot or dash, we proceed with the functions i
or a
corresponding to 'I'
and 'A'
respectively. And in case we have already reached the end of the code, we know that we have decoded an 'E'
. The functions i
and a
work just like e
, except that they proceed with other letters; and the morse code tree becomes their call graph.
Now, writing all these functions by hand would be most error-prone and tedious, they all look the same. But abstracting repetitive patterns is exactly where functional programming shines; we are to devise a combinator that automates the boring parts of the code for us.
What does this combinator look like? Well, the non-boring parts are the letter it represents and the two follow-up functions, so these will be parameters:
combinator :: Char -- letter
-> (MorseCode -> Char) -- function for dot
-> (MorseCode -> Char) -- function for dash
-> (MorseCode -> Char) -- result function
That’s all we need; we can implement
combinator c x y = \code -> case code of
'.':ds -> x ds
'-':ds -> y ds
[] -> c
which allows us to write
e = combinator 'E' i a
i = combinator 'I' s u
... etc ...
But wait, what have we done? The type signature of combinator
looks exactly like that of
Branch :: Char -- letter
-> Tree Char -- tree for dot
-> Tree Char -- tree for dash
-> Tree Char -- result tree
but with the type Tree Char
replaced by (MorseCode -> Char)
! In fact, let’s rename the combinator to lowercase branch
branch c x y = \code -> case code of
'.':ds -> x ds
'-':ds -> y ds
[] -> c
and observe that the tree of functions now reads
e = branch 'E' i a
i = branch 'I' s u
... etc ...
or
e = branch 'E' (branch 'I' ... ) (branch 'A' ...)
if we inline the function definitions. But this is of course just the tree from the section on dichotomic search with each Branch
replaced by branch
!
In other words, branch
is like a drop-in replacement for the constructor Branch
. To implement the morse code tree directly as functions instead of as an algebraic data type, we simply replace every occurrence of Branch
with branch
in our previous code. Thus, we have a new implementation
dict :: MorseCode -> Char
dict = interpret program
where
interpret = head . foldl exec []
exec xs '_' = leaf : xs -- push leaf
exec (x:y:xs) c = branch c y x : xs -- combine subtrees
decodeLetter = dict
where
leaf = undefined
is a trivial replacement for the Leaf
constructor.
To further understand how and why the replacement of Branch
by branch
works, it is instructive to derive it systematically from just the program text.
In particular, consider the implementation of decodeLetter
from the earlier section on dichotomic search:
decodeLetter :: MorseCode -> Char
decodeLetter = tag . foldl (flip step) dict
where
step '.' = left
step '-' = right
In this formulation, the focus is on the recursion over the input string performed by foldl
, neglecting the recursive descent into the dictionary tree. Let’s systematically rewrite this code to highlight the latter.
First, we interpret it as converting the dictionary tree into a function that decodes letters
decodeLetter = decodeWith dict
decodeWith :: Tree Char -> (MorseCode -> Char)
decodeWith dict = tag . foldl (flip step) dict
where
step '.' = left
step '-' = right
Then, we turn the foldl
into explicit recursion
decodeWith dict [] = tag dict
decodeWith dict (d:ds) = decodeWith (step d dict) ds
where
step '.' = left
step '-' = right
and dissolve the step
function into the pattern matching
decodeWith dict [] = tag dict
decodeWith dict ('.':ds) = decodeWith (left dict) ds
decodeWith dict ('-':ds) = decodeWith (right dict) ds
We group the pattern matching on the input code into a case expression
decodeWith dict = \code -> case code of
[] -> tag dict
('.':ds) -> decodeWith (left dict) ds
('-':ds) -> decodeWith (right dict) ds
and finally, we replace the field selectors by pattern matching on the tree, also noting the case of Leaf
:
decodeWith Leaf = undefined
decodeWith (Branch c x y) = \code -> case code of
[] -> c
('.':ds) -> (decodeWith x) ds
('-':ds) -> (decodeWith y) ds
The recursion on the tree is apparent now, decodeWith
simply traverses both subtrees and combines the results. We can make this even more evident by appealing to the combinator we defined in the previous section:
decodeWith Leaf = leaf
decodeWith (Branch c x y) = branch c (decodeWith x) (decodeWith y)
In other words, decodeWith
takes a tree and simply substitutes each Leaf
constructor with leaf
and each Branch
constructor with branch
.
Of course, first using Branch
and then replacing it with branch
is a waste; we should rather use branch
from the start and thus cut away the intermediate tree. That’s exactly what we’ve done in the previous section!
Replacing constructors is a very general pattern, not restricted to binary trees. For instance, you probably know following function:
foldr f z [] = z
foldr f z (x:xs) = x `f` foldr f z xs
It’s the good old fold! And at its heart, it just replaces the constructors of the list data type: z
is the substitute for the empty list []
, and f
is put in lieu of the (:)
constructor.
In its honor, any function that substitutes constructors in such fashion is known as a generalized fold. A more exotic but commonly used alternative name is “catamorphism”. In other words, decodeWith
is a catamorphism.
In this generality, the idea of deforestation is that instead of first constructing a data structure and then chopping it up with a catamorphism, it is more efficient to saw the pieces properly at the time of creation. For example, creating a list and then adding all elements
sum [1..100] = foldr (+) 0 (enumFromTo 1 100)
is less efficient than adding the elements as they are created
sumFromTo a b -- changes to enumFromTo
| a > b = 0 -- 0 replaces []
| otherwise = a + sumFromTo (a+1) b -- + replaces (:)
Merging two functions in this fashion is also called fusion.
Performing deforestation manually, as we did, sacrifices reusability: the morse code tree only exists as a call graph and cannot be printed out anymore, and sumFromTo
is not nearly as useful as are sum
and enumFromTo
were.
Hence, the goal is to teach the compiler to automatically fuse catamorphisms with their structure creating counterparts, the so called anamorphisms. That’s what efforts like a short-cut to deforestation and the recent stream fusion are set out to do, yielding dramatic gains in efficiency while preserving the compositional style.
I hope you had fun doodling around with morse code; I certainly did. If you long for more, here a few suggestions:
While we’re at it, the intermediate lists created by words
in
decode = map decodeLetter . words
can be deforested as well and fused into the letter decoding. Try it yourself, or see the example source code.
Polish notation, the converse of reverse polish notation, puts function symbols before their arguments and is thus closer to the Haskell syntax. When the arities of the functions are known in advance, like 2 for Branch
and 0 for Leaf
, this notation doesn’t require parenthesis either. Write a program
in polish notation and a corresponding tiny interpreter to create the morse code tree.
Since deforestation is supposed to make things faster, how about comparing our deforested morse code tree to say an implementation in C? But instead of doing benchmarks, let me give two general remarks on the machine representation.
First, one might think that the call graph we’ve built is compiled to function calls, with e
calling i
or a
and each function having different machine code, just as we originally intended. But this is actually not the case. When defined with branch
, the functions are represented as closures, sharing the same machine code but carrying different records that store their free variables c
, x
and y
. This is not very different from storing c
, x
and y
in a Branch
constructor. Template Haskell or some kind of partial evaluation would be needed to hard-code the free variables into the executable. For more on the Haskell execution model, take a look at the GHC commentary.
Second, all our implementations decipher a letter by follow a chain of pointers: descending a tree means following pointers to deeper nodes, and making procedure calls means repeatedly jumping to different code parts. This of course raises questions of cache locality, branch prediction or memory requirements for storing all these pointers.
In C, however, there is another technique available: instead of following a chain of pointers to get to the destination, the address of the final stop is calculated directly with pointer arithmetic. The following example illustrates this:
#include <stdio.h>
#include <string.h>
char buf[99], tree[] = " ETIANMSURWDKGOHVF L PJBXCYZQ ";
int main() {
while (scanf("%s", buf)) {
int n, t=0;
for(n=0; buf[n]!=0; n++)
t = 2*t + 1 + (buf[n]&1); /* compute destination */
putchar(tree[t]); /* fetch letter */
}
}
Here, the tree is represented as an array and paths to nodes are encoded as (zero-less) binary numbers. The program calculates the path t
to the proper letter and then fetches it with an O(1) memory lookup. Compare also the morse code translator by reddit user kayamon.
Since the number of memory accesses is kept to a minimum, this technique is the most efficient; but it is impossible to reproduce with algebraic data types alone. Fortunately, arrays libraries are readily available in Haskell as well.
The main drawback of this approach, and of arrays in general, is the lack of clarity; indices are notorious for being non-descript and messy. The raw index arithmetic in the example C code above sure seems like magic!
Such magic is best hidden behind a descriptive abstract data type. How about rewriting the example in Haskell such that decodeLetter
looks exactly like the one in the section on dichotomic search? In other words, the abstract data type is to support the functions left
, right
and tag
. One possible solution can be found in the source code accompanying this article.