citizen428.blog()

Try to learn something about everything

Tom and Michael vs the Monad

So, over the last few weeks Tom Crinson and I read another paper, this time “The essence of functional programming” (available here) by Philip Wadler, one of the principal designers of Haskell. Tom’s post is here.

The paper

Don’t let yourself be fooled by the title, this is not an introduction to functional programming or lambda calculus. Instead it talks about monads, a topic one reads a lot about and that’s often made out to be a lot more complicated than it really is.

The whole paper is 23 pages long, so I won’t really summarize it all, but rather try to give you an idea if it might be an interesting read for you. Section 1 lays out some basics and explains the difference between pure and impure code. In section 2 the reader finally encounters the monad, defined using the usual triple (M, unitM, bindM). After that we see several versions of a simple call-by-value interpreter, demonstrating various well-known monads (identity, state, error, list etc), as well as a rewrite of the interpreter to implement call-by-name semantics. For some more theory Wadler lays down the monadic laws and provides a definition of monads in terms of mapM and joinM. He also mentions that list comprehension notation generalizes to arbitrary monads, which I found to be quite interesting. Section 3 introduces continuation passing style (CPS) and its relation to monads, before section 4 talks about the influence of monads on Haskell, including some cool info tidbits about early Haskell versions that seem to lack do-notation and the bind operator (the paper is from 1992 after all). Last but not least there’s the wrap-up and the work is put in context of other research.

Takeaway

Monads are not rocket science, they are just another tool in a programmer’s toolbox, especially of a programmer in a pure functional language. In the end, a monad really isn’t much more than a formalized computation, following certain rules like “I put in an A and get out a B, so what ever comes next has to take a B”. Ok, maybe that was a bit oversimplified, but it’s definitely not magic. In fact I wrote up an explanation for Tom, trying to get the basic idea across:

If you imagine a Monad as some sort of box, UnitM is what puts an item into the box. Bind extracts such a value (M a), processes it with a function (a → M b) and returns another item in a box (M b). Why is that cool, because you can’t leave the box! If you think about it – and that’s an oversimplification – it’s a bit like aligning your types so you can chain method calls (just that Unit and Bind have to follow the Monadic laws). Imagine in Ruby you have an intermediary result but instead of returning a plain value OR an array, you always make sure to return [*values]. This way you can chain anything that expects an array, no matter how many intermediary values there were. That’s a bit like having an item wrapped up in some sort of array monad (once again, we are in oversimplification land here, but maybe the analogy helps).

Difficulty of the paper

The author states that he assumes no previous knowledge of category theory or Haskell (the language used in the examples), but judging Tom’s reaction to the paper, I’m not convinced by that. For somebody who has seen Haskell before and/or has a basic understanding of Monads already this is a pretty good and easy to follow read though.

Q&A

Since Tom had some trouble with the paper – but bravely fought his way through it – we exchanged quite some emails on the actual content. This is cool, because it’s exactly the reason we are doing this! First we needed to get some Haskell syntax out of the way:

Let’s do it slow with actual Haskell (this is a stupid function though, you could just do add = (+)):

add :: Int → Int → Int
add x y = x + y

This means that “add” is a function that takes two “Ints” and returns an “Int”. The value after the last → is the actual return value of the function. If you wonder about the intermediate arrows, that’s because Haskell functions technically only take one argument and return a new function which takes another argument etc.

So what happens when you call

add 5 6

is that “add” gets applied to the argument 5, which returns another function that looks roughly like (add y = 5 + y) which in the end returns the number 11. This is a result of Haskell being based on the lambda calculus, where the equivalent expression would be \x.\y.x+y. For practical purposes it’s enough to remember the following though: whatever goes before the last → are the arguments to the function, the thing after the last → is the return type.

What about polymorphic functions? Easy:

add :: (Num a) => a → a → a
add x y = x + y

Here we just say that we take 2 arguments of type “a” and return another “a”, with the constraint that “a” has to be numeric.

Since it also is covered in the paper, here’s an example of a higher order function, good old map (that’s the same as Ruby’s map):

map :: (a → b) → [a] → [b]
map f [] = []
map f (x:xs) = f x : map f xs

This basically tells us that “map” takes two arguments. The first is itself a one-argument function ((a → b)) the second a list of values ([a]) and we return another list of values ([b]). Note that “a” and “b” can be the same type, but don’t have to be.

Could you please, as simply as possible, define a Monad?
I have to quote Brian Beckman on that, “Monads are a theory for interacting things”. Or to paraphrase, Monads are an abstraction that allows you to define how the various stages of a computation interact. It’s a sequence of operations, put into context.

What would you use a Monad for?
As seen in the paper, Monads are quite a flexible abstraction and can be used for many different things (take for example the MapReduce Monad). Personally I’ve never used one outside of my limited Haskell experience, but .Net developers may find it interesting to know that LINQ is a monad (more on that here).

Can I use them in Ruby?
You can, in fact there’s a link listed below that shows simple Monads implemented in Ruby. It’s more a question of how useful they actually are in a mostly object oriented language that allows for side-effects.

If I wanted to learn more about this sort of thing where should I go?
Brian Beckman recorded an awesome 67 minute video on the topic, which is probably the best introduction I ever saw.

Links

Information Overload 2011-05-15

Information Overload 2011-05-08

So You Want to Learn Clojure?

For last Monday’s Lambdaheads meeting I prepared a list of resources for people interested in learning Clojure, so I thought I’ll share it here. The actual presentation had a lot of comments, but I’m afraid I’m too lazy to repeat them in writing. Maybe the list is still useful for some people though:

Books

Books WIP

Online reading

Coding

Screencasts / Videos

Community

Tom and Michael vs Google’s MapReduce

The other day on Twitter I complained about my own laziness when it comes to reading academic papers, and Tom Crinson chimed in to let me know that he’d also like to read more of them. That’s how we decided to set up a sort of online based reading group for this type of thing, where we’ll choose a paper to read, discuss it via email and then blog about it (Tom’s post is here).

The first in this series of blog posts is about Google’s MapReduce framework:

MapReduce: Simplified Data Processing on Large Clusters

The paper and a presentation based on the paper

Summary (by Tom Crinson)

This is a paper on an architectural software paradigm and framework that Google uses to process or generate vast amounts of data on commodity hardware.

The basic elements are that each task is split into two stages:

  • Mapping – a user defined processing task is applied to each job and converted to an intermediate key value format.
  • Reducing – a second stage of processing takes in the intermediate output from the mapping stage and performs user defined aggregation functions to generate the desired output.

The paper indicates that many different types of tasks can be split up in this way. Obvious ones for Google are processing pages, indexes and links etc, but people are also using the framework to do artificial intelligence tasks. This variety proofs the flexibility of the paradigm.

The idea itself is pretty simple, we do it a lot in day to day programming: get a set of data, do some operation on all of the elements and then perform some kind of aggregation; sum for example. Google just took it one step further and on a huge scale. They talk about setting up clusters of thousands of machines to do the processing. This is where their framework comes in.

The framework takes all of the pain out of distributing the tasks, setting workers onto individual jobs and handling input and output from each. It also takes care of the nitty gritty of fault tolerance, so for example if a worker dies the master process will notice and assign another worker to re-execute all of the tasks that the dead worker had done. The master will also take care of dealing with de-duplicating the jobs so your results are as they should be. It also has a pretty nice feature to handle “stragglers” (slow tasks): the framework preemptively sets another set of workers to duplicate the effort on the last few jobs just in case there is an issue with a worker machine that is slowing the process down. This was shown to have a huge effect in task completion time, an increase of something like 44% was seen without the backup processing.

In case of master failure the whole process stops, but as it goes along the master has been setting “checkpoints” so in case of a crash, a new master can be started up and can continue on from the last checkpoint. This is a manual process though as the user should probably check to see if any of the input caused the failure.

Another very handy feature is bad record detection. If there is a spurious data element that for some reason causes a worker to hang repeatedly, the master is notified and on the next attempt to process that record set, the bad record is skipped.

There are some other key elements to the system, like splitting he input so it can be assigned to the different workers. This has a default hash function on the key, but can be user specified if you don’t simply want hash(key) mod numWorkers as this is quite simplistic. See an article on consistent hashing for alternatives and why it’s bad. The input data tends to consist of just input files, however an API is exposed to allow custom data providers which is a nice feature.

Takeaway

For me the key takeaway was the following sentence: “[it] allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily”, which basically means that MapReduce tries to take an inherently complex task and abstracts all the difficult bits away, so that programmers can focus solely on the data manipulation/transformation aspects.

Difficulty of the paper

Very easy. It’s a simple textual description of MapReduce including good examples. There’s basically no math and the language is easy to understand and not full of computer science jargon, so even less technical oriented people should be able to understand most of it. The presentation also serves an excellent summary.

Examples and further reading, food for thought

I hacked together a little something in IRB to illustrate the basic workings of the algorithm (it’s not perfect, but good enough for our purposes):

1
2
3
4
5
6
7
8
9
10
11

1
2
3
4
5
6
7
8
9
10
11
<span class='line'><span class="o">&gt;&gt;</span> <span class="n">words</span> <span class="o">=</span> <span class="s2">&quot;foo bar baz qux foo bar lala&quot;</span>
</span><span class='line'><span class="c1">#=&gt; &quot;foo bar baz qux foo bar lala&quot;</span>
</span><span class='line'><span class="o">&gt;&gt;</span> <span class="c1"># map (user-supplied)</span>
</span><span class='line'><span class="o">.</span><span class="n">.</span>   <span class="n">mapped</span> <span class="o">=</span> <span class="n">words</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sr">/ /</span><span class="p">)</span><span class="o">.</span><span class="n">map</span> <span class="p">{</span> <span class="o">|</span><span class="n">w</span><span class="o">|</span> <span class="o">[</span><span class="n">w</span><span class="p">,</span><span class="mi">1</span><span class="o">]</span> <span class="p">}</span>
</span><span class='line'><span class="c1">#=&gt; [[&quot;foo&quot;, 1], [&quot;bar&quot;, 1], [&quot;baz&quot;, 1], [&quot;qux&quot;, 1], [&quot;foo&quot;, 1], [&quot;bar&quot;, 1], [&quot;lala&quot;, 1]]</span>
</span><span class='line'><span class="o">&gt;&gt;</span> <span class="c1"># aggregate (built-in)</span>
</span><span class='line'><span class="o">.</span><span class="n">.</span>   <span class="n">aggregated</span> <span class="o">=</span> <span class="n">mapped</span><span class="o">.</span><span class="n">group_by</span> <span class="p">{</span> <span class="o">|</span><span class="n">w</span><span class="p">,</span> <span class="n">c</span><span class="o">|</span> <span class="n">w</span> <span class="p">}</span>
</span><span class='line'><span class="c1">#=&gt; {&quot;foo&quot;=&gt;[[&quot;foo&quot;, 1], [&quot;foo&quot;, 1]], &quot;bar&quot;=&gt;[[&quot;bar&quot;, 1], [&quot;bar&quot;, 1]], &quot;baz&quot;=&gt;[[&quot;baz&quot;, 1]], &quot;qux&quot;=&gt;[[&quot;qux&quot;, 1]], &quot;lala&quot;=&gt;[[&quot;lala&quot;, 1]]}</span>
</span><span class='line'><span class="o">&gt;&gt;</span> <span class="c1"># reduce (user-supplied)</span>
</span><span class='line'><span class="o">&gt;&gt;</span> <span class="n">aggregated</span><span class="o">.</span><span class="n">each_with_object</span><span class="p">({})</span> <span class="p">{</span> <span class="o">|</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">),</span> <span class="n">h</span><span class="o">|</span> <span class="n">h</span><span class="o">[</span><span class="n">k</span><span class="o">]</span> <span class="o">=</span> <span class="n">v</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:last</span><span class="p">)</span><span class="o">.</span><span class="n">inject</span><span class="p">(</span><span class="ss">:+</span><span class="p">)</span> <span class="p">}</span>
</span><span class='line'><span class="c1">#=&gt; {&quot;foo&quot;=&gt;2, &quot;bar&quot;=&gt;2, &quot;baz&quot;=&gt;1, &quot;qux&quot;=&gt;1, &quot;lala&quot;=&gt;1}</span>
</span>

When thinking about the algorithm a bit more, it also dawned on me that MapReduce sounds a lot like a Monad. When I started searching for this, it soon turned out that someone else had the same idea:

MapReduce as a monad

Then there’s also a nice article I found on this last week, more or less by accident:

Map / Reduce – A visual explanation

Last but not least some links to MapReduce related frameworks and libraries:

Hadoop
BashReduce
Skynet
Cascalog

Information Overload 2011-05-01

Information Overload 2011-04-24

Ruby Book Recommendations

As Ruby matured and became part of the programming mainstream (at least to a certain extent), publishers got on the bandwagon and supply us with more and more Ruby books, many of which are really good. Alas time is limited and most of us won’t manage to read them all, so I thought I’ll write up my current recommendations.

From zero to hero

For people new to Ruby, there are 3 books I consider absolute must reads:

  • The Well-Grounded Rubyist by David A. Black:
    This book really lives up to its title and gives the reader a solid grounding that will be a great basis for diving deeper into the language (I wrote a longer review here). Alternatively you might want to consider Programming Ruby by Dave Thomas, with Chad Fowler and Andy Hunt, which is also an excellent book and a standard recommendation for Ruby newbies.
  • Eloquent Ruby by Russ Olsen:
    This book is a true gem! Mostly written for people coming to Ruby from other programming languages, Russ Olsen doesn’t dwell on banalities, but instead teaches you good, idiomatic Ruby. The presented material is excellent and the writing is witty and clever. If you are new to programming this is probably not the best book for you, but if you have experience in some other language or “just” want to get more fluent in Ruby, “Eloquent Ruby” is hard to beat.
  • Ruby Best Practices by Gregory T. Brown:
    Once you worked your way to the other two books or have already coded in Ruby for a while, you definitely should check out “Ruby Best Practices”. As the title suggests this is not about learning Ruby, it’s about getting the most out of the language, for yourself as well as for potential collaborators and colleagues. No matter how long you’ve been doing Ruby, I’m almost certain you’ll still pick up a little trick here and there.

Now obviously you can’t just sit down, read those 3 books and become a great Rubyist through reading. Start reading, work with the text, use IRB to experiment and find yourself some challenges to try your hands on (e.g. the Ruby Koans or Code Katas).

Advanced Rubyists

  • The Ruby Programming Language by David Flanagan, Yukihiro Matsumoto:
    I guess this wouldn’t really need to be mentioned, but for completeness’ sake I’ll include it here. While I don’t think it’s a great book for learning the language, it’s something every serious Ruby developer should own as a reference.
  • Refactoring: Ruby Edition by Jay Fields:
    Granted, this may not sound exciting, but “Refactoring Ruby” is a really good book, that will come in handy once you have to work on a somewhat significant Ruby codebase. Think of it as design patterns for refactoring.

Of course “Ruby Best Practices” also falls in this category, but I already mentioned it in the first section.

What Ruby books do you recommend and why?

Information Overload 2011-04-17

ClojureX Is Back

Considering how many tools for getting up and running with Clojure are available nowadays (e.g. cljr, cake or Dejour) I’m not sure if anyone still cares, but as of today I’m actively working on ClojureX again.

There’s not really a good reason for that, but I noticed I’ve been absent from the Clojure community for way too long and would like to change that again, so see this as a first step.

Copyright © 2016 - Michael Kohl - Powered by Octopress