technically a blog

home | blog / tags | newsletter | projects

Functional Programming and Big Data

Saturday, November 12, 2016

Update: I wrote this while preparing a talk for the Columbus Functional Programmers meetup. You can find the talk on YouTube. It has more humor than these words, but then you'd have to listen to my voice.

This post is a long one, so here’s a brief roadmap. We’ll start with a quick introduction to functional programming. Then you’ll get a quick introduction to Apache Spark and the history of big data. After that, we will get to a hands on demo of Spark. Okay, are you with me? Let’s go!

Intro to Functional Programming

Motivation

First of all, why should you even care about functional programming?

Simply put, functional programming matters because it is a big part of the future of the software industry. The industry is buzzing about functional programming (FP). Elements of FP are working their way into most mainstream languages. Even C++ and Java, stalwarts of the procedural object-oriented camp, have adopted lambda functions. It is less common to see FP adopted wholesale, but functional languages like Scala, F#, and Clojure are gaining in popularity. Although uncommon, companies are even using Haskell in production systems.

You should care about functional programming even if you never use it in production (although, I suspect you will). Functional programming gives you a completely different way of thinking about problems and is a good tool in any programmer's toolbelt. Of course, getting this other perspective comes with a price: FP usually takes a significant investment to learn and to learn well.

Fluffy Abstract Explanation

So, with the benefits in mind, let's tackle the first question: what is functional programming? Wikipedia defines it as "a programming paradigm [...] that treats computation as evaluation of mathematical functions and avoids changing-state and mutable data". Let's break that down piece by piece:

a programming paradigm is a essentially a style of programming and the features it uses. Paradigms you'll hear about most frequently are: imperative; object-oriented; procedural; functional; declarative. There is often overlap between these, and it's mostly a way to classify languages and talk about them more easily.
computation as evaluation of mathematical functions means that instead of a "data recipe" where you have a set of instructions that you follow step by step, you describe with math what you expect as output based on what you provide as input. That is, you precisely describe the relationship between the set of all inputs and the set of permitted outputs of your function.
avoiding changing-state and mutable data means that you can't say x = 5 and then later say x = 10. When you set a value equal to something, it is equal forever and you can't change the state. If you create a list and you need to add a new element to it, you don't modify it in-place - you create a new list with the element added to it. This gives a few nice properties: you don't have to worry about concurrent accesses to data structures, since those are read-only; you don't have to worry about a function modifying data you pass in, since it can't; and it simplifies testing.

So, in a functional programming language, you write code using functions that don't have side effects. Since we are arguably removing features from imperative languages (mutable data, side effects, etc.), we must also be adding features (or creating a very strange language). Here are a couple of features we will always have in functional languages:

Higher order functions: functions that can take functions as arguments, and can return functions as results. This makes it so you can do really cool things like writing your own control structures. We'll see examples of this in the next section, since it underpins most of functional programming.
Lambda functions are anonymous functions. They sometimes have restrictions in what they can do (for example, lambdas in Python cannot do everything lambdas in Haskell can do) but in principle, a lambda function is just an unnamed function.
Algebraic datatypes are composite types, most commonly product types (such as tuples or records) and sum types (such as union types). We will also see examples of these in the next section.

There are a lot of other features that you see more in functional programming languages, but it is important to keep in mind that not all FP languages are Haskell, and you can do FP even if your language is technically in a different paradigm (for example, JS has a strong community building around doing FP, especially with the rise of frameworks like React and Redux and libraries like Ramda).

That made no sense, show me the code

Let's not pretend that that was perfectly clear. Unless you've actually done some functional programming, that explanation is likely abstract and not perfectly clear, so let's look at a few concrete examples. These will all be in Scala (it can show both imperative and functional styles, and it is the language used for Spark).

"Hello Fibonacci"

The canonical example for getting started with functional programming seems to be calculating the Fibonacci sequence. It's short and digestible and shows a little bit of the flavor (and avoids IO, which can be difficult in functional languages).

n.b.: I'm assuming the user will always pass in valid input, and we aren't concerned with error handling here. That's for another blog post.

First, let's take a look at an imperative implementation:

def imperativeFibonacci(n: Int): Int = {
  var a: Int = 0
  var b: Int = 1
  var index: Int = 0

  while (index < n) {
    index += 1

    val next = a + b
    a = b
    b = next
  }

  a
}

This is basically the version we all wrote when we were learning. It was kind of tricky to write, and a lot of that trickiness comes from the fact that when we look at the definition of the Fibonacci series on Wikipedia, it is not expressed as this kind of calculation. Wouldn't it be nice if we could write it in a way that's closer to how it's defined?

We're in luck. Here is one way we could write a functional implementation:

def fibonacci(n: Int): Int = n match {
  case 0 => 0
  case 1 => 1
  case _ => fibonacci(n-1) + fibonacci(n-2)
}

This is much cleaner. It has two major problems, though: it will result in a stack overflow if we run with too high of an n value, and it will be really slow for large n (it's O(2^n), which makes kittens cry).

Here's another functional approach which is still clean and avoids both of these problems:

def fibonacci(n: Int): Int = {
  def fib(n: Int, a: Int, b: Int): Int = n match {
    case 0 => a
    case _ => fib(n-1, b, a+b)
  }
  fibHelper(n, 0, 1)
}

This one avoids stack overflows by using tail calls, which are optimized by the Scala compiler and turned into loops. It also is more efficient, since it compiles down to something very similar to our imperative version above.

What makes this better than the imperative approach? Truthfully, it isn't necessarily better. It definitely is different, and having a different approach will benefit you.

Examples (Lambdas, Maps, Folds)

Now we have seen a basic example, we should look at a more thorough, complete, and realistic example. This is obviously contrived, but it should give you the flavors of functional programming.

Let's pretend that you're a professor and your program has a list of student records in it (containing name, id, and grade). First, let's define the datatype we are using:

case class Student(name: String, id: String, grade: Float)

Now you want to know who is failing your course so you can intervene and help them get a better grade. We need to find the students who are currently failing. As an imperative programmer, you might write something like this:

def getFailingStudents(roster: Seq[Student]): Seq[Student] = {
  var disappointments = Seq[Student]()
  for (student <- roster) {
    if (student.grade < 90.0) { // we have high standards
      disappointments :+= student
    }
  }
  disappointments
}

If you also want to find the students who are passing, you will have to write nearly identical code. Let's see how we would do both of them in a functional style. I'm going to skip actually implementing the filter function and just show you how we do it with some functional constructs (higher order functions, lambda functions):

val failingStudents = roster.filter(x => x.grade < 90.0)
val passingStudents = roster.filter(x => x.grade >= 90.0)

Without higher order functions, we would not be able to define this kind of filter function. (We could hack it together using anonymous classes and overriding methods, like was done in Java for a long time, but that is ugly and very cumbersome; this is very clean.) The great thing about doing filters this way is we don't have to reimplement anything for passing students, we just use a different predicate.

Now let's compute the average grade of your students. Again, first imperative...

def averageGrade(roster: Seq[Student]): Seq[Student] = {
  var total = 0.0
  for (student <- roster) {
    total += student.grade
  }
  total / roster.length
}

...and then functional...

val sum = roster.map(student => student.grade)
                .foldLeft(0.0)((a,b) => a + b)
val avg = sum / roster.length

Here we have introduced two new concepts:

map is used to transform one list into another list. It applies the supplied function to every element of the list. In this case, we transform a Seq[Student] into a Seq[Float]. This generally preserves the structure of the list, but transforms the content of it.
fold is used to compact down a list and generate a resulting value (foldLeft and foldRight just control associativity). The first argument is the initial accumulator, and then it applies the given function to the current accumulator and the next element of the list to generate the new accumulator. In our case, we transform a Seq[Float] into a Float by summing up the list. Note: fold is also sometimes called reduce.

What's Left?

There is a wealth of knowledge out there to gain in functional programming, and this introduction has come nowhere close to telling you everything useful about it. All of you should spend some time on reading and learning about functional programming. Hopefully, this has been a useful taste and will give you at least some value. Now we have to move on to other things.

Why is FP in Big Data?

I think at least a little bit of the hype about functional programming lately is thanks to the big data community. That should be apparent after learning more about how it is applied. Let's go through the history of big data to see how we've gotten to where we are, then go through the core concepts from FP that are useful in big data and how to use them and apply them.

We haven't always had the infrastructure needed for handling big data, in terms of network speed and storage capacity. One of the first companies which had both the capacity for big data and the need for it was Google. Another was Yahoo. (It turns out, the internet is big and generates a lot of data.) One of Yahoo's search engineers, Doug Cutting, created Lucene in 1999. The project ran well for a while but was running into a few problems, and Google happened to release a relevant paper on a distributed filesystems, which was then integrated into Lucene. Again in 2004, Google released a paper about a framework called MapReduce, and then it was integrated into some of Yahoo's infrastructure. In 2006, this integration was pulled out into its own project, called Hadoop. The Hadoop ecosystem grew over time and eventually some very smart folks at Berkeley created Spark, which is basically the de facto big data processing framework now.

So, what is MapReduce, and what is Spark?

What is MapReduce? Simply put, MapReduce is a way to compute on large amounts of data by providing Map and Reduce operations. You can have as many iterations of your computation as you want, and in each one, you define a Mapper which is run over each input record and generates output, and you define a Reducer which reduces down the results and either prepares them for output or for further computation. These operations are designed to be run across many machines, often hundreds or thousands, so we have some specific requirements we need to support that. We discussed Map and Reduce (fold) above, so we already know that these concepts are drawn from functional programming. It's curious that the entire computing model Google released is based around two fundamental functions in functional programming, so we have to dig in to see why those functions were chosen. It turns out that the assumptions we make for functional programming are very helpful in doing distributed computations:

Avoiding side effects makes life better. With functional programming, one of the core tenets is that you do not use side effects when computing values, so if f(10) returns 3 the first time you evaluate it, then f(10) will return 3 every time you evaluate it. Why does this matter for distributed computing? Because machine and network failures are fairly common, and you are almost guaranteed to encounter them when you run a cluster of hundreds or thousands of machines. If your computation always returns the same output for the given input, then dealing with failures is easy - just rerun the failed part of the computation on a new machine. But if it doesn't always return the same result (such as doing a distributed random shuffle of an array), then you have to start the entire computation over if any single part of it fails.
Avoiding global state makes life better. This goes hand-in-hand with avoiding side effects, but is a subtly different point (or a more specific one). By avoiding global mutable state, you make it really easy to distribute your computation across many machines, because you no longer have to worry about shared global locks or synchronizing state between the machines. You only have to worry about getting each machine the data it is computing on.
Without side effects, testing is easier. Since our computation doesn't (or shouldn't) have side effects, we can test things more easily, because we don't have to reset the computation between runs. We just pass in reasonable input to the test and as long as we get back the correct output, we are good to go. Whereas with side effects, we would have to worry about cleaning up after the tests, make sure that the computation can run correctly even if a previous run failed, etc.

Now, Hadoop (the open source implementation of MapReduce) was not perfect. Since Java did not support lambda functions or first-class functions until very recently, Hadoop MapReduce required you to write classes for the mapper and reducer, and these were very large and very clunky even when you were doing something relatively simple. Some people figured the solution was to add bindings for Python, where these implementations could be much shorter. However, it is still a big lift to write a class in order to just run a couple of functions... we should be able to pass those in directly. Further, people started to recognize that MapReduce was not the perfect paradigm for solving every single problem - it worked very well for some, and most could be shoved into it, but it wasn't perfect.

Along comes Spark to save the day.

What is Spark? Apache Spark is an engine for large-scale data processing. It lets you do things like compute product recommendations, figure out duplicate patients in an elecronic health record system, and analyze clickstream data for that sweet, sweet advertizing revenue. Basically, it lets you pump in a lot of data, do some computations on it, and pump out results (and supports doing this on streaming data, too). This is a lot like Hadoop MapReduce, except that you are not restricted to running a map and a reduce over your data - you can do many other operations. All of this was enabled by the work done on Hadoop, which was generalized into a resource manager which Spark was later written on top of.

So, if we can do the same things we could with Hadoop MapReduce, why do we need Spark at all? Well, we need it because it borrowed more from functional programming - and being written in Scala, these functional concepts are much easier to apply.

First-class functions make life easier. Instead of defining a mapper class, we just pass in a mapping function: ourData.map(_ + 1). Instead of taking another whole file for the class just to create a function to pass in as the mapper, we can do it in one line, by just defining the map function.
We get better error handling. Instead of returning null when a computation returns nothing, or manually crafting a datatype we can return that captures either-this-or-nothing, we have built-in datatypes that cover this (Either and Maybe), and we get an added bonus - any code that pattern matches against our return type is forced by the compiler to handle both cases, so we can rest assured that we won't have unhandled code paths. This is mostly a benefit brought in by algebraic data types.
Operating over collections is easy. Remember that filter example above? We can do exactly that in Spark by just passing in a filter. The same with averages, or any other computation we can think of. Spark exposes a collections API we can use much like the built in collections, so we can do things almost exactly like we would on in-memory data (in a functional style), and get distributed computation for free.

Hands on with Spark

Now that we've learned what Spark is and where it came from, let's get our hands dirty with some actual examples of how Spark works. We will look at some standard functional programming functions and properties, and how these apply to writing Spark jobs.

Higher Order Functions

Now let's go through some of the common higher order functions you'll use when you're writing Spark jobs.

Filter

In functional languages, filtering lists (or any collection) is simple:

List(1,2,3,4,5).filter(x => x%2 == 0) // Only even numbers

We can do the same thing in Spark:

rdd.filter(x => x%2 == 0) // Only even numbers

It is the same operation we had before. We simply pass in function, and it gets applied to our data automatically.

Map

Mapping over a collection is a way of converting a collection of one type into a collection of another type. Suppose you have a list of Strings:

List("1","2").map(x => x.toInt)

We can do the same thing in Spark:

stringNums.map(x => x.toInt)

The problem here is that sometimes we might have something that cannot be parsed, and Spark will abort the job if it fails too many times, so we should not have uncaught exceptions. How do we solve this problem in a functional style? We simply use the Option type (and a handy Scala wrapper that turns exceptions into None and returned values into Some(...) values). Here's the same conversion, but safe:

stringNums.map(x => Try(x.toInt).toOption)

This is great, but the problem is we now have RDD[Option[Int]] where we wanted RDD[Int]. How do we correct this? By reading the next section!

Flattening

When we have a list of lists (or generally, a collection of collections), we can flatten that into just the outer shell. Essentially, we take the innermost nested elements, and we pull them out of their containers into the parent containers. That's kind of hard to understand abstractly, so let's look at an example. Here's some vanilla Scala code that takes a Seq[Seq[Int]] and applies flatten, resulting in a Seq[Int]:

Seq(Seq(1,2), Seq(3), Seq(), Seq(4,5,6)).flatten == Seq(1,2,3,4,5,6)

We can do this with Options, too! Here's what that looks like:

Seq(None, Some(1), Some(2), None, None, Some(3)) == Seq(1,2,3)

Okay, so now we need to see how to do it in Spark. Spark, unfortunately, does not have flatten built in, but it does have flatMap, which means "apply map to this, and then flatten the results". We can work with that. There are two ways we can rewrite our old code to utilize our newfound flattening capabilities:

stringNums.map(x => Try(x.toInt).toOption).flatMap(identity)
stringNums.flatMap(x => Try(x.toInt).toOption)

The first line maps over the collection and then flattens it after the fact, while the second just uses flatMap in the first place and flattens it as it goes. The second is preferred, but the first is an option if you have a really good reason to do it.

Reduce (and friends)

We saw reduce before, and we can use it in Spark, as well. Let's say we have an RDD[Student] that contains all our students, and we want to compute the average grade right now. We can do that by first extracting their grades, then reducing across it, and then dividing that by the total number of students.

val numStudents = students.count
val sum = students.map(s => s.grade)
                  .reduce(_ + _)
val average = sum / numStudents

What if we want to count the words in a document? Suppose we have the document line-by-line. Then we can use one of the cousins of reduce, reduceByKey, to do this after we turn each word into a word-count-pair. This example leverages flatMap and map, and then combines everything down with a reduceByKey:

lines.flatMap(line => line.split(" "))
     .map(word => (word, 1))
     .reduceByKey(_ + _)

At the end, we will have turned an RDD[String] into RDD[(String,Count)] and we have the word counts we were looking for.

There are other higher order functions we can also use, and these are available in the API docs. Now, let's move on and look at a couple of other things we need to know about how functional programming applies to Spark.

Associativity and commutativity

First, some super dry terminology:

An associative operation is one where you can add in parentheses wherever you want and still get the same result. This means that, to be associative, we must have: (a + b) + c == a + (b + c). This holds true for most things we do, like addition and multiplication, but does not hold true for exponentiation: it's not the case that (2 ^ 3) ^ 4 == 2 ^ (3 ^ 4). It's also not true that (2 - 3) - 4 == 2 - (3 - 4).
A commutative operation is either one that drives to work, or it's one where you can rearrange the order of the elements and still get the same result. This means that, to be commutative, we must have a * b == b * a (note: the * can mean multiplication, but it stands in for any operation we are doing). So, we can notice again that this does not hold for exponentiation or subtraction, but does hold for addition and multiplication.

This is important to understand when writing Spark programs, because you need your operations (usually) to be associative and commutative. If they are not, your code will have race conditions and non-deterministic behavior, and may also crash Spark.

Suppose you wrote this:

someNumbers.reduce(_ - _)

What would you expect the result to be? The short answer is: we don't know. Since the operation is not associative and is not commutative, we have broken both constraints we need to have this operation work well. In practice, this will probably kill your Spark job and will definitely give you unpredictable results if it does finish.

Usually you won't try to reduce with - or ^, but this is something to keep in mind always. I know from personal experience that with sufficiently advanced Spark jobs, you can break associativity and commutativity in subtle ways that will eventually come out but be very difficult to debug. So keep it in mind, and think about this if your job sporadically fails.

What if you try side effects / IO?

Another thing to note is that sometimes, it is tempting to do IO or side effects within your Spark job. For example, you might want to compute a new interest rate for each customer, then write it back to the database:

customers.map(cust => calculateNewInterestRate(cust))
         .map(writeToDb(cust))

The problem is, we've just massively distributed our computation, and now we are going to essentially do a distributed denial of service attack on our database! This is problematic for obvious reasons, and I'd say that folks wouldn't try this, but I've seen it done, at places I've worked or where friends have worked.

You can also do something similar by reading data in, such as configuration files:

customers.map(cust => if (getConfig.flagIsOn) .......)

If you aren't careful, you'll read the configuration file for every single customer, and then your operations team will come hunting for you. Let's hope they don't have any unresolved anger issues.

Beyond just having your ops team hate you, this style of coding also is very difficult to test, because you have to have the configuration server/files, your database, etc. available just to run the code, even if you're not testing that interaction.

So, how do you resolve both of these cases? Basically, you do what you are supposed to do in any functional programming language: cleanly separate anything that is "pure" (no side effects, no IO) from anything that relies on the outside world.

Conclusion

Hopefully by now, you have the basic flavor of functional programming and you've seen how it has influenced Spark, and big data in general. There is a lot here to learn, but it is worth it and will ultimately make you a stronger engineer by giving you a second, independent way of thinking about your problems.

If you have any questions, feel free to contact me (info in the side bar).

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.

Want to become a better programmer? Join the Recurse Center!