Elixir/Erlang Clustering in Kubernetes

August 4, 2016

At work, our infrastructure is run on OpenShift Origin, a RedHat OSS project which is a bunch of nice tooling on top of Kubernetes. It’s been really pleasant to work with for the most part, though there have been some growing pains and lessons learned along the way. Since I was responsible for pushing to adopt it, and setting up the cluster, I’ve been sort of the go-to for edge cases and advice designing our applications around it. One of the first things that came up, and which I’ve spent a lot of time working with, is how to handle some of our Elixir/Erlang applications which need to form a cluster of nodes.

For those not entirely familiar, here’s a brief recap of how Erlang does distribution. Nodes must be configured to start in distributed mode, with a registered long or short name, and a magic cookie which will be used for authentication when connecting nodes. Typically you configure your node for distribution in vm.args like so:

-name [email protected]
-setcookie myapp

The above will tell the VM on start up to enable distribution, register the node with the long name [email protected] with the magic cookie myapp. Other nodes which wish to connect to this node, must explicitly connect with :net_adm.connect_node(:'[email protected]'), be present in the .hosts.erlang file read by those nodes, or one can rely on implicitly connecting when the node is referenced in a call to :rpc.call/4 or whatever. It’s important to note that the domain, i.e. 192.168.1.2 in this case, must be routable. So just putting any old domain name in there is not a good idea. Anyway, once we’ve connected to the node, we can now talk to processes on the other node, etc.

All of this is pretty manual, other than the .hosts.erlang file, which if you’re wondering what that is, here’s the excerpt from the Erlang manual:

File .hosts.erlang consists of a number of host names written as Erlang terms. It is looked for in the current work directory, the user's home directory, and $OTP_ROOT (the root directory of Erlang/OTP), in that order.

The format of file .hosts.erlang must be one host name per line. The host names must be within quotes.

Example:

    'super.eua.ericsson.se'.
    'renat.eua.ericsson.se'.
    'grouse.eua.ericsson.se'.
    'gauffin1.eua.ericsson.se'.

Even that file though requires knowing in advance what hosts to connect to. If you’re familiar with Kubernetes, you’ve probably already realized by now that this is not really possible. Container IP addresses are dynamic, and even if they were static, dynamically scaling the number of replicas based on load means that you will have nodes joining/leaving the cluster.

So how does one handle clustering in such a dynamic environment? My first shot at solving this was to use Redis as a cluster registry. It worked fine, mostly, but I hated the dependency, and wanted something easily reusable across our other apps. My next shot at addressing those issues, was to build a library I recently released, called Swarm, which as part of it’s functionality, includes autoclustering via a UDP gossip protocol. This worked nicely in my test environment, which was not run under Kubernetes, but when I pushed it to our developlment environment in OpenShift, I found out that OpenShift does not currently route UDP packets over the pod network. Damn it.

It was at this point that I was doing some maintainence on one of our applications, and discovered that Kubernetes mounts an API token, and the current namespace, for the pod’s service account, into every container. This API token can be used to then query the Kubernetes API for the set of pods in a given service. Swarm is built with pluggable “cluster strategies”, so I wrote one to pull all pod IPs associated with services which match a given label selector, i.e. app=myapp. It then polls the Kubernetes API every 5s and connects to new nodes when they appear. To be clear, since all you get from Kubernetes is the pod IP, you only have half of the node name you need for the -name flag in vm.args, but for my use case, I could simply share the same hostname, i.e. myapp, and then use the pod IP to get the full node name. Success!

If you are running on Kubernetes, and want to cluster some Elixir/Erlang nodes, give Swarm a look. In the priv directory, it has a .yaml file containing the definition of a Role which will grant any user associated with that role, the ability to list endpoints (the set of pods in a service). To give you a quick run down of the steps required:

  • Create the endpoints-viewer role using the Swarm-provided definition.
  • Grant the default serviceaccount in the namespace you plan to cluster in, the endpoints-viewer role.
  • Configure Swarm with the node basename and label selector to use for locating nodes.
  • Start your app!

I should probably talk about Swarm in another blog post at some point, but it also provides a distributed global process registry, similar to gproc, but is leaderless, and can handle a much larger number of registered processes. In addition to the process registry, it also does process grouping, so you can publish messages to all members of a group, or call all members and collect the results. Names can be any Erlang term, which gives you a great deal more flexibility with naming. It has it’s own tradeoffs vs gproc though, so it isn’t necessarily the go-to solution for every problem, but was necessary for my own use cases at work because we’re an IoT platform, and have processes per-device which need to have messages routed to them wherever they are in the cluster.

Reach out on Twitter or GitHub if you have questions about my experiences, I’d love to know how other people are tackling these kinds of things!