Case study: How LinkedIn uses containers to run its professional network
- 14 September, 2016 19:09
When LinkedIn started in 2003 it was a simple Java application with a web server.
Today the company that calls itself the world’s largest professional network is a powerhouse network that Microsoft earlier this summer spent $26.2 billion to acquire.
“Growing the site has been a journey,” says LinkedIn’s director of engineering Steve Ihde. And recently, application containers have played a big role.
+MORE AT NETWORK WORLD: Q&A with IBM Cloud Chief: The next phase of cloud is a race to add value | Inside Bank of America’s IT Transformation +
Ihde says over the years there have been a couple of major “inflection” points for the company from an infrastructure engineering and application development perspective. Around 2011 LinkedIn began feeling the pains of its monolithic app becoming too complex to manage. The engineering team was attempting to coordinate a new release of the app every two weeks, which made updates and new feature releases difficult to manage.
Project Inversion was an effort to decentralize the services that make up the application and rethink the development of new code. “We basically blew up the release process,” Ihde explains.
As part of Project Inversion, LinkedIn pivoted to a fine-grained microservices-based approach. Each of the thousand or so services that make up the app were managed independently with an owner and development team that released new features when they were ready.
“We decentralized and decoupled the whole thing,” he says.
It’s not an easy environment to manage, Ihde acknowledges. Many services depend on each other, so when one is updated another must be too; sometimes dozens of services need to be updated simultaneously. “It does become complicated to choreograph that whole effort,” he says. If too many services rely on one another that could be a sign the boundaries of services need to be redefined, though, he says.
In come containers
The past few years have been another inflection point and Ihde says application containers have driven this one.
“We started to realize we weren’t using the hardware resources as efficiently as we could,” he says. “We were basically hand-managing a system for resource allocation.” One team would file a ticket asking for hardware resources (compute capacity, for example) and get it. “It works reasonably well, but it’s not optimized for global efficiency,” he says.
In 2014 the company launched its private cloud called LinkedIn Platform as a Service (LPS), which is used for managing new application development and automating the underlying hardware needed run those apps. With LPS, the entire infrastructure was pooled together and automatically allocated to the services that need it.
There are two major custom developed components of LPS. One is named Rain, which is an API-driven infrastructure automation platform. Developers write their code and through APIs request the amount of memory and CPU needed, which Rain configures. “With Rain, applications no longer need to ask for an entire machine as a unit of resource, but instead can request a specific amount of system resources,” Ihde explained in a blog post earlier this year.
A second major component (there are many others) of LPS is named Maestro, which Ihde calls the “conductor” of LPS. Maestro is a higher-level automation platform compared to the infrastructure-focused Rain. Maestro handles the process of adding new services into the application, configuring them with other services, routing traffic to the necessary locations, registering the feature with the broader system and basically making sure new services fit nicely into the broader environment.
Both Rain and Maestro were designed to manage code developed and packaged in application containers, specifically using the Docker runtime. LinkedIn runs LPS on a bare-metal, non-virtualized environment with containers. There’s no overhead of a hypervisor, multiple operating system instances and virtual machines.
Containers also allow LinkedIn’s engineering team to enforce fine-grained security controls. Ihde says the goal is to give each service the impression that it is running on an empty host with nothing else on the machine. If a bug or a hack were to compromise a component, the idea is the blast radius of the incident would be limited. Application containers help do this: Configuring name spaces of containers in the Linux kernel allows for the container processes to be hidden from other containers on the same host. Each container has its own network name space and IP address too.
No thanks, public cloud
So why wouldn’t LinkedIn just do all this in a public cloud? Microsoft recently bought LinkedIn, so will this whole infrastructure platform move over to Azure? LinkedIn officials said they could not comment on anything related to Microsoft’s purchase of the company. But Ihde did say his team has explored using public cloud services. “It’s just not cost effective for us at our pretty large scale,” he says. “We can meet our needs ourselves.”
Ihde acknowledges that may not be the case for every company, but LinkedIn operates four data centers around the world and has a staff of 3,000 people involved in engineering and operations, so by running at that scale, they’re able to optimize for their own efficiencies.