It takes a lot of horsepower to support LinkedIn’s 467 million members worldwide, especially when you consider that each member is getting a personalized experience, a web page that includes only their contacts. Supporting the load are some 100,000 servers spread across multiple data centers. To learn more about how LinkedIn makes it all happen, Network World Editor in Chief John Dix recently talked to Sonu Nayyar, VP of Production Operations IT, and Zaid Ali Kahn, Senior Director of Infrastructure Engineering.
Lets start with the big picture view of what you have for data centers around the world.
Nayyar: We have three main data centers in the U.S. that serve LinkedIn.com worldwide, one in Richardson, Texas, one in Ashburn, Virginia, and a new one in Oregon that we just launched. We have a smaller data center in Singapore that we launched earlier this year, and its main purpose is to improve our member experience in APAC. It has basically a complete set of data but it’s only for our members in APAC. All four are connected by our MPLS backbone and 13 global points of presence (POPs).
Are they all of similar architecture, or a mix given they were built at different times?
Nayyar: We have a mix. We started with a colo facility prior to building our first data center in Ashburn. And obviously the technology has improved incrementally year after year, and now Oregon is a complete step-up function.
Kahn: Virginia was when we first started the shift to the wholesale model. So instead of using retail providers [of compute capacity] from companies like Equinix, etc., we leased data center space – basically a big empty shell – and built out everything inside, right down to the power, the busways, the racks, all those things. And after Virginia we built another one in Texas because we were scaling pretty fast. When we got to Oregon we were able to step back and think about what we wanted the future of our data centers to look like. That’s when we made the transition to a hyperscale model. Going forward, we will retrofit our other data centers for the new model.
You guys aren’t building your own servers like some of the Web giants, are you?
KAHN: No. We’re using standalone rack servers. We work closely with OEM vendors to make sure they meet our specifications for performance, etc. We were one of the first big users of Cisco’s UCS, but we have moved more towards Supermicro commodity hardware.
Are these data centers also supporting your business needs?
Nayyar: We have a hybrid. We do have a small footprint in Santa Clara where we have our corporate data center resources -- HR, Finance, development, pre-prep production, etc. -- but we constructed Oregon so we can use security zones to support those corporate needs from any data center.
What does the customer-facing LinkedIn application look like?
Nayyar: Our app is complex, so everything in the data center is in support of rendering the page when you go to LinkedIn.com. As you can imagine, you have different connections than I do, everyone does, so the page you see is highly customized and there’s a ton of east to west traffic in our data center to generate each page. A lot of computation has to go on. For every byte that comes into our network we go 100x east to west to generate the page.
Nayyar: With our application everything is connected. Obviously some parts of our site are separate, like Recruiter has a different interface. But for the general consumer member, LinkedIn.com is all connected.
Kahn: We have multiple products and thousands of services. You’ve probably heard about Rest.Li, one of our most talked about gateway integrators, when these things communicate you also end up with a very large volume of data moving between data centers.
Does each data center support the same thing or are duties distributed?
Nayyar: Any one site can serve traffic. If there is a failure in one data center we just route traffic to another site. There’s replication going on between all the data centers in real time across our 100Gbps MPLS backbone. They all serve the same thing and that’s how we improve our availability. If there is an outage at one site, whether it’s a bug, a network issue, a power issue or even a change gone bad, we can easily fail traffic out within five minutes. All of them work together serving LinkedIn.com.
Are you serving the populace by geographic region?
Kahn: Yes. We are a heavy user of Anycast [the ability to promote one IP address from multiple points in the network], which means we can route our members to the closest POP [point of presence].
Nayyar: We try to figure out which set of users from which part of the country should be routed where and route them to the nearest POP. POPs are small scale data centers with mostly network equipment and proxy servers that act as end-points for user's TCP connection requests.
Kahn: We select the location of the POP based on member experience. We know which geographical areas are challenging. We have a data science model we use to do predictive analytics that shows, if we put a POP in Australia, then page download time will improve by X percent. Then we have to build POPs in these areas and they tie back to our data centers. All the heavy lifting of the pages is done at the POPs, and then there’s backend data connectivity, but the POPs helps to make the page download time faster. We’ve seen improvements up to 25% percent in page download just by having a POP in Asia.
Nayyar: We monitor our site speed very closely across the world and we’re constantly looking on how we can improve that. Whether we do that via the network or continuing to improve the app, the heaviness of the pages, or within the data center, reducing the time it takes to build that page and then presenting it to our members.
Ok, let’s turn to you’re newest data center in Oregon, which came online in November. How is it different?
Kahn: Compute-wise it is much more dense. Typically people do 7-9 kilowatts per rack. We don’t own these facilities so we wanted to optimize for space by packing more servers into a rack. We can do more than 14 kilowatts per rack. But with dense compute, as you can imagine, there’s going to be a lot of heat, so we had to figure out how to innovate the design of the data center cooling system. We ended up going with rear-door heat exchanges. We’re one of the first to do water-based cooling at the rack. There’s a CapEx expense to do that, obviously, but over time we will be using a lot less power.
So you’re pumping water through the racks?
Nayyar: We’re basically precooling water outside and circulating it through these rear-door heat exchangers, which neutralizes the hot air right at the rack so there is no cold air/hot air-aisle containment necessary.
Any concern about pumping water around all those systems?
Nayyar: That was one of the concerns as we were looking at the technology, but we tested it thoroughly and the designs are really robust. We also have quite a bit of monitoring around this so we know if there is any kind of leak, but we’re not concerned as of right now.
Using outside air to cool the water must be pretty efficient. What kind of PUE (Power Usage Effectiveness) are you looking at for the Oregon data center?
Nayyar: Oregon is commissioned for 1.06. And its worth mentioning that our corporate goal is to be using 100% sustainable energy in the future. We’re not there yet, obviously, but we’re working towards it and that’s part of the reason why we chose Infomart in Oregon because they have direct access to renewable energy.
Let’s turn to the innovative work you’ve done on the network side, what you have spelled out in your Project Altair design documents. As I understand it, each of your racks has a top-of-rack switch and they communicate with multiple fabric devices.
Kahn: Yes. The Altair design is one big fabric solution. Think of it as a big flat network. There is no core, no chassis. Say you were building for 100,000-plus servers using the traditional enterprise model. A packet going from one server to another would end up traversing 25 to 30 chipsets, giving you milliseconds of latency between two servers. What we’ve done is reduce that to less than five chipsets for server-to-server communications using a five-stage Clos architecture, a spine and leaf design, and that reduces our switching latency between two servers to microseconds. (see Figure 1).
So in our spine and leaf topology everything is broken down into different stages. Each top-of-rack has four paths upwards to four different spines, and those four spines talk multiple ways to the spines above it, so all of those spine switches become one big fabric. Every top-of-rack switch has four or more paths to get out using equal-cost multi-path (ECMP). (see Figure 2).
Are the top of rack and spine switches similar?
Kahn: Yes, they are actually exactly the same. We’ve gone to a single SKU model which means we buy only one type of the switch, a one U device.
Do you get all your switches from the same supplier?
Kahn: No. It’s one platform. They are all the same design and same chipset. One SKU. You can have multiple suppliers, but the same platform. Ours use a Tomahawk chipset and are 32x100G ports, 3.2Tbps. We bring 50Gbps to each server, which is different. We believe we’re the first to have actually deployed in a way that every server can have a 10G, 25G or 50G and, in the future, even a 100Gbps path. We’ve kind of future-proofed this for the next four or more years.
All of the spines are 100Gbps and the subscriptions between the spines are one to one so, if you send in 100Gbps, you always get 100Gbps out. Down to the top-of-rack we bring 50Gbps and we do that using the PSM4 standard so we can take two 100Gbps ports and split it into four 50Gbps ports bringing the effective available cabinet bandwidth to 200Gb/s.
I read in some of your documentation that top-of-racks are not redundant, so that means you can afford to lose a whole cabinet, is that because everything is replicated across the servers?
Kahn: Yes, and across data centers. It’s all about distribution of failure domains and simplify infrastructure. At this scale you have to share fate. The applications are fault tolerant enough that we can lose an entire cabinet and things will just failover, either within the data center or across the data center.
Do I understand right that you’re running your own code in your top-of-rack switches?
Kahn: Some of them. We are a mix of OEM (Original Equipment Manufacturer) and ODM (Original Design Manufacturer). OEMs would be a provider like Cisco, or whatever. Then we have ODM suppliers and we run our own code on those, and we are slowly adopting that as we are building out new cabinets and a new set of databases.
Why develop your own?
Kahn: We had very specific things we wanted to control. We wanted to focus on how we manage our fabric. Our goal is not to necessarily build the world’s best network operating system. That’s not our goal. Our goal is to build the applications on top of the control plane that manages our fabric network.
For example, we want to do streaming telemetry from the switch itself and upload it into a platform for machine learning and use that to figure out how to intelligently route traffic, find performance bottlenecks and just operate the network better. That’s our goal. Internally we call this initiative the Programmable Data Center. We want to understand more about the application level of the network and optimize traffic inside the data center for that.
Ok. And you’re supporting both IPv4 and v6 with the goal of moving to v6 across the board?
Kahn: Yeah. We are very active on the v6 front. A few years ago we launched www.linkedin.com on IPv6 to address the inevitable exhaustion of IPv4 addresses. We decided to tackle the problem on the edge first so we can address markets that are sending IPv6-only traffic. We have seen high IPv6 growth in mobile traffic and also some performance gains. Recently we started to look at IPv6 inside the data center as we scale. We will soon be running out of v4 inside our data centers so we decided to dual stack v4 and v6 with the goal that eventually we will be v6 only in a couple of years.
What’s the total capacity of your data centers, and what do you anticipate in terms of growth going forward, especially given your acquisition by Microsoft?
Nayyar: If I include our corporate data center, I would say we’re close to 40 megawatts. We’re definitely adding more capacity next year. That’s in the plan. What we don’t know is how the integration with Microsoft is going to affect usage. The deal just closed, so we’re starting to figure out how we can work together. Right now our plans factor in organic growth, but we’re going to have to wait to see how things work out.
I think that was everything on my list. Anything I didn’t think to ask you about?
Nayyar: One thing. Our philosophy has always been that, wherever it makes sense, we want to give back and open source the projects we’ve been working on. Zaid mentioned switching telemetry, which is a very scalable, fast, replicated streaming app we built, a messaging pipeline. We open-sourced that and there are a couple reasons why.
Obviously if we open-source something other people can benefit from it, but we also believe there are business benefits. One is we get multiple people sharing back, which improves the effort, and two we believe it helps the craftsmanship of our engineers because when they’re working on code that’s being looked at by millions of people they do a better job with documentation, and they create more elegant code because their name is on it.
Nayyar: We have an open hardware initiative called Open19 that has created a little bit of buzz and you’ll see more happening in that spot next year. But just to give you an idea, we decided to create an open standard for a 19-inch rack environment for your server, storage and networking. The goal is to reduce common components by 50%. Everything in a rack requires power and network so we are consolidating anything that’s a common component inside the rack by 50%.
Besides saving significant CapEx, Open19 can help you integrate racks 2-3x faster. If you have a shared power module, a shared networking component, you don’t have to have messy cables anymore. We have a lot of OEM and ODM vendors signing up because they are able to retain their intellectual property but, by adhering to this standard, they can enable a lot of flexibility for their future base.
We’re creating a consortium and LinkedIn is one of the leaders of that consortium. We are partnering with others strategically and the idea is that the committee will come together and we will then open up the designs and move the initiative forward.