When the Large Hadron Collider (LHC) starts back up in June, the data collected and distributed worldwide for research will surpass the 200 petabytes exchanged among LHC sites the last time the collider was operational. Network challenges at this scale are different from what enterprises typically confront, but Harvey Newman, Professor of Physics at Caltech, who has been a leader in global scale networking and computing for the high energy physics community for the last 30 years, and Julian Bunn, Principal Computational Scientist at Caltech, hope to introduce a technology to this rarified environment that enterprises are also now contemplating: Software Defined Networking (SDN). Network World Editor in Chief John Dix recently sat down with Newman and Bunn to get a glimpse inside the demanding world of research networks and the promise of SDN.
Can we start with an explanation of the different players in your world?
NEWMAN: My group is a high energy physics group with a focus on the Large Hadron Collider (LHC) program that is about to start data taking at a higher energy than ever before, but over the years we've also had responsibility for the development of international networking for our field. So we deal with many teams of users located at sites throughout the world, as well as individuals and groups that are managing data operations, and network organizations like the Energy Sciences Network, Internet2, and GEANT (in addition to the national networks in Europe and the regional networks of the United States and Brazil).
For the last 12 years or so we've developed the concept of including networks along with computing and storage as an active part of global grid systems, and had a number of projects along these lines. Working together with some of the networks, like Internet2 and ESnet, we were able to use dynamic circuits to support a set of flows or dataset transfers and give them priority, with some level of guaranteed bandwidth.
So that was a useful thing to do, and it's been used to some extent. But that approach is not generalizable in that not everybody can slice up the network into pieces and assign themselves guaranteed bandwidth. And then you face the issue of how well they're using the bandwidth they reserved, and whether we would be better off assigning large data flows to slices or just do things in a more traditional way using a shared general purpose network.
So I presume that's where the interest in SDN came in?
What's happening with SDN is a couple of things. We saw the possibility to intercept selected sets of packets on the network side, and assign flow rules to them so we don't really have to interact much with the application. There is some interaction but it's not very extensive, and that allows us to identify certain flows without requiring that a lot of the applications to be changed in any pervasive way. The other thing is, beyond circuits you have different classes of flows to load-balance, and you want to prevent saturating any sector of the infrastructure. In other words, we want mechanisms so our community can handle these large flows without impeding what other people are doing. And in the end, once the basic mechanisms are in place we want to apply machine learning methods to optimize the LHC experiments' distributed data analysis operations.
Most advanced research and education networks in the last year or two have made the transition from 10 Gb/sec (Gbps) backbones to 100 Gbps; so people tend to say, "Wow, now you have lots of bandwidth." But the laboratory and university groups in our field people have deployed facilities where many petabytes of data are stored, along with very large numbers of servers which are moving from 1 Gbps to 10 Gbps and in some cases 40 Gbps network interfaces. And 100 Gbps interfaces are expected within the next few months, so as fast as the core networks progress, the capabilities at the edges are progressing even faster, so this is a real issue.
Are you looking at using SDN on specific research networks or trying to implement the capabilities across a range of them?
NEWMAN: There is one project called the LHC Open Network Environment (LHCONE) that was originally conceived to help with operations that involved multiple centers. To understand this, though, I have to explain the structure of the data and computing facilities.
The LHC Computing Model was originally a hierarchical picture that included a set of "tiered" facilities. We called CERN the "Tier 0" where the data taken at the LHC are first analyzed. There are now 13 Tier 1 centers, which are major national computing centers including centers at the Fermi National Accelerator Laboratory (Fermilab) and the Brookhaven National Lab (BNL) in the US.
There are also more than 160 so-called Tier 2 centers at universities and other labs throughout the world, each of which serves a region of a large country like the United States, or in some cases they serve an entire country. Then every physics group has a so-called Tier3 cluster, and there are about 300 of those. All of these facilities are interconnected by the research and education networks mentioned.
The US is involved mainly in the two biggest experiments at the LHC. The one I work on is called CMS, short for the Compact Muon Solenoid (CMS) which is served by FermiLab, and our competing experiment is called ATLAS, which is served by BNL.
CMS and ATLAS are multipurpose particle physics experiments exploring the most fundamental constituents of matter and forces of nature. In 2012 they both discovered the Higgs boson, thought to be responsible for mass in the universe. And with the restart of the LHC higher energy and luminosity (intensity) we anticipate even greater discoveries of physics beyond the Standard Model of particle physics that embodies our current knowledge.
University connections to the Tier 1 centers are mostly through Internet2 and regional networks. So, for example, I am at Caltech so we work with Internet2 and with CENIC which is the California region network.
So the experiments at the Large Hadron Collider are the data sources and the networks are used to distribute this data to users for analysis?
Right. Data is taken by the experiments, processed for the first time at CERN and then distributed to the Tier 1 centers via a sort of star network with some dedicated crosslinks for further processing and analysis by hundreds of physics teams. Once the data is at the Tier 1s it can be further distributed to the Tier 2s and Tier 3s, and once there, any site can act as the source of data to be accessed, or transferred to another site for further analysis. It is important to realize that the software base of the experiments, each consisting of several millions of lines of code, is under continual development as the physics groups improve their algorithms and their understanding and calibration of the particle detector systems used to take the data, with the goal of optimally separating out the new physics "signals" from the "backgrounds" that result from physics processes we already understand.
The data distribution from CERN to the Tier 1s is relatively straightforward, but data distribution to and among the Tier 2s and Tier 3s at sites throughout the world is complex. That is why we invented the LHCONE concept in 2010, together with CERN: to improve operations involving the Tier 2 and Tier 3 sites, and allow them to make better use of their computing and storage resources in order to accelerate the progress of the LHC program.
To understand the scale, and the level of challenge, you have to realize that more than 200 petabytes of data were exchanged among the LHC sites during the past year, and the prospect is for even greater data transfer volumes once the next round of data taking starts at the LHC this June.
The first thing that was done in LHCONE was to create a virtual routing and forwarding fabric (VRF). This was something proposed and implemented by all the research and education networks, including Internet2, ESnet, GEANT, and some of the leading national networks in Europe and Asia, soon to be joined by Latin America.
That has really improved access. We can see dataflow has improved. It was a very complex undertaking and very hard to scale because we have all of these special routing tables. But now the next part of LHCONE, and the original idea, is a set of point-to-point circuits.
You remember I talked about dynamic circuits and assigning flows to circuits with bandwidth guarantees. A member of our team at Caltech together with a colleague at Princeton has developed an application that sets up a circuit across LHCONE, and then assigns a dataset transfer (consisting of many files, totaling from one to 100 terabytes, typically) to the circuit.
ESnet's dynamic circuits, which have been in service for quite a while, are called OSCARS. There is an emerging standard which is promoted by the Open Grid Forum called NSI, and we'll integrate NSI with the application, so that's one upcoming milestone for the LAT ONE part of this picture.
One might ask, "What sets the feasible scale of worldwide LHC data operations?" Two aspects are the data stored, which is hundreds of petabytes and growing at a rate that will soon reach an Exabyte, and the second major factor is the ability to send the data across networks, over continental and transoceanic distances.
One venue to address the second factor and show the year-to-year progress in the ability to transfer many petabytes of data at moderate cost using multiple generations of technology is the Supercomputing Conference. This is a natural place to bring our efforts on network transfer applications, state of the art network switching and server systems and software defined network architectures together, in one intensive exercise spanning a week from setup to teardown.
Caltech and its partners, notably Michigan, Vanderbilt, Victoria and the US HEP labs (Fermilab and BNL), FIU, CERN, and other university and lab partners, along with the network partners mentioned above, have defined the state of the art (nearly) every year since 2002 as our explorations of high speed data transfers climbed from 10 Gbps to 100 Gbps and, more recently, hundreds of Gbps.
The Supercomputing 2014 event hosted the largest and most diverse exercise yet, defining the state of the art in several areas. We set up a Terabit/sec ring with a total of 24 100 Gbps links among the Caltech, NITRD/iCAIR and Vanderbilt booths, using optical equipment from Padtec, a Brazilian company, and Layer 2 Openflow-capable switching equipment from Brocade (a fully populated MLXe16) and Extreme Networks.
We also connected to the Michigan booth over a 100 Gbps dedicated link and we had four 100 Gbps wide area network links connecting to remote sites in the US, Europe and Latin America over ESnet, Internet2, and the Brazilian national and regional networks RNP and ANSP.
In addition to the networks we constructed a compact data center capable of very high throughput using state of the art servers from Echostreams and Intel with many 40 Gbps interfaces and hundreds of SSDs from Seagate and Intel.
Then apart from the high throughput and dynamic changes at Layer 1, one of the main things was to be able to show software-defined networking control of large data flows. So that's where our OpenFlow controller, which is written mainly by Julian, came in.
We demonstrated dynamic circuits across this complex network, intelligent network path selection using a variety of algorithms using Julian's OpenDaylight controller, and the ability of the controller to react to changes in the underlying optical network topology, which were driven by an SDN Layer 1 controller written by a Brazilian team from the university in Campinas.
Once set up, we quickly achieved more than 1 Tbps on the conference floor and about 400 Gbps over the wide area networks. The whole facility was set up, operated with all the SDN related aspects mentioned above, and torn down and packed for shipment in just over one intense week.
The exercise was a great success, and one that we hope will show the way towards next generation extreme-scale global systems that are intelligently managed and efficiently configured on the fly. We're progressing well, and expect to go from testing to preproduction and we hope into production in the next year or so.
After Supercomputing in 2014, we set up a test bed and Julian has started to work with a number of different SDN-capable switches--including Brocade's SDN-enabled MLXe router at Caltech and switches at other places. So we're progressing and we expect to go from testing to preproduction and we hope into production in the next year or so.
This is just one cycle in an ongoing development effort, keeping pace with the expanding needs and working at the limits of the latest technologies and developing new concepts of how to deal with data on a massive scale in each area. One target is the Large Hadron Collider, but other projects in astrophysics, climatology and genomics, among others, could no doubt benefit from our ongoing developments.
So the goal of all these efforts is to enable users to set up large flows using SDN?
NEWMAN: Yes. The first users are data managers with very large volumes of data, from tens of terabytes to petabytes, who need to transfer data in an organized way. We can assign those flows to circuits, and give them dedicated bandwidth while the transfers are in progress, to make the task of transferring the data shorter and more predictable.
Then there are thousands of physicist who access and process the data remotely, and repeatedly, they continue to improve their software and analysis methods in the search for the next round of discoveries. This large community also uses dynamic caching methods, where chunks of the data are brought to the user so that the processing power available locally each group of users can be well used. We'll probably treat each research team, or a set of research teams in a given region of the world as a group, in order to reduce the overall complexity of an already complex global undertaking.
So some folks will have direct access to the controller while others will have to make requests of you folks?
NEWMAN: People are authorized once they have enough data to deal with. You see, there's a scale matching problem. Given the throughput we deal with, if you have less than, let's say a terabyte of data, it hardly matters. If I have a data center with tens to hundreds of terabytes to transfer at a time, there would be some interaction between the data manager side and the network side. The data manager can make a request, "I've got this data to transfer from A to B," and the network side can use a set of controllers to help manage the flows, and see that the entire set of data arrives, in an acceptable time.
We've worked out a solution where, for each data set transfer, we know which of the many, many compute nodes are going to be involved. In order to direct traffic, we get a list of all the source IP addresses and pass those on to the controller, and when the controller sees the source and destination IPs it can set up a flow rule and map the flow onto a dynamic circuit between A and B.
When dealing with individuals, it's just going to be a question of looking at the aggregate traffic and how it's flowing and trying to direct flows. Down the line we intend to apply machine learning classifications and understanding to learn the patterns from the flow data so we can manage it. That's somewhat down the line but I think it's an interesting application to apply to this kind of problem.
How many controllers do you think you'll end up with?
NEWMAN: That's an interesting question because this is actually a collaboration of many organizations. We start with one controller. I think ultimately there will be a few at strategic points, a handful, but how the different controllers interact is not very well developed in the OpenDaylight framework.
How many switches will ultimately be controlled?
NEWMAN: There are 13 Tier 1 sites and 160 Tier 2 sites, but I think we'll probably end up somewhere in the middle, which is a few dozen switches involved with the largest flows.
Did you look at buying a controller versus building one?
NEWMAN: We looked at some controllers. We had previous development based on the Floodlight controller. Julian?
BUNN: The OpenDaylight controller is public domain software and supported by the major vendors and many research groups. It has become sort of the de facto SDN controller in the community. There have been others, such as the Floodlight controller, which we've used. Some of these were a little less open. That's why we picked OpenDaylight. We'd already worked with Floodlight so we knew how an SDN controller worked.
NEWMAN: The Brocade Vyatta Controller is based on OpenDaylight and Brocade is an active contributor to OpenDaylight project, but the funding agencies prefer that we choose open-source software because of the potential benefits of engaging a larger community of users and developers.
What version of OpenFlow have you settled on here?
BUNN: OpenFlow 1.0 because we found the particular switches we've been using support that very well. We don't need any of the features in 1.3. The sort of flows we're writing into the switch tables don't really need anything more advanced than 1.0 at the moment.
NEWMAN: The other aspect is, when you have test events there are typically different flavors of switches involved, so by requiring OpenFlow 1.0 it's easier to make them all work together. We foresee moving to Openflow 1.3 when the number of switches supporting it increases, and when there is a greater need to moderate the size of flows on the fly (a feature supported in 1.3).
We're also following the OpenDaylight releases. We worked with the Hydrogen release and then, after the SC14 conference, we tried some exercises with the Helium release. So we look at what's being developed and what features there are and if any are important we adapt them. The next OpenDaylight release, which is called Lithium: it's an enhancement of Helium, and we will use it when it's available.move past in June.
Speaking of timing, what's the next step? How long will it take to see this vision through?
NEWMAN: It's very progressive. We're starting to get it out in the field. Our test bed at Caltech has six switches, three different types, including Brocade MLXe and CER switch routers and others, and we're going to add a fourth type at Michigan. Julian is set up to try his flow rules and we have our mechanism to integrate with the end application where we can get these lists of IP addresses which we can use to match to the setup flow rules for those particular IP addresses. As soon as we exercise that we start to do it again in the wide area.
Part of our team is at CERN in Geneva and we certainly will want to set up a switch there. That should happen in the next few months, and then the idea is to set up a preproduction operation starting with some of these managed flows and the application in my CMS experiment, so in the next year or two we'll be well on our way to production.
So this is predominately a wide area thing, but are there data center or campus implications as well?
NEWMAN: It depends. Campus, maybe. Brocade's ICX campus switches are OpenFlow 1.3 ready, so flow control can be done down to the workstation or server level. Data center and directing flows, I can see a lot of potential there. The point is where you have shifting loads and you have large data flows and want to have them go efficiently, this could be very useful. It clearly is a big vision. We'll start to implement this and see how it goes. But I think it will have a big impact, with implications for research and education networks and the universities and labs they serve.
The scale you guys deal with is so different from the enterprise folks I typically talk to, so it's very interesting.
NEWMAN: Yes, I should give you some numbers. In 2012 during the last LHC run, about 200 petabytes of data were transferred. After that we stopped taking data and you'd think the level of activity would be less, but we still sent 100 petabytes. The next run of the LHC, which is a three-year run, will start in June (commissioning of the accelerator is going on right now), and we're expecting much larger data flows than before. (Since the interview, the next run of the LHC has started.)
The Energy Sciences Network (ESnet) reached 18 petabytes per month at the end of last year and the growth rate since 1992 is a factor of ten every four and a quarter years, which is a growth rate of 72% per year. The projection forward is an exabyte a month by about 2020 and 10 exabytes a month by about 2024.
In terms of individual flows, we can already do several tens of gigabytes per second in production and we can saturate 200 Gbps links (for example a 100 Gbps link bidirectionally) over long distances at will.