A few years ago, I was involved in a consulting project with a large company in the healthcare industry that was in the middle of a data center migration. After the networks and servers were stood up at the new location they needed to migrate massive amounts of data in bulk so the company secured a pair of OC192 circuits, providing nearly 10Gbps of throughput in each direction on each circuit.
Everything seemed to be in order, so they began transferring data. To their surprise, they were only seeing throughput in the tens of megabits per second, even on servers connected to the network via gigabit Ethernet switches. After exhausting all the normal troubleshooting steps, they decided to bring in a fresh set of eyes. What we discovered may seem counterintuitive: this company's pipes were just too big. The company was suffering from a Long Fat Network (LFN).
The LFN problem addressed here relates to one function of one protocol in one layer of the OSI model: the Transmission Control Protocol, or TCP. Layer 4, the Transport Layer, provides numerous functions, including:
Segmentation of data. If the amount of data sent by an application exceeds the capability of the network, or of the sender or receiver's buffer, the Transport Layer can split up the data into segments and send them separately.
Ordered delivery of segments. If a piece of data is broken up into multiple segments and sent separately, there is no guarantee the segments will arrive at the destination in the correct order. The Transport Layer is responsible for receiving all the segments and, if necessary, putting the data stream back together in the correct order.
Multiplexing. If a single computer is running multiple applications, the Transport Layer differentiates between them and ensures data arriving on the network is sent to the correct application.
In addition, the Transport Layer traditionally has been responsible for reliability, or guaranteed delivery of data. Not all Transport Layer protocols provide reliability mechanisms, and which Transport Layer protocol is used by a given application depends on a number of considerations. However, the majority of data traversing networks today utilizes the TCP, which does indeed provide a reliability mechanism. And it is TCP's reliability mechanism that is at the heart of the LFN problem.
When data is ready to be sent TCP performs the following sequence of events:
1. TCP on the initiating computer establishes a connection with TCP on the remote computer.
2. Each computer advertises its Window Size, which is the maximum amount of data that the other computer should send before pausing to wait for an acknowledgment. The advertised window size is typically related to the size of the computer's receive buffer.
3. TCP begins transmitting the data in intervals equal to the maximum segment size, or MSS (also negotiated by the hosts). Once the amount of data transmitted equals the window size, TCP pauses and waits for an acknowledgment. TCP will not send any more data until an acknowledgment has been received.
4. If an acknowledgment is not received in a timely manner, TCP retransmits the data and once again pauses to wait for an acknowledgment.
This "send and wait" method of reliability ensures that data has been delivered, and frees applications and their developers from having to reinvent the wheel every time they want to add reliability to their applications. However, this method lends itself to inefficiencies based on two factors: 1) how much data a computer sends before pausing, and 2) how long the computer has to wait to receive an acknowledgment. It is these two factors that are critical to understanding, and ultimately overcoming, the LFN problem.
We now have enough information to understand the LFN problem. TCP is efficient on Short Skinny Networks, but not on Long Fat Networks. The longer the network (i.e. the higher the latency), the longer TCP has to sit by twiddling its thumbs waiting for an acknowledgment before it can send more data. And the fatter the network (i.e. the faster a sender can serialize data onto the wire), the greater the percentage of time TCP is sitting by idly. When you put those two together -- Longness and Fatness, or high latency and high bandwidth -- TCP can become very inefficient.
Here is an analogy. Let's say you have a coworker who talks a lot. It's not that he has a lot to say, he just talks really slowly. When you are having a conversation face to face, he can pretty much just keep talking and talking and talking. He gets near-instant acknowledgment that you heard what he said, so he can just keep talking. There is very little dead air. This is equivalent to a Short Skinny network.
Now let's say your coworker becomes an astronaut and flies to Mars. He calls you on his astronaut phone to tell you about the trip, and he is really excited so he talks really fast. But the delay is really long. Since he can't see you, he decides that every 25 words he will pause and wait for you to respond before he continues speaking.
Since your friend talks really fast, let's say it only takes him five seconds to spit out 25 words before pausing to wait for a response. If the round trip delay between Earth and Mars is 10 seconds, he will only be able to speak 33% of the time. The other 67% of the time the line between you and the Martian is sitting idle.
It wouldn't be such a big deal if he didn't speak so fast. If it took him two minutes to speak those same 25 words instead of blurting them out in five seconds, he'd be speaking for about 92% of the time. Likewise, if the round trip latency between you were lower, let's say two seconds, the utilization percentage of the line would go up as well. In this case he would speak for five seconds and then pause for two, achieving a utilization of about 71%.
Let's look at a real-world network scenario. Two computers, Computer A and Computer B, are located at two different sites that are connected by a T-3 link. The computers are connected to Gigabit Ethernet switches. The one-way latency is 70 milliseconds. Computer A initiates a data transfer to Computer B using an FTP PUT operation. The following sequence of events occurs (for the sake of simplicity, I will leave out some of the TCP optimizations that may occur in the real world):
1. Computer A initiates a TCP connection to Computer B for the data transfer.
2. Each computer advertises a window size of 16,384 bytes, and an MSS of 1,460 bytes is negotiated.
3. Computer A starts sending data to Computer B. With an MSS of 1,460 bytes and a window size of 16,384 bytes, Computer A can send 11 segments before pausing to wait for an acknowledgment from Computer B.
So how efficient is our sample network? To figure this out, we need to calculate two numbers:
1. The maximum amount of data that could be in flight on the wire at any given point in time. This is called the bandwidth-delay product. Think of it like an oil pipeline: how much oil is contained within a one mile stretch of pipe if the oil is flowing at 10mph and you are pumping 10 gallons per minute? (Answer: 60 gallons). In our example, the T-3 bandwidth is 44.736Mbps (or 5.592 megabytes per second) and the delay is 70 milliseconds. So the bandwidth-delay product is 5.592 x .07, or about 0.39MB (399.36KB). This means at any given point in time, if the T-3 link is totally saturated, there is 0.39MB of data in flight on the wire in each direction.
2. The amount of data actually transmitted by Computer A before pausing to wait for an acknowledgment. In our example, Computer A sends 11 segments, each being 1460 bytes. So Computer A can only send 1460 x 11 = 16,060 bytes (15.68KB) before having to pause and wait for an acknowledgment from Computer B.
So, if the network link could support 399.36KB at any given point, but Computer A can only put 15.6KB on the wire before pausing to wait for an acknowledgment, the efficiency is only 3.9%. That means that the link is sitting idle 96.1% of the time!
Do you see the problem? In some cases, TCP sacrifices performance for the sake of reliability, particularly when latency and/or bandwidth is relatively high. But is it possible to achieve both performance and reliability? Can we have our cake and eat it too?
Yes, and we'll look at the options in a moment. But first, let's look at the two obvious but unrealistic solutions:
1. Decrease latency. If we could decrease the amount of time it takes for a bit to make it from one side of the network to the other, computers wouldn't have to go get a cup of coffee every time they send a TCP Window's worth of data. But until someone comes out with the Quantum Wormhole router, or figures out how to increase the speed of light or bend space, you're probably stuck with the latency you've got.
2) Decrease throughput. If we turn a Fat network into a Skinny network without changing the latency or the TCP window size, it stands to reason that link utilization would go up. But I recommend thinking twice before bringing this option up with your colleagues ("What, you want less bandwidth?").
OK, now that we've got that out of the way, let's look at the real solutions:
1) TCP window scaling. One might wonder why the TCP window size field is only 16 bits long, allowing for a maximum of a 65,535 byte window. But remember that TCP's reliability mechanism was written in a day when data link bandwidth was measured in bits. Today, 10 Gigabit links are common (and 40G and 100Gbps links are becoming more common).
RFC 1323, titled "TCP Extensions for High Performance," was published in 1992 to address some of the performance limitations of the original TCP specification in a world of ever increasing bandwidth. In particular, TCP Option 3, titled "Window Scaling," addressed the 65,535 byte window size limitation. Rather than increasing the window size field in the TCP header to a number larger than 16 bits (and thus rendering it incompatible with existing implementations), Option 3 introduces a value by which the TCP window size is bitwise shifted to the left. A value of 1 shifts the 16 bits to the left by 1 bit, doubling the window size. A value of 2 shifts the 16 bits to the left by 2 bits, quadrupling the window size. The maximum value of 14 shifts the 16 bits to the left by 14, increasing the window size by 2^14.
Increasing the window size has the obvious benefit of allowing TCP to send more segments before pausing to wait for a response. However, this performance benefit comes with some risk, such as buffer issues and larger retransmits when segments are lost. Virtually every modern operating system in use today uses TCP window scaling by default, so if you're seeing small window sizes on the network, you may need to do some troubleshooting. Are there any firewalls or IPS devices on the network stripping TCP options? Are hosts scaling back the window size due to buffers filling up or excessive packet loss?
2) Multiple TCP sessions. The problem described here applies to a single TCP session only. In the earlier example, Computer A's TCP session was only utilizing 3.9% of the link's bandwidth. If 25 computers were transmitting, each using a single TCP session, a link utilization of 97.5% could be achieved. Or, if Computer A was able to open 25 TCP sessions simultaneously, the same utilization could be achieved. This will almost never be a good solution to the problem at hand, but is included here for completeness.
3) Different transport layer protocol. TCP isn't the only transport layer protocol available. TCP's unreliable cousin, the User Datagram Protocol, does not provide any guarantee of delivery, and is therefore free to consume all available resources.
4) Caching. Content caching utilizes proxies to store data closer to the client. The first client to access the data "primes" the cache, while subsequent requests for the same data are served from the local proxy. Content caching is a band-aid solution that is becoming increasingly obsolete in an age of constantly changing and dynamically created content, but it is still worth mentioning.
5) Edge computing. Paid services like Akamai decentralize content and push it to the edges of the network, as close to the clients as possible. One of the results is lower latency between clients and servers.
6) WAN optimization and acceleration. Products from companies like Riverbed and Silver Peak, or open source alternatives like TrafficSqueezer, employ various techniques such as data deduplication, compression, and dictionary templating to increase the perceived performance of a WAN link.
Earlier, I said the company in question here had too much bandwidth. Well, that wasn't really the case. Their real problem was the behavior of the protocol they were using. One of their system admins, at the direction of a software developer, had monkeyed with the configuration of their servers in an attempt to tune application performance. The application wasn't able to pick up packets from the buffer fast enough (due to a software bug), causing the TCP window to scale back the window size, but not before buffers were filled, packets were dropped, and retransmits were occurring. In an attempt to eliminate the retransmits, this admin had statically set the TCP window size on the servers to a relatively low value. Since TCP was being artificially constrained, it was never able to scale up to fill the available bandwidth of the data link.
Today, barring misconfiguration, most networks won't run into LFN problems because TCP window scaling is widely used by default. However, with bandwidth ever on the rise, performance will most certainly become more and more of an issue because latency is fixed (unless you figure out how to bend space-time), and we are likely to see similar symptoms.
Heder, CCIE No. 24788, is a network architect with NES Associates in Alexandria, Va., specializing in large-scale network design. Heder holds a master's degree with a concentration in network architecture and design, and has a patent filed for an IPv6 technology. He can be reached at firstname.lastname@example.org.