An Effective Approach to Alleviating the Challenges of TCP

12
1 23 Information Systems Frontiers A Journal of Research and Innovation ISSN 1387-3326 Inf Syst Front DOI 10.1007/s10796-013-9463-4 IDTCP: An effective approach to mitigating the TCP incast problem in data center networks Guodong Wang, Yongmao Ren, Ke Dou & Jun Li

Transcript of An Effective Approach to Alleviating the Challenges of TCP

1 23

Information Systems FrontiersA Journal of Research and Innovation ISSN 1387-3326 Inf Syst FrontDOI 10.1007/s10796-013-9463-4

IDTCP: An effective approach to mitigatingthe TCP incast problem in data centernetworks

Guodong Wang, Yongmao Ren, Ke Dou& Jun Li

1 23

Your article is protected by copyright and all

rights are held exclusively by Springer Science

+Business Media New York. This e-offprint is

for personal use only and shall not be self-

archived in electronic repositories. If you wish

to self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

Inf Syst FrontDOI 10.1007/s10796-013-9463-4

IDTCP: An effective approach to mitigating the TCP incastproblem in data center networks

Guodong Wang · Yongmao Ren · Ke Dou · Jun Li

© Springer Science+Business Media New York 2013

Abstract Recently, TCP incast problem in data center net-works has attracted a wide range of industrial and academicattention. Lots of attempts have been made to address thisproblem through experiments and simulations. This paperanalyzes the TCP incast problem in data centers by focus-ing on the relationships between the TCP throughput andthe congestion control window size of TCP. The root causeof the TCP incast problem is explored and the essenceof the current methods to mitigate the TCP incast is wellexplained. The rationality of our analysis is verified bysimulations. The analysis as well as the simulation resultsprovides significant implications to the TCP incast problem.Based on these implications, an effective approach namedIDTCP (Incast Decrease TCP) is proposed to mitigate theTCP incast problem. Analysis and simulation results ver-ify that our approach effectively mitigates the TCP incastproblem and noticeably improves the TCP throughput.

Keywords Congestion control · Data center networks ·TCP incast

G. Wang (�) · K. DouGraduate University of Chinese Academy of Sciences,Beijing, Chinae-mail: [email protected]

K. Doue-mail: [email protected]

G. Wang · Y. Ren · K. Dou · J. LiComputer Network Information Center of ChineseAcademy of Sciences, Beijing, China

Y. Rene-mail: [email protected]

J. Lie-mail: [email protected]

1 Introduction

Data centers have become more and more popular forstoring large volumes of data. Companies like Google,Microsoft, Yahoo, and Amazon use data centers for websearch, storage, e-commerce, and large-scale general com-putations. Building data centers using commodity TCP/IPand Ethernet networks is still attractive because of their lowcost and ease-of-use. However, TCP is easily to suffer adrastic reduction in throughput when multiple senders com-municate with a single receiver in these networks, which istermed incast.

To mitigate the TCP incast problem, many methods havebeen proposed. The accepted methods include enlargingthe switch buffer size (Phanishayee et al. 2008), decreas-ing the Server Request Unit (SRU) (Zhang et al. 2011a)and shrinking the Maximum Transmission Unit (MTU)(Zhang et al. 2011b). Reducing the retransmission time-out (RTO) to microseconds is also suggested to shortenTCP’s response time, while the availability of such timer isa stringent requirement to many OS (Vasudevan et al. 2009).Another commonly used method is to modify the TCP con-gestion control algorithm. Phanishayee et al. (2008) triesdisabling the Slow Start mechanism of TCP to slow downthe congestion control window (cwnd) updating rate, butthe experimental result shows that this approach does notalleviate the TCP incast. Alizadeh et al. (2010) providesfine-grained congestion window based control by using theExplicit Congestion Notification (ECN) marks. Wu et al.(2010) measures the sockets’ incoming data rate periodi-cally, and then adjusts the rate by modifying the announcedreceive window.

In view of so many methods, this paper provides the fol-lowing contributions. First, we analyze the root causes ofthe TCP incast problem from the perspective of the cwnd

Author's personal copy

Inf Syst Front

size of TCP. Based on the analysis, the relationships amongthe TCP throughput, the SRU, the switch buffer size and thecwnd are discussed in this paper. Second, we explore theessence of the current methods which are adopted to smooththe TCP incast problem. These methods include enlargingthe switch buffer size, decreasing the SRU, shrinking theMTU and so on. Third, a series of simulations are conductedto verify the accuracy of our analysis. The analysis as wellas the simulation results provides significant implications tothe TCP incast problem. Finally, based on the implications,an effective method named IDTCP is proposed to mitigatethe TCP incast problem.

The remainder of this paper is organized as the follows.Section 2 describes the related work. Section 3 illustratesthe analysis of the TCP incast problem. Section 4 validatesour analysis by simulation and presents the implications.Section 5 describes the IDTCP congestion control algo-rithm. Section 6 conducts a serious of simulations to val-idate the performance of IDTCP. Section 7 concludes thispaper.

2 Related work

The problem of incast has been studied in several papers.Gibson et al. (1998) first mentioned the TCP incast phe-nomenon. Phanishayee et al. (2008) found the TCP incastcould be related to the overflow of the bottleneck buffer.The phenomenon was then analyzed in Chen et al. (2009),Zhang et al. (2011a), to relate it to the number of senders andthe different sizes of SRU. Similarly, Zhang et al. (2011b)used a smaller MTU to increase the bottleneck buffer sizein terms of number of packets.

Phanishayee et al. (2008) outlined several solutions,which include the use of a different TCP variant for betterpacket loss recovery, disabling slow-start to slow down theincrease of TCP sending rate, or even reducing the retrans-mission timeout (RTO) to microseconds so that there wouldbe a shorter idle time upon RTO. Vasudevan et al. (2009)further investigated the RTO in detail and found it was veryeffective if we could make use of a high resolution timerin OS kernel. However, the availability of such timer foreach TCP connection is a stringent requirement to manyOS. Although (Phanishayee et al. 2008) pointed out thatdisabling slow-start did not alleviate the TCP incast, theexplanation and the detailed analysis were not presented.

Throttling TCP flows to prevent the bottleneck bufferbeing overwhelmed is a straightforward idea. DCTCP(Alizadeh et al. 2010) used the ECN to explicitly notifythe network congestion level, and achieved fine-grainedcongestion control by using the number of ECN marks.However, the deployment of DCTCP may require updat-ing switches to support ECN. ICTCP (Wu et al. 2010)

measured the bandwidth of the total incoming traffic toobtain the available bandwidth, and then controlled thereceive window of each connection based on this informa-tion. The weakness of ICTCP is that the number of theconcurrent senders that do not experience TCP incast isstill not large enough and its flow rate adaptation is notimmediate.

3 Analysis

3.1 Model

The data center networks are usually composed of highspeed and low propagation delay links with limited switchbuffer. In addition, the client applications usually strip overdifferent servers for reliability and performance. The modelpresented in Phanishayee et al. (2008) abstracts the mostbasic representative setting in which TCP incast occurs. Inorder to simplify our analysis we use the same model asshown in Fig. 1.

In our model, a client requests a data block, which isstriped over n servers. Table 1 shows the specific modelnotation which we use.

3.2 The congestion control window and tcp incast

Traditional TCP’s cwnd updating mechanism combines twophases: Slow Start and Congestion Avoidance. At the SlowStart phase, the TCP doubles its cwnd from the default value(in standard TCP, the default value is 1) at every RTT, whileat the Congestion Avoidance phase, it leads to an approx-imately linear increase with time. The standard TCP cwndupdating rules are given by

Wi+1 ={

Wi + 1, Slow Start

Wi + 1/Wi, Congestion Avoidance(1)

where the index i denotes the reception of the ith ACK.For a given SRU, the number of ACKs which is used to

acknowledge the received packets can be described by

ACK = SRU

MSS(2)

Fig. 1 A simplified TCP incast model

Author's personal copy

Inf Syst Front

Table 1 Model notation

Symbol Description

n server number

B switch buffer size (in packets)

BDP Bandwidth Delay Product

SRU Server Request Unit (in KB)

ACK the number of Acknowledgement

MSS Maximum Segment Size (1460 Byte)

MTU Maximum Transmission Unit (1500 Byte)

cwnd congestion control window (in packets)

To find out the maximum cwnd (cwnd max) which is inresponse to the ACKs, we suppose the link is not congestedand TCP is in the Slow Start phase. Each ACK’s arrival willincrease the cwnd of TCP by 1 from the initial value, so themaximum cwnd can be given by

cwnd max = ACK + 1 = SRU

MSS+ 1 (3)

The range of the current cwnd (cwnd(t)) of each server(n) can be given by

1 ≤ cwnd(t)n ≤ cwnd maxn (4)

From Fig. 1, each server will send cwnd(t)n ×MT U data synchronously to the client; n servers mean∑n

i=1 cwnd(t)n × MT U data will be synchronouslyappended to the switch buffer. As the servers will funda-mentally share the same switch, in order to ensure that theconcurrent packets will not exceed the link capacity, thefollowing condition should be met

n∑i=1

cwnd(t)n × MT U ≤ B + BDP (5)

From Eq. 5, in order to increase the number of servers(n), one approach is to enlarge the switch buffer size (B)(Phanishayee et al. 2008).Another approach is to decreasethe cwnd which can be achieved by reducing the SRU(Zhang et al. 2011a). For the maximum cwnd is in directproportion to the SRU as shown in Eq. 3. By limitingthe SRU, the maximum cwnd will be limited. The thirdapproach is to decrease the MTU (Zhang et al. 2011b).Therefore, the essence of these approaches (Phanishayee etal. 2008; Zhang et al. 2011a, b) can well be explained byEq. 5.

Besides, as cwnd(t) can be given by

cwnd(t) ={

2t/RT T , Slow Startt−tγRT T

+ γ, Concgestion Avoidance(6)

where t represents the elapsed time, tγ and γ are the timeand cwnd respectively when TCP exits the Slow Start phase.To mitigate the TCP incast problem, another approach is

to decrease the cwnd growth rate both in Slow Start andCongestion Avoidance phases.

3.3 The congestion control window size in DCN

From Eq. 6, the relationship between the cwnd and theupdating time (T (cwnd), the time that changes with thegrowth of cwnd) can be given by

T (cwnd) ={

RT T × log2cwnd Slow StartRT T × (cwnd − γ ) Congestion Avoidance

(7)

As TCP transmits approximately cwnd size of packetswithin the time of RTT, the average throughput (TP) can begiven by

T P = cwnd × MSS

RT T(8)

where the default MSS is 1460 byte. From Eq. 8, it is clearthat cwnd = T P×RT T

MSS. Take the typical data center net-

work as an example, the bandwidth is 1Gbps and the RTT is0.1ms, so the cwnd = 1Gbps×0.1ms

1460Byte= 8.562. Based on this

discussion, we calculate the cwnd and the cwnd updatingtime of the traditional TCP in different BDPs in Table 2.

Table 2 shows that in the link with 1000Mbps bandwidthand 0.1ms RTT, it is enough for TCP to fill the 1Gbps pipeonly with small cwnd (8.562). The time to finish the SlowStart (suppose the Slow Start ends when the throughput is50 % of the bandwidth, which is referred the standard TCPmechanism that reduces the cwnd to 50 % of the currentcwnd when congestion occurs) is only 0.210ms (calculatedby Eq. 7, RT T × log2cwnd = 0.1ms × log24.281 =0.210ms) which is too violent for the standard TCP tocontrol the flow.

The requirement for small cwnd of TCP in data centernetworks indicates that the standard TCP is too aggressiveto avoid the TCP incast problem. Therefore, slowing downthe cwnd growth rate of the traditional TCP is a critical stepto mitigate the incast problem. Additionally, it can be seenfrom Table 2 that small (8.562) cwnd can fill the 1Gbpsbandwidth, so the total throughput will not be excessivelyaffected by slowing down the cwnd growth rate.

Table 2 The relationships between BDP and cwnd updating time

BDP (Mbps×ms) cwnd SStime(ms) CAtime(ms)

1000×0.1 8.562 0.210 0.428

1000×0.2 17.12 0.620 1.712

1000×0.3 25.68 1.105 3.853

1000×0.4 34.25 1.639 6.849

1000×0.5 42.81 2.210 10.702

Author's personal copy

Inf Syst Front

4 Validation and implications

In this section, we validate the accuracy of Eq. 5 throughsimulation on the NS-2 platform (Fall and Vardhan 2007)and discuss the impact of parameters upon TCP incastthroughput. The network topology is shown in Fig. 1. Thebottleneck link has 1Gbps capacity. The RTT between theserver and the client is 0.1ms. The module for TCP incast isdeveloped by (Phanishayee et al. 2008).

Firstly, we use Eq. 5 to estimate the TCP incast point,which is the number of concurrent servers when TCP incastoccurs. Suppose the MSS is 1460 Byte (then the MTU is1500 Byte accordingly) and the switch buffer size (B) is128KB. To explore the extreme case, we set the cwnd of allservers to 1, the minimum value. From Eq. 5, the TCP incastpoint can be calculated by

incast point = B + BDP

MT U

= 128 × 1024Byte + 1Gbps × 0.1ms

1500Byte

= 131072Byte + 12500Byte

1500Byte

= 95.71 (9)

By using the same way, we estimate some other TCPincast points and show them in Table 3.

In order to validate the TCP incast points estimated inTable 3, the following experiments are conducted.

4.1 The extreme case when cwnd is 1

We first validate the extreme case when cwnd is set to 1.In this experiment, the SRU is set to 64KB and the switchbuffer size increases from 32KB to 256KB.

Figure 2 is the throughput for different number of serverswhen cwnd is 1. When the switch buffer size is 32KB,the incast point is around 28 servers, which is approxi-mate the value presented in Table 3. With the increase of

Table 3 The estimated TCP incast points

B (KB) MTU (Byte) cwnd incast points

32 1500 1 30.18

64 1500 1 52.02

128 1500 1 95.71

256 1500 1 183.09

32 1500 2 19.25

64 1500 2 30.18

32 1500 3 15.61

64 1500 3 22.89

the switch buffer size, the number of the concurrent serversincreases accordingly. When the buffer size are 64KB,128KB and 256KB, the simulated incast points are around50, 97 and 180, respectively, which are in conformity withthe estimated points presented in Table 3.

4.2 The throughput when cwnd is 2 or 3

We further validate the cases when cwnd is set to 2 or 3.In order to compare with the results which are presented inFig. 2, the SRU is set to 64KB and the switch buffer sizevaries from 32KB to 64KB.

Figure 3 shows that with the increase of cwnd, the num-ber of concurrent servers with the same SRU and switchbuffer size decreases accordingly. This trend is well antici-pated from Eq. 5. Most importantly, the simulated results arealso in conformity with the estimated incast points presentedin Table 3.

4.3 The throughput with a varying SRU

In this section, we validate the throughput with a varyingSRU when cwnd is set to the minimum value 1. In thisexperiment, the switch buffer size increases from 64KB to128KB.

Figure 4 shows that when cwnd is set to a constantvalue, the influence of varying SRU on the incast points willbe significantly weakened. This can be explained well byEq. 5 for there is no direct relationship between the SRUand the TCP incast points. However, it should be notedthat, just as we described in section III B, by shrinkingthe SRU the TCP incast points can be greatly increasedwhen cwnd is not a constant value (the traditional TCP orother TCP variants). For the nature of shrinking the SRU isto reduce the maximum cwnd (cwnd max) as described inEq. 3. Reducing the maximum cwnd can indeed increasethe TCP incast points from Eq. 5. However, if the cwnd is aconstant value, the maximum cwnd is determined, then theTCP incast point will not be influenced by the value of SRU.Figure 4 also shows that after the TCP incast happens, morebandwidth will be exploited by using a larger SRU. For thelarger SRU will take more time to be transmitted and morebandwidth will be grabbed accordingly. All in all, theseexperiments once again validate the accuracy and rational-ity of Eq. 5. Most importantly, the simulated results are alsoin conformity with the estimated incast points presented inTable 3.

From the above simulation, the accuracy and the ratio-nality of Eq. 5 are validated. For most of the relationshipamong the cwnd, the switch buffer size, the SRU can beexplained well by Eq. 5. Besides, The Eq. 5 and the simu-lation results also provide significant implications as shownin the follows.

Author's personal copy

Inf Syst Front

Fig. 2 Throughput for differentnumber of servers when cwndis 1

0 20 40 60 80 100 120 140 160 180 2000

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

B=32KBB=64KBB=128KBB=256KB

• Only optimizing the TCP congestion control algorithmcan not solve, but mitigate the TCP incast problem. Tomitigate the TCP incast, the cwnd growth rate shouldbe reduced. To solve the TCP incast problem, the appli-cation layer approach (Podlesny and Williamson 2012)such as scheduling the concurrent servers may be a wayout.

• The switch buffer size and the MTU play significantrole in mitigating the TCP incast problem. To avoidpacket loss, Eq. 5 should be met.

• Setting the cwnd to 1 can maximize the concur-rent servers, but the bandwidth utilization will beaffected when the number of the concurrent servers issmall.

5 IDTCP congestion control algorithm

Based on the above discussion, a novel congestion con-trol algorithm named IDTCP is proposed in this section.IDTCP mitigates TCP incast problem through the followingstrategies.

5.1 Constantly monitoring the congestion level of the link

Using a packet loss as the only congestion signal whichis adopted in traditional TCP is insufficient for accurateflow control (Katabi et al. 2002). The binary signal (lossor not) only expresses two extreme states of the networklink, therefore, a delicate and effective mechanism whichcan continuously estimate the bottleneck status is enhancedin IDTCP.

In large BDP networks, estimating the number of queu-ing packets (Brakmo and Peterson 1995) is widely usedto measure the congestion level of the link. However, indata center networks, the required cwnd to meet the 1Gbpsbandwidth is much smaller than that in large BDP net-works as shown in Table 2. As a consequence, using thequeuing packet to estimate the congestion level of the linkin data center networks is not as accurate as that in largeBDP networks. Therefore, in IDTCP algorithm, we use α =RT T avg−RT T base

RT T baseto estimate the congestion level of the

link. The RT T base is the minimum RTT measured by thesender, and RT T avg is the average RTT estimated duringthe current cwnd packets updating period.

Fig. 3 Throughput for differentnumber of servers when cwnd is2 and 3

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

cwnd=2,B=32KBcwnd=2,B=64KBcwnd=3,B=32KBcwnd=3,B=64KB

Author's personal copy

Inf Syst Front

Fig. 4 Throughput for differentnumber of servers when cwnd is1 and the switch buffer increasesfrom 64KB to 128KB

0 10 20 30 40 50 600

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

SRU=32KBSRU=64KBSRU=128KBSRU=256KB

(a) Buffer Size=64KB

0 10 20 30 40 50 60 70 80 90 100 1100

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

SRU=32KBSRU=64KBSRU=128KBSRU=256KB

(b) Buffer Size=128KB

5.2 Slowing down and dynamically adjustingthe congestion control window growth rate

TCP’s Slow Start algorithm is not slow at all, for it dou-bles the cwnd every RTT. Table 2 verifies the aggressive-ness of Reno in data center networks by computing theSlow Start time. Besides, Eqs. 5 and 6 imply that slowingdown the cwnd growth rate can significantly increase thenumber of the concurrent TCP servers. Therefore, IDTCPslows down and dynamically adjusts the cwnd growth rateaccording the congestion level which is estimated by α asfollows.

W(t) = (1 + m)t/RT T , 0 ≤ α ≤ 1 (10)

Where m is the cwnd growth rate of IDTCP, and α is thequeuing delay which is measured from the link. Compar-ing to the standard Reno as shown in Eq. 6, the Slow Startand Congestion Avoidance mechanisms of Reno are inte-grated in to one mechanism in IDTCP. The cwnd growthmechanism in IDTCP is a discrete exponential increase with

RTT and the base is dynamically adjusted according tothe congestion level of the link. The specific relationshipsbetween m and α are shown in Fig. 5.

Figure 5 shows that the cwnd of IDTCP starts up with theSlow Start mechanism when there is little congestion (α ≤0.1) in the link. With the increase of the congestion level,the cwnd growth rate of IDTCP decreases accordingly, inwhich way to allow as many as possible concurrent serversto join into the network.

5.3 Setting the cwnd to 1 if the link is totally congested

From Eq. 5 and the simulation results in Section 4.1, wecan find that setting the cwnd to 1 can maximum the num-ber of the concurrent servers. The shortcoming of settingthe cwnd to 1 is that when the number of the concurrentservers is small (e.g., the concurrent servers are less than8 in the typical DCN network with RTT of 0.1ms, and thebottleneck bandwidth of 1Gbps), the total throughput undersuch a cwnd can not sufficiently utilize the bandwidth. That

Author's personal copy

Inf Syst Front

Fig. 5 The relationshipsbetween α and m

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The value of αT

he v

alue

of m

is, when the total cwnd (∑n

i=1 cwnd(t)n) is 8 (8 servers,the cwnd of each server is 1), the throughput calculated byEq. 8 is tp = 1460B×8

0.1ms= 934.4Mbps, which is less than the

bandwidth of 1Gbps. On the other hand, if the number ofthe concurrent servers is more than 8, the 1Gbps bandwidth

can be fully utilized even the cwnd of each server is 1. Thereare many approaches to be adopted to get the number of theconcurrent servers, but for simplicity, in IDTCP, we use thecongestion level of the link to indicate the time when thecwnd is set to be 1. Specifically, if the propagation delay is

Fig. 6 Throughput for differentnumber of servers when SRU is64KB

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

IDTCPReno

(a) SRU=64KB, Buffer Size=32KB

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

IDTCPReno

(b) SRU=64KB, Buffer Size=64KB

Author's personal copy

Inf Syst Front

Fig. 7 Throughput for differentnumber of servers when SRU is128Kbps

0 10 20 30 40 50 60 70 80 90 1000

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)T

hrou

ghpu

t (M

bps)

IDTCPReno

(a) SRU=128KB, Buffer Size=128KB

0 20 40 60 80 100 120 140 160 180 2000

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

IDTCPReno

(b) SRU=128KB, Buffer Size=256KB

equal to the queuing delay, which means that the RT T avg

is 2 times of the RT T base, and α = 1, accordingly, wethink the link is totally congested, and the cwnd is set tobe 1.

6 Performance evaluation

In this section, we use the model developed by Phanishayeeet al. (2008) which is described in Section 4 to verify

Fig. 8 Throughput for differentnumbers of servers when SRUvaries

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

The Number of Servers (n)

Thr

ough

put (

Mbp

s)

SRU=32KBSRU=64KBSRU=128KBSRU=256KB

Author's personal copy

Inf Syst Front

the performance of IDTCP. The performance metric is thethroughput of the bottleneck link. The throughput is thetotal number of bytes received by the receiver divided bythe completion time of the last sender. We explore through-put by varying parameters, such as the number of servers,switch buffer size and server request unit size.

Figure 6 is the throughput for different number of serverswhen SRU is 64KB. When switch buffer size is 32KB,only 3 concurrent servers will cause the Reno TCP col-lapse. Our approach IDTCP increases the number of theconcurrent servers to 25 under such a small buffer size.With the increase of the switch buffer size, the number ofthe concurrent servers increase accordingly. The increaseof the number of the concurrent servers of IDTCP is muchmore prominent than Reno. When buffer size is 64KB, thenumber of the concurrent servers of IDTCP reaches to 47without throughput collapse.

Figure 7 is the throughput for different number of serverswhen the SRU is fixed to 128KB. The increase of thebuffer size has greatly expanded the number of the concur-rent servers of IDTCP. When the switch buffer size risesto 128KB, the number of the concurrent servers of IDTCPreaches to nearly 100 while the throughput collapse occurs.In contrast, the number of the concurrent servers of Reno arestill no more than 10. If the buffer size is further expandedto 256KB, the number of the concurrent servers of IDTCPincreases to 175 accordingly, while Reno less than 20.

We further evaluate the performance of IDTCP in thesituation when the switch buffer size is fixed but the SRUvaries from 32KB to 256KB as shown in Fig. 8. It is clearthat before the network is congested, the larger the SRU, thegreater the throughput, which is different from the results asshown in Fig. 4. However, with the increase of the numberof the concurrent servers, the network is inevitably beingcongested. Once the network is totally congested, the cwndof IDTCP will be set to be 1, then the performance of IDTCPis close to the extreme case which is shown in Fig. 4.

7 Conclusion

In this paper, we discuss the TCP incast problem in datacenters through analyzing the relationship between theTCP throughput and the cwnd size of TCP. Our analy-sis explores the essence of the current methods which areadopted to smooth the TCP incast problem. We verifythe rationality and accuracy of our analysis by simula-tions. The analysis and the simulation results provide manyvaluable implications to the TCP incast problem. Basedon these implications, we further propose a new conges-tion control algorithm named IDTCP to mitigate the TCPincast problem. Simulation results verify that our algorithmgreatly increased the number of the concurrent servers and

effectively mitigate the TCP incast in data center networks.Further more detailed theoretical modeling efforts, as wellas more extensive evaluation in actual networks are thesubjects of our ongoing work.

Acknowledgments This work is an extension from the conferencepaper entitled ‘The Effect of the Congestion Control Window Sizeon the TCP Incast and its Implications’ which was presented in theIEEE ISCC 2013. This work is partially supported by the foundationfunding of the Internet Research Lab of Computer Network Infor-mation Center, Chinese Academy of Sciences, the President Fundingof the Computer Network Information Center, Chinese Academy ofSciences under Grant No.CNIC ZR 201204, and the Knowledge Inno-vation Program of the Chinese Academy of Sciences under GrantNo.CNIC QN 1303.

References

Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P.,Prabhakar, B., Sengupta, S., Sridharan, M. (2010). Data centertcp (dctcp). In ACM SIGCOMM computer communication review(Vol. 40, no. 4, pp. 63–74). ACM.

Brakmo, L., & Peterson, L. (1995). Tcp vegas: end to end congestionavoidance on a global internet. IEEE Journal on Selected Areas inCommunications, 13(8), 1465–1480.

Chen, Y., Griffith, R., Liu, J., Katz, R., Joseph, A. (2009). Under-standing tcp incast throughput collapse in datacenter networks. InProceedings of the 1st ACM workshop on research on enterprisenetworking (pp. 73–82). ACM.

Fall, K., & Vardhan, K. (2007). The Network Simulator (ns-2). Avail-able: http://www.isi.edu/nsnam/ns.

Gibson, G., Nagle, D., Amiri, K., Butler, J., Chang, F., Gobioff, H.,Hardin, C., Riedel, E., Rochberg, D., Zelenka, J. (1998). A cost-effective, high-bandwidth storage architecture. In ACM SIGOPSoperating systems review (Vol. 32, no. 5, pp. 92–103). ACM.

Katabi, D., Handley, M., Rohrs, C. (2002). Congestion control for highbandwidth-delay product networks. In ACM SIGCOMM computercommunication review (Vol. 32, no. 4, pp. 89–102). ACM.

Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D., Ganger,G., Gibson, G., Seshan, S. (2008). Measurement and analysisof tcp throughput collapse in cluster-based storage systems. InProceedings of the 6th USENIX conference on file and storagetechnologies.

Podlesny, M., & Williamson, C. (2012). Solving the tcp-incast prob-lem with application-level scheduling. In IEEE 20th internationalsymposium on, analysis & simulation of computer and telecommu-nication systems (MASCOTS), 2012 (pp. 99–106). IEEE.

Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D.,Ganger, G., Gibson, G., Mueller, B. (2009). Safe and effectivefine-grained tcp retransmissions for datacenter communication. InACM SIGCOMM computer communication review (Vol. 39, no. 4,pp. 303–314). ACM.

Wu, H., Feng, Z., Guo, C., Zhang, Y. (2010). Ictcp: incast congestioncontrol for tcp in data center networks. In Proceedings of the 6thinternational conference (p. 13). ACM.

Zhang, J., Ren, F., Lin, C. (2011a). Modeling and understanding tcpincast in data center networks. In Proceedings IEEE INFOCOM,2011 (pp.1377–1385). IEEE.

Zhang, P., Wang, H., Cheng, S. (2011b). Shrinking mtu to mitigatetcp incast throughput collapse in data center networks. In Commu-nications and mobile computing (CMC), 2011 third internationalconference (pp. 126–129). IEEE.

Author's personal copy

Inf Syst Front

Guodong Wang received the Ph.D. degree from the University ofChinese Academy of Sciences in 2013. Dr. Wang is currently withthe Computer Network Information Center, Chinese Academy of Sci-ences, China. His research focuses on Future Networks and Protocols,Fast Long Distance Networks, and Data Center Networks.

Yongmao Ren is an associate professor at Computer Network Infor-mation Center, Chinese Academy of Sciences. He received his B.Sc.in Computer Science and Technology in 2004 at University of Scienceand Technology of Beijing. He obtained the Ph.D. in Computer Soft-ware and Theory from Graduate University of the Chinese Academy ofSciences in 2009. His main research interests are on congestion controland future Internet architecture.

Ke Dou received the M.Sc. from the University of Chinese Academyof Sciences in 2013. He is currently pursuing the Ph.D. degree inUniversity of Virginia. His research focuses on Data Center Networks.

Jun Li is currently a professor, Ph.D. supervisor and vice chief engi-neer of the Computer Network Information Center, Chinese Academyof Sciences (CAS). He earned his M.S. degree and Ph.D. degree fromthe Institute of Computing Technology, CAS in 1992 and 2006, respec-tively. Before that, he received his B.S. degree in Hunan Universityin 1989. He has been working on research and engineering at thefield of computer network over 20 years. And, he has gotten manyachievements in the fields of Internet routing, architecture, protocoland security, etc. He once developed the first router in China. And, hewon the national technological progress awards. He has been PI formany large research projects such as 863 program project. He has pub-lished over 50 peer-reviewed papers and one book. His current researchinterests include network architecture, protocol and security.

Author's personal copy