Architectural Characterization of Processor Affinity in Network Processing

Architectural Characterization of Processor Affinity in Network Processing

Annie Foong, Jason Fung, Don Newell, Seth Abraham, Peggy Irelan, Alex Lopez-Estrada Intel Corporation

[email protected]

Abstract

Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of user-defined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before. 1. Introduction

The arrival of 10 Gigabit Ethernet (GbE) allows a standardized physical fabric to handle the tremendous speeds previously attributed only to proprietary networks. Though aimed primarily to meet the needs of traffic loads seen in data centers and storage area networks (SANs), the concept of a converged fabric across WANs, LANs and SANs is appealing. However, supporting multi-gigabit/s of TCP traffic can quickly saturate the abilities of a SMP server today. At the platform level, integration of the memory controller on the CPU die will effectively scale memory bandwidth with processing power. Next generation buses, such as PCI-Express, will potentially deliver 64Gbps of bandwidth. While platform improvements will continue to address bus and memory bottlenecks, a system bottleneck still exists in terms of a processor’s capacity to process TCP at these

high speeds. Adding more processors to a system, by itself, does not address the problem – TCP/IP software implementations are known for their inability to scale well in general-purpose SMP operating systems (OS) [10][16]. However, next generation chip multiprocessors (CMP) will bring multiple cores to each CPU [8], making SMP scaling a major operating systems design issue.

Previous work had shown potential performance improvements by careful affinity of processes/threads to processors in a SMP system [13][21]. However, current general purpose operating systems support only static affinity, and have minimal consideration of user-defined affinity in their schedulers. The ultimate goal of this work is to make a case for generic OS schedulers to provide mechanisms that account for user-directed affinity. Before that can be done, we must first expose the full potential of affinity by an in-depth characterization of the reasons behind performance gains.

We provide the background necessary to understand the motivation for our work and problem statement in Sections II & III. In Section IV, we give implementation details and tools used for the analysis. In Section V, we provide overall performance data for all the affinity modes possible, so as to determine the data points worthy of further study. We focused on studying these data points in Section VI. Here, we go in-depth to analyze our results in the context of a TCP reference stack. As we proceed to different stages of analysis, we gradually hone in to events and metrics that matter. Where pertinent, we also call out places where affinity does not make a difference. We conclude by discussing related and future work. 2. Background

The major overheads of TCP are well studied [3][5][9]. A seminal paper by [4] showed that the number of instructions for TCP protocol processing itself is minimal. The non-scalability of TCP stems from the fact that it requires substantial support from

the operating system (e.g. buffer management, timers, etc) and incurs substantial memory overheads. These include memory accesses for data movement (the copy-based BSD sockets programming model can incur up to three accesses to memory per request). We refer interested readers to [5] where we describe an implementation of the TCP fast paths as typified by Linux.

Network adapters (NIC) manufacturers’ efforts to offload functionality from the processor have resulted in real but incremental improvements. They range from checksum and segmentation offloads [3] to complete offload of the TCP stack to hardware [1]. In the case of full-fledged TOEs, industry’s effort has been elusive at best. TCP has a far more complex state machine than most other transports. Unlike some newer protocols (e.g. Fibre Channel and Infiniband), which were designed specifically for hardware implementation from ground up, TCP began as a software stack. Corner cases abound that are not so easily addressed if the solutions are hardwired. Finally, the most commonly overlooked overheads are those incurred by scheduling and interrupting [9]. Though not cost intensive operations by themselves, these have an indirect intrusive effect on cache and pipeline effectiveness. The intrusions into a TCP stack come in the form of applications issuing requests from above and interrupts coming from devices on data arrives or leaves.

The impact of intrusions can be substantial in general-purpose SMP OSes. These OSes are designed to a run a wide variety of applications. As such, their scheduler will always attempt to load balance, moving processes from processors with heavier loads to those with lighter loads. However, process migration is not free. The migrated process will have to pay the price of warming various levels of data caches, instruction cache and translation-lookaside buffers in the processor that has just migrated to. On the other hand, generic OSes do not attempt to balance interrupts across processors. Both Windows NT and Linux default SMP configuration operates with device interrupts going to CPU0. Under high loads, CPU0 saturates before other processors in the system. Previous attempts of OSes to redistribute interrupts in either a random or round-robin fashion had given rise to bad side effects [1]. Interrupt handlers ended up being executed on random processors and created more contention for shared resources. Furthermore, cache contents (e.g. holding TCP contexts) are not reused optimally as packets from devices are sent to different processors on every interrupt.

Since the scheduler prioritizes balancing over process-to-interrupt affinity, going to more processors

will only increase the non-scalability problem. Finally, the TCP/IP stack differs from most applications in that it is completely event-driven. Applications issue requests from the top, independent of data arriving from the network. Data can arrive/leave inside OR outside of the requesting process’s context. This creates interesting affinity problems and possibilities.

3. Problem Statement

While it is always possible to design different scheduling algorithms that can be effective under differing workloads [15], it remains that the OS is still oblivious of application needs. We propose that an application knows its own workload best and is in the position to better “place” itself than the OS scheduler. However, leveraging such an optimization can be difficult in general-purpose OSes. A first step in validating this hypothesis is to characterize the impact of user-directed affinity as it is supported by today’s OSes. Our research questions and methodology in this study thus becomes: 1. What is the baseline profile and characterization of TCP processing ? To answer this, we break down the TCP stack into logical, functional bins. Separation at the procedure call level (> 300 procedures) would render any analysis useless. Examining only the top few functions [1] provides only a partial view. Instead, we have carefully examined all the code for a reference TCP stack (Linux-2.4.20), and separated the procedures into basic blocks of TCP functionality. We shall provide a full baseline characterization of TCP processing in two affinity modes. This will form the foundation for the comparative study. 2. Where exactly do affinity improvements happen ? By how much ? To quantify this, we performed a series of speedup analyses using Amdahl’s Law[6]. We do a comparative study of processing times (and other events) in the no-affinity mode against those in the full affinity mode. Speedups and improvements are derived accordingly. 3. What (subset) of architectural events are responsible for performance improvements ? While previous researchers have all attributed performance improvement to better cache locality, there had not been an extensive attempt to fully expose the architectural reasons behind improvements. We monitor the count of various events including last level cache misses. It must be noted that we did not exhaustively look at all possible events. Rather, we focused on the usual performance culprits (e.g. cache

misses, branch mispredictions, TLB misses etc). By using expected costs for event penalties seen in processors, we are able to obtain a first-order approximation of the primary performance-affecting events.

Experimental-based characterization of the entire application runs [14] and analytic models of protocol stacks [20] have given us an overall understanding of networking requirements. In addition to architectural understanding, we also hope to bring a systems software perspective to this study. By abstracting TCP processing to a level where analysis gives useful insights, we can quantify and directly relate architectural events to software implementation. For example, while it is important to know that the overall CPI of transmit processing of 64KB is about 5, it is extremely useful to further realize that data copy routines incur CPIs of 4, while interface routines incur CPIs of 17. Such a unique view allowed us to (i) provide a solid understanding of TCP processing in different affinity modes; (ii) showcase exactly where affinity brings benefits to TCP processing; and (iii) relate these benefits in terms of improvements seen in various architectural events. 4. Experimental Setup

System under Test (SUT) Client

Processors Intel 2GHz P4 Xeon MP × 2 Intel 3.06GHz P4 Xeon × 2

Cache 512KB L2, 2MB L3 512KB L2

FSB Freq 400 MHz 533 MHz

Memory DDR 200MHz Registered ECC 256MB/channel × 4

DDR 266MHz Registered ECC 2GB/channel × 2

Board Chipset

Shasta-G ServerWorks GC-HE

Westville Intel E7501

PCI Bus 64-bit PCI-X 66/100MHz 64-bit PCI-X 66/100MHz

NIC Dual-port Intel PRO/1000 MT×4 Dual-port Intel PRO/1000 MTx1

Figure 1 System configurations and Cluster Setup

Figure 1 summarizes the configuration of the system under test (SUT) and clients, and the setup of our tiny cluster. We have used the ttcp micro-benchmark to exercise bulk data transmits (TX) and receives (RX) between 2 nodes. A connection is set up once between two nodes. Data is sent from the transmitter(s) to the receiver(s), reusing the same buffer space for all iterations. ttcp workload primarily characterizes bulk data transfer behavior, and must be understood in that context. We have chosen this simple workload because it exercises the typical and optimal TCP code path, and allows us to focus on understanding the network stack without application-related distractions. Moreover, our focus is on quantifying the differences that affinity brings in an ideal scheduling case, where load balancing quirks do not come into play.

This workload will project directly to real workloads that are based on long-lived connections and bulk data (e.g. iSCSI and other network storage). ttcp caching behavior is also representative of real web or file servers that serve static file content to/from the network (no touching of payload data). Web characterization studies [24] showed that although 50% of web requests may be dynamic in nature, but they resulted in 30-60% of quasi-static “templates” that can be reused. More importantly, we can partition any general workload into “network fast paths”, “network connection setup/teardown” and “application processing” as exemplified in [14]. The studies done here of affinity benefits will project directly to the portions involving network fast paths.

To study the various affinity modes, we have used the mechanisms that are available through Redhat’s patched version of the Linux-2.4.20 kernel (and officially folded into the mainstream Linux-2.6 kernel [11]). These mechanisms allow processes, threads and interrupts to be statically bound to processors. In our tests, one connection (unique IP address) is owned by one instance of ttcp, and serviced by one physical NIC. There are a total of 8 GbE NICs, 8 connections and 8 ttcp processes running on our SUT. We will compare 4 modes of affinity – (i) no affinity (no aff); (ii) interrupt-only affinity (IRQ aff) (e.g. interrupts from NICs 1-4 are directed to go to CPU0); (iii) process-only affinity (proc aff) (e.g. ttcp processes 1-4 are bound to CPU0); and (iv) full affinity (full aff). Full affinity is the case where a ttcp process is affinitized to the same processor as the interrupts, coming from the NIC that it is assigned to (Figure 2). We modified ttcp to use sys_sched_setaffinity() to set process affinity [12]. We statically redirect interrupts from a device to a specific processor by setting a bit

SSUUTT

1 2

5 6

3 4

7

SSUU TT

1 2

5 6

3 4

7 8

8

CClliieenntt 44 4 8

CClliieenntt 33 3 5

CClliieenntt 22 2 6

CClliieenntt 11 1 5

ttcp process

mask in smp_affinity under the Linux’s /proc filesystem.

Figure 2 Two possible permutations of interrupt and process affinity

To get processing distribution insights for our in-depth analysis, we have used Oprofile-0.7 [18] as our measurement tool. Oprofile is low-overhead, systems-wide profiler based on event sampling. It allows us to determine the number of events that had occurred in any function during a run. The events are those supported by the processor’s hardware event counter registers [7]. E.g. if the event of interest is set to cycles, we can determine time spent in a function; if the event is last-level cache (LLC) miss, we can determine how many times data touching resulted in a memory access. It must be noted Oprofile is based on statistical sampling. It is not an exact accounting tool. When a profile is performed over a long run, it gives a fairly accurate distribution of where events lie. For the profiling to capture all cycles, we further ensured that the processors polls on idle, instead of the default power-saving mode. 5. Overview of Performance

In this section, we present a performance overview of the various possibilities of process and interrupt affinities. Figure 3 shows the TCP performance comparison of the four affinity models. The bars show CPU utilization (almost fully utilized in all cases), while throughput is represented by lines. We see that process affinity alone has little impact on throughput. Under this mode, CPU0 not only has to service all interrupts, but also at least 4 ttcp processes. Any affinity benefits are negated by more pronounced load imbalance. On the other hand, interrupt affinity alone can improve throughput by as much as 25%. This behavior is a result of the scheduling algorithm. To

reduce cache interference, the scheduler tries as much as possible to schedule a process onto the same processor that it was previously running on. By the same token, “bottom halves/tasklets” (i.e. tasks scheduled to run at a later time) of interrupt handlers are usually scheduled on the same processor where their corresponding “top halves” had previously run. As a result, interrupt affinity indirectly leads to process affinity as well. Of course, there is no guarantee and interrupt and process contexts can still land up on different CPUs. The best improvement in throughput (up to 29%) is therefore achieved with full affinity. We also ran similar tests on 4P systems (not shown here) and observed even better improvement brought on by affinity. However, this has more to do with the imbalance of workload rather than the intrinsic impact of affinity. Without affinity, the bottleneck that CPU0 imposes on a 4P system becomes even more pronounced. CPU0 is fully saturated with interrupt processing, even though there are idle cycles available on the other processors. Given these caveats, further analysis will be done only on 2P systems.

A more illuminating view is normalize processor cycles with work done - GHz/Gbps, (i.e. cycles per bit transferred). This “cost” metric allows us to account for both CPU and throughput improvement at the same time (Figure 4). To better interpret these charts, we look at the cost of a 64KB transmit. It is about 1.9 in the no affinity case, and is reduced to 1.4 in the full affinity case. This is a reduction of about 25%. Affinity has a bigger impact on large size transfers, and we will explain why in the next section.

6. Detailed Analysis and Methodology

In this section, we do a detailed analysis on the extreme data points identified in the previous section, i.e. receives and transmits of 128B and 64KB, under no and full affinity modes. Behavior of other data points will fall somewhere in between the extremes. We begin with a baseline analysis of the 2 affinity modes and show pertinent metrics that characterize the stack. We proceed to extract performance impact indicators based on Pentium 4’s expected events penalties [23]. Once these are identified, the final comparative study, evaluating the impact of affinity in terms of these identified events, is subsequently done.

CPU 0CPU 0 CPU 1CPU 1

11 22

33 44

55 66

77 88

No AffinityInterrupts default to CPU0, OS-based scheduling

TCP TCP


11 22

33 44

55 66

77 88

1111 2222

3333 4444

5555 6666

7777 8888

No AffinityInterrupts default to CPU0, OS-based scheduling

TCP TCP


1 2 3 4 5 6 7 8

11 22

33 44

55 66

77 88

Full AffinityEach Interrupt and process mapped to a specific CPU

TCP TCP


1 2 3 4 5 6 7 81 2 3 4 5 6 7 8

11 22

33 44

55 66

77 88

1111 2222

3333 4444

5555 6666

7777 8888

Full AffinityEach Interrupt and process mapped to a specific CPU

TCP TCPTCP TCP

Figure 3 TCP CPU utilization and throughput

Figure 4 TCP processing costs

6.1. Baseline TCP Characterization Table 1 shows a comprehensive characterization of

the stack. We have separated the compute-intensive parts of TCP protocol processing (Engine), i.e. the cranking of the state machine, from the memory-intensive parts of TCP processing (Buffer mgmt). Buffer mgmt includes memory, buffer management routines and the manipulation of TCP control structures, etc. Copies are of movement of payload data only. This allows us to highlight the impact of the copy semantics imposed by BSD-based synchronous sockets [5]. Data copies is always uncached on the receive side, since the packet arrives in memory via device DMA. Whether or not it is cached on the send-side depends on the application caching behavior. In our experiments, we have set ttcp to serve data directly from cache. This reflects how modern server applications are designed. E.g. in-kernel web servers

(IIS, TUX) serve data out of the buffer cache. A

full implementation of the sockets interface includes not only the obvious BSD sockets API (both in kernel and user), but also the Linux system call (sys_call routine), and schedule-related routines. This is how an application causes a socket action to be executed from the user level all the way down to the TCP stack. We put all these functions into the interface bin. Driver includes both the NIC driver routines, and NIC interrupt processing. Locks include all synchronization-related routines. Timers refer to all of the timer routines that TCP uses.

A few baseline observations are worth calling out and we shall reserve comparisons to later sections. For 64KB transfers, the top timing hotspots include the TCP engine, buffer mgmt and copies. For 128B transfers, the hotspots are the sockets interface and the TCP engine.

TX Bandwidth vs CPU Utilization

0

10

20

30

40

50

60

70

80

90

100

128 256 1024 4096 8192 16384 65536

Transaction Size (bytes)

CPU

Util

izat

ion

(%)

500

850

1200

1550

1900

2250

2600

2950

3300

3650

4000

Ban

dwid

th (M

b/s)

No Aff CPU Proc Aff CPU IRQ Aff CPU Full Aff CPUNo Aff BW Proc Aff BW IRQ Aff BW Full Aff BW

RX Bandwidth vs CPU Utilization

0

10

20

30

40

50

60

70

80

90

100

128 256 1024 4096 8192 16384 65536


CPU

Util

izat

ion

(%)

500

850

1200

1550

1900

2250

2600

2950

3300

3650

4000

Ban

dwid

th (M

b/s)

No Aff CPU Proc Aff CPU IRQ Aff CPU Full Aff CPUNo Aff BW Proc Aff BW IRQ Aff BW Full Aff BW

Rx Cost in GHz/Gbps

1

2

3

4

5

6

128 256 1024 4096 8192 16384 65536


GH

z/G

bps

No Aff Proc Aff IRQ Aff Full Aff

Tx Cost in GHz/Gbps

1

2

3

4

5

6

128 256 1024 4096 8192 16384 65536


GHz

/Gbp

s

No Aff Proc Aff IRQ Aff Full Aff

Table 1 Baseline Characterization

CPI: Cycles per instruction MPI: Last-level cache Misses per instruction % Branches: Number of branches/Number of Instructions % Br Mispredicted: Number of branches mispredicted/Number of branches

TX 64KB % Cycles CPI MPI % Branches % Br mispredictedNo Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff

Interface 6.0% 5.0% 17.62 11.27 0.0212 0.0063 20.06% 20.41% 6.66% 8.90%Engine 25.5% 21.8% 5.01 3.41 0.0070 0.0016 16.96% 16.40% 1.83% 2.24%

Buf Mgmt 28.0% 20.3% 5.93 4.06 0.0065 0.0007 16.92% 16.49% 1.07% 0.53%Copies 27.1% 37.1% 3.93 4.12 0.0106 0.0095 2.20% 2.24% 0.37% 0.39%Driver 10.4% 12.2% 6.06 5.35 0.0049 0.0030 14.93% 14.68% 1.37% 1.57%Locks 0.6% 0.0% 14.65 16.49 0.0025 0.0040 24.80% 20.09% 0.78% 31.73%

Timers 2.0% 3.0% 4.07 7.10 0.0029 0.0116 9.99% 10.96% 0.15% 0.27%Overall 99.7% 99.5% 5.04 4.14 0.0078 0.0047 11.53% 10.76% 1.41% 1.43%

TX 128B % Cycles CPI MPI % Branches % Br mispredictedNo Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff



Timers 1.5% 2.2% 2.58 3.15 0.0016 0.0042 15.69% 14.49% 0.15% 0.15%Overall 98.8% 99.1% 4.56 4.11 0.0038 0.0028 15.80% 15.97% 0.90% 0.81%

RX 64KB % Cycles CPI MPI % Branches % Br mispredictedNo Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff



Timers 11.3% 8.2% 5.85 7.35 0.0097 0.0146 9.60% 10.42% 0.19% 0.21%Overall 99.9% 99.4% 8.49 7.53 0.0133 0.0101 15.28% 16.13% 1.37% 1.20%

RX 128B % Cycles CPI MPI % Branches % Br mispredictedNo Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff No Aff Full Aff

Interface 41.5% 46% 8.49 8.66 0.0032 0.0036 19.76% 20.74% 0.22% 0.21%Engine 23.7% 21% 3.38 2.72 0.0021 0.0005 15.21% 15.59% 0.98% 0.92%

Buf Mgmt 10.0% 7% 2.31 1.55 0.0023 0.0002 17.25% 17.32% 0.64% 0.44%Copies 13.8% 15% 4.99 5.14 0.0074 0.0077 9.93% 10.94% 0.02% 0.00%Driver 5% 5% 5.64 4.44 0.0063 0.0024 12.64% 13.01% 3.58% 4.27%Locks 2.7% 1% 17.95 23.22 0.0080 0.0103 30.14% 34.79% 2.38% 3.11%

Timers 2.2% 3% 3.04 3.17 0.0018 0.0042 14.33% 12.92% 0.08% 0.13%Overall 99.0% 99.0% 4.66 4.23 0.0032 0.0023 16.42% 16.81% 0.68% 0.63%

Time spent in drivers is also quite substantial for

large transfers. This is expected – large transfers involve primarily data and descriptors manipulation, while the small transfers involve primarily socket read/write calls. For the Pentium 4s, a CPI value of 1 is considered good, and a value of 5 is considered poor [23]. We see that TCP processing, on the whole, does poorly in terms of CPI. Furthermore, TX generally has lower CPIs and MPIs than RX, an indicator that RX is more memory-bound. Very large CPIs are seen in interface and locks. Given the nature of these bins, system calls and resource contention respectively, we expected these inefficiencies. In all cases, the computation requirement of the TCP engine remains incredibly constant (~20%–30%), when cycles have been normalized to work done. Copies tend to take more time in RX than TX. Copies on RX (under Linux-2.4.x) are implemented via rep movl (repeat string move); whereas copies on TX are implemented via a carefully crafted rolled-out loop that moves data efficiently based on its alignment. This also explains the glaringly large CPI and MPI seen in RX of 64KB – one instruction was responsible for moving a whole lot of data. Alignment for TX data is known beforehand allowing for the rolled-out optimization. RX copies were implemented assuming arbitrary arrival of bytes, and alignment is not guaranteed. A more optimized version of RX copy, based on integer copy, had since appeared in Linux-2.6 [1]. In addition, timers in RX of 64KB take up substantially more time. Most of that time is in the routine do_gettimeofday(), which is used by the bottom half of the receive interrupt handler to compare current time with the timestamp in arriving packets. There is no corresponding use this routine on the TX path. Overall, branches make up about 10% - 16% of all instructions in TCP fast path, and the percentage of branch mispredictions is also fairly low (< 2%). Table 2 Spinlocks (Linux) implementation

The only deviation from this norm is seen in locks of large transfers in full affinity mode. Diving deeper into the implementation of spinlocks in Linux (Table 2) and looking at the absolute value of branches and mispredictions reveal the reasons.

The seemingly large miscprediction rate (in the full affinity case) is not due to more mispredictions, but due to fewer branches. The number of branches and instructions in the full affinity case is about 5-10% that of no affinity. In the full affinity case, processes and interrupts are bound to the same processor making for minimal spinlock contention. If a processor successfully grabs a lock, there are no jumps or branches required. The number of branches and instructions decrease accordingly. As such, when a branch misprediction does occur, it counts very heavily against the branch misprediction ratio. When contention is high, as in the no affinity case, the processor finds itself spinning in a spinloop waiting for the lock to be released. On Pentium 4, REPZ NOP translates to a PAUSE instruction, which is implemented as a no-op with a pre-defined delay, in an attempt to reduce memory ordering violations [7]. In either REPZ NOP or PAUSE implementations, the absolute number of branches taken in the no affinity case will be much larger than the full affinity case.

6.2. Performance Impact Indicators

As seen in the previous section, there is no limit to the depth we can go to find evidence in support of our observations. While is it tempting to dive deeper, we wanted to make sure that we are studying architectural events that have a substantial impact on performance. To this end, we follow the tuning advice and recommendations found in VTune 7.1 (manual) [23], a profiling tool written specifically for Intel architecture (IA) based processors.

The events we have chosen to study include machine clears (i.e. instruction pipeline flushes), trace cache (TC) misses, L2 cache misses, LLC misses, the number of page walks due to instruction TLB (ITLB)

Address Instructions Comments c02bd319: lock decb 0x2c(%ebx) atomic decrement of “lock”

lock=1 in unlocked state

js c02c2c0e <.text.lock.tcp> if already held by another processor, jump to .text.lock.tcp

… Successfully grabbed lock, continue in caller’s original path

c02c2c0e <.text.lock.tcp>: Cmpb $0x0,0x2c(%ebx) Check if “lock” value is 0 Repz nop Translates to a PAUSE

jle c02c2c0e <.text.lock.tcp> if lock <= 0, i.e. another processor still owns the lock, jump back to .text.lock.tcp

Spinloop

jmp c02bd319 Lock is free. Start from the beginning to try to grab lock

misses and data TLB (DTLB) misses and branch mispredictions (Br Mispredict). The numbers in Figure 5 show the percentage of overall time that is attributed to a particular event for an entire run. We derive these numbers by the following general formula:

% Time attributed to event = (# of occurrences of event * cost of event) /

total cycles in run It must be noted that the cost (penalty) shown in

Figure 5 represents a reasonable average and is by no means exact. The cost of machine clear, in particular, can vary widely since it is highly dependent on the state of the pipeline (i.e. workload dependent) when the clear occurs. Moreover, in any processors with deep, out-of-order pipeline, the costs are not linearly additive. However, this method of using performance impact indicators does provide us with a useful first order approximation of where most of time is spent, and what events we should focus on. As an academic exercise, we have also used Pentium 4’s theoretical maximum of 3 retired instructions per cycle (last row) to obtain a lower bound of what the % of processing time might be. The 2 primary events, which stood out clearly as impacting performance, are machine clears and LLC misses. They account for most of the cycles in the entire run. The only place where machine clears and LLC misses have not accounted for most of the cycles is in the case of TX of 128B, in the full affinity mode. It points to the possibility of other performance-affecting events that we have not considered in this study.

64KB Tx Rx

Cost No Aff Full Aff No Aff Full AffMachine clear 500 59.3% 54.8% 71.2% 60.1%TC miss 20 1.0% 1.4% 0.8% 0.8%L2 miss 10 2.3% 1.6% 2.2% 2.3%LLC miss 300 39.8% 33.6% 45.5% 39.0%ITLB miss 30 0.1% 0.1% 0.2% 0.2%DTLB miss 36 0.2% 0.3% 0.3% 0.3%Br Mispredict 30 0.9% 1.1% 0.8% 0.8%Instr 0.33 6.0% 8.0% 4.0% 4.0% 128B Tx Rx

Cost No Aff Full Aff No Aff Full AffMachine clear 500 39.8% 22.4% 66.8% 21.3%TC miss 20 1.6% 1.8% 1.2% 1.4%L2 miss 10 0.9% 0.7% 0.8% 0.6%LLC miss 300 24.2% 19.8% 20.6% 15.7%ITLB miss 30 0.1% 0.1% 0.1% 0.2%DTLB miss 36 0.2% 0.2% 0.1% 0.1%Br Mispredict 30 1.0% 1.0% 0.8% 0.8%Instr 0.33 7.4% 8.2% 7.2% 7.9% Figure 5 Deriving Performance Impact Indicators

6.3. Comparative Characterization We focus our deeper analysis on the 2 primary

factors. Figure 5 shows the % improvement in time (i.e. reduction in cycles) going from the no affinity to full affinity mode. The no affinity baseline characteristics are repeated here for reference. The overall improvement in time is about 22% for 64KB transfers, and 9% for 128B transfers. We derived % improvement for cycles (or other events), by measuring and comparing the number of events in the two affinity modes (using Amdahl’s law). For example, to derive the % improvement in the number of machine clears in the TCP engine, when going from the no affinity to full affinity mode, we use the following: % Improvement = (clears-engineNo / clears-totalNo) × (1 – clears-engineFull / clears-engineNo)

clears-engineNo = Number of machine clears (per work done) observed in the TCP engine under no affinity clears-engineFull = Number of machine clears (per work done) observed in the TCP engine under full affinity clears-totalNo = Total number of machine clears (per work done) observed under full affinity

Diving deeper into TX 64KB transfers, we see that

buffer mgmt contributes 11.6% (i.e. half) of the 22.1% in improvement. To understand why, we look across the row at and see corresponding improvement in LLC misses and machine clears improvements. More data is served from cache in the full affinity mode. Since buffer management makes up a large part of TCP processing in large transfers, affinity has a larger impact at these sizes. The other large improvement happened in the TCP engine. Despite our efforts to keep this bin compute-only, we cannot avoid memory-touching in these routines completely (e.g. routines that calculate TCP window sizes must also read TCP contexts). What is interesting is that affinity did not seem to affect copies. A deeper understanding of the stack and workload reveals the reasons. ttcp does no work other than read() or write(). On issuing a write(), unless the process has used up its time slice at system call, the scheduler continues the execution of the write. The copy is most likely done immediately in the context of the process, and on the same processor. RX is more unpredictable as packets will generally arrive outside the context of the process that issued the read. However, in ttcp, packets are arriving constantly. Enough data sits in the socket buffers, ready to be copied into user space.

Table 3 Relating improvements to events

Turning our attention to 128B, we see that the

largest % time spent is in the interface (as should be expected), but the largest % improvement occurs in buffer mgmt and TCP engine. The improvement in the interface itself is minimal. The same sockets layer is used in both affinity modes.

Given the non-memory intensive nature of this layer, affinity has minimal effects. We note that LLC cache misses improve in both bins, for the same reasons noted before. Noteworthy is the improvement seen in machine clears in the TCP engine for small transfers (up to 63%). Machine clears are the major source of pipeline stalls.

Machine clears can be caused by memory

reordering or self-modifying code. They can also be caused by any type of hardware interrupts. For processors with deep pipelines, machine clears can be costly. We collected numbers for machine clears due to memory reordering or self-modifying code and these turned up to be near zero. Once we confirmed that, we are quite certain that most of the machine clears are due primarily to interrupts. Interrupts may come in the form of interrupts from devices (NICs), inter-processor interrupts (IPI), page faults and other exceptions.

Quantifying the impact of machine clears due to interrupts had been challenging for various reasons.

RX 128B no affinity Improvements inFunctional bins % Time CPI MPIx10-3 Cycles LLC ClearsInterface 43.1% 8.5 3.5 -0.1% -1.2% 4.1%Engine 23.7% 3.4 2.1 4.3% 14.7% 63.2%Buffer mgmt 10.0% 2.3 2.5 3.7% 13.6% 6.3%Copies 13.8% 4.9 7.8 0.2% -0.5% 0.1%Driver 2.9% 5.7 7.1 0.4% 4.0% 0.8%Locks 2.7% 21.0 10.6 1.4% 0.9% 1.3%Timers 2.2% 2.8 1.9 -0.7% -2.8% 0.0%Overall 4.6 3.4 9.2% 28.6% 75.7%

TX 64KB no affinity Improvements inFunctional bins % Time CPI MPIx10-3 Cycles LLC ClearsInterface 6.0% 16.7 20.1 1.6% 1.0% 6.0%Engine 25.5% 4.9 6.6 7.9% 16.9% 14.2%Buffer mgmt 28.0% 6.1 6.8 11.6% 18.9% 7.6%Copies 27.1% 4.2 11.3 0.3% 4.0% -1.6%Driver 10.4% 6.2 5.0 0.5% 3.4% 2.4%Locks 0.6% 16.8 4.1 0.6% 0.1% 0.0%Timers 2.0% 4.2 3.2 -0.4% -1.3% -0.1%Overall 5.2 8.0 22.1% 43.0% 28.5%

TX 128B no affinity Improvements inFunctional bins % Time CPI MPIx10-3 Cycles LLC ClearsInterface 42.4% 8.6 3.4 -0.9% -1.1% -5.1%Engine 29.0% 3.4 2.2 2.8% 12.5% 53.9%Buffer mgmt 11.6% 4.5 5.0 4.2% 14.3% 2.7%Copies 5.9% 1.6 8.4 -0.1% -0.5% -2.0%Driver 4.4% 5.6 7.0 1.0% 5.4% 1.8%Locks 3.8% 16.6 4.0 2.7% 0.5% 0.6%Timers 1.5% 2.5 1.9 -0.5% -1.7% -0.1%Overall 4.6 4.1 9.3% 29.3% 51.8%

RX 64KB no affinity Improvements inFunctional bins % Time CPI MPIx10-3 Cycles LLC ClearsInterface 7.5% 14.0 15.4 2.2% 3.0% 7.0%Engine 22.7% 4.7 4.7 4.8% 9.7% 11.1%Buffer mgmt 20.4% 6.8 10.9 11.6% 15.9% 9.0%Copies 32.1% 65.7 134.7 0.4% 3.5% -0.2%Driver 7.2% 6.9 10.4 1.7% 5.4% 2.2%Locks 1.3% 18.1 6.8 1.0% 0.2% -0.3%Timers 8.2% 6.5 9.5 -0.7% -2.8% -0.5%Overall 8.6 13.3 21.0% 35.0% 28.2%

Oprofile does not support the explicit counting of interrupts. We can get an idea of the total count of device interrupts from the /proc statistics, but that is only a partial picture. Nevertheless, we can safely deduce that machine clears in driver code are predominantly from NICs. Those seen in the TCP engine are most likely due to IPI. However, we currently cannot prove the IPI reasoning. Interrupts are, by nature, arbitrary intrusions into the execution path. The sampling nature of any event-based tool suffers “skid” into functions that did not incur the event itself. In other words, a machine clear that is caused by an interrupt may not be accounted for until many instructions later, whereby it will be attributed to the function where the interrupt occurs. While device interrupts are heavier in nature (and is at a granularity visible to Oprofile), IPIs are typically very light-weight, and hence difficult to account for. What we can account for are their indirect after effects, i.e. machine clears occurring in the TCP engine.

We are in the process of patching the Linux kernel to collect IPI numbers to correlate and support our claims. Meanwhile, we offer a partial explanation here. In the full affinity mode, each processor is responsible for 4 processes and their respective interrupts. Interrupts are serviced on the same processor that will ultimately run the higher layers of the stack and user application. There is a direct path of execution within a processor, and IPIs are minimized. In the no affinity mode, CPU1 can potentially be running all of the user processes (since CPU0 is saturated with interrupt processing). The interrupts and possibly the lower layers of the TCP stack are executed on CPU0. To complete execution, CPU0 will need to interrupt CPU1. In other words, in the no affinity case, the full execution path is divided between 2 processors, requiring extra IPIs for synchronization and scheduling. To support our reasoning, a per-cpu view of Oprofile results is useful. We examine the functions (of small transfers) that give rise to the most machine clears (Table 4). Only the TCP engine and interrupt routines are shown. If our reasoning has merit, we would see more machine clears in TCP engine on CPU1 (in the no affinity case).

We first confirmed that CPU0 is responsible for servicing all device interrupts. IRQ0xnn_interrupt are the corresponding IRQ interrupt handlers for each of our NIC devices. We see a total of 8 interrupt handlers. In the full affinity case, the interrupt handlers are equally divided between the 2 processors. Worthy of note is the number of (absolute) machine clears that each device interrupt handler sees. They are similar across processors and across affinity modes.

This is exactly as expected since affinity does not change the arrival behavior of device interrupts.

Table 4 Functions (TCP engine & Interrupts) with highest number of machine clears TX 128B no affinity TX 128B full affinityCPU 0 CPU 0

samples % symbol samples % symbol2385 17.80 tcp_transmit_skb 1097 20.73 tcp_sendmsg957 7.14 tcp_sendmsg741 5.53 tcp_v4_rcv822 6.14 IRQ0x1a_interrupt 737 13.93 IRQ0x23_interrupt796 5.94 IRQ0x23_interrupt 694 13.12 IRQ0x1a_interrupt788 5.88 IRQ0x27_interrupt 655 12.38 IRQ0x19_interrupt757 5.65 IRQ0x1d_interrupt 644 12.17 IRQ0x24_interrupt750 5.60 IRQ0x25_interrupt672 5.02 IRQ0x1b_interrupt643 4.80 IRQ0x19_interrupt576 4.30 IRQ0x24_interrupt

CPU 1 CPU 1samples % symbol samples % symbol

4559 63.74 tcp_sendmsg 908 18.67 tcp_sendmsg437 6.11 tcp_transmit_skb288 4.03 inet_sendmsg

725 14.91 IRQ0x1b_interrupt698 14.35 IRQ0x1d_interrupt682 14.02 IRQ0x25_interrupt655 13.47 IRQ0x27_interrupt

RX 128B no affinity RX 128B full affinityCPU 0 CPU 0

samples % symbol samples % symbol4718 25.02 tcp_v4_rcv 166 3.36 tcp_recvmsg1215 6.44 tcp_transmit_skb

867 4.60 IRQ0x19_interrupt 637 12.91 IRQ0x23_interrupt744 3.95 IRQ0x23_interrupt 609 12.34 IRQ0x19_interrupt712 3.78 IRQ0x24_interrupt 598 12.12 IRQ0x24_interrupt472 2.50 IRQ0x1d_interrupt 541 10.96 IRQ0x1a_interrupt467 2.48 IRQ0x27_interrupt466 2.47 IRQ0x25_interrupt459 2.43 IRQ0x1b_interrupt447 2.37 IRQ0x1a_interrupt

CPU 1 CPU 1samples % symbol samples %

4576 25.55 tcp_recvmsg 145 3.35 tcp_recvmsg3308 18.47 tcp_rcv_established 27 0.62 tcp_v4_rcv2955 16.50 tcp_transmit_skb 25 0.58 __tcp_select_window1545 8.63 tcp_v4_do_rcv 25 0.58 tcp_rcv_established

585 13.52 IRQ0x1d_interrupt585 13.52 IRQ0x25_interrupt554 12.80 IRQ0x1b_interrupt544 12.57 IRQ0x27_interrupt

Turning our attention to the TCP engine, (and

looking across the row) there is a large decrease in the absolute number of machine clears going from the no affinity to full affinity mode. In the full affinity case, the number of machine clears is similar for both CPU0 and CPU1. However, in the no-affinity case, the number of machine clears found within the TCP engine on CPU1 is considerably larger than that on CPU0. We use this as indirect evidence that more machine clears are happening on CPU1 because of TCP processing functionality has been divided between processors, and more IPIs are required for scheduling work on different processors.

Table 5 Correlating cycles improvement to event improvement Rank correlation LLC Clears TX 64KB 0.62 0.80 TX 128B 0.93 0.89 RX 64KB 0.82 0.93 RX 128B 0.96 0.79

*Critical value for p=0.05, degf=5, 1-tail is 0.377 Finally, using another approach to validate the

analysis methods used so far, we quantify the relation between improvements seen in timing to improvements seen in LLC misses and machine via another approach. We performed a Spearman’s Rank Correlation test on the data. A value of +1 shows perfect correlation, i.e., the trend in one data set move exactly with another. A value of 0 implies no correlation. Table 2 shows the correlation results, which are also statistically significant. A strong correlation exists, and we conclude that improvements in LLC misses or machine clears are predictive of improvements in processing times. This confirms the impact indicators derived earlier in Section 6.2. 7. Related Work

Researchers have looked at affinity in various forms. Some took a hard partitioning approach whereby the TCP stack is executed only on specific processors [17][19]. Protocol and application processing are confined to their respective processors, removing OS intrusive effects into the network stack (and vice versa). Muir [19] noted that total instruction cache misses were reduced in their implementation. Analytical models were also developed to characterize different affinity-based scheduling algorithms. Vaswani [22] presented analytical models for the Sequent Symmetry (a machine with twenty 16MHz Intel 80386 processors). Their studies showed that cache affinity had minimal impact (for the slow processors at that time). However, they did contend that affinity scheduling will become more important as processor speed increases. Later work [13][21] evaluated affinity-based scheduling of parallelized loops. The success of their algorithms is based on their ability to dynamically achieve good affinity with load balancing. Salehi [20] presented models that support the effectiveness of affinity-based scheduling for UDP processing on multiprocessors. Their focus was to evaluate the effectiveness of various algorithms, but since they worked in user space, they did not consider

system and implementation costs. The current version of Linux 2.6 takes a more intelligent scheme whereby the kernel dispatches interrupts to one processor for a short duration before it randomly switches the interrupt delivery to a different processor. The random distribution resolves the system bottleneck problem while the delayed switching provides a best-effort approach to improve cache locality. However, cache inefficiencies are still unavoidable. Furthermore, the kernel has to update the task priority registers (TPRs) in the IO Advanced Priority Interrupt Controllers (APIC) regularly, increasing the number of un-cacheable writes.

8. Conclusions & Future Work

While this paper did not propose new implementations, we have exposed the primary performance-affecting reasons as to why affinity works for TCP processing. We experimentally confirmed the intuition that processor affinity can improve performance via better cache locality. We further showed that machine clears is the other major factor of consideration in affinity studies. This was never called out by previous researchers. We quantified that when full affinity is used, the benefits are seen primarily in the TCP engine and buffer management routines. We also showed that machine clears can be reduced tremendously if we are able to direct various levels of the stack execution onto the same processor. This bodes well for future network adapters that have the ability to look deeper into packets to extract flow information (receive-side scaling) [16] and to direct connections and interrupts, dynamically, to a specific processor.

It is encouraging to see how simple mechanisms, without the need for hardware offloads or major software rewrites, have afforded us good gains in performance. However, we have investigated only bulk data transfers and affinity settings in the most basic configuration. While more scheduling intelligence must accompany affinity to allow for non-uniform and dynamically changing applications, it can be envisioned that dedicated servers (e.g. web server running a known number of worker threads and NICs), may have workloads that can leverage off static user-directed mechanisms. A necessary extension of this study is to determine the applicability of this characterization to more complex networking workloads (e.g. SpecWeb, TPC-C). We have started initial work that showed promising performance gains when running a file IO benchmark over iSCSI/TCP.

We believe that this study has provided pertinent data in guiding the design of better scheduling algorithms based on user-directed affinity. We also believe we have established methods to leverage performance counters that can be applied to future investigations of performance understanding. User-directed affinity is applicable to other workloads besides networking. As we look towards the arrival of chip multi-processors, whereby multiple cores, possibly with multi threads, reside on multiple processors, we believe that affinity and mechanisms to better manage affinity will undoubtedly take a central role in future operating systems. 9. Acknowledgements

We would like to thank Ravi Iyer for his advice on

speedup analysis. We are also especially grateful to John Levon, Philippe Elie and Will Cohen who provided us with a special patch of Oprofile that exposes processor ID information, and also helped us understand Oprofile’s interpretation of events on hyperthreaded processors. 10. References [1] “Competitive Landscape: TNICs vs. NICs vs. iSCSI

HBAs”, Alacritech white paper, available at http://www.alacritech.com/assets/applets/Competitive.Landscape.pdf, 2004.

[2] V. Anand and B. Hartner. “TCP/IP Network Stack Performance in Linux Kernel 2.4. and 2.5”, Proc. of the Linux Symposium, Ottawa, June 2002.

[3] J. Chase, A. Gallatin and K. Yocum. “End-System Optimizations for High-Speed TCP”, IEEE Comms, 39:4, April 2001.

[4] D. Clark, V. Jacobson, J. Romkey, and H. Salwen. “An analysis of TCP processing overhead”, IEEE Comms, 27(6):23-29, June 1989.

[5] A. Foong, T. Huff, H. Hum, J. Patwardhan and G. Regnier. “TCP performance re-visited”, Proc. of the IEEE Intl. Symposium on Performance of Systems & Software, Austin, Mar 2003.

[6] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Analysis. 3rd ed., Morgan Kaufmann, 2003.

[7] IA-32 Intel® Architecture Software Developer’s Manual:Systems Programming Guide, Vol. 3, Intel Corporation, 2002.

[8] R. Kalla, B. Sinharoy and J. Tendler, “IBM Power5 Chip: A Dual-core multithreaded processor”, Proc. of HotChips 2004, 2004.

[9] J. Kay and J. Pasquale. “The importance of non-data touching processing overheads in TCP/IP”, Proc. of ACM SIGCOMM, San Francisco, 1993.

[10] P. Leroux. “Meeting the bandwidth challenge: Building Scalable Networking Equipment Using SMP”, Dedicated Systems Magazine, 2001.

[11] Cross-referencing Linux, available at http://lxr.linux.no [12] R. Love. Linux Kernel Development, Sams Publishing,

2004. [13] E. Markatos and T. LeBlanc. “Using Processor Affinity

in Loops Scheduling on Shared-Memory Multiprocessors”, Proc. of the 1992 ACM/IEEE Conference on Supercomputing, 104-113, 1992.

[14] S. Makineni and R. Iyer. “Architectural Characterization of TCP/IP Packet Processing on the Pentium M microprocessor”, Proc. Intl. Conf. on High Performance Computer, Feb 2004.

[15] D. Mosberger. “A Closer Look at the Linux O(1) Scheduler”, HP Lab reports. Available at http://www.hpl.hp.com/research/linux/kernel/o1.php, 2003.

[16] “Scalable Networking: Eliminating the Receive Processing Bottleneck—Introducing RSS”, Microsoft whitepaper, available at http://www.microsoft.com/whdc/, 2004.

[17] A. Muir and J. Smith. “AsyMOS: An asymmetric multiprocessor OS”, Proc. of OPENARCH '98, 25-34, April 1998.

[18] “Oprofile: A system-wide profiling tool for Linux”, available at http://oprofile.sourceforge.net, 2004.

[19] G. Regnier, D. Minturn, G. McAlpine, V. Saletore, A. Foong. “ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine”, IEEE Micro, Jan 2004.

[20] J. Salehi, J. Kurose and D. Towsley, “The effectiveness of affinity-based scheduling in multiprocessor network protocol processing”, IEEE/ACM Trans. on Networking, vol 4:4, pp 516-530, 1996.

[21] S. Subramaniam and D. Eager, “Affinity scheduling of unbalanced Workloads” Supercomputing 94, pp 214-226, 1994.

[22] R. Vaswani and J. Zahorjan. “The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors”, Proc. of the thirteenth ACM symposium on Operating systems, Pacific Grove, pp 26 – 40, 1991.

[23] VTune(TM) Performance Analyzer 7.1 Tuning Assistant manual, 2004.

[24] W. Shi, R. Wright, E. Collins & V. Karamcheti. “Workload Characterization of a Personalized web site,” 7th Intl. Workshop on Web Content Caching & Distribution, Boulder, CO, Aug 2002.

* Performance numbers reported in this paper do not necessarily represent the best performance available for the processors named and should not be used to compare processor performances. ** Other names and brands may be claimed as the property of others. Intel and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Architectural Characterization of Processor Affinity in Network Processing

Documents

Transcript of Architectural Characterization of Processor Affinity in Network Processing