Optimizing processor performance for Wintel applications

46
1998 Demand Technology, Inc. Optimizing processor performance for Wintel applications: a case study Demand Technology Software 1020 Eighth Avenue South, Suite 6, Naples, FL 34102 phone: (941) 261-8945 fax: (941) 261-5456 e-mail:[email protected] http://www.demandtech.com

Transcript of Optimizing processor performance for Wintel applications

1998 Demand Technology, Inc.

O ptimiz ing processorperform ance for Wintel

applications: a case studyD em and Technology Software

10 20 E ighth Avenue South, Suite 6 , Naples, F L 3 410 2phone: (94 1 ) 261 -894 5 fax: (941 ) 2 61-5 456

e-m ail:m arkf@ dem andtech.comhttp://www.dem andtech.com

21998 Demand Technology, Inc. Wintel application tuning: processor optimization

Application tuning case study

■ The target application is a C programwritten to perform interval performancedata collection for Windows NT♦ It is im portant that this program perform well

because “If you are not part of the solution, youare part of the problem .”

♦ Performance Analysts use our product, and theyare very dem anding custom ers

31998 Demand Technology, Inc. Wintel application tuning: processor optimization

Application tuning case study

■ At a point in development where thecode was reasonably mature andstable, I subjected it to performanceanalysis using several tools.♦ Microsoft V isual C ++ version 5♦ R ational V isual Q uantify execution profiler♦ Intel vTune version 2 .5 optim ization tool

41998 Demand Technology, Inc. Wintel application tuning: processor optimization

Windows NT on Intel hardware

■ In order to use vTune effectively, ithelps to understand how Intel processorhardware works♦ extensive Intel processor docum entation ships on

the product C D

■ Target environment:♦ Microsoft Windows NT 4 .0♦ Intel P entium and Pentium Pro hardware

51998 Demand Technology, Inc. Wintel application tuning: processor optimization

NT perform ance m onitoring

■ Performance SeNTrytm collectionagent

initialization

loop until cycle end = TR U E ;Win3 2 API calls to retrieve perform ance data;calculate;Write data to file;

end loop;

61998 Demand Technology, Inc. Wintel application tuning: processor optimization

Win32 perform ance m onitoring API

■ F amiliar and well-documented interface♦ The only program matic way to enumerate the

Processes running on an NT system♦ NT Performance data is structured as

■ O bjects (records)■ C ounters (fields)

♦ C ollection agents are associated with O bjects■ NT base O bjects, including kernel O bjects■ extended O bjects require a Perflib dll

71998 Demand Technology, Inc. Wintel application tuning: processor optimization

D ata C ollection sets

■ D efine proper subsets of a MasterC ollection set♦ D efines all known O bjects and C ounters♦ Some O bjects are instanced: there can be multiple

occurrences of instanced O bjects♦ Two P arent:C hild relationships defined

■ Process is the parent of Thread■ P hysical D isk is the parent of L ogical D isk

81998 Demand Technology, Inc. Wintel application tuning: processor optimization

D ata C ollection sets

■ Performance considerations♦ D ata collection is performed one O bject at a tim e

■ This was necessary due to a bug in the Win3 2 collectionservices

■ An n:1 correspondence between O bjects and theirassociated collection routines

♦ With the exception of C hild O bjects■ They are collected at the same time as the P arent

O bjects

♦ There are m any instances of som e O bjects■ Process and Thread

91998 Demand Technology, Inc. Wintel application tuning: processor optimization

D ata C ollection sets

■ Performance considerations♦ There are com pelling reasons why data collection

should be done at frequent intervals■ identified by Buzen and S hum, 1 9 96

♦ Performance data for processes that term inatebefore the end of the interval is lost

♦ one collection interval used for both Accum ulatorC ounters (processor time) and InstantaneousC ounters (e.g., processor Q ueue length)

101998 Demand Technology, Inc. Wintel application tuning: processor optimization

D ata C ollection sets

■ Performance considerations♦ Ideally, collection should be perform ed at least

once per m inute;♦ possibly, som e O bjects could be collected even

m ore frequently in order to accum ulate samples ofInstantaneous C ounter values

♦ C an our code handle it?

111998 Demand Technology, Inc. Wintel application tuning: processor optimization

G oals of the tuning exercise

■ Profile our code execution path so thatwe can understand it better♦ Profilers elim inate a lot of idle speculation about

what your code is doing

■ Better understand the Win32 servicesand their interaction with our code♦ We cannot changes these services, but perhaps

we can interact with them in better ways

121998 Demand Technology, Inc. Wintel application tuning: processor optimization

G oals of the tuning exercise

■ E valuate code optimization strategies♦ optim iz ing C ompiler options

■ Pentium and P entium Pro specific optimizations

♦ In-line assem bler♦ C ode restructuring♦ etc.

■ F eed results forward into thedevelopment process

131998 Demand Technology, Inc. Wintel application tuning: processor optimization

VC ++ code profiler

■ Built-in compiler option■ T imes program functions during run time

♦ Must run the application under the debugger

■ C reates a text report showing:♦ F unction tim e♦ F unction+C hild F unction time♦ H it C ount

■ E xample: D efaultC ollectionSet once persecond

141998 Demand Technology, Inc. Wintel application tuning: processor optimization

Module Statistics for dmperfss.exe---------------------------------- Time in module: 283541.261 millisecond Percent of time in module: 100.0% Functions in module: 155 Hits in module: 11616795 Module function coverage: 72.3%

Func Func+Child Hit Time % Time % Count Function--------------------------------------------------------- 248146.507 87.5 248146.507 87.5 249 _WaitOnEvent (dmwrdata.obj) 8795.822 3.1 8795.822 3.1 393329 _WriteDataToFile (dmwrdata. 4413.518 1.6 4413.518 1.6 2750 _GetPerfDataFromRegistry (dm 3281.442 1.2 8153.656 2.9 170615 _FormatWriteThisObjectCount 3268.991 1.2 12737.758 4.5 96912 _FindPreviousObjectInstanceC 2951.455 1.0 2951.455 1.0 3330628 _NextCounterDef (dmwrdata.ob

VC ++ code profiler output

151998 Demand Technology, Inc. Wintel application tuning: processor optimization

VC ++ profiler: O bservations

■ O ur program is “sleeping” 87 .5% ofthe time!

■ C an only look at your program’s code♦ If your function is spending all its tim e m aking

Win32 AP I calls or calling other dlls, they arenot visible

■ Parent-child relationships amongmodules are not readily apparent

161998 Demand Technology, Inc. Wintel application tuning: processor optimization

R ational V isual Q uantify

■ Add-on product♦ Visual S tudio “integration”

■ Select profiling at the level of thefunction call or the line

■ Adds instrumentation to each moduleduring the runtime session♦ Includes all shareable and relocatable exes and

dlls called by your program !

171998 Demand Technology, Inc. Wintel application tuning: processor optimization

R ational V isual Q uantify

■ R eporting♦ graphic view of your program’s critical execution path

■ breaks out dlls and some system services

♦ parent-child relationships among m odules is explicit♦ convenient navigation between views

■ Performs analysis of ∆ between twoexecution runs

201998 Demand Technology, Inc. Wintel application tuning: processor optimization

R ational V isual Q uantify

231998 Demand Technology, Inc. Wintel application tuning: processor optimization

R ational VQ : O bservations

■ Added instrumentation affects absolutefunction time values observed♦ We only spent 3 2 % of our tim e “S leeping”♦ relative tim ing relationship between functions

appear unaffected

■ App is very intuitive and easy to use♦ e.g., relationships between function calls

■ Ability to trace module executionthrough 3 rd party functions can be veryuseful!

241998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel vTune

■ Standalone execution profiler■ R elies on system-wide sampling

♦ m aps the location of the P rogram C ounter to them odule in m em ory

♦ catches every program, including the O S

■ O ptionally, can also be used to reporton the Pentium/Pentium Properformance metrics during programexecution

261998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel vTune

■ H igh percentage of samples showed NTrunning the Idle Thread!

■ Switched to Master C ollection set onceper second to generate more activity♦ R ational VQ overhead was too high to perform a

com parable test♦ R esult: very different profile of activity

281998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel vTune

■ H otspot analysis showed two functionsaccounted for > 70% of the activityinside our process address space♦ NextInstanceD ef

♦ IsP reviousAndParentS am eInstance

■ vTune analyzes x86 assembler code toassist you in taking advantage of thesuperscalar features of the P5

291998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel processor perform anceoverview

■ C omplex Instruction set (C IS C )♦ Maintain upward com patibility with original 8 -bit

8 0 8 0 instruction set

■ With improvements in semiconductorfabrication, add♦ pipelining, TL B, cache, branch prediction

■ 4 86

♦ elements of R ISC superscalar processors■ Pentium, P entium P ro

301998 Demand Technology, Inc. Wintel application tuning: processor optimization

Processor Year Clock Speed(MHz)

Bus Width(bits)

AddressableMem ory

Transistors

8080 1974 2 8 64K 6,0008086 1978 5-10 16 1 MB 29,0008088 1979 5-8 8 1 MB 29,000

80286 1982 8-12 16 16 MB 134,000386 DX 1985 16-33 32 4 GB 275,000486 DX 1989 25-50 32 4 GB 1,200,000

Pentium 1993 60-233 32 4 GB 3,100,000Pentium Pro 1995 150-200 64 4 GB 5,500,000

Pentium II 1997 233-333 64 4 GB 7,500,000

Intel processor evolution

311998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel processor evolution

Processor Highlights8080 1 chip m icroprocessor8086 10X perform ance 80808088 8 bit version of 8086

80286 Virtual Mem ory386 DX 32 bit Registers486 DX L1 Cache; pipelined

Pentium dual integer pipelinePentium Pro m icroarchitecture

Pentium II Dual bus; MMX

321998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel x8 6 pipeline

■ 5-stage pipeline introduced with 486♦ Integrated 8 K data and instruction cache♦ Prefetch♦ Instruction D ecode 1♦ Instruction D ecode 2 (address calculation)♦ E xecute♦ Write Back

D1PF D2 EX WB

331998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel x8 6 pipeline expectations

D1PF D2 EX WB

D1PF D2 EX WB

D1PF D2 EX WB

D1PF D2 EX WB

D1PF D2 EX WB

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Instruction 5

341998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel x8 6 pipeline perform ance

■ Processor performance ≈ C lock cycle speed * cycles per

instruction (C PI)

■ Intel instruction set complexity reducesthe effectiveness of pipelining♦ “Integer” instruction cycle times range from 1-9♦ rep instruction prefixes have 4 clock startup

overhead♦ 32 -bit address far call takes 2 2 clock cycles

351998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel 5 8 6 dual integer pipeline

■ Improved processor performance (C PI)because certain instruction pairs canexecute in parallel

D1

PF

D2 EX WB

D1 D2 EX WB

u pipe

v pipe

361998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel 5 8 6 dual integer pipeline

■ R ules for instruction pairing are arcane♦ O ne-cycle instructions can usually be paired

■ a R IS C subset within a C ISC

♦ Instructions with imm ediate operands or addressdisplacem ents cannot be paired

♦ If there is an explicit R egister dependencybetween instructions, there is no pairing

■ x8 6 only contains 8 G eneral P urpose R egisters

■ R ISC processors depend on compilersthat understand how to exploit them!

371998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel 5 8 6 dual integer pipeline

■ Intel only manufactures hardware, so…♦ Intel introduces processor m easurem ents so that

you can collect actual C PI perform ance statistics♦ Intel introduces vTune to assist developers in

analyz ing software developed to run on itshardware

381998 Demand Technology, Inc. Wintel application tuning: processor optimization

vTune code analysis

401998 Demand Technology, Inc. Wintel application tuning: processor optimization

VC ++ code generation options

■ Try different “G ” O ptimization switches...♦ G 5 → P5 optim ization

■ some improvement, but the assembler codegenerated in the hot spot was unchanged.

411998 Demand Technology, Inc. Wintel application tuning: processor optimization

VC ++ code generation options

■ C language code:PE R F _INS TANC E _D E F INIT IO N * NextInstanceD ef

( P E R F _INS TANC E _D E F INIT IO N *pInstance ){ P E R F _C O U NTE R _BL O C K *pC trBlk; pC trBlk = (PE R F _C O UNTE R _BLO C K *)

((P BY TE )pInstance + pInstance->ByteL ength); return (P E R F _INS TANC E _D E F INIT IO N *)

((P BY TE )pInstance + pInstance->ByteL ength + pC trBlk->ByteLength);}

421998 Demand Technology, Inc. Wintel application tuning: processor optimization

VC ++ code generation options

■ G enerates tight Assember code

0 0 4 0 8 D 4 0 m ov ecx,dword ptr [esp+4 ]0 0 4 0 8 D 4 4 m ov edx,dword ptr [ecx]0 0 4 0 8 D 4 6 m ov eax,dword ptr [ecx+edx]0 0 4 0 8 D 4 9 add eax,ecx0 0 4 0 8 D 4 B add eax,edx0 0 4 0 8 D 4 D ret

431998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel 6 8 6 m icroarchitecture

■ Increased parallelism■ Instructions are translated into R ISC micro-

instructions■ Pool of 40 G P pseudo-R egisters■ Micro-instructions can be executed out of

sequence■ C PI ≈ rate at which instructions are retired

441998 Demand Technology, Inc. Wintel application tuning: processor optimization

Intel 6 8 6 m icroarchitecture

■ Performance not nearly so dependent onactual code generation, since theprocessor has the capability to unwindinstructions and execute them out oforder.

■ my vTune test runs were made on a P6!♦ vTune does not analyze code running on a P6 - there

should be little or no need to♦ Note: internally, the P entium II is a P6

451998 Demand Technology, Inc. Wintel application tuning: processor optimization

E pilogue

■ R ational VQ and Intel vTune tests gavevery different execution profiles♦ C ould we isolate the C ollection set dependency?♦ Y es, returned to VQ using a C ollection set that

included Thread O bjects and duplicated the vTuneresults

■ VQ helped us zero in execution pathissues♦ but there was considerable m easurem ent overhead

■ P6 probably makes vTune obsolete

461998 Demand Technology, Inc. Wintel application tuning: processor optimization

Where to get m ore inform ation

■ Windows NT Workstation 4 .0 R esource K it■ Microsoft Developer Network C D■ Intel vTune docum entation (m ost is also available from

Intel’s Web site click here.)■ C om puter Architecture: A Q uantitative Approach,

H ennesey and Patterson■ Pentium (P ro) Processor System Architecture,

Mindshare, Inc.■ Inner L oops, Booth■ The Indispensable Pentium Book, Messner