Optimizing a class of in-network processing applications in networked sensor systems
Optimizing processor performance for Wintel applications
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Optimizing processor performance for Wintel applications
1998 Demand Technology, Inc.
O ptimiz ing processorperform ance for Wintel
applications: a case studyD em and Technology Software
10 20 E ighth Avenue South, Suite 6 , Naples, F L 3 410 2phone: (94 1 ) 261 -894 5 fax: (941 ) 2 61-5 456
e-m ail:m arkf@ dem andtech.comhttp://www.dem andtech.com
21998 Demand Technology, Inc. Wintel application tuning: processor optimization
Application tuning case study
■ The target application is a C programwritten to perform interval performancedata collection for Windows NT♦ It is im portant that this program perform well
because “If you are not part of the solution, youare part of the problem .”
♦ Performance Analysts use our product, and theyare very dem anding custom ers
31998 Demand Technology, Inc. Wintel application tuning: processor optimization
Application tuning case study
■ At a point in development where thecode was reasonably mature andstable, I subjected it to performanceanalysis using several tools.♦ Microsoft V isual C ++ version 5♦ R ational V isual Q uantify execution profiler♦ Intel vTune version 2 .5 optim ization tool
41998 Demand Technology, Inc. Wintel application tuning: processor optimization
Windows NT on Intel hardware
■ In order to use vTune effectively, ithelps to understand how Intel processorhardware works♦ extensive Intel processor docum entation ships on
the product C D
■ Target environment:♦ Microsoft Windows NT 4 .0♦ Intel P entium and Pentium Pro hardware
51998 Demand Technology, Inc. Wintel application tuning: processor optimization
NT perform ance m onitoring
■ Performance SeNTrytm collectionagent
initialization
loop until cycle end = TR U E ;Win3 2 API calls to retrieve perform ance data;calculate;Write data to file;
end loop;
61998 Demand Technology, Inc. Wintel application tuning: processor optimization
Win32 perform ance m onitoring API
■ F amiliar and well-documented interface♦ The only program matic way to enumerate the
Processes running on an NT system♦ NT Performance data is structured as
■ O bjects (records)■ C ounters (fields)
♦ C ollection agents are associated with O bjects■ NT base O bjects, including kernel O bjects■ extended O bjects require a Perflib dll
71998 Demand Technology, Inc. Wintel application tuning: processor optimization
D ata C ollection sets
■ D efine proper subsets of a MasterC ollection set♦ D efines all known O bjects and C ounters♦ Some O bjects are instanced: there can be multiple
occurrences of instanced O bjects♦ Two P arent:C hild relationships defined
■ Process is the parent of Thread■ P hysical D isk is the parent of L ogical D isk
81998 Demand Technology, Inc. Wintel application tuning: processor optimization
D ata C ollection sets
■ Performance considerations♦ D ata collection is performed one O bject at a tim e
■ This was necessary due to a bug in the Win3 2 collectionservices
■ An n:1 correspondence between O bjects and theirassociated collection routines
♦ With the exception of C hild O bjects■ They are collected at the same time as the P arent
O bjects
♦ There are m any instances of som e O bjects■ Process and Thread
91998 Demand Technology, Inc. Wintel application tuning: processor optimization
D ata C ollection sets
■ Performance considerations♦ There are com pelling reasons why data collection
should be done at frequent intervals■ identified by Buzen and S hum, 1 9 96
♦ Performance data for processes that term inatebefore the end of the interval is lost
♦ one collection interval used for both Accum ulatorC ounters (processor time) and InstantaneousC ounters (e.g., processor Q ueue length)
101998 Demand Technology, Inc. Wintel application tuning: processor optimization
D ata C ollection sets
■ Performance considerations♦ Ideally, collection should be perform ed at least
once per m inute;♦ possibly, som e O bjects could be collected even
m ore frequently in order to accum ulate samples ofInstantaneous C ounter values
♦ C an our code handle it?
111998 Demand Technology, Inc. Wintel application tuning: processor optimization
G oals of the tuning exercise
■ Profile our code execution path so thatwe can understand it better♦ Profilers elim inate a lot of idle speculation about
what your code is doing
■ Better understand the Win32 servicesand their interaction with our code♦ We cannot changes these services, but perhaps
we can interact with them in better ways
121998 Demand Technology, Inc. Wintel application tuning: processor optimization
G oals of the tuning exercise
■ E valuate code optimization strategies♦ optim iz ing C ompiler options
■ Pentium and P entium Pro specific optimizations
♦ In-line assem bler♦ C ode restructuring♦ etc.
■ F eed results forward into thedevelopment process
131998 Demand Technology, Inc. Wintel application tuning: processor optimization
VC ++ code profiler
■ Built-in compiler option■ T imes program functions during run time
♦ Must run the application under the debugger
■ C reates a text report showing:♦ F unction tim e♦ F unction+C hild F unction time♦ H it C ount
■ E xample: D efaultC ollectionSet once persecond
141998 Demand Technology, Inc. Wintel application tuning: processor optimization
Module Statistics for dmperfss.exe---------------------------------- Time in module: 283541.261 millisecond Percent of time in module: 100.0% Functions in module: 155 Hits in module: 11616795 Module function coverage: 72.3%
Func Func+Child Hit Time % Time % Count Function--------------------------------------------------------- 248146.507 87.5 248146.507 87.5 249 _WaitOnEvent (dmwrdata.obj) 8795.822 3.1 8795.822 3.1 393329 _WriteDataToFile (dmwrdata. 4413.518 1.6 4413.518 1.6 2750 _GetPerfDataFromRegistry (dm 3281.442 1.2 8153.656 2.9 170615 _FormatWriteThisObjectCount 3268.991 1.2 12737.758 4.5 96912 _FindPreviousObjectInstanceC 2951.455 1.0 2951.455 1.0 3330628 _NextCounterDef (dmwrdata.ob
VC ++ code profiler output
151998 Demand Technology, Inc. Wintel application tuning: processor optimization
VC ++ profiler: O bservations
■ O ur program is “sleeping” 87 .5% ofthe time!
■ C an only look at your program’s code♦ If your function is spending all its tim e m aking
Win32 AP I calls or calling other dlls, they arenot visible
■ Parent-child relationships amongmodules are not readily apparent
161998 Demand Technology, Inc. Wintel application tuning: processor optimization
R ational V isual Q uantify
■ Add-on product♦ Visual S tudio “integration”
■ Select profiling at the level of thefunction call or the line
■ Adds instrumentation to each moduleduring the runtime session♦ Includes all shareable and relocatable exes and
dlls called by your program !
171998 Demand Technology, Inc. Wintel application tuning: processor optimization
R ational V isual Q uantify
■ R eporting♦ graphic view of your program’s critical execution path
■ breaks out dlls and some system services
♦ parent-child relationships among m odules is explicit♦ convenient navigation between views
■ Performs analysis of ∆ between twoexecution runs
201998 Demand Technology, Inc. Wintel application tuning: processor optimization
R ational V isual Q uantify
231998 Demand Technology, Inc. Wintel application tuning: processor optimization
R ational VQ : O bservations
■ Added instrumentation affects absolutefunction time values observed♦ We only spent 3 2 % of our tim e “S leeping”♦ relative tim ing relationship between functions
appear unaffected
■ App is very intuitive and easy to use♦ e.g., relationships between function calls
■ Ability to trace module executionthrough 3 rd party functions can be veryuseful!
241998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel vTune
■ Standalone execution profiler■ R elies on system-wide sampling
♦ m aps the location of the P rogram C ounter to them odule in m em ory
♦ catches every program, including the O S
■ O ptionally, can also be used to reporton the Pentium/Pentium Properformance metrics during programexecution
261998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel vTune
■ H igh percentage of samples showed NTrunning the Idle Thread!
■ Switched to Master C ollection set onceper second to generate more activity♦ R ational VQ overhead was too high to perform a
com parable test♦ R esult: very different profile of activity
281998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel vTune
■ H otspot analysis showed two functionsaccounted for > 70% of the activityinside our process address space♦ NextInstanceD ef
♦ IsP reviousAndParentS am eInstance
■ vTune analyzes x86 assembler code toassist you in taking advantage of thesuperscalar features of the P5
291998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel processor perform anceoverview
■ C omplex Instruction set (C IS C )♦ Maintain upward com patibility with original 8 -bit
8 0 8 0 instruction set
■ With improvements in semiconductorfabrication, add♦ pipelining, TL B, cache, branch prediction
■ 4 86
♦ elements of R ISC superscalar processors■ Pentium, P entium P ro
301998 Demand Technology, Inc. Wintel application tuning: processor optimization
Processor Year Clock Speed(MHz)
Bus Width(bits)
AddressableMem ory
Transistors
8080 1974 2 8 64K 6,0008086 1978 5-10 16 1 MB 29,0008088 1979 5-8 8 1 MB 29,000
80286 1982 8-12 16 16 MB 134,000386 DX 1985 16-33 32 4 GB 275,000486 DX 1989 25-50 32 4 GB 1,200,000
Pentium 1993 60-233 32 4 GB 3,100,000Pentium Pro 1995 150-200 64 4 GB 5,500,000
Pentium II 1997 233-333 64 4 GB 7,500,000
Intel processor evolution
311998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel processor evolution
Processor Highlights8080 1 chip m icroprocessor8086 10X perform ance 80808088 8 bit version of 8086
80286 Virtual Mem ory386 DX 32 bit Registers486 DX L1 Cache; pipelined
Pentium dual integer pipelinePentium Pro m icroarchitecture
Pentium II Dual bus; MMX
321998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel x8 6 pipeline
■ 5-stage pipeline introduced with 486♦ Integrated 8 K data and instruction cache♦ Prefetch♦ Instruction D ecode 1♦ Instruction D ecode 2 (address calculation)♦ E xecute♦ Write Back
D1PF D2 EX WB
331998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel x8 6 pipeline expectations
D1PF D2 EX WB
D1PF D2 EX WB
D1PF D2 EX WB
D1PF D2 EX WB
D1PF D2 EX WB
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction 5
341998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel x8 6 pipeline perform ance
■ Processor performance ≈ C lock cycle speed * cycles per
instruction (C PI)
■ Intel instruction set complexity reducesthe effectiveness of pipelining♦ “Integer” instruction cycle times range from 1-9♦ rep instruction prefixes have 4 clock startup
overhead♦ 32 -bit address far call takes 2 2 clock cycles
351998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel 5 8 6 dual integer pipeline
■ Improved processor performance (C PI)because certain instruction pairs canexecute in parallel
D1
PF
D2 EX WB
D1 D2 EX WB
u pipe
v pipe
361998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel 5 8 6 dual integer pipeline
■ R ules for instruction pairing are arcane♦ O ne-cycle instructions can usually be paired
■ a R IS C subset within a C ISC
♦ Instructions with imm ediate operands or addressdisplacem ents cannot be paired
♦ If there is an explicit R egister dependencybetween instructions, there is no pairing
■ x8 6 only contains 8 G eneral P urpose R egisters
■ R ISC processors depend on compilersthat understand how to exploit them!
371998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel 5 8 6 dual integer pipeline
■ Intel only manufactures hardware, so…♦ Intel introduces processor m easurem ents so that
you can collect actual C PI perform ance statistics♦ Intel introduces vTune to assist developers in
analyz ing software developed to run on itshardware
381998 Demand Technology, Inc. Wintel application tuning: processor optimization
vTune code analysis
401998 Demand Technology, Inc. Wintel application tuning: processor optimization
VC ++ code generation options
■ Try different “G ” O ptimization switches...♦ G 5 → P5 optim ization
■ some improvement, but the assembler codegenerated in the hot spot was unchanged.
411998 Demand Technology, Inc. Wintel application tuning: processor optimization
VC ++ code generation options
■ C language code:PE R F _INS TANC E _D E F INIT IO N * NextInstanceD ef
( P E R F _INS TANC E _D E F INIT IO N *pInstance ){ P E R F _C O U NTE R _BL O C K *pC trBlk; pC trBlk = (PE R F _C O UNTE R _BLO C K *)
((P BY TE )pInstance + pInstance->ByteL ength); return (P E R F _INS TANC E _D E F INIT IO N *)
((P BY TE )pInstance + pInstance->ByteL ength + pC trBlk->ByteLength);}
421998 Demand Technology, Inc. Wintel application tuning: processor optimization
VC ++ code generation options
■ G enerates tight Assember code
0 0 4 0 8 D 4 0 m ov ecx,dword ptr [esp+4 ]0 0 4 0 8 D 4 4 m ov edx,dword ptr [ecx]0 0 4 0 8 D 4 6 m ov eax,dword ptr [ecx+edx]0 0 4 0 8 D 4 9 add eax,ecx0 0 4 0 8 D 4 B add eax,edx0 0 4 0 8 D 4 D ret
431998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel 6 8 6 m icroarchitecture
■ Increased parallelism■ Instructions are translated into R ISC micro-
instructions■ Pool of 40 G P pseudo-R egisters■ Micro-instructions can be executed out of
sequence■ C PI ≈ rate at which instructions are retired
441998 Demand Technology, Inc. Wintel application tuning: processor optimization
Intel 6 8 6 m icroarchitecture
■ Performance not nearly so dependent onactual code generation, since theprocessor has the capability to unwindinstructions and execute them out oforder.
■ my vTune test runs were made on a P6!♦ vTune does not analyze code running on a P6 - there
should be little or no need to♦ Note: internally, the P entium II is a P6
451998 Demand Technology, Inc. Wintel application tuning: processor optimization
E pilogue
■ R ational VQ and Intel vTune tests gavevery different execution profiles♦ C ould we isolate the C ollection set dependency?♦ Y es, returned to VQ using a C ollection set that
included Thread O bjects and duplicated the vTuneresults
■ VQ helped us zero in execution pathissues♦ but there was considerable m easurem ent overhead
■ P6 probably makes vTune obsolete
461998 Demand Technology, Inc. Wintel application tuning: processor optimization
Where to get m ore inform ation
■ Windows NT Workstation 4 .0 R esource K it■ Microsoft Developer Network C D■ Intel vTune docum entation (m ost is also available from
Intel’s Web site click here.)■ C om puter Architecture: A Q uantitative Approach,
H ennesey and Patterson■ Pentium (P ro) Processor System Architecture,
Mindshare, Inc.■ Inner L oops, Booth■ The Indispensable Pentium Book, Messner