A Framework for Data Partitioning for C++ Data-Intensive Applications

21
Design Automation for Embedded Systems, 9, 101–121, 2004. c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. A Framework for Data Partitioning for C++ Data-Intensive Applications A. MILIDONIS [email protected] G. DIMITROULAKOS M. D. GALANIS A. P. KAKAROUNTAS VLSI Design Lab., Dept. of Elect. and Comp. Eng., University of Patras, Rio 26110, Greece G. THEODORIDIS Section of Electr. & Computers, Dept. of Physics, Aristotle Univ. of Thessaloniki, 54124, Greece C. GOUTIS VLSI Design Lab., Dept. of Elect. and Comp. Eng., University of Patras, Rio 26110, Greece F. CATTHOOR IMEC, Kapeldreef 75 B-3001 Leuven, Belgium Abstract. We present an automated framework that partitions the code and data types for the needs of data management in an object-oriented source code. The goal is to identify the crucial data types from data management perspective and separate these from the rest of the code. In this way, the design complexity is reduced allowing the designer to easily focus on the important parts of the code to perform further refinements and optimizations. To achieve this, static and dynamic analysis is performed on the initial C++ specification code. Based on the analysis results, the data types of the application are characterized as crucial or non-crucial. Continuing, the initial code is rewritten automatically in such a way that the crucial data types and the code portions that manipulate them are separated from the rest of the code. Experiments on well-known multimedia and telecom applications demonstrate the correctness of the performed automated analysis and code rewriting as well as the applicability of the introduced framework in terms of execution time and memory requirements. Comparisons with Rational’s Quantify TM suite show the failure of Quantify TM to analyze correctly the initial code for the needs of data management. Keywords: background memory management, data type, analysis, automated flow, rewriting 1. Introduction Current and future multimedia applications and other with similar behavior (e.g. protocol network applications) characterized by high complexity, diverse functionality, huge amount of data transfers, and large data storage requirements [5]. They are usually described by large specification codes using high-level OO-based description languages such as C++ or SystemC. The conventional design procedure for such kind of applications starts by studying and analyzing the initial code to identify its crucial parts in terms of different design factors such as performance, data transfers and storage needs, and power consumption. Next, these crucial parts are refined, optimized, and mapped to predefined or custom-developed platforms [11]. In the majority of the cases only a small amount of the code—that is

Transcript of A Framework for Data Partitioning for C++ Data-Intensive Applications

Design Automation for Embedded Systems, 9, 101–121, 2004.c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

A Framework for Data Partitioningfor C++ Data-Intensive Applications

A. MILIDONIS [email protected]. DIMITROULAKOSM. D. GALANISA. P. KAKAROUNTAS

VLSI Design Lab., Dept. of Elect. and Comp. Eng., University of Patras, Rio 26110, Greece

G. THEODORIDIS

Section of Electr. & Computers, Dept. of Physics, Aristotle Univ. of Thessaloniki, 54124, Greece

C. GOUTIS

VLSI Design Lab., Dept. of Elect. and Comp. Eng., University of Patras, Rio 26110, Greece

F. CATTHOOR

IMEC, Kapeldreef 75 B-3001 Leuven, Belgium

Abstract. We present an automated framework that partitions the code and data types for the needs of datamanagement in an object-oriented source code. The goal is to identify the crucial data types from data managementperspective and separate these from the rest of the code. In this way, the design complexity is reduced allowing thedesigner to easily focus on the important parts of the code to perform further refinements and optimizations. Toachieve this, static and dynamic analysis is performed on the initial C++ specification code. Based on the analysisresults, the data types of the application are characterized as crucial or non-crucial. Continuing, the initial code isrewritten automatically in such a way that the crucial data types and the code portions that manipulate them areseparated from the rest of the code. Experiments on well-known multimedia and telecom applications demonstratethe correctness of the performed automated analysis and code rewriting as well as the applicability of the introducedframework in terms of execution time and memory requirements. Comparisons with Rational’s QuantifyTM suiteshow the failure of QuantifyTM to analyze correctly the initial code for the needs of data management.

Keywords: background memory management, data type, analysis, automated flow, rewriting

1. Introduction

Current and future multimedia applications and other with similar behavior (e.g. protocolnetwork applications) characterized by high complexity, diverse functionality, huge amountof data transfers, and large data storage requirements [5]. They are usually described bylarge specification codes using high-level OO-based description languages such as C++ orSystemC. The conventional design procedure for such kind of applications starts by studyingand analyzing the initial code to identify its crucial parts in terms of different design factorssuch as performance, data transfers and storage needs, and power consumption. Next,these crucial parts are refined, optimized, and mapped to predefined or custom-developedplatforms [11]. In the majority of the cases only a small amount of the code—that is

102 MILIDONIS ET AL.

distributed sparsely over small sets of code lines—is important in terms of the considereddesign and quality factors [9, 17]. Given that the modern applications are described byhundreds of thousands code lines, there is a strong need to automatically identify thesecrucial parts and separate them from the rest of the code.

For the data processing itself, the identification of the crucial kernels (usually inner loops)has been treated quite well in the past. But since modern applications are data-dominated,a distinct step called background data management (or memory management), which isresponsible for mapping the application’s data types to the background memory, should beincluded. This step aims at reducing the (complex) data transfers and storage requirementsof the application and mapping the data structures efficiently on the hierarchical memoryorganization of the underlying platform [3, 5, 7]. It must be noticed that the backgroundmemory consists of larger on- or off-chip memories that require a separate (set of) accesscycle(s).

In contrast, the foreground memory management that is present in most conventionalcompilers focuses on the register files and the registers’ existence in the data-path. In [12,16] algorithms for register allocation are presented. They use a clique separator technique forimproving the space overhead by coloring the code portion interference graphs and makingsplitting and spilling of live range decisions, with the goal to move overhead operations toinfrequently executed parts of a program.

To reduce design complexity and to increase the exploration freedom, the first sub-step of(background) memory management is to classify the application’s data types in crucial andnon-crucial from the data management perspective. As crucial data types are consideredthe ones whose refinement has a large impact on important design factors such as the totalmemory’s power consumption, performance, and total memory size. After the identificationof the crucial data types, code transformations and optimizations should be applied onlyon these data types to reduce the data transfers and storage requirements, while the codeassociating with the less important data types should not yet be considered. The latter shouldindeed be postponed to the foreground management step where the decisions on small datatypes and scalars really belong (because only then they can be correctly traded-off withother data-path related issues to which they have a strong coupling).

As a general illustration of the design-time impact of the foreground/background decisionstep, Figure 1 shows the amount of the design time that was required for an MPEG-4-based video application during a project at IMEC for C-based source code where similarconsiderations apply [5]. At the left side the MPEG-4 design is shown with a two monthspenalty for distinguishing the background memory’s data types fully manually. At the rightside it is shown by estimation that the design time without the above foreground/backgrounddecision (and the resulting pruning of the source code) would be two times larger.

Considering the complexity and the large sizes of the initial specification codes of themodern applications, the development of a fully automated flow for partitioning the initialspecification code in crucial and non-crucial parts is necessary for the needs of data man-agement, also in an object-oriented code. Such a partitioning has not been tackled earlierin literature. To the best our knowledge at this point there is no academic or commercialframework that performs the above partitioning fully automatically and efficiently.

In this paper an automated framework for deriving the crucial code’s parts from thedata management perspective and automatically isolating them from the rest of the code is

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 103

Figure 1. Shortness of exploration by data types separation in foreground/background memory.

presented. First, static and dynamic analysis is performed on the initial C++ code. Basedon the analysis results, the data types are characterized as crucial or non-crucial. Then, thecode is rewritten automatically such that the data to be stored in foreground memory and theassociated code portions that manipulate them are placed inside of newly generated functionsand are abstracted from the code portions related to background memory’s data. A set ofexperiments on well-known applications shows on one hand the correctness of the performedanalysis and automated rewriting and on the other hand the efficiency of the proposedframework in terms of execution time and memory requirements. Moreover, comparisonswith the Rational’s QuantifyTM suite demonstrate the failure of existing analysis frameworkslike QuantifyTM to correctly perform the required analysis, while the rewriting step is eventotally missing.

The paper is organized as follows: In Section 2, previous work on data managementand existing code partitioning tools are discussed. In Section 3 the proposed framework isintroduced and its implementation is discussed in details in Section 4. The experimentalresults on well-known multimedia applications and comparisons with Rational’s QuantifyTM

are given in Section 5. Finally, the conclusions are listed in Section 6.

2. Data Management and Code Partitioning

One of the major bottlenecks of the modern applications is the huge amount of data transfersand storage requirements. This results in high power dissipation and performance degra-dation [3, 5, 7]. Experiments have shown that the energy consumption due to memory andbusses of the initial un-optimized code exceeds by far the consumption of the data path andcontrol logic. Also, the data transfers between the data path and memory bound the sys-tem’s performance, while the memory hierarchy dominates the silicon area [3, 5, 20]. Thus,many data management techniques, also called memory management techniques, have beenproposed in recent years [5, 7, 14, 18, 19]. Good overviews can be found in [3, 20].

The basic concept of all the data management techniques is to apply code refinementsand optimizations in specific code portions, which are responsible for high memory energy,

104 MILIDONIS ET AL.

to reduce the data transfers or to derive a memory architecture where the frequent mem-ory accesses are performed on small memories. It is worth to mention that usually thecode transformations for data management are applied before the conventional compilertransformations [5].

In [5–7] the Data Transfer and Storage Exploration (DTSE) methodology has been pro-posed by IMEC. It starts by analyzing the initial specification code, identifying its crucialparts and manually separating them from the rest of the code. Next, platform-independentcontrol-and data-flow transformations are applied to the code’s crucial parts to increase thelocality and regularity of memory accesses, remove redundant data transfers, and exploitdata reuse opportunities. Afterwards, platform-dependent transformations are taken placeto increase the data bandwidth and derive the memory organization.

In the Bologna-Torino cluster also a focus on memory partitioning has been initiated. E.g.in [18], after performing profiling on the initial code and identifying the memory accessespatterns, the use of a special scratch-pad memory called Application Specific Memory(ASM) or the memory partitioning according to the application’s memory access patternshave been proposed. Thus, the frequently memory accesses are performed on small on/offchip memories resulting in reduced energy dissipation.

At Irivine, the main focus has been on memory-aware compilation. E.g. in [19] the useof an on-chip scratch-pad memory is proposed to reduce the energy due to memory access.Also, the application of code transformations for minimizing the power consumption ofthe on chip and the off-chip (DRAM) memory has been presented. At Penn State the mainfocus has been on transformations. E.g. in [13, 14], special memory management techniquesfor improving the energy consumption of the cache memory have been presented. It mustbe stressed that none of the above approaches has a direct automated way to partition thecode in background and foreground memory portions (crucial and non-crucial data types).In [22] an algorithm is introduced that analyses the application and selects program anddata parts which are placed into the scratchpad memory. The power consumption is reducedsignificantly comparing to a cache solution. Also, several other groups have started workingon such issues recently.

Other important issues are the partitioning of data types with large memory storagerequirements into smaller for storing them in scratch pad memories, the application of looptransformation that will lead to energy dissipation and the proposal of a data memory layouttaking into account the address pointer assignment [10, 25].

Considering the complexity and the large sizes of the initial code descriptions of modernapplications, some tools to identify the important parts of the code for data managementneeds have been presented. But it will be shown that these lack important features for ourcontext since none of the them characterize correctly the data types as crucial or non crucialfrom the memory management perspective and they do not separate automatically the code’scrucial parts from the non-crucial ones.

Rational’s Quantify [Rational] focuses on function calls and its output is a dynamicfunction flow graph without giving any information about which parts of the code containthe largest amount of memory accesses and which are the most frequently accessed datatypes.

IMEC’s ATOMIUM [ATOMIUM] is another software suite that deals with the desiredprofiling for memory management. ATOMIUM offers a complete code analysis but with

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 105

respect to memory management it still lacks two issues. First, it works only for C codes,while a growing portion of the modern multimedia applications are described in C++.Although the EDG compiler [EDG] offers a C++ to C conversion, it turns out to benot sufficient for further manipulation of the re- writing code. That occurs because all datatypes and function names change during the conversion, while new data types are introducedintermediately. For that reason, the original C++ code should be used for dynamic analysissince EDG introduces some data accesses overhead. The second currently missing featureof ATOMIUM is that the produced C code cannot always be executed, as the compiler inits current version does not support some of the constructs related to the C conversion ofdynamic C++ instructions (e.g. new, delete). So, the partitioning of the crucial data typesand the associated code that manipulates them should be achieved on the original C++code. Certainly, that step would have to be performed manually in the pre-processing stepsince ATOMIUM doesn’t perform it automatically.

3. General Description of the Introduced Framework

3.1. Code Partitioning in Three Hierarchical Layers

As it has mentioned the proposed framework aims at identifying the application’s crucialdata types and separating them from the rest of the code. To achieve this, the initial codeshould be separated in at least two distinct hierarchical layers [5]. The first layer containsthe crucial data types (usually arrays), as far as memory energy consumption, memoryaccess bandwidth, or size is concerned. It also contains the code portions (nested loopsand manifest conditions) that manipulate the crucial data types. The second layer, which iscalled with functions from the first layer, includes the rest of the code. Among these layersonly the first one is considered for exploration during the background data managementstep. The second layer that can still contain small non-crucial arrays is handled by theforeground memory management (register oriented) steps, which are tightly integrated inmore conventional instruction-level compilers.

To perform such code partitioning, profile information needs to be extracted by analyzingthe initial code. Then, the data types are characterized as crucial or non-crucial in terms ofbackground memory management. Afterwards, the code is rewritten automatically so thatthe crucial data types and the associated code portions are separated from the rest of thecode resulting in the construction of the two layers. If several “dynamically created tasks”are present, then the above approach is performed per “dynamically created task”. We thenalso foresee another “top-layer” that contains these tasks (usually implemented by meansof a thread library). Hence, in total we have 3 layers, of which the data management relatedone is the middle layer 2.

3.2. The Proposed Flow

Figure 2 illustrates the proposed flow. Initially, in the Preprocessing/Analysis step, the C++source code is converted to C and useful profiling information is extracted by performingstatic and dynamic analysis. Continuing, in the Data Type Classification step, all data types

106 MILIDONIS ET AL.

Figure 2. The proposed flow.

are assigned with a weight produced by a cost function determining how crucial they arefrom memory management perspective. The cost function, which can be provided by theuser, is a function of the size and frequency of accesses of the data type. Finally, in therewriting step, the initial C++ source code is rewritten automatically by separating it inthe layers 2 and 3.

In C++, a data type may be defined as an array of a class instance. Our approach focuseson arrays as class states or local arrays inside function bodies because these are the onesthat are considered as unique structures in terms of background memory management. Thecode portions that manipulate them should be considered for moving from layer 2 (datamanagement layer) to 3 (register management layer) and vice versa. For this considerationwe take into account whether the objects (instances of each class that contains crucial arrays)contribute significantly to the memory’s total power consumption. An array is characterizedas crucial if during the execution of the code it is accessed very frequently and/or it requireslarge memory amount for storage.

3.3. Compilation and Instrumentation Issues

The proposed code separation does not prevent any conventional complier to apply therequired code transformations. As it has been mentioned, the data management precedesthe conventional instruction-level compiler optimizations. Thus, after the application ofcode transformations related to memory management, the whole code is recombined andis fed to the compiler. Also, as it is explained in the code-rewriting step further on, noproblems related to linker arise. Finally, it is worth to mention that the proposed approachdoes not perform the actual code transformations for the data management needs but onlyidentifies and separates the initial code in crucial and non-crucial parts.

To implement the above automatically, some existing software frameworks have beenused. The EDG compiler [EDG] is used to convert the initial C++ source code to C code.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 107

This transformation has been adopted since good public domain software tools are availablethat extract important profile information on C source codes in an efficient way. The SUIF2compiler [1] is such a tool and it is used to obtain static analysis information. Since ittakes as an input only C code, the information that is extracted should be correlated withthe original C++ source code. It must be stressed that the produced C code is not usedfor further manipulation (code refinement and transformations) but only for performingstatic analysis. As already mentioned, that would introduce unnecessary and undesiredrestrictions to our analysis. Finally, the lexical analyzer LEX [15] is used for adding to theoriginal C++ code special structures (e.g. counters, cost functions etc.) that are used forgathering information during the dynamic analysis.

4. Detailed Description

In this section, the implementation of the proposed flow is discussed in details. For betterunderstanding of its contribution, all the steps presented in the previous section are appliedin a small but realistic application code. Figure 3 shows the initial C++ description codeof the multimedia full search algorithm. It contains a motion estimation procedure in whichthe data types (arrays) cur, prev, vec x, and vec y are processed. Next, all the analysis stepsand those that process with their results are applied on this code.

4.1. Pre-Processing and Analysis

Profile information needs to be extracted always before memory management explorationon the given application. The motivation of the analysis is to get a good knowledge onwhich of the application’s data types are most frequently accessed and which require a lotof memory space to be stored. In this step, two kinds of analysis are performed, namely thestatic and dynamic ones.

Static Analysis

Figure 4 shows the flow that is followed during static analysis. Firstly, the EDG compilergenerates C source code from the initial C++ code. Continuing, using properly developedscripts, the SUIF2 compiler operates on the C code and information about the FunctionFlow Graph (FFG), the structure definitions, and their static sizes are extracted.

The static size of a structure is considered as the total memory space that is needed forstoring an instance of that structure. As mentioned, a structure is considered as an arrayand its static size is the sum of all its elements’ memory storing space. Afterwards, thestep traces all the structural declarations and then the names of their instances. However,the EDG compiler slightly changes the names of the functions and class instances whenconverting from C++ to C source code. For that reason, properly developed scripts areemployed to reverse these changes and reveal the original names of the initial C++ code.

Figure 5 shows a part of the developed SUIF2 script for identifying all of the class namesand their corresponding objects. It must be noticed that the SUIF2 compiler stores all parsing

108 MILIDONIS ET AL.

Figure 3. Motion estimation—full search C++ code.

Figure 4. Static analysis flow.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 109

Figure 5. SUIF2 script for identifying classes and objects.

information gained from the input code inside special objects from the SUIF2 library. Thescript calls specific functions from that library that output the needed information. For thisimplementation, a SUIF2 technique called iterator for searching special structures inside thecode is used. This is described in lines 2–4, where all VariableSymbol objects of the SUIF2intermediate format will be searched inside the produced file by the SUIF2 compiler. Theseobjects store the information of all class and object names of the initial C++ code input. Bycalling the proper functions from the SUIF2 library (Lines 6–8) this information is extracted.Lines 9 and 10 print the object names and their related class names in correspondence. Inthe application example of Figure 3, since there is no object declared for reasons of keepingthe amount of code short, the output of this procedure will be the name of the class:full s.

In Figure 6 a part of another SUIF2 script is shown. By this script, all the class members ofthe input code are identified and for each one of them the information about their static sizeis extracted. Also, the information of the current class is outputted. For this implementation,two iterators are used. One for scanning all the GroupType objects (Lines 2–4) which storeinformation about each class and one for scanning all FieldType objects (Lines 5–7) whichstore information about each member of a class. After calling the proper SUIF2 libraryfunctions (Lines 8–17), the procedure outputs the name of the current class member (Line18), its size in bytes (Line 19) required for its storage in memory, and the name of the classthat this member belongs (Line 20). The output of the procedure for the experimental codeof Figure 3 is shown in Table 1.

Figure 7 shows a part of a SUIF2 script, which identifies all local declared arrays andvariables, their memory size, the name of the function that they are stored in, and the name

Table 1. Information Extracted for Each class Member

Class member name curr Prev vec x

Static size 811,008 811,008 3,168

Class name full s full s full s

110 MILIDONIS ET AL.

Figure 6. SUIF2 script for extracting the memory size of the members of all classes.

Figure 7. SUIF2 script for identifying local arrays inside functions.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 111

Table 2. Information Extracted for Each Locally Declared Data Type

Local Data Types p1 p2 min dist

Array/Variable Variable Variable Variable Variable

Static size (Bytes) 32 32 32 32

Function name fs motion fs motion fs motion fs motionestimation estimation estimation estimation

Class name full s full s full s full s

of the corresponding class that this function belongs. By this way, the flow extracts allinformation needed concerning the application’s global and local arrays. The steps thatfollow will use this information.

For this implementation, two iterators are used. One for finding all functions and theinformation related to them stored in an object called ProcedureDefinition (Lines 2–5) andone for finding the information for all the variables that are declared in these functionslocally. The extracted information is stored in the object VariableSymbol (Lines 7–9). Aftercalling proper functions from the SUIF2 library for identifying the static size of each datatype (Lines 12–14), the procedure examines whether it is a variable or an array (Lines 18–19) and prints the data type’s name, its static size, and the function that is declared locally(Lines 20–22). As mentioned before, the EDG compiler changes the names of the functionsby extending them with the names of the classes they belong. We have developed C scriptsthat abstract this information. The results of applying these scripts on the code of Figure 3are shown in Table 2.

Dynamic Analysis

Dynamic analysis needs to be performed to find out which of the data types are mostfrequently accessed. Figure 8 explains the way that this task is accomplished.

After static analysis all class, object and array names are known. Also, all the functions ofeach class have been identified. The proposed flow automatically uses properly developedscripts, which are used by LEX and operate on the initial C++ source code. The goal is toplace an increment instruction of a variable underneath each instruction of the initial codeof an array access. For each array, a unique variable is declared for counting its accesses.Another function is also placed inside the code, which has the mission of printing the numberof accesses for each array before the execution ending. The above are accomplished usingC scripts that generate LEX scripts, which identify the end of the main function of the initialC++ code. Then, they write before it printing instructions of the memory accesses of eachof the code’s array. In addition, additional scripts are used to create some extra header filesfor storing the new data types and counting the arrays’ accesses and memory sizes. Also,extra scripts have been developed to realize the function for printing the results of dynamicanalysis. Next, the code is executed and all information about the accesses of each array isextracted.

112 MILIDONIS ET AL.

Figure 8. Dynamic analysis flow.

As described before, all array names and the classes they belong to are well known fromstatic analysis. The previously mentioned C scripts use this information to produce LEXscripts that place counting instructions underneath each code instruction when an array isaccessed. Such a script is shown in Figure 9.

As it can be seen the counting instruction “prev full s info.counter++;” is placed under-neath each appearance of the array named “prev” at any instruction of any procedure of theclass named “full s”. Initially, the script searches for the beginning of the function body ofa function that belongs to class full s (Line 5). Then, while parsing only inside the current

Figure 9. LEX script for placing counters inside the initial code.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 113

Figure 10. Weight assignment procedure.

function body (Lines 6–7) if the word “prev” is appeared (Line 8) then after the end of thecurrent instruction a counting instruction is printed by our parser (Lines 18, 19).

4.2. Data Type Partitioning

All the above information is going to be used for deciding which data types are the crucialones from the background data management perspective. For that reason, a cost functionmust be employed that will use the extracted information from static and dynamic analysisand will assign to each data type a weight. In that way, a measure of how crucial each datatype is obtained. Figure 10 shows the assignment of weights on the data types.

In order to decide that a data type is crucial for data management perspective, its static sizeand the number of accesses during the execution time is taken into account. This informationis important since the data management optimization’s phase that will take place afterwardsin order to contribute significantly to the system’s total power consumption needs to focus onarrays that require a large amount of memory storage and they are accessed very frequently.The proposed framework uses a simple cost function, which is given in equation (1). k1is a parameter defined by the designer, for approximating the number of accesses for eachelement of an array, while k2 is the parameter for approximating the static size of an arraythat is usually accessed during execution. k2 is important for the reason that each data typedoesn’t use its entire memory space at each instance when it is processed. The designershould also be able to give as an input, using this parameter, the exact required memoryspace of each data type that is usually processed instead of the total declared size that isusually overestimated. The main reason for letting the designer to define k1, k2 is that thereare many different cases of accessing arrays that they couldn’t all be included in one costfunction.

However, depending on the designer’s intentions the weights in this cost function maychange since a new one can be given as an input. This is efficient, since all necessaryinformation for separating a code to portions important for foreground and backgrounddata memory management, are provided by the proposed framework.

Weight = k1 ∗ number of accesses ∗ k2 ∗ static size (1)

114 MILIDONIS ET AL.

Table 3. Analysis Results for Full Search Algorithm Code

Data types # of Accesses Storage size (bits) Weight

Cur 5,702,400 811,008 4.62E+12

Prev 5,436,736 811,008 4.41E+12

vec x 2,509 3,168 7.95E+06

vec y 2,509 3,168 7.95E+06

After the weight assignment to each instance of the initial code’s data types, the designershould determine the threshold beyond which each instance is considered as crucial. Thisshall be given to the framework as an input.

Applying the above to the experimental code of Figure 3 and automatically executing theproposed pre-processing analysis step, static and dynamic analysis information is extractedfor each array. Table 3 shows the number of accesses for each array during execution timeand the number of bits required for their memory storage. Continuing, during the data typeassignment step a weight is assigned to each array according to a weight function. Theresults are shown in Table 3.

As can be seen, the weights of cur and prev arrays are a million times larger than those ofvec x and vec y. For that reason, the first two data types are considered crucial and are storedin the background memory, while the last two should be stored in foreground memory. Inthis simple code, this is also obvious for the developer of the code but for complex codeswith dozens arrays and few thousands of lines of high-level source code, this is not trivialat all. Automated analysis is then essential.

4.3. Code Rewriting

The next step is to separate the code portions that manipulate the non-crucial data typesby placing them in Layer 3. It is accomplished by developing proper scripts. Using thesescripts the rest of the code is placed in Layer 2. As it has mentioned, Layer 2 includes thecode that manipulates the data types and functions calls for the new generated functions,which correspond to the code place in Layer 3.

Usually, the bottleneck of the power consumption in the memories tends to appear in asmall fraction of the code, which covers about 10% of the total amount of code, but which istypically highly scattered across the entire code. These code portions manipulate the crucialarrays, while the remaining 90% of the code accesses only non-crucial ones. For that reason,there will be many functions of the initial code where their body will be totally characterizedas code for Layer 3 and a few others that will include code portions that manipulate crucialarrays. That leads to the conclusion that the overhead caused by creating new functions thatwill include code for Layer 3 will not increase significantly the code size and the executiontime. For this case, the main functionality of the flow’s scripts is the identification of classfunction bodies and the basic blocks inside them. As basic block is considered a group ofinstructions that don’t contain a jump instruction except from the function call instructions.Only the last instruction of a basic block must be a jump instruction.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 115

Figure 11. Code rewriting procedure.

Rewriting the initial C++ code as described above, no compilation problems occur andthe functionality of the program is identical compared to the initial one. It must be noticedthat the code rewriting’s output code is as readable and as understandable as the originalcode. Only the function hierarchy is partly changed. In principle, a postprocessing step couldeven be added to produce again the selectively inlined code for the designer after applicationof a number of actual memory oriented transformations. In this way the designer that willuse the produced code as input for a memory management optimization step will not meetany extra difficulties for understanding and transforming the code. For the same reason,neither the optimizing compiler is perturbed.

By separating the crucial code from the non-crucial one the previously discussed hierar-chical layers are formed. Figure 11 shows the automated rewriting procedure. At this point,the designer has clear advantages concerning the task of background memory managementexploration. He has a very good knowledge of which are the code’s data types, their memorysize, the names of their instances, and the number of times they are accessed during theexecution time.

Additionally, because the code of Layer 3 is fully separated in different functions it canbe kept in a separate file that is protected so that no errors can be introduced. By this way,the manual application of the transformations that follow can only have an impact on theLayer 2 code. This clearly limits the verification and debugging effort as shown also inFigure 1. All the above information combined with the automated way of extracting, leadsto a seriously reduced design time for the given application.

Figure 12 shows a part of a LEX script for identifying basic blocks. Initially, a functionbody declaration is identified (Line 1). The next step is to identify each discrete basic blockinside the current function body. Lines 2–16 describe some different cases that a basic blockmay appear. Usually, it consists of a group of instructions between or inside the scope ofstatements likes for, while, if, and do. To identify the whole scope of those statements, theparser uses the code in lines 19–31 where the symbols ‘{‘ and ’}’ are taken into account.

We have also implemented C scripts that use the information extracted from the analysissteps about which data types are considered crucial and generate automatically LEX scriptsthat perform code partitioning by separating different basic blocks to Layers 2 and 3. Abasic block that doesn’t access crucial arrays will be used as the body of a newly generatedfunction and its function call replaces this basic block at the initial code flow. To implementthis efficiently, all variable and array accesses by the considered basic block need to bescanned. All the arrays and variables declared locally must be declared as inputs through the

116 MILIDONIS ET AL.

Figure 12. LEX script for identifying basic blocks.

function’s parameter list, while all the global arrays and variables should remain unchangedin the new function body. The function prototype declaration of each newly created functionis declared inside the corresponding class declaration.

Figure 13 shows an example of the automated rewriting. Beside every newly generatedfunction a comment is automatically printed, which declares that these function contentsbelong to Layer 3. The automated flow scans all the input code’s basic blocks and identifiesthose that don’t access any data types from the background memory. It rewrites these basicblocks as bodies of new Layer 3 functions as shown in Figure 13 (ii). By this way, notonly the foreground-background data types are distinguished but the corresponding codeportions are also separated.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 117

Figure 13. Code separation in two layers.

4.4. Comparison With Other Commercial Suites

Comparing all the above described steps of the proposed flow with Rational’s Quantifysuite it can be said that the only information given by Quantify is the number of times thefs motion estimation procedure was executed which is ‘1’. However, this is not sufficientinformation for deciding which data types should be placed in foreground and backgroundmemory. Also, no code rewriting is performed by Quantify for separating the foreground-background code portions. The last difference is also valid for the ATOMIUM tool suite atIMEC, the Xtel at EPFL and PowerEscape Analyzer [POWERESCAPE] which handle Ccode. This is not yetsufficient as explained in Section 2. For these reasons the uniquenessof the automated flow’s operations is substantiated.

5. Experimental Results

To verify the effectiveness and validity of the introduced framework, experiments on fivewell-known codes have been performed. The tested codes are parts of multimedia andtelecommunication algorithms. More specifically, the OFDM baseband transmitter is basedon the physical layer of the IEEE 802.11a specification. JPEG is an international standard forthe compression of multilevel still images. Cavity Detection is a medical image-processingalgorithm and ADPCM is a voice codec. VIC [VIC] contains a group of applications. The

118 MILIDONIS ET AL.

Table 4. Tool’s Analysis Results vs. Counters Placed by Hand for Arrays Memory Accesses

Accesses(Counters

Data types Accesses tool manually Static size (Bytes) Weight

JPEG

Dct 786,508 786,508 525,504 4.13E+11

Quantize 65,664 65,664 527,552 3.46E+10

Zigzag 131,200 131,200 528,576 6.93E+10

Entropy 13,104 13,104 549,824 7.2E+09

OFDM

Scrambler 69,127 69,127 1,784 1.23E+08

Fec 99,876 99,876 4,080 4.07E+08

Interleaver 15,360 15,360 2,304 3.54E+07

Cp 94,176 94,176 4,608 4.34E+08

CAVITY DETECTION

g image 4,572,708 4,572,708 8,192,000 3.89E+13

GaussBlur 1,271,696 1,271,696 8,192,000 1.04E+13

DetectRoot 1,271,696 1,271,696 8,192,000 1.04E+13

c image 1,021,930 1,021,930 8,192,000 8.37E+12

ADPCM

Encoder 4,761 4,761 3,648 1.74E+07

Decoder 11,105 11,105 3,648 4.05E+07

VIC

Decoder 140 140 1,536 2.15E+05

P64Decoder 13,565 13,565 198,208 2.68E+09

Matcher 104 104 96 9.98E+03

TclObject 757 757 128 9.69E+03

HIPERLAN

EC Tx unack::Tx Queue 1,216,833 1,216,833 7,454,744 9.07E+12

EC Rx unack::Rx Queue 1,216,833 1,216,833 7,454,744 9.07E+12

ModemBuffer 1,216,833 1,216,833 12,000 1.46E+10

Ethernet Packet Register 1518 1518 1,155,991 1.75E+09

one among them that the proposed flow’s experiments focused on is a video applicationimplementing the H.261 standard. Finally, the last application is the Hiperlan/2, which is atelecom protocol used on wireless networks [4].

Table 4 shows the measurements taken by the proposed flow for all applications. To verifythe correctness of the measurements, special modifications on the code of the experimentedapplications have been performed since no tool exists to perform measurements similarto what the introduced flow does. For that reason, concerning each class state accesses,counters have been placed by hand inside the function bodies under each static appearance

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 119

Table 5. Object Function Calls of Quantify

Function calls Function callsData types (Quantify) Data types (Quantify)

CAVITY DETECTION OFDM

G image 4,572,708 Scrambler 40

Gaussblur 1,271,696 Fec 40

Detectroot 1,271,696 Interleaver 40

C image 1,021,930 Cp 40

JPEG ADPCMDct 1,024 Encoder 10

Quantize 1,024 Decoder 26

Zigzag 1,024 Decoder 26

Entropy 1,024 HIPERLAN/2

EC Rx unack::Rx Queue 9090

ModemBuffer 876

Ethernet Packet Register 1761

of a class state and their results after execution have been compared with the proposedflow’s measures.

Table 5 shows the measurements taken by QuantifyTM. For the VIC application thereare no measurements since it operates on Linux and no Quantify suit is available. It isclear that QuantifyTM, doesn’t give the appropriate profiling information for defining howcrucial each data type is. In particular, QuantifyTM, measures only the number of times anobject’s function is called. Thus, QuantifyTM, measures the number of all function calls,which are made dynamically. No information is produced about the accesses of each stateof an object and none for the total static size of an object. This lack of profiling informationusing QuantifyTM may lead to totally unusable results concerning the declaration of datatypes as crucial or non-crucial.

As can be seen, Tables 4 and 5 give different results on how crucial a data type is. Asan example of how misleading the results of Quantify are, it could be assumed, lookingat Table 5, that for JPEG, Entropy is one of the most crucial data types. However, sinceEntropy’s total memory accesses are eventually the least of all, as reported in Table 4,Entropy should actually be characterized as the least crucial data type of JPEG.

Table 6 shows the crucial data types of each benchmark, the memory that is required forprocessing each application by the proposed flow and the execution time needed to con-clude the flow. All the above measurements have been taken while executing the proposedautomated flow on an Intel Pentium IV processor having 512 MB RAM under the LinuxSuSE 7.3 environment. By focusing only on the most crucial data types of each application,the target code is significantly reduced. In these experiments, the crucial data types areselected as those whose sum of weights exceeds 80% of the total application weight.

120 MILIDONIS ET AL.

Table 6. Tool’s Memory and Time Requirements and Applications’ Code Size Total Decrement

Total Crucial Memory required Exec. TimeApps. Lines Lines Crucial data types by tool (Bytes) (sec)

JPEG 935 140 Dct 22,684 12OFDM 1,050 406 Fec, Cp, Scrambler 5,204 3Cavity 576 303 g image, Gauss blur 2,352 3Detection Detect rootsADPCM 417 417 Encoder, Decoder 4,124 2VIC 9,208 1,417 P64 Decoder 5,972 5Hiperlan/2 70,000 1,930 EC Tx unack 106,496 14

6. Conclusions

The automated flow presented in this document implements the code data memory parti-tioning from a background memory management perspective. It is a prototype tool flow,which separates the useful code, for background memory management optimization pur-poses, from the rest of the C++ code. This is a task that no other existing system leveldesign tool performs. The main benefit of using this tool is a serious reduction on the designtime of a given application. The exact part of the code that would have to be explored froma background memory management optimization point of view is revealed. This is usuallya small part comparing to the total size of the application code. The rest of the code can betemporarily ignored for memory exploration purposes. Moreover, this separation processenables a large gain in overall verification time.

Acknowledgments

We thank European Social Fund (ESF), Operational Program for Educational and VocationalTraining II (EPEAEK II) and particularly the Program PYTHAGORAS, for funding theabove work.

References

1. Aigner, G., A. Diwan, D. L. Heine, M. S. Lam, D. L. Moore, B. R. Murphy, and C. Sapuntzakis. An Overviewof the SUIF2 Compiler Infrastructure, Technical report, Stanford University, 2000.

2. ATOMIUM, http://www.imec.be/atomium3. Benini, L., and G. De Micheli. System-Level Power Optimization Techniques and Tools. ACM TODAES,

vol. 5, no. 2, pp. 115–192, 2000.4. Bisdounis, L., S. Blionas, M. Speitel, E. Macii, S. Nikolaidis, A. Milidonis, and M. Bonno. Design Story-II.

Public Deliverable of Energy-Aware SYstem-on-chip design of the HIPERLAN/2 standard, IST 2000-30093Project. 2003.

5. Catthoor, F., S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. CUSTOM MEM-ORY MANAGEMENT METHODOLOGY-Exploration of Memory Management Organization for EmbeddedMultimedia System Design. Kluwer Academic Publishers, 1998.

FRAMEWORK FOR DATA PARTITIONING FOR C++ DATA-INTENSIVE APPLICATIONS 121

6. Catthoor, F., K. Danckaert, S. Wuytack, and N. D. Dutt. Code Transformations for Data Transfer and StorageExploration Preprocessing in Multimedia Processors. IEEE Design and Test of Computers, vol. 18, no. 3, pp.70–82, 2001.

7. Catthoor, F., K. Danckaert, C. Kulkarni, E. Brockmeyer, P. Kjeldsberg, T. Achteren, and T. Omnes. DataAccesses and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers,2002.

8. EDG, “C++ Front End”, Propiertary Information of Edison Design Group, www.edg.com , 1992.9. Edwards, S., L. Lavagno, E. Lee, and A. Sangiovanni-Vincentelli. Design of Embedded Systems: Formal

Models. Validation, and Synthesis. In Proc. of IEEE, 1997, pp. 366–390.10. Falk H., and M. Verma. Combined Data Partitioning and Loop Nest Splitting for Energy Consumption

Minimization. In International Workshop on Software and Compilers for Embedded Systems (SCOPES),2004.

11. Gajksi, D., F. Vahid, S. Narayan, and J. Gong. Specification and Design of Embedded Systems. Prentice Hall,1994.

12. Gupta, R., M. Soffa, and D. Ombres. Efficient Register Allocation via Coloring using Clique Separators. ACMTransactions on Programming Languages and Systems (TOPLAS), vol. 16, Issue 3, May 1994.

13. Hu, J. S., N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Power-efficient Trace Caches. In The Proceedingsof the 5th Design Automation and Test in Europe Conference (DATE’02), Paris, France, 2002.

14. Kandemir, M., J. Ramanujam, and A. Choudhary. Improving Cache Locality by Combination of Loop andData Transformations. In IEEE Trans. on Computers, 1999, pp. 159–167.

15. Levine, J. R., T. Mason, and D. Brown. Lex & Yacc. O’REILLY & Associates Inc. 1992.16. Lueh, G., T. Gross, and A. Adl-Tabatabai. Fusion-based Register Allocation. ACM Transactions on Program-

ming Languages and Systems (TOPLAS), vol. 22, Issue 3, May 2000.17. Luepers, R. Code Optimization for Embedded Processors. Kluwer Academic Publishers, 2002.18. Macci, A., L. Benini, and M. Poncino. Memory Design Techniques for Low Energy Embedded Systems. Kluwer

Academic Publishers, 2002.19. Panda P. and N. Dutt. Memory Issues in Embedded Systems-on-Chip Optimization and Exploration. Kluwer

Academic Publishers, 1999.20. Panda, P., F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle, and P.G.

Kjeldsberg. Data and Memory Optimizations for Embedded Systems. In ACM TODAES, vol. 6, no. 2, pp.142–206, 2001.

21. POWERESCAPE, http://www.powerescape.com22. Steinke, S., L. Wehmeyer, B. Lee, and P. Marwedel. Assigning Program and Data Objects to Scratchpad

for Energy Reduction. In the Proceedings of the 5th Design Automation and Test in Europe Conference(DATE’02), Paris, France, 2002.

23. Rational Quantify for Windows v2001A. Rational Software Corporation.24. VIC, VideoConferencingTool, in www-nrg.ee.lbl.gov/vic25. Wess, B., and T. Zeitlhofer. On the Phase Coupling Problem Between Data Memory Layout Generation and

Address Pointer Assignment. In International Workshop on Software and Compilers for Embedded Systems(SCOPES) 2004.