Fast Identification of Custom Instructions for Extensible Processors

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007 359

Fast Identification of Custom Instructionsfor Extensible Processors

Xiaoyong Chen, Douglas L. Maskell, Senior Member, IEEE, and Yang Sun

Abstract—This paper proposes a fast algorithm to enumerateall convex subgraphs that satisfy the I/O constraints from thedataflow graph (DFG) of a basic block. The algorithm can be tunedto determine all subgraphs or only those connected subgraphs.This allows a choice between better instruction-set extension (ISE)and faster design space exploration. The algorithm uses a gradingmethod to identify the next node for inclusion into a subgraph. Ifthe selected node is included, other related nodes are included aswell, thus ensuring that the resultant subgraph is always convexand at the same time, reducing the problem size by a block ofnodes. If the selected node is not included, the DFG will be splitinto smaller DFGs, thus reducing also the problem size. Withthis as base, the algorithm employs a simple but efficient methodto prune the invalid subgraphs that violate the I/O constraints.Results show that for relatively small DFGs with small explorationspace, the new algorithm has similar runtimes to that of existingalgorithms. However, for larger DFGs with much larger explo-ration space and with multiple input and output constraints, theruntime improvement can be orders of magnitude better than thatof existing algorithms. The new algorithm can be used to quicklyidentify custom instructions for ISE of embedded processors.

Index Terms—Algorithm, configurable processor, instruction-set extension (ISE).

I. INTRODUCTION

MODERN reconfigurable devices [14], [15] make it at-tractive to customize an embedded processors’ architec-

ture for specific applications. By integrating custom functionunits (CFUs) composed of programmable logic into an existingprocessor, a cluster of the base processor’s instructions canbe replaced by a single custom instruction evaluated in theCFU. This kind of instruction-set extension (ISE) combines thespeedup and power/area savings offered by application-specifichardware in addition to the simplicity and flexibility offered bya general-purpose processor.

While instruction-set-extensible processors and their soft-ware tools have been around for several years, mapping areal application onto them still requires considerable manualeffort. To improve the efficiency and correctness of the mappingprocess, automatic generation of appropriate custom instruc-tions for an application is important. Given a computation-intensive code region, custom-instruction generation involves

Manuscript received March 6, 2006; revised May 29, 2006 and July 11, 2006.This paper was recommended by Associate Editor G. E. Martin.

X. Chen and D. L. Maskell are with the School of Computer En-gineering, Nanyang Technological University, Singapore 639798 (e-mail:[email protected]; [email protected]).

Y. Sun is with the Department of Electrical and Computer Engineer-ing, National University of Singapore, Singapore 117576 (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCAD.2006.883915

identifying the instruction clusters suited for implementationas custom instructions and selecting the best combination ofcustom instructions. In this paper, we focus only on custom-instruction identification.

Identifying custom instructions is a time-consuming job,for the possible instruction clusters (or patterns) may expandexponentially with the number of instructions to be considered.Despite this, only a small proportion of the patterns is feasibleto be implemented as custom instructions. This is becausethe instruction-set architecture and microarchitecture may limitthe number of I/Os of a custom instruction. For example, thecustom instructions in PRISC [7] and ConCISe [8] can onlyhave two inputs and one output. Besides, the patterns need tobe convex [1], i.e., they are able to be executed atomically. Theobjective of custom-instruction identification is to quickly findthe valid patterns that satisfy these constraints. While variousapproaches have been proposed to solve this problem, someprevious work [7]–[10] restricts the custom instructions fromhaving only a single output. Such a strict constraint may pre-vent the extensible processors from achieving higher speedup[11]–[13]. Others [4]–[6] allow multi-output custom instruc-tions but rely on heuristics or genetic algorithms to explorethe design space, possibly missing some better solutions. Itseems that Tensilica AutoTIE can also enumerate all connectedvalid patterns [17]. However, details about the algorithm andits performance under various constraints do not appear to beavailable in the literature.

Atasu et al. [1] proposed an algorithm to enumerate all validpatterns by constructing a search tree. On each level of thetree, a node is selected. Inclusion of the node or not will resultin two branches of the tree. When a new node is included, anew pattern is found. The new pattern will be checked againstthe convexity and I/O constraints. As the node is selected inreverse topologically sorted order, the search space is trimmedby utilizing the relation between the current pattern and a newpattern containing it. Pozzi et al. [2] further improved thisalgorithm by adding a pruning criterion based on the numberof permanent inputs. This algorithm is fast when exploringsmall basic blocks but is not efficient for large ones, especiallywhen the custom instructions can have multiple outputs. Whiledisconnected patterns may better exploit parallelism in theapplication [2], it may not be useful if the host processor canexploit that parallelism. Thus, in some situations, we may onlywant to enumerate the connected valid patterns. Yu and Mitraproposed a fast algorithm [3] to enumerate only connected validpatterns. The algorithm has two phases. First, the upward conesand downward cones of all nodes are enumerated. Then, allpatterns are enumerated by computing the unions of upward

0278-0070/$25.00 © 2007 IEEE

360 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2007

cones and downward cones. As a new pattern is composedby extending an existing pattern, the algorithm considers onlyconnected patterns, reducing the computation time comparedto [1]. However, the algorithm may consider a large number ofinvalid patterns and consider a pattern more than once, makingit still slow for large code regions, especially when there aremultiple inputs and outputs.

In the rest of this paper, we will present a fast algorithmto enumerate valid patterns. The algorithm can be tuned todetermine all valid patterns or only those connected valid pat-terns. We will formulate the problem in Section II and describeour algorithm in Sections III and IV. Section V introducessome implementation detail. Section VI presents the results anddiscussion. Section VII concludes this paper.

II. PROBLEM STATEMENT

For the sake of clarity, we briefly formulate the custom-instruction identification problem in this section. Note that thisproblem has been formulated in the literature [1].

A dataflow graph (DFG) G(V,E) is a directed acyclic graphthat represents the dataflow of a basic block. The nodes Vrepresent primitive operations, and the edges E represent datadependencies among operations. If (u, v) ∈ E, then u is saidto be “adjacent to” v and v is “adjacent from” u. Each DFGG(V,E) has a number of inputs and outputs from and to someexternal nodes (V +) not in V . Some additional edges (E+)connect the nodes in V + with the nodes in V , G, V +, and E+,forming a new graph G+(V ∪ V +, E ∪ E+). A pattern P is asubgraph of a DFG G : P ⊆ G. Each pattern has some inputnodes and output nodes. Given a DFG G, G+, and a patternP ⊆ G, an input node of P is a node in G+ that is adjacentto a node in P . We use IN(P ) to denote the input node setof P . An output node of P is a node in P that is adjacent toa node in G+ but not in P . We use OUT(P ) to denote theoutput node set of P . The pattern P is “convex” [1] if thereexists no path from a node u ∈ P to another node v ∈ P , whichinvolves a node w �∈ P . Otherwise, the pattern is “nonconvex.”For example, in Fig. 1, the pattern {0, 1, 2, 3} is a convexpattern, while the pattern {0, 1, 3} is not. In addition, someoperations such as memory operations may not be amenableto hardware synthesis and thus need to be precluded from avalid pattern. Such operations are referred to as “invalid.” Thecustom-instruction identification problem is to find every validpattern P that satisfies the following constraints.

1) P is convex.2) |IN(P )| ≤ IN_LIMIT .3) |OUT(P )| ≤ OUT_LIMIT .4) P contains no invalid operation.

III. ALGORITHM

To illustrate the motivation for our algorithm, consider theDFG shown in Fig. 1. As an example, we consider one stepin the custom-instruction identification process. In this step,we first assume that node 4 is not in the pattern. In this case,nodes 0, 1, 2, and 3 cannot coexist with node 6 in the pattern.Otherwise, the pattern will be nonconvex. This suggests that in

Fig. 1. Example of a DFG.

this situation, we can split the large DFG into two smaller DFGs{0, 1, 2, 3, 5} and {5, 6}. Thus, the problem reduces to findingthe convex patterns of the two smaller DFGs with the constraintthat the pattern from {5, 6} must contain at least one node from{6}. This splitting eliminates many of the invalid patterns thatare considered in [2] and [3].

Next, we assume that node 4 is in the pattern. Then, if weconsider node 0 to be in the pattern, nodes 1, 2, and 3 shall alsobe united into the pattern; else, the pattern is not convex. Thus,we need not consider these nodes in following the exploration,resulting in a reduced search space. We can repeat the unite andsplit processes until no remaining nodes need to be considered.As all the patterns obtained are convex, we do not need anyconvex check.

A. Overview

Given a DFG G(V,E) and a node u ∈ V , the other nodesin V can be grouped into several subsets according to theirrelationship with u.

1) Predecessors of u: Pred(G, u) = {v|v ∈ V, v �= u, thereis a path in G from v to u}. Any node in Pred(G, u) iscalled a predecessor of u.

2) Successors of u: Succ(G, u) = {v|v ∈ V, v �= u, thereis a path in G from u to v}. Any node in Succ(G, u) iscalled a successor of u.

3) Disconnected nodes of u: Disc(G, u) = {v|v ∈ V, v �=u, there is neither a path from u to v nor from v to u}.Any node in Disc(G, u) is called a disconnectednode of u.

Additionally, we define the following.

1) Immediate Predecessors of u: IPred(G, u) = {v|v ∈ V,v �= u, u is adjacent from v}. Any node in IPred(G, u) iscalled an immediate predecessor of u.

2) Immediate Successors of u: ISucc(G, u) = {v|v ∈ V,v �= u, u is adjacent to v}. Any node in ISucc(G, u) iscalled an immediate successor of u.

Similarly, given a graph G(V,E) and a subgraph P ⊆ G,we define several subsets of V according to their relationshipswith P .

1) Predecessors of P : Pred(G,P )=(∪u∈P Pred(G, u))−P .2) Successors of P : Succ(G,P ) = (∪u∈P Succ(G, u)) − P .3) Disconnected nodes of P : Disc(G,P ) = {u|u ∈ G, u �∈

P, u �∈ Succ(G,P ), u �∈ Pred(G,P )}.

According to the preceding definitions, we know thatPred(G,P )∩Disc(G,P )=∅ and Succ(G,P )∩Disc(G,P ) =∅. In addition, for a convex graph, we have the followingconclusion.

CHEN et al.: FAST IDENTIFICATION OF CUSTOM INSTRUCTIONS FOR EXTENSIBLE PROCESSORS 361

Fig. 2. Algorithm: An overview.

Lemma 1: Given a graph G(V,E) and a subgraph P ⊆ G. IfP is convex, Pred(G,P ) ∩ Succ(G,P ) = ∅.

Proof: If there exists a node w such that w ∈ Pred(G,P )and w ∈ Succ(G,P ), then there is a path from w to a node uin P , and there is a path from a node v in P to w. Therefore,there is a path from v to u that involves a node w not in P . Thisconflicts with the condition that P is convex. �

Fig. 2 presents an overview of our algorithm. Given a con-vex DFG G, a convex pattern P : P ⊆ G, and a node rg,if rg equals NOT_A_NODE (NaN), the enumerate func-tion searches all valid patterns Pi : P ⊆ Pi ⊆ G. Otherwise,the function searches all valid patterns Pi : P ⊆ Pi ⊆ G,Pi ∩Pred(G0, rg) �= ∅. We call node rg a “redundancy guardingnode.” We will explain the use of it later in the text. Thealgorithm begins with an empty pattern P , the convex DFGthat contains all operations in a basic block (denoted as G0),and NaN as an initial value of rg. The searching processconsists of recursive invocation of the enumerate function. Ineach invocation of the enumerate function, a node is selectedfrom the remaining node set (G-P ) by calling the select_nodefunction. The unite function handles the situation where theselected node, if a valid node, is included in the pattern. Itproduces a new pattern and recursively calls the enumeratefunction with the new pattern as input. The split functionhandles the situation where the selected node is not included inthe pattern. It decomposes the DFG G into one or two smallerDFGs and then recursively calls the enumerate function withthe new DFGs as input. The process returns when the remainingnode set is empty. We will respectively introduce the unite, split,and select_node functions in Sections III-B, III-C, and IV.

B. Unite Function

Fig. 3 shows the unite function, while Fig. 4 illustrates itsoperations in different situations. As mentioned in the moti-vation example, adding a node to a convex pattern may resultin a nonconvex pattern. To produce a convex pattern, we needto add other related nodes to the pattern. However, no invalidnodes should be added to the pattern. This is guaranteed by theselect_node function, which will be discussed in Section IV.Furthermore, no node should be unnecessarily added. Other-wise, some valid patterns may be missed. In other words, weneed to form a minimal convex pattern that encloses the currentpattern and the selected node. The pattern P ′ derived by theunite function is such a pattern.

Fig. 3. Algorithm of the unite function.

Fig. 4. Illustration of unite operations. (a) u ∈ Pred(G, P ). (b) u ∈Succ(G, P ). (c) u ∈ Disc(G, P ).

Lemma 2: Given a convex DFG G, a convex pattern P :P ⊆ G, and a node u : u ∈ G but u �∈ P , the following condi-tions hold.

1) The P ′ derived by the unite function satisfies P ⊆ P ′ ⊆G, u ∈ P ′, and P ′ is convex.

2) If Pi is another subgraph of G that satisfies P ⊆ Pi, u ∈Pi, and Pi is convex, then P ′ ⊆ Pi.

Proof:

1) According to the unite function, it is obvious that P ⊆P ′ ⊆ G and u ∈ P ′. We only prove that P ′ is convex.P ′ is computed differently in three situations. In thefirst case, u ∈ Pred(G,P ), P ′=P∪ u ∪(Succ(G, u) ∩Pred(G,P )). According to the assumption, P is convex.Since {u} ∪ (Succ(G, u) ∩ Pred(G,P )) ⊆ Pred(G,P ),there is no path from a node in P to a node in{u} ∪ (Succ(G, u)Pred(G,P )). Assume that vm, . . . ,vi, . . . , vn is a path from a node vm in {u} ∪(Succ(G, u) ∩ Pred(G,P )) to a node vn in P ∪ u ∪(Succ(G, u) ∩ Pred(G,P )). As G is convex, vi ∈ G.vm is either u or a node in Succ(G, u). According to thedefinition of Succ(G, u), vi should be in Succ(G, u).vn is in P ∪ {u} ∪ (Succ(G, u) ∩ Pred(G,P )). Asvn cannot be u, vn is either in P or in Pred(G,P ).According to the definition of Pred(G,P ), vi shouldbe either in P or in Pred(G,P ). In sum, vi ∈Succ(G, u) ∩ (P ∪ Pred(G,P )) = (Succ(G, u) ∩ P ) ∪(Succ(G, u) ∩ Pred(G, P )) ⊆ P ∪ (Succ(G, u) ∩Pred(G,P )) ⊆ P∪{u}∪(Succ(G, u)∩Pred(G,P )). In


Fig. 5. Algorithm of the split function.

Fig. 6. Illustration of split operations. (a) u ∈ Pred(G, P ). (b) u ∈Succ(G, P ). (c) u ∈ Disc(G, P ).

the second case, u ∈ Succ(G,P ), P ′ = P ∪ {u} ∪(Succ(G, u) ∩ Succ(G,P )). The proof is similar tothe first case. In the third case, u ∈ Disc(G,P ), andP ′ = P ∪ {u}. According to the assumption, P isconvex. As G is acyclic, there is no path from u to u.Thus, {u} is convex. According to the definition ofDisc(G,P ), there is neither a path from u to a node in Pnor a path from a node in P to u. Therefore, P ∪ {u} isconvex.

2) In the first case, u ∈ Pred(G,P ), and P ′ = P ∪ {u} ∪(Succ(G, u) ∩ Pred(G,P )). Assume that P ⊆ Pi, u ∈Pi, and Pi is convex, and vi ∈ P ′, but vi �∈ Pi. Obvi-ously, vi �∈ PU{u}. Since vi ∈ P ′, vi ∈ Pred(G,P ) ∩Succ(G, u). This means that there is a path from u tovi and a path from vi to a node in P . This contradictswith the assumption that Pi is convex. In the secondcase, u ∈ Pred(G,P ), P ′ = P ∪ {u} ∪ (Succ(G, u) ∩Pred(G,P )). The proof is similar to the first case. Inthe third case, u ∈ Disc(G,P ), P ′ = P ∪ {u}. If P ⊆ Pi

and u ∈ Pi, then P ′ = P ∪ {u} ⊆ Pi. �

C. Split Function

Fig. 5 shows the split function, while Fig. 6 illustrates itsoperation in different situations. When a selected node is notincluded in the pattern, its predecessors and successors cannotcoexist in the pattern. Thus, we can split the DFGs into smallerDFGs. Nevertheless, the split operation should satisfy threeconditions. First, the resultant DFGs must enclose the current

pattern; otherwise, the same pattern may be enumerated morethan once. Second, the resultant DFGs must be convex so thatthe unite function can be called with the resultant DFGs asinput. Furthermore, this operation must be safe, i.e., no validpattern is removed from future consideration. We show that oursplit function can satisfy all these requirements in Lemmas 3,4, and 5.

Lemma 3: Given a convex DFG G, a convex pattern P :P ⊆ G, and a node u : u ∈ G but u �∈ P , then the followingconditions hold.

1) If u ∈ Pred(G,P ), then P ⊆ G-(Pred(G, u) ∪ {u}).2) If u ∈ Succ(G,P ), then P ⊆ G-(Succ(G, u) ∪ {u}.3) If u ∈ Disc(G,P ), then P ⊆ Disc(G, u).

Proof:

1) As u ∈ Pred(G,P ), there exists a path from u to anode v ∈ P . So, v ∈ Succ(G, u). Assume that there isanother node w ∈ P , but w �∈ G-(Pred(G, u) ∪ {u}). So,w ∈ Pred(G, u) because w cannot be u according to thedefinition of Pred(G,P ). Hence, there is a path from wto u. In sum, there is a path from w ∈ P to v ∈ P , whichinvolves a node u �∈ P . This contradicts the assumptionthat P is convex.

2) Similar to 1).3) If v ∈ P but v �∈ Disc(G, u), then v ∈ Succ(G, u), or

v ∈ Pred(G, u). In other words, there is a path from vto u or from u to v. This contradicts the assumption thatu ∈ Disc(G,P ). �

Lemma 4: Given a convex DFG G and a node u : u ∈ G, thefollowing conditions hold.

1) G-(Succ(G, u) ∪ {u}) is convex.2) G-(Pred(G, u) ∪ {u}) is convex.

Proof:

1) Assuming that G-(Succ(G, u) ∪ {u}) is nonconvex, thenthere is a path from a node in G-(Succ(G, u) ∪ {u}) toanother node in G-(Succ(G, u) ∪ {u}), which involvesa node in Succ(G, u) ∪ {u} because G is convex. Ac-cording to the definition of Disc(G, u), the path cannotend with a node in Disc(G, u); otherwise, there is at leastone path from u to a node in Disc(G, u) via a node inSucc(G, u). In addition, as the graph is acyclic, the pathcannot end with a node in Pred(G, u). So, such a pathdoes not exist.

2) Similar to 1). �Lemma 5: Given a convex DFG G, a convex pattern P : P ⊆

G, and a node u : u ∈ G but u �∈ P , if Pi is another convexpattern, P ⊆ Pi ⊆ G, and u �∈ Pi.

1) If u ∈ Pred(G,P ), Pi ⊆ G-(Pred(G, u) ∪ {u}).2) If u ∈ Succ(G,P ), Pi ⊆ G-(Succ(G, u) ∪ {u}).3) If u ∈ Disc(G,P ), Pi ⊆ G-(Pred(G, u) ∪ {u}), or

Pi ⊆ G-(Succ(G, u) ∪ {u}), but Pi ∩ Pred(G, u) �= ∅.Proof:

1) Assuming that Pi � G-(Pred(G, u) ∪ {u}), Pi containsat least one node v in (Pred(G, u) ∪ {u}). Since u �∈ Pi,v ∈ Pred(G, u). Since u ∈ Pred(G,P ), at least one nodew ∈ P is the successor of u. As P ⊆ Pi, w ∈ Pi. Hence,there is a path from v in Pi to w in Pi via the node u


Fig. 7. Illustration of redundancy guarding node rg.

not in Pi. This contradicts with the assumption that Pi isconvex.

2) Similar to 1).3) As u �∈ Pi and Pi is convex, Pi ⊆ G-(Pred(G, u) ∪

{u}), or Pi ⊆ G-(Succ(G, u) ∪ {u}). If Pi ∩Pred(G, u) = ∅, Pi ⊆ G-(Pred(G, u) ∪ {u}). �

In the split function, if u ∈ Disc(G,P ), two new DFGs aregenerated, and they have an intersection: Disc(G, u). Thus, thesame pattern may be enumerated in exploring both of them. Toavoid that, the redundancy guarding node rg is assigned u forthe second DFG [Fig. 5 (line 5)]. This forces the select_nodefunction to choose nodes from Pred(G, u). Once a node fromPred(G, u) is united into the pattern, rg will be set back toNaN in the unite function [Fig. 3 (line 2)]. This way, a patterngenerated from exploring the second DFG contains at least onenode in Pred(G, u), making it unique in the result patterns. Takethe DFG in Fig. 1 as an example, Fig. 7 shows how rg works inthe enumerating process.

IV. NODE SELECTION

By selecting an appropriate node from the remaining nodes(G-P ), the select_node function can ensure the correctness andimprove the efficacy of the enumerating process. Fig. 8 givesthe pseudocode of the select_node function. Our strategy isfirst to determine the candidates under different constraints andsituations and then to select the “best” node by using a gradingmethod. If no candidate satisfying the requirements is found,we return NOT_FOUND to trim the search space.

We divide the remaining nodes (G-P ) into two classes:1) connected nodes (Pred(G,P )Succ(G,P )) and 2) discon-nected nodes (Disc(G,P )). To enumerate only connected validpatterns, we choose a connected node when P is not empty andchoose a node in G when P is empty [Fig. 8 (lines 2–4)]. Toenumerate all valid patterns, we first check if the current patternviolates the input and output constraint. If it violates one ofthem or both, we attempt to resolve the violation by choosing aconnected node [Fig. 8 (lines 5–6)]. Otherwise, we will selecta disconnected node before selecting any connected nodes[Fig. 8 (lines 7–9)].

Fig. 8. Algorithm of the select_node function.

A. I/O Constraints

When the current pattern violates the I/O constraints, weneed to determine if we can derive a larger pattern that satisfiesthe I/O constraints and which nodes should be selected toachieve that. As the output constraint is usually more stringentthan the input constraint, we handle output violation beforeinput violation if both of them happen.

If the current pattern P violates the output constraint, we firstcheck if the number of external outputs |EXT_OUT| is alreadybigger than the output limit. If this is true, the output violationcannot be resolved [Fig. 8 (lines 11–13)]. Otherwise, we findout the possible nodes that can lead to a pattern with feweroutputs. The possible nodes can only come from Succ(G,P ).To balance the efficiency and effectiveness of the selection,we choose the node that has at least two immediate predeces-sors in P or Succ(G,P ) and, at the same time, is a succes-sor of at least �((|OUT(P )| − |ext_Out|)/(OUT_LIMIT −|ext_Out|) output nodes [Fig. 8 (lines 14–17)]. Fig. 9(a)shows an example of the node selection in this situation.

If the current pattern P violates the input constraint, wefirst check whether the number of external inputs |IN(P )-G|is already bigger than the input limit. If this is true, addingany of the remaining nodes cannot resolve the input violation[Fig. 8 (lines 19–20)]. Otherwise, we calculate the possiblenodes that can lead to a pattern with fewer inputs. The pos-sible nodes can only come from Pred(G,P ). We choose thenode that has no inputs or has a sibling in P or Pred(G,P )[Fig. 8 (lines 21–24)]. For example, in the situation shownin Fig. 9(b), the possible nodes include 2, 3, and 4. If no


Fig. 9. Node selection when violating I/O constraint.

such nodes exist, no valid patterns can be found by adding theremaining nodes.

B. Grading Nodes

To select a node from a particular set of candidates, weestimate the remaining search space, assuming that a candidatenode was selected. We do this estimation for each candidate andselect the one that minimizes the estimated remaining searchspace. The estimation depends on the following two cases.

Case 1—Selecting a Disconnected Node: We consider thevalid nodes first. Referring to Fig. 10(a), in the unite operation,only the selected node will be removed from consideration.So, we can roughly consider the remaining search space as2r−1. Similarly, in the split operation, the estimated remainingspace is 2p+d + 2s+d − 2d = 2d(2p + 2s − 1) = 2r−1((2p +2s − 1)/2p+s). As r is a constant for each candidate, the func-tion F1(p, s) = (2p + 2s − 1)/2p+s can be used to evaluatethe remaining search space. Unfortunately, F1(p, s) is not goodas it may cause overflow for a large p and s and may requirethe use of floating-point values. Since it is only necessaryto compare F1(p, s) for different candidates, the followingfunction is used to replace F1(p, s):

Grade1(p, s) ={

0, if b = 0(b � 16) + a, if b �= 0

}

where a = max(p, s), b = min(p, s), and 0 ≤ p, s ≤ 216.It can be shown that for any integer 0 ≤ p1, s1, p2, s2 < 216,

if F1(p1, s1) < F1(p2, s2), then Grade1(p1, s1) >Grade1(p2, s2); if F1(p1, s1) = F1(p2, s2), thenGrade1(p1, s1) = Grade1(p2, s2). Hence, Grade1(p, s) isan inverse estimation of the remaining search space. Foran invalid node, the estimated remaining space along theunite branch is zero, while on the split branch, it is also2r−1((2p + 2s − 1)/2p+s). By comparison, we can find that aninvalid node always leads to a smaller remaining search spacethan a valid one and thus will be selected first.

Case 2—Selecting a Connected Node: Again, we considervalid nodes first. Referring to Fig. 10(b) and (c), in the

Fig. 10. Remaining search space.

unite operation, p + 1 nodes will be removed from the searchspace, giving a remaining space of 2r−p−1. In the split op-eration, s + 1 nodes will be removed from the search space,giving a remaining space of 2r−s−1. In total, the remain-ing search space is 2r−p−1 + 2r−s−1 = 2r−1((2p + 2s)/2p+s).Thus, we only need to compare F2(p, s) = ((2p + 2s)/2p+s)for each node. As previously, the following function is used toreplace F2:

Grade2(p, s) = (b � 16) + a

where a = max(p, s), b = min(p, s), and 0 ≤ p, s ≤ 216.Like Grade1(p, s), Grade2(p, s) is an inverse estimation of

the remaining search space. Applying this grading method onan invalid node, we can find the b should be p [Fig. 10(b)] ors [Fig.10(c)] while a should be r. As r is always bigger than pand s, we can use |G0| to replace r. Since the invalid nodes nearpattern P have greater grades and are selected first, no invalidnode will be united into a valid pattern.

V. DATA STRUCTURE AND OPTIMIZATIONS

Data Structure: In our implementation, bit vectors are usedto store patterns and other subsets of the DFG. Bit vectorsare ideal for dense subsets but not for sparse subsets. Ouralgorithm involves both dense and sparse sets. Evaluating thedata structure’s impact on the algorithm’s performance is leftfor future work. During the enumerating process, some subsetssuch as G and P need to be passed as function arguments.Passing a bit vector from function to function is inefficient if thebit vector is large and does not change. Instead, we staticallyallocate an array of bit vectors for each subset argument andpass the index of the array element. The number of bit vectorsneeded for each subset argument in our algorithm is no morethan the number of nodes of the DFG. In our implementation,unite and split have a time complexity of O(|G0|).

Calculation of Pred(G, u), Succ(G, u), Pred(G, p), andSucc(G, p): According to the definitions in Section IV, weknow that Pred(G, u) = G ∩ Pred(G0, u) and Succ(G, u) =G ∩ Succ(G0, u), where G0 is the basic block’s DFG.Pred(G0, u) and Succ(G0, u) can be computed once and usedrepeatedly. We perform this computation before the enumerat-ing process. Assuming that Node is the adjacency list storingthe DFG G0, Fig. 11 gives the pseudocode to calculate thepredecessors of each node. The calculation of Succ(G0, u) issimilar.


Fig. 11. Calculate predecessors of a node.

Fig. 12. Calculation of Pred(G, P ) and Succ(G, P ) (a) in the unite operation(b) in the split operation.

Fig. 13. Calculation of IN(P ) and OUT(P ).

As Pred(G,P ) and Succ(G,P ) are intensively used in thealgorithm, we compute them incrementally during the enumer-ating process (see Fig. 12).

Calculation of IN(P ) and OUT(P ): The input nodes andoutput nodes of a pattern change when uniting new nodes intothe pattern. We calculate them by analyzing the old I/O nodesand the new nodes added to the pattern (see Fig. 13).

VI. RESULTS

In this section, we compare the performance of our algorithm(referred to as “split”) with the algorithm in [2] (referred to as“exhaustive”) and the algorithm in [3] (referred to as “union”).All comparisons were carried out on a Pentium 4 1.5-GHz

TABLE IEXPERIMENTAL DATA

CPU with 512-MB memory. We used the Pentium time stampcounter to measure the time used by the algorithms.

Table I describes the sources of the DFGs used in ourexperiments. All these DFGs are obtained from the benchmarksin MiBench [18]. These benchmarks are compiled using GCC3.4.3 with option -O3. For the benchmark sha, we use the fullyunrolled version. The target architecture is the MIPS II integerinstruction set [16]. Some computation-intensive basic blocksfrom the benchmarks are selected for our experiments. TheDFGs of the basic blocks are built from analyzing the assemblyof the benchmarks. All memory operations are considered asinvalid nodes in the DFG. In the experiments that enumerate allvalid patterns, we use the DFGs of the basic blocks directly. Inthe experiments that enumerate only connected valid patterns,we use the largest connected region of a basic block’s DFG.This region is either a disjoint subgraph of the basic block’sDFG or is separated with the other parts by invalid nodes. In ad-dition, we consider the LI instruction as invalid in enumeratingconnected patterns as this instruction will cause union to spendover 90% of its time on redundant patterns for the benchmarkbitcnts.

A. Enumerate All Valid Patterns

Table II compares the performance of split and exhaustive inenumerating all valid patterns. For different input and outputlimits, the two algorithms produce the same patterns. Thesearch spaces of both algorithms, the number of valid patterns,the time used by split, and the speedup of split over exhaustiveare shown in Table II from left to right. The search space ofexhaustive is the total number of patterns it considered. Thesearch space of split is the number of calls to the enumeratefunction. We can observe that for the small to medium DFGs 1,2, 3, and 4, split is comparable to exhaustive when the outputlimit is 1 but is about 1.5 to 4.5 times faster when the outputlimit increases to 2 and 3. For the large DFGs 5, 6, and 7, splitis faster than exhaustive in all cases, with speedups of greaterthan one order of magnitude, especially when the output limitis more than one.

To explain the performance difference between exhaustiveand split, we studied the influences of the DFG size and I/Olimit on them. Fig. 14 shows the search space per valid patternof exhaustive and split for different DFGs (in ascending orderof size) when the I/O constraint is 2/1. From it, we can seethat with the increase of the DFG size, the search space pervalid pattern of split shows no obvious increase while that ofexhaustive has a clear tendency to expand. This indicates thatcompared to exhaustive, the DFG size has a smaller influence


TABLE IIPERFORMANCE COMPARISONS: ENUMERATE ALL VALID PATTERNS

Fig. 14. Search space per valid pattern under I/O constraint 2/1.

on split. By selecting the appropriate node, split’s uniting andsplitting operations can quickly diminish the problem size byruling out all nonconvex patterns. For example, for the DFGshown in Fig. 1, based on the grades computed, split will firstchoose node 3. If node 3 is not included, the problem willbe decomposed into two problems with much smaller sizes. If

Fig. 15. Search space per valid pattern for DFG 5 under different I/Oconstraints.

node 3 is included, split will continue to choose node 0. If node0 is also included, node 1 and node 2 shall be included, alsoreducing the problem size by two nodes. Furthermore, the split-ting and uniting operations also help prune the invalid patternsthat violate the I/O constraints. For example, the external inputcheck shown in Fig. 8 is seemingly the same as the permanentinput check in exhaustive [2]. However, in fact, the externalinput check is more efficient. Take still the DFG shown in Fig. 1as an example. If the current pattern is {5, 6} and node 4 is notincluded, then for exhaustive, this pattern has one permanentinput (node 4), while for split, this pattern has two permanentinputs (nodes 4 and 3). This is because if node 4 is not included,node 3 has been removed from G and thus become an externalnode. Besides, by analyzing the nodes in the remaining space,split can advance toward the “correct” direction. For example,in Fig. 9(a), if the current pattern is {6, 7} and the input limit is2, split will choose nodes 1 or 2 as they are the possible nodesto reduce the number of inputs. However, exhaustive will trynodes 5, 4, and 3, and various combinations of them.

Fig. 15 plots the search spaces per valid pattern of exhaustiveand split for DFG 3 under different I/O constraints. It can beseen that the search space per valid pattern decreases whenincreasing the number of inputs or outputs for both algorithms.Nevertheless, the decrease for split is generally sharper thanthat for exhaustive, especially when increasing the number ofoutputs. When the I/O constraints become loose, there will bemore valid patterns while the total patterns remain unchanged.Hence, the search space per valid pattern has an overall trendto decrease. However, accompanying the increase in the num-ber of valid patterns, there will be an increase in the invalidpatterns that satisfy the output constraint but not the others orsatisfy the output and convexity constraints but not the inputconstraint. This will cause a substantial expansion of the searchspace. Split is also affected, but it does not consider nonconvexpatterns and has a better way to prune the invalid patterns thatinviolate the input constraint, and thus has a smaller increase insearch space.

We can also observe from Table II that the number of validpatterns does not increase exponentially with the size of DFG,but it increases very quickly with the number of allowableinputs and outputs. This observation is supported by the suc-ceeding analysis.

Lemma 6: Given a DFG G, an input node set I , and anoutput node set O, there is at most one convex pattern in Gwhose input and output nodes are I and O, respectively.


Proof: Assume that there are two such patterns P1 and P2,and u �∈ P1 but u ∈ P2. Obviously, u �∈ O, so there exists a pathfrom u to a node in P1. As P2 is convex, all the nodes on thepath are in P2. According to the definition of input nodes, thispath cannot contain a node in IN(P2). However, since u �∈ P1,this path will contain an input node in IN(P1). This contradictsthe assumption that IN(P1) = IN(P2). It can be shown that thepattern, if it exists, is (O ∪ Pred(G,O)) ∩ (Succ(G, I) − I −Pred(G, I)). �

Lemma 7: Given a DFG G, associated G+, inputlimit IN_LIMIT , and output limit OUT_LIMIT ,if V is the set of valid patterns in G, then |V | ≤(|G|)OUT_LIMIT (|G+|)IN_LIMIT .

Proof: To get a valid pattern, we first choose a numberof nodes from G as output nodes. We have no more than(|G|)OUT_LIMIT choices. Then, we choose a number ofnodes from G+ as input nodes. We have no more than(|G+|)IN_LIMIT choices. In total, we have no more than(|G|)OUT_LIMIT (|G+|)IN_LIMIT combinations of inputnodes and output nodes. According to Lemma 6, we can obtainat most (|G|)OUT_LIMIT (|G+|)IN_LIMIT convex patternsfrom these combinations. �

Lemma 7 tells us that the number of valid patterns is notexponential in DFG size but is likely to be exponential in thenumber of inputs and outputs. This well explains the afore-mentioned observation. More importantly, it can be shown byexample that without any one of the constraints (input, output,and convexity), the number of valid patterns can be exponentialwith DFG size. Therefore, it is important to prune the invalidpatterns using a balanced method so that the algorithm canbe efficient for a wide range of DFGs. This explains whythe permanent input check added to the exhaustive algorithmin [2] has a dramatic influence on its performance for someDFGs. Without considering invalid nodes, split can obtain aconvex pattern in each call of enumerate. Yet, the number ofconvex patterns is still very large or even exponential in DFGsize. The method used to prune invalid patterns that violateI/O constraints (Section IV) greatly reduces the number ofpatterns considered, sometimes from an exponential value toa polynomial one.

In addition to DFG size and I/O limit, the DFG topologyis another important factor that influences the number of validpatterns and the algorithms’ performance. For example, underthe I/O constraint 2/1, DFG 3 has 450 valid patterns, while thelarger DFG 4 has only 257 valid patterns.

B. Enumerate Connected Valid Patterns

Table III compares split with exhaustive and union for enu-merating all connected valid patterns. As in [3], we add aconnectivity check step to exhaustive after a valid pattern isobtained. All three algorithms produce the same valid patterns.The search space of union is the total patterns and conesconsidered by it.

From the results, we can see that for very small DFGs 1 and6, split is about 1–12 times faster than exhaustive. For mediumDFGs 2, 3, 4, and 5, split achieves no more than 20× speedupover exhaustive when the output limit is 1 but is orders of

TABLE IIIPERFORMANCE COMPARISONS: ENUMERATE CONNECTED

VALID PATTERNS

magnitude faster than it when the output limit is 2 or 3. For thelarge DFG 7, split is orders of magnitude more than exhaustivein all situations. The speedup of split over exhaustive is farlarger than that in Table II. The main reason is that split does notneed to consider any disconnected patterns, while exhaustivedoes, thus contributing to the speedup of split.

The results also show that split is considerably faster thanunion for all situations. Although union restricts the constituentpatterns or cones to satisfy the convexity and input or outputconstraints, the combination of them may not. So, additionalchecks are unavoidable. For example, in Fig. 1, if node 4 isnot in the pattern, all combinations of the valid upper cones{0, 1, 2, 3, 5}, {1, 2, 3, 5}, {1, 3, 5}, {2, 3, 5}, and {3, 5}and the valid down cone (5, 6) are invalid. In addition, for someregions such as DFGs 1 and 3, union produces a large numberof redundant patterns that slows down its speed. Besides, wecan find that exhaustive’s performance is also better than union.The newly inserted permanent input check [2] substantiallyimproved the performance of exhaustive by reducing a lot ofsearch space. There are some discrepancies between the resultsreported in [3] and ours. This is perhaps due to [3] usingdifferent DFGs because for the same region in the benchmark


bitcnts, they obtained 23 valid patterns with the limit of threeinputs and one output, while we get 52 valid patterns with thelimit of two inputs and one output.

VII. CONCLUSION

This paper has presented a fast algorithm to enumerate allconvex subgraphs that satisfy the I/O constraints from the DFGof a basic block. The algorithm can be tuned to determine allsuch subgraphs or only those connected subgraphs. The exper-iments show that the algorithm’s performance is comparableto the state-of-the-art algorithms for small DFGs but can beorders of magnitude faster than them for large DFGs, especiallywhen a custom function is allowed to have two or more outputs.Thus, this technique could be used to quickly identify custominstructions for ISE of embedded processors.

REFERENCES

[1] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specificinstruction-set extensions under microarchitectural constraints,” in Proc.40th Des. Autom. Conf., Jun. 2003, pp. 256–261.

[2] L. Pozzi, K. Atasu, and P. Ienne, “Exact and approximate algorithmsfor the extension of embedded processor instruction sets,” IEEE Trans.Comput.-Aided Design Integr. Circuits Syst., vol. 25, no. 7, pp. 1209–1229, Jul. 2006.

[3] P. Yu and T. Mitra, “Scalable custom instructions identification forinstruction-set extensible processors,” in Proc. Int. Conf. Compilers,Architectures, and Synth. Embed. Syst., Sep. 2004, pp. 69–78.

[4] P. Biswas et al., “Introduction of local memory elements in instruction setextensions,” in Proc. 41st Des. Autom. Conf., Jun. 2004, pp. 729–734.

[5] P. Biswas, S. Banerjee, N. Dutt, L. Pozzi, and P. Ienne, “ISEGEN: Gener-ation of high-quality instruction set extensions by iterative improvement,”in Proc. Des. Autom. and Test Eur. Conf., Mar. 2005, pp. 1246–1251.

[6] N. Clark, H. Zhong, and S. A. Mahlke, “Processor acceleration throughautomated instruction set customization,” in Proc. Annu. Int. Symp.Microarchit., Dec. 2003, pp. 129–140.

[7] R. Razdan and M. D. Smith, “A high-performance microarchitecture withhardware-programmable functional units,” in Proc. 27th Annu. Int. Symp.Microarchit., Nov. 1994, pp. 172–180.

[8] B. Kastrup, A. Bink, and J. Hoogerbrugge, “ConCISe: A compiler-drivenCPLD-based instruction set accelerator,” in Proc. 7th IEEE Symp. Field-Programmable Custom Comput. Mach., Apr. 1999, pp. 92–101.

[9] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functionalunit,” in Proc. 27th Annu. Int. Symp. Comput. Architecture, Jun. 2000,pp. 225–235.

[10] J. Cong et al., “Instruction set extension with shadow registers for con-figurable processors,” in Proc. ACM 13th Int. Symp. Field-ProgrammableGate Arrays, Feb. 2005, pp. 99–106.

[11] P. Ienne, L. Pozzi, and M. Vuletic, “On the limits of processor spe-cialisation by mapping dataflow sections on ad-hoc functional units,”Swiss Federal Inst. Technol. Lausanne, Comput. Sci. Dept., Lausanne,Switzerland, Tech. Rep. 01/376, Dec. 2001.

[12] P. Yu and T. Mitra, “Characterizing embedded applications for instruction-set extensible processors,” in Proc. 41st Des. Autom. Conf., Jun. 2004,pp. 723–728.

[13] X. Chen and D. L. Maskell, “M2E: A multiple-input, multiple-outputfunction extension for RISC-based extensible processors,” in Proc. 19thInt. Conf. Architecture Comput. Syst., Mar. 2006, vol. 3894, pp. 191–201.

[14] Altera Corp. Stratix III device handbook. [Online]. Available:http://www.altera.com/literature/hb/stx3/stratix3_handbook.pdf

[15] Xilinx Inc. Virtex-4 user guide. [Online]. Available: http://direct.xilinx.com/bvdocs/userguides/ug070.pdf

[16] G. Kane and J. Heinrich, MIPS RISC Architecture. Englewood Cliffs,NJ: Prentice-Hall, 1992.

[17] D. Goodwin and D. Petkov, “Automatic generation of application spe-cific processors,” in Proc. Int. Conf. Compilers, Architectures, and Synth.Embed. Syst., Nov. 2003, pp. 137–147.

[18] M. R. Guthaus et al., “MiBench: A free, commercially representativeembedded benchmark suite,” in Proc. 4th IEEE Int. Workshop WorkloadCharacterization, 2001, pp. 3–14.

Xiaoyong Chen received the B.E. degree in com-puter engineering from Huazhong University of Sci-ence and Technology, Wuhan, China, in 1997. Heis currently working toward the Ph.D. degree at theSchool of Computer Engineering, Nanyang Techno-logical University, Singapore.

His research interests include (re)configurablecomputing, embedded processors and softwaretools, microprocessor simulation, SoC design, andsimulation.

Douglas L. Maskell (S’84–M’85–SM’03) receivedthe B.E (Hons.), M.Eng.Sc., and Ph.D. degrees inelectrical and computer engineering from JamesCook University, Townsville, Australia, in 1980,1985, and 1996, respectively.

He is currently an Associate Professor with theSchool of Computer Engineering, Nanyang Tech-nological University (NTU), Singapore. He is alsothe Leader of the Reconfigurable Computing Group,Centre for High Performance Embedded Systems(CHiPES), NTU. His current research interests in-

clude dynamic (runtime) reconfigurable computing, including efficient utiliza-tion of FPGA hardware and architecture resources for near routeless placementand fast configuration. He also conducts research in a number of embeddedsystems application areas, including biomedical algorithm acceleration usingFPGA, embedded applications and architectures in computational cognitivescience, low-complexity digital filters, and low-complexity phase and distancemeasurement.

Yang Sun received the B.E. degree in computerengineering from Huazhong University of Scienceand Technology, Wuhan, China, in 2000. She iscurrently working toward the Ph.D. degree at theDepartment of Electrical and Computer Engineering,National University of Singapore, Singapore.

Her research interests include embedded systemdesign, distributed computing, performance evalua-tion, and optimization.

Fast Identification of Custom Instructions for Extensible Processors

Documents

Transcript of Fast Identification of Custom Instructions for Extensible Processors