Customizing CPU Instructions for Embedded Vision Systems

Customizing CPU instructions for embedded visionsystems

Stephane PiskorskiLRI

Universite Paris Sudemail: [email protected]

Lionel LacassagneIEF


Samir BouazizIEF


Daniel EtiembleLRI


Abstract—This paper presents the customization of two pro-cessors: the Altera NIOS2 and the Tensilica Xtensa, for funda-mental algorithms in embedded vision systems: the salient pointextraction and the optical flow computation. Both can be usedfor image stabilization, for drones and autonomous robots. Using16-bit floating-point instructions, the architecture optimization isdone in terms of accuracy, speed and power consumption. Acomparison with a PowerPC Altivec is also done.

Keywords

Customization of CPU instructions, 16-bit floating-pointinstructions, SIMD extensions, embedded vision system, SoC.

I. INTRODUCTION

Embedded vision systems and SoCs are great scientificchallenges. Designers have to integrate more and more CPUpower to run still hungrier algorithms while limiting powerconsumption. Adding SIMD instructions within a processor isa good way to tackle the problem of power consumption.For vision and media processing, the debate between integer

and floating-point computing is still open. On one hand, peopleargue about using fixed-point computation instead of floating-point computation. For instance, G. Kolli justifies “using fixed-point instead of floating-point for better 3D performance”in the Intel Graphics Performance Primitives library in [7].Techniques for automatic floating-point to fixed-point con-versions for DSP code generations have been presented [11].On the other hand, people propose lightweight floating-pointarithmetic to enable floating-point signal processing in low-power mobile applications [4].This paper follows previous ones dealing with optimized 16-

bit SIMD floating-point instructions and presents a consistentAlgorithm-Architecture Adequation where both architecturesand algorithms are customized to provide high performanceembedded systems. Our presentation uses two examples: theHarris’ salient point extraction or Points of Interest and theHorn and Shunk’s optical flow computation.The specificity of these algorithms is that computations

cannot be done with only 8-bit integers as they lack the neededaccuracy and dynamic range. A fine study of variable widthfloating-point format - 16-bit floats and smaller ones - andtheir impact on algorithm accuracy is done. As optical flowis an iterative algorithm, mastering the error accumulation

and propagation, from one iteration to the next one, is veryimportant.We then present benchmark results for the PowerPC using

Altivec, NIOS2 and Xtensa processors, which provide accurateinformation on the implementation of embedded systems. Wefinally focus on the power consumption and processor areaissues.

II. F16 FLOATING-POINT FORMATS

s exponent fractionF16

F32

5 10

s exponent fraction

8 23

1

1

F32 s exponent fraction

8 231

s exponent fractionF13

5 71

"000"

Fig. 1. F32, F16 and F13 floating-point numbers

Some years ago, a 16-bit floating-point format called “half”has been introduced in the OpenEXR format [14] and in theCg language [13] [10] defined by NVIDIA. It is currentlybeing used in some NVIDIA GPUs. It is justified by ILM,which developed the OpenEXR graphics format, as a responseto the demand for higher color fidelity in the visual effectindustry: “16-bit integer based formats typically representcolor component values from 0 (black) to 1 (white), but don’taccount for over-range value (e.g. a chrome highlight) thatcan be captured by film negative or other HDR displays.Conversely, 32-bit floating-point TIFF is often overkill forvisual effects work”.In the remaining part of this paper, this 16-bit floating-

point format will be called “half” or F16, the IEEE-754 32-bitsingle-precision floating-point format will be called F32, andthe 8-bit, 16-bit and 32-bit integers will be called I8, I16 andI32 formats.The “half” format is presented in figure 1. A number is

interpreted exactly as in the other IEEE floating-point formats.The exponent is biased with an excess value of 15. Value 0is reserved for the representation of 0 (fraction = 0) and

https://www.researchgate.net/publication/221147975_Automatic_floating-point_to_fixed-point_conversion_for_DSP_code_generation?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/26532561_Lightweight_Floating-Point_Arithmetic_Case_Study_of_Inverse_Discrete_Cosine_Transform?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/2896128_Cg_A_system_for_programming_graphics_hardware_in_a_C-like_language?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

denormalized numbers (fraction != 0). Value 31 representsinfinities (fraction = 0) or NaNs (fraction != 0). For 0 <exponent < 31, the value of the coded floating-point numberis ("1)s # (1.fraction) # 2exponent!15. The range of theformat extends from 2!14 $ 6 # 10!5 to 216 " 25 = 65504for normalized numbers.To balance the embedded hardware and the algorithms

accuracy, we also had a look at an even lighter floating-pointcoding: F13. It has a 5-bit exponent and a 7-bit mantissa (figure1). For hardware implementation, such a format saves a lot oflogic area (figure 4). For software implementation, it is storedinto 16-bit words, but we called it F13 because, as we knowthe computation dynamic range for these algorithms, the 3msb of the exponent are useless (although not equal to zerobecause of the bias). Then we have one bit for the sign, 5useful bits for the exponent (stored into 8 bits) and 7 bits forthe mantissa, that is 13 bits. Finally, F13 could be interpretedas the two “Most Significant Bytes” of F32.

III. SUMMARY OF PREVIOUS RESULTS

In [3][8], we have considered the speedup between versionsusing SIMD 16-bit floating-point instructions (called F16) andversions using 32-bit floating-point instructions (called F32).Speedups have been measured on general purpose processors:Pentium 4 and PowerPC G5. The following benchmarks havebeen used:

• Deriche filters (smoother and gradient),• spline zoom [16],• JPEG compression [9],• wavelet transform [15].For all benchmarks, SIMD F16 provides at least a speedup

of 2 compared to SIMD F32. Speedup can even reach #4because of the smaller F16 cache footprints: when F16 datafit in the cache, but not F32 ones. From a qualitative point ofview, F16 versions of the algorithms are quite as good as F32

and better than the usual I32 version. For example JPEG F16

results in a 5 dB higher PSNR than I32, and matches the F32

PSNR.For all benchmarks, we also tested several rounding modes:

truncation or standard rounding mode, with or without denor-mals. Analysis shows that truncating without denormals is veryclose to rounding , with or without denormals. This is actuallythe ideal situation for embedded hardware. In the article, weonly consider F16 with truncation and without denormals.

IV. ALGORITHMS PRESENTATIONA. Point of InterestWe used the Harris’ corner edge detector [5]. Coarsity K

is defined by:

K = Sxx # Syy " Sxy # Sxy (1)

The computation stages are described in figure 2, whereSxx, Sxy and Syy are the smoothed squared first derivativesof the image I .In a real-time implementation context, Gauss and Sobel 3#3

convolution kernels are usually chosen, but can be replaced

I

Grad X

Grad Y

Ix

Iy

x

Ixx=Ix*Ix

Ixy=Ix*Iy

Iyy=Iy*Iyx

x Gauss

Gauss

Gauss

Sxx

Sxy

Syy

coarsitySxx*Syy-Sxy² K

first loops nest second loops nest

Fig. 2. Harris PoI algorithm

by more efficient filters like Canny-Deriche’s filters. Here themain computational problem lies in the dynamic range.

B. Optical Flow

I0,I1

Grad X

Grad Y

Ix

Iy

Grad T It

u,vaverage

iterative loop

speedupdate

Fig. 3. Horn & Shunk algorithm

u = u " Ix # Ix # u + Iy # v + It

!2 + I2x + I2

y

(2)

v = v " Iy # Ix # u + Iy # v + It

!2 + I2x + I2

y(3)

The Horn and Shunk optical flow computation is based onan iterative framework, whose steps are:

• spatio-temporal first derivatives Ix, Iy and It

• average speed (u, v),• speed update (u, v).Here, the main problem is the accuracy through all compu-

tations and especially during the division in the speed updatestage. Because this division is different in every point, it cannotbe easily tabulated and put into a LUT.

V. ARCHITECTURES PRESENTATIONA. PowerPC G4 AltivecTo allow a fair comparison between the two hardware

targets, we also implemented algorithms on a PowerPC G4,to get a reference in terms of speed and power consumption.The PowerPC G4 was the first processor to implement the128-bit SIMD multimedia extension called Altivec [2] byFreescale, Velocity Engine by Apple and VMX by IBM. It isa competitive RISC processor: at 1 GHz, the PowerPC release7457 has a power consumption of about 10 watts.

B. Nios IIThe algorithms have been implemented into an Altera

Stratix 2 FPGA running at 150 MHz (when configured witha stand-alone NIOS 2 processor), with 60000 cells.We focused on two aspects:

https://www.researchgate.net/publication/4174703_16-bit_floating_point_instructions_for_embedded_multimedia_applications?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/4088229_16-bit_FP_sub-word_parallelism_to_facilitate_compiler_vectorization_and_improve_performance_of_image_and_media_processing?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/3215132_AltiVec_Extension_to_PowerPC_Accelerates_Media_Processing?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/3321323_Splines_A_Perfect_Fit_for_Signal_and_Image_Processing?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

C10 C9 C8 C7 C6 C4 C3 C2 C1C5

20212223242526272829

KK K K KK K K KK

Fig. 4. Kogge & Stone structure based on BK cells

• the impact of optimization parameters for synthesis ofone instruction: the addition,

• the implementation of the complete algorithmsThe 16-bit floating-point adder is the time-critical operator.

Switching from F16 to F13 (7-bit mantissa) or F14 (8-bitmantissa) directly provides a speedup, since such a formatsaves one level of adders at each step of computation. Figure4 details the 10-bit mantissa addition step, implemented withthe Kogge & Stone structure [6], based on BK cells [1]. Thewhite part of the figure corresponds to the addition of the 7-bitmantissas of the F13 format.With NIOS II processors, the customized instructions can

be implemented as combinational (1-cycle) or multicycleinstructions. Multicycle operations are not pipelined, whichmeans that the pipeline is stalled until the end of the operationwhen the done signal becomes active. Finding the optimaltrade-off between the operation latency (number of cycles) andthe maximal CPU clock frequency is thus important. This iswhy we have tested 1-cycle and 2-cycle versions of the 16-bitfloating-point arithmetic instructions.

C. XtensaSome benchmarks were tested using an Xtensa LX cus-

tomizable processor from Tensilica Inc. (www.tensilica.com)to compare F16 and F32. The Xtensa LX core is a 32-bitRISC processor, targeting SoC designs, with a configurableinstruction set. A proprietary Verilog-like language called TIEcan be used to add custom instructions, register files or I/Oports, which are automatically taken into account by theprovided hardware simulation and software compilation tools.Floating-point instructions with custom formats could thus beeasily implemented and immediately be tested and comparedwith each other. They were also compared to the native XtensaLX FPU (manipulating data refered to as floats hereafter),which provides IEEE compliant floating-point computationinstructions with a 4-cycle latency.The Xtensa LX processor simulated in our tests was con-

figured to use a 128 bit-wide memory interface, a 32-KB 4-way associative cache memory, a 5-stage pipeline. A FPUwas incorporated, which makes use of a 16-entry register

file. All speed and area estimates in this chapter assume anASIC implementation with a 90nm GT process geometry. Ahardware loop counter is also provided, which avoids branchmispredictions since the image size is known at compile timein our algorithms.Custom floating-point instructions were implemented using

TIE, with mantissa and exponent widths as parameters: a F16

version and a F32, both without subnormals and with truncatedcalculations. Table V-C lists the main arithmetic functions thatwere implemented along with their estimated area (given byXtensa development tools). Instructions decoding, load/storeinstructions and other miscellaneous logic require about 25%additional area. The custom floating-point instructions usededicated custom register files, also implemented by using theTIE language.

Block F16 F32 F16 SIMD (8 elems.)Add 1384 2666 11470Mul 1695 4430 13480!2n 158 275 1197/2n 124 221 931Byte->Fxx 273 493 1885Fxx->Byte 1469 2068 11476Reg. file (16 entries) 5306 9481 34716Total 10409 19634 75155

TABLE IMAIN TIE LOGIC BLOCKS AND ESTIMATED SIZES (GATES)

VI. QUALITATIVE APPROACHFor the optical flow computation we have tested the follow-

ing configurations:• F64: the reference schema that can be seen as the highprecision Matlab computation,

• F32: the usual floating-point implementation,• F16: our proposition,• F13: our highly embedded proposition,• I16: the integer competitor of F16,• I32: the integer competitor of F32.The classic methodology is to study the accuracy of the

floating-point algorithm and then to convert it into a fixed-point algorithm. In the case of the optical flow the twofirst steps (first derivative and speed average) use divisionoperators. The I16 implementation keeps them in place butthe I32 implementation postpones them at the final step, whichrequires more bits to hold the temporary results: 1/16 Gausscoefficient is removed, like 1/4 Sobel and 1/12 factor foraverage speed computation.The I16 and I32 versions use radix-8 fixed-point com-

putations, to provide an 8-bit fractional part and an 8-bitinteger part, for the 16-bit version. The speed update equationremaines unchanged in I16 but is modified for I32 version asfollows:

u =112

[u " Ix # Ixu + Iy v + 12 # 28 # It

16!2 + I2x + I2

y

](4)

.

https://www.researchgate.net/publication/2887564_A_Regular_Layout_for_Parallel_Adders?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

https://www.researchgate.net/publication/224483264_A_Parallel_Algorithm_for_the_Efficient_Solution_of_a_General_Class_of_Recurrence_Equations?el=1_x_8&enrichId=rgreq-9bd2d1270722a65abc604303cf3024b6-XXX&enrichSource=Y292ZXJQYWdlOzQyODE5MTI7QVM6OTk5MTc2Mjg0NDQ2NzRAMTQwMDgzMzYyNzA5Ng==

To optimize speed on PowerPC, F32 are used for accuratecomputations, but 16-bit storage is performed to reduce cachefootprints. Conversions or truncations occur after computa-tions:

• F32%I16: F32 computations are stored into I16,• F32%F16: F32 computations are stored into F16,• F32%F13: F32 computations are stored into F13.For I16 storage, a radix 8 is used and implemented with the

dedicated instruction i=vec cti(f,8): i = f # 28. For F16

storage, both mantissa and exponent are truncated (while thebias changes from "127 to "15). F13 storage simply requiresthe mantissa truncation.

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

Optical Flow Absolute convergence

Iterations

Mea

n Sq

uare

Erro

r

F64F32F32F16F16F32F13I32F32I16I16

I16

F32-I16

I32

F32-F13 F64 & F32 & F16

Fig. 5. absolute convergence of optical flow versions

To estimate the impact of number coding, two images werecreated, the second one with an offset of (3, 1) pixel. Theimages were then downscaled by a factor 4, leading to anapparent motion of (3/4, 1/4) pixel. Figure 5 represents themean square error between the estimated flow and the realflow. Note that no interpolation has been introduced betweenHorn & Shunk iterations in order to avoid introducing anyother source of approximation in the computation.We can see that (legend indicates curves in reverse order)

I16 and I32 converge to a bigger error than F16. We can noticethere is no difference between F32 and F64, and that F16 isvery close to F32. This is an evidence of F16 interest in imageprocessing algorithms.We can also notice that, obviously, F32%F16 are better than

F16, since computations are done with a greater accuracy.But what is also very important is that, first, F32%I16 isclose to I32 with a twice smaller memory footprint, but leadsto a bigger error than F32%F13 which is between F16 andI32. That enforces the use of F13, for both PowerPC and thehardware implementation.Now focusing on the ten to thirty first iterations of the

algorithm (6) instead of hundreds of iterations which is a bitirrelevant for real-time execution on embedded systems, wecan see figure 5, that all the floating-point versions overlap:F16 and F13 versions are as accurate as F32 and F64 versions.

5 10 15 20 25 300.3

0.35

0.4

0.45

0.5

0.55

Optical Flow Absolute convergence zoom

Iterations

Mea

n Sq

uare

Erro

r

F64F32F32F16F16F32F13I32F32I16I16

I16

F32-I16

I32

F16, F32 & F64 versions

Fig. 6. first iterations of optical flow convergences

For PoI, the dynamic range is the main issue. Since theinput data is an 8-bit image, the output K could be as large as32 bits. Since IEEE-754 F16 maximum value is approximately216, there are two possibilities to fix the problem:

• by normalizing input interval to [0 : 1] with a division by256 that could take place during the conversion from I 8

to F16,• by replacing -15 bias by 0 or by a positive value.Since computations are performed with truncation we have

tested two configurations: division by 16 and by 256. Resultsare very close. The accuracy was estimated through theextraction of PoI on the whole Movi house sequence [12].We provide for variable width mantissa, ranging from 4 bitsto 10 bits, the minimum, the average and the maximum PSNRbetween custom F16 and F32 over the 120 images of thesequence.

float mantissa size min avg maxF16 10 78.9 82.0 85.5F15 9 72.8 76.0 79.9F14 8 66.0 68.8 72.2F13 7 58.9 61.8 64.6F12 6 52.3 55.4 58.3F11 5 46.9 50.0 53.0F10 4 42.3 44.5 45.4

TABLE IIPSNR BETWEEN VARIABLE WIDTH F16 AND F32 FOR POI

Again, F16 are very close to F32, and the F13 version witha 7-bit mantissa performs well. We can also note that, becauseof the nature of the algorithm - “multiplication of horizontalgradient by the vertical gradient” - the algorithm is very robustto a short number coding.

VII. QUANTITATIVE APPROACH

In this section, we provide the execution time results forthe three architectures. The metric used is cpp, which is the

number of clock cycles per pixel.

cpp =t # Freq

n2(5)

We believe that cpp is more useful than cpi to comparedifferent architectures, since it can also help to compare thecomplexity-per-point of different algorithms. Moreover cpp isalso useful to detect cache misses from one size of image toanother.

A. Quantitative approach with PowerPCFor the PowerPC implementation, we started from a full F32

version where the image is coded with F32 pixel instead ofI8: such a version provides a rough idea of performance andis also easy to develop. Then we switched to I8: the internalloop had to be unrolled four times, since in I8 vector pixelsare packed by 16 and in F32 vector, pixels are packed by 4.Such a scheme provides additional ways to optimize the code:

• software pipelining: to reduce the data dependency stresswithin a loop iteration (for PoI, it prevents from storingfirst derivatives into memory) and to reduce the totalnumber of cache accesses,

• data interlacing: to reduce the number of active referencesin the cache,

• data packing: F32 points are stored into F13 to divide by2 the cache footprint.

0 200 400 600 800 1000 12000

10

20

30

40

50

60G4!OpticalFlow!I8!F32!SIMD

image size

cpp

I8!F32I8!F32+InterI8!F32!F16I8!F32!F16+Inter

Fig. 7. Optical flow on powerPC Altivec

We provide 4 versions: I8%F32 and I8%F32 % F13, withand without data interlacing. For F13, we get a very densecode where data interlacing has a small impact, compared toI8%F32 version. At the end, we achieve a #2 speedup fromthe basic SIMD version. The metric used is cpp, that is thenumber of clock cycles per pixel.

B. Quantitative approach with XtensaFor the Xtensa LX implementation, the image processing

benchmarks (PoI and Optical Flow) were written in C. A firstversion used integer values (referred to as int hereafter). A sec-ond one used native floats; data had the C language float type,and calculations were made using the Xtensa LX’s FPU and

0 200 400 600 800 1000 12000

50

100

150G4!PoI!I8!F32!SIMD

image size

cpp

I8!F32 PlanarI8!F32+PipelineI8!F32+Pipeline+InterI8!F32!F13+Pipeline+Inter

Fig. 8. PoI on powerPC Altivec

the associated 16-entry-wide register file. Two other versions,using F16 and F32 were written with C intrinsics to call ourcustom floating-points instructions. Since the number of cyclesneeded to perform computations using our custom instructionsdepends on the desired operating frequency, simulations weremade with several assumptions: an ideal 1 or 2-cycle latencycase and a worst case with custom instructions having thesame 4-cycle latencies as the native ones.The results also depend on the latency of the memory

connected to the system. Simulations were first run withthe hypothesis of a latency-free memory, which is totallyunrealistic but shows the speedup directly brought by thefaster-computed instructions. They were then run with memoryaccesses including the cache behaviour. Eventually, a 16-bitSIMD version (8 points per data vector, because of the 128-bit wide memory interface) was tested to get an idea of whatperformance could be achieved when using F16 at their fullpotential.The programs were compiled using the GCC toolchain

provided with the Xtensa Development environment, with fulloptimization (GCC option -O3) and for various square imagesizes. All the benchmarks started with a completely pollutedcache when applicable.Figures 9, 10, 11 and 12 present the number of cycles

needed to compute each algorithm. Table VII-B shows thenumber of cycles spent processing data cache misses. Finallytable VII-B shows number of cycles spent waiting because ofdata dependencies between instructions in the pipeline.The simulations run with a zero latency memory shows that

F16 computation does not bring a significant speedup by itself,compared to its 32-bit equivalent (results are even contraryto what expected when compared to native floats since thecompiler is able to optimize them in a better way).However, simulating the cache behaviour highlights the

benefit of F16: their twice smaller size leads to twice less cachemisses, which is a real advantage as the processor/memoryspeed gap increases. Furthermore, this also doubles the numberof operations per instruction for constant SIMD vector size and

Fig. 9. PoI’s cpp. - cache misses not simulated

Fig. 10. PoI’s cpp. - data cache and memory latencies simulated

hardware cost, since HDL operators reduce in size in the sameproportions.Besides table VII-B shows the reduction of the number of

pipeline stalls due to data dependencies. The reduced numberof cycles needed to compute F16 helps the compiler to reorderthe instructions for a better pipeline efficiency.

C. Quantitative approach with NIOS2Two kinds of synthesis have been performed:• structural version: the architecture is only described withstructural elements,

image size 64 128 256Optical Flowint,floats,F32 0.62 0.63 0.63F16,F16SIMD 0.38 0.38 0.38PoIint,floats 0.81 0.81 0.81F32 0.82 0.82 0.82F16 0.44 0.44 0.44F16SIMD 0.45 0.45 0.44

TABLE IIIXTENSA - DATA CACHE MISSES PER POINT

Fig. 11. Optical Flow algorithm - cache misses not simulated

Fig. 12. Optical Flow algorithm - data cache and memory latencies simulated

• functional version: structural description but with func-tional comparators

The reduction of the FPU size, by switching from a 4-stageaddition to a 3-stage addition is not proportional to area, butthe implementation of F13 instead of F16 provides a higherclock rate with the same area (number of cells) in the caseof retiming for the structural version. The speedup due toretiming for F16 and F13 is [#2.3 : #2.8] while the areaincrease is only [#1.5 : #1.6]. Finally F13 combined withretiming generates a overall speedup of #3 with only an areaincrease of #1.5.For the NIOS2, 3 versions of the operators were designed:• F16-2: 2-cycle latency multiplication and addition.• F16-1.5: a 1-cycle multiplication and a 2-cycle addition.• F16-1: 1-cycle multiplication and addition.

image size 64 128 256PoIfloats 19.71 20.35 20.67F32 14.17 14.58 14.79int,F16,F16SIMD 1.00 1.00 1.00

TABLE IVXTENSA - INSTRUCTION INTERLOCKS PER POINT

structural functionalFP format Freq cells Freq cells

F16 without retiming 48.0 509 52.5 701F16 with retiming 110.0 764 99.1 937

F13 without retiming 51.9 487 52.6 853F13 with retiming 143.0 781 132.0 1143

TABLE VNIOS 2: OPTIMIZATION OF THE ADDITION OPERATOR

For these three versions we provide the maximum frequencyand cpp and compare the SIMD2 F16 to scalar I32 version.

image sizedesign Freq 64 128 200 256 260scalar 150 300.5 310.2 219.3 315.6 227.4F16-2 130 211.3 229.7 204.0 239.4 208.0

F16-1.5 92 179.0 194.5 167.8 202.8 171.4F16-1 80 136.5 148.2 120.0 154.6 123.0

TABLE VIcpp POI ON NIOS2

We can notice that the cpp are better for sizes which arenot powers of two (200 and 260). This comes from the NIOS2direct-mapped data cache. A smart implementation should useimages with padding at the end of each line to get “unaligned”starts of lines.

VIII. EMBEDDED APPROACH

To estimate the embedded capability of each architecture,we compute the energy consumption which is based on arough estimation of the power consumption multiplied by theexecution time. We do not try to get accurate results but trendsto compare the three architectures.

A. Embedded approach with PowerPC

image size 64 128 256 512Optical Flow

cpp 7.5 11.6 19.6 28t(ms) 0.03 0.19 1.24 7.34

energy (mJ) 0.30 1.9 12.8 73PoI

cpp 22.6 29 40 53.2t(ms) 0.09 0.48 2.66 13.9

energy (mJ) 0.92 4.75 26.6 139

TABLE VIIEMBEDDED PERFORMANCE OF POWERPC G4

B. Embedded approach with XtensaA rough estimate of the Xtensa LX processor power con-

sumption was made by assuming it proportional to the chiparea computed by Xtensa tools, since our academic licensedid not permit a better estimation. This leads to about 250mWwhen adding scalar F16 instructions and 350mW when addingSIMD F16 instructions, with the same basic assumptions (core,

caches, process geometry, etc.) at 500MHz. This gives an ideaof the reachable electric consumption as given in table VIII,where results are given for a realistic average case (10:5:5:5memory latencies, 2-cycle F16 instructions) instead of theworst case previously described in figures 10 and 12.

image size 64 128 256 512Optical Flow F16

cpp 157.8 160.0 162.1 n.a.t(ms) 1.3 5.2 21.2 n.a.

energy (mJ) 0.3 1.3 5.3 n.a.PoI F16

cpp 158.2 162.4 164.3 n.a.t(ms) 1.3 5.3 21.5 n.a.

energy (mJ) 0.4 1.4 5.4 n.a.Optical Flow F16 SIMD

cpp 26.6 26.8 27 n.a.t(ms) 0.2 0.9 3.5 n.a.

energy (mJ) 0.1 0.3 1.2 n.a.PoI F16 SIMD

cpp 30.3 30.2 30.1 n.a.t(ms) 0.25 1.0 4.0 n.a.

energy (mJ) 0.1 0.4 1.4 n.a.

TABLE VIIIROUGH ESTIMATES OF XTENSA LX PERFORMANCES

C. Embedded approach with NIOS2

design freq. (MHz) Power (mW)scalar 150 1303F16-2 130 1307

F16-1.5 92 1145F16-1 80 1123

TABLE IXMAX FREQUENCY AND POWER CONSUMPTION OF NIOS2

To estimate the power consumption of the three F16 ver-sions, we used PowerPlay from Quartus, with the pessimisticassumption of 25% of active signals at each cycle. We can seethat all F16 version have a smaller or equal consumption thanI32, with also a smaller frequency and a smaller cpp.

image sizedesign freq. 64 128 200 256 260scalar 150 2.61 2.69 1.90 2.74 1.98F16-2 130 2.12 2.31 2.05 2.41 2.09

F16-1.5 92 2.23 2.42 2.09 2.53 2.14F16-1 80 1.93 2.08 1.68 2.17 1.73

TABLE XNIOS2 ENERGY CONSUMPTION FOR POI (µJ/pixel)

Due to small parallelism (2 operations per instruction), thespeedup of SIMD F16 compared to scalar I32 is small, butmore importantly, the 1-cycle F16 version frequency is half thescalar version, making easier to interface the FPGA with thememory. We can also notice that with such 1-cycle operators,there is no more data dependencies in the code, decreasingalso the impact of the compiler quality.

IX. CONCLUDING REMARKS

image sizearchi. freq. 64 128 256NIOS II 80 7.848 34.085 113.16

PowerPC G4 1000 0.920 4.750 26.60Xtensa 500 0.087 0.346 1.38

TABLE XIPOI ENERGY CONSUMPTION (MJ) ARCHITECTURE COMPARISON

We have continued the evaluation of 16-bit floating-pointoperators and instructions. While previous results focused onthe impact of 16-bit SIMD on general purpose processors andautomatic code vectorization, we have extended the evaluationof short and customizable floating-point formats to embeddedvision systems and SoCs, through the implementation ofimage stabilizing algorithms where accuracy and dynamicrange problems make F16 assert themselves. We got promisingresults on each customized architecture.Xtensa achieves the same level of performance, but with a

#10 smaller consumption than Altivec. NIOS 2 efficiency isnot as high as PowerPC or Xtensa: but this soft core imple-mentation of F16 can provide low-end embedded systems withfloating-point capabilities and flexibility. Its main advantage isto cut development time compared to classical FPGA design,and also to be ready to integrate into embedded vision systems,like the one currently developed in the IEF laboratory. We canalso note that, once optimized, the widely spread PowerPC G4remains a good challenger for embedded applications.We are currently developing an Altivec compatible ISA

for Xtensa (with simulation of 4-operand instructions whichare specific to Altivec), to accelerate the port of an existingapplication and to facilitate comparisons between differentarchitectures. The last but not least advantage of such astrategy is to avoid re-developing high level code and datatransformations when an efficient Altivec version alreadyexists.Future works will consider more robust and sophisticated

embedded motion compensation algorithms and techniques forthe automatic exploration of configurations according to speed,power consumption and hardware resources metrics.

REFERENCES

[1] R.P. Brent, H.T. Kung, “A Regular Layout for Parallel Adders”,IEEE Transacton of computer, vol 31,3, pp 260-264, 1982.

[2] K. Diefendorff, P.K. Dubeyn R. Hochsprung, H. Scales, “Altivecextension to PowerPC accelerates media processing”, IEEEMicro, March-April 2000.

[3] D. Etiemble, L. Lacassagne, “16-bit FP sub-word parallelismto facilitate compiler vectorization and improve performanceof image and media processing”, in Proceedings ICPP 2004,Montreal, Canada.

[4] F. Fang, Tsuhan Chen, Rob A. Rutenbar, “Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Trans-form” in EURASIP Journal on Signal Processing, Special Issueon Applied Implementation of DSP and Communication Sys-tems.

[5] J. Harris, M. Stephens, “A combined corner and edge detector”,4th ALVEY Vision Conference, pages 147-151,1998.

[6] P. M. Kogge, H. Stone, “A Parallel Algorithm for the efficientsolution of a general class of recurrence equations”, IEEETransactions on computers, Vol 22, 8, pp786-793, 1973.

[7] G. Kolli, “Using Fixed-Point Instead of Floating-pointfor Better 3D Performance”, Intel Optimizing Center,http://www.devx.com /Intel/article/16478

[8] L. Lacassagne, D. Etiemble, “16-bit floating-point instructionsfor embedded multimedia applications”, in IEEE CAMP 2005,Palermo, Italy.

[9] C. Lee, M. Potkonjak, W.H. Mongione-Smith, “Mediabench: atool for evaluating and synthetising multimedia and commu-nication systems”, Proceeding Micro-30 conference, ResearchTriangle Park, NC, December 1995.

[10] W.R. Mark, R.S.Glanville, K. Akeley and M.J. Kilgard, “Cg: Asystem for programming graphics hardware in a C-like language.

[11] D. Menard, D. Chillet, F. Charot and O. Sentieys, “AutomaticFloating-point to Fixed-point Conversion for DSP Code Gener-ation”, in International Conference on Compilers, Architecturesand Synthesis for Embedded Systems (CASES 2002).

[12] IRISA Movi group: http://www.irisa.fr/texmex/base images/index.html

[13] NVIDIA, Cg User’s manual, http://developer.nvidia.com/view.asp/?IO=cg toolkit

[14] OpenEXR, http://www.openexr.org/details.html[15] A. Said and W. A. Pearlman, “A New Fast and Efficient Image

Codec Based on Set Partitioning in Hierarchical Trees”, IEEETransactions on Circuits and Systems for Video Technology, vol.6, pp. 243-250, June 1996.

[16] M. Unser, “Spline, A perfect fit for signal and image process-ing”, in IEEE Signal Processing Magazine, November 99, pp22-38.

[17] T.R. Haflhill, “Tensilica tackles bottleneck”, in MicroprocessorReport, May 31, 2004






























Customizing CPU Instructions for Embedded Vision Systems

Documents

Transcript of Customizing CPU Instructions for Embedded Vision Systems