FPGA based Network-on-Chip Designing Aspects - CiteSeerX

National Conference on Advanced Computing and Communication Technology | ACCT-10|

828

FPGA based Network-on-Chip Designing Aspects

Saourbh Singh Mehra Department Of ECE,

Vaish College of Engg, Rohtak Haryana, India

[email protected]

Rajni Kalsen

Department Of ECE, Vaish College of Engg, Rohtak

Haryana, India [email protected]

Rajeev Sharma Department Of ECE,

Vaish College of Engg, Rohtak Haryana, India

[email protected]

Abstract - Networks-on-Chip is a recent solution paradigm adopted to increase the performance of multi-core designs. The key idea is to interconnect various computation modules (IP cores) in a network fashion and transport packets simultaneously across them, thereby gaining performance. In addition to improving performance by having multiple packets in flight, NoCs also present a host of other advantages including scalability, power efficiency, and component re-use through modular design. In this paper, we present the alternatives in communication architecture, followed by a detailed description of a Network-on-Chip. We conclude the paper by discussing recent research in the NoC area and its various implementations. Keywords: FPGA, Network-On-Chip (NOC), Routing, System-ON-Chip (SOC)

1. INTRODUCTION

Basically FPGAs comprised of a large number of programmable logic and interconnects to implement user applications. Recently, a coarse-grained approach has been adopted by FPGA companies that combine the fine-grained reconfigurable resources with hard embedded cores that match their ASIC counterparts in performance, power and area. FPGAs are presently utilized in various application domains, from mobile portable devices to space devices. They increasingly replace ASICs as the choice of target technology due to their increasing device sizes and operating frequency.

A System-on-Chip (SoC) design integrates processors, memory, and a variety of IPs in a single design. Due to the FPGA capabilities and high time-to-market pressures, complex SoC designs are increasingly targeted to FPGA. Traditionally cores in FPGAs are connected using bus-based architectures. NoCs are proposed as an alternative to eliminate the inherent performance bottleneck in bus-based architectures. In addition to an increase in performance, NoCs present a host of other advantages, particularly for FPGA designs.

1.1. Energy Efficiency

FPGAs are a suitable implementation for portable devices and therefore, low power operation is a critical requirement.

Figure. 1. Module

for on-chip communication via the Network-on-Chip approach using a Virtex2-Pro FPGA from Xilinx [1].

As opposed to bus based architectures, NoCs consume low power due to less switched capacitance (shorter lines) [2]. Further, due to a high level of parallelism in communication, the overall energy requirement is comparatively low. Thus, the FPGA’s are energy efficient devices and are beneficial in use miniature devices. 1.2. Design Reuse

Due to a hierarchical approach, there is a rapid reduction in design and verification time associated with Network on chips. Further, development of IP cores can be independent from the application / other parts of the design. There is a steady increase expected in the % of design component re-used, which supports NoC as a design paradigm.

2. COMMUNICATION ARCHITECTURE ALTERNATIVES

This section presents various ways of implementing

communication architectures in FPGAs and devices based on them. Some of the main trade-offs involved in choosing an interconnect mechanism are Throughput Available, Static Schedule Requirement, Switch/Interconnect Area Overhead, Signal Integrity and Bandwidth Guarantee. Based on this trade-offs the interconnection mechanisms can be broadly classified into the following:


829

2.1. Dedicated FPGA Interconnects

Present FPGA devices support dedicated interconnects. These are the spatially distributed FPGA resources configured through programmable switches. The latency of this type of communication is very low and there is guaranteed bandwidth to support the communications. However, the interconnect utilization is extremely low, as the dedicated connections are almost never time-multiplexed for a different communication. With the limited resource available within the FPGA and with increasing design complexities, it is challenging to preserve signal integrity with this form of interconnects. 2.2. Time-muxed Interconnects

Time-muxed interconnects offer a high throughput connections with very high interconnect utilization. This approach requires all the communication schedules to be known off-line. With increasing number of core communications, the area requirement for context memory offsets the gain achieved in throughput and interconnects utilization [3]. 2.3. Circuit-Switched Interconnects

In a Circuit-switched mechanism, the resources utilized for the communication architecture can be time-multiplexed. The latency of such a network is minimum and the communication schedules can be determined online. However, preserving signal integrity for larger designs is a big challenge as this circuit-switched connection, once established, follows a synchronous scheme [4]. 2.4. Packet-Switched Networks

The central idea with this structure is to transport data across modules in the form of packets. Multiple packets are in sight from several source, destination pairs, thereby increasing the overall performance. Similar to circuit-switched interconnects; this form of communication is established online and does not require static scheduling. The packetization / serialization overhead involved along with the lack of bandwidth guarantee are the main drawbacks of this approach [5].

3. ON-CHIP NETWORK

The main modules present in an on-chip network are, the IP cores (computation units), Network Interface (NI) and the NoC backbone. 3.1. IP Cores

The Intellectual Property (IP) cores form the computation elements of the application. These cores are

developed independently and often re-used in different parts of the application. To handle current design complexities, there is a shift in the design trend to a more modular IP based approach.

Figure. 2. Network on chip components (CPU, Routers etc.) 3.2. Network Interface (NI)

The NI forms the glue logic between the IP cores and the network backbone. It standardizes the interface between the IPs present in a network and the routers. Further for packet-switched networks, it also packetizes the outgoing data from the IP and injects it to the network.

When an IP receives data from the network, the NI grants the request of the downstream router to receive the packets. The NI is customized to suit the requirements of a particular network backbone. Research is underway to standardize these network interfaces and IP communication protocols. 3.3. Network-on-Chip Backbone

The router forms the heart of the NoC backbone. It is responsible for transporting packets that originate from the IP cores. Traditionally, a router in a mesh network has four directional ports (North, East, South and West) to communicate with the neighboring routers. Further, it has at least one local port through which an IP core is interfaced to the network. Upon receiving a packet, the router decodes, buffers and routes it in the appropriate direction based on the destination node.

4. NETWORK-ON-CHIP ASPECTS

An on-chip interconnection network can be described by a set of design choices called network aspects. Choices of suitable network aspects have a large impact on the design metrics of the NoC. Topology, Flow Control, Arbitration, Buffering and Routing are the main aspects of a network.


830

4.1. Topology

Network topology comprises of an arrangement and connectivity of the routers. 2D Mesh, Ring and Star are some of the popular network topologies. The quality of a network can be defined in terms of some of its characteristics, namely, bisection bandwidth, degree and diameter. Choice of an appropriate topology has an impact on performance, area utilized and power consumed. Based on the interconnect requirement, area and power, 2D-mesh networks have been found suitable for many target devices [6]. In addition to this, mesh networks can use a simple, deadlock free XY routing mechanism. 4.2. Flow Control

The three main alternatives in flow control of an NoC are, Store and Forward, Virtual Cut-Through, and Wormhole. In the Store & Forward technique, the out-going packet is completely buffered at the downstream router before making the next hop. This technique used in [7] sustains high packet latency (directly proportional to packet size). In contrast to this, Virtual Cut-Through technique has low latency as the head it progresses without waiting for the rest of the packet. However, the buffer requirements are in terms of multiples of packet sizes for both the above approaches. The wormhole technique sustains low latency with minimum buffer requirements, with a packet residing across multiple nodes, thereby increasing the complexity of the switch. 4.3. Routing

The routing mechanism determines the decisions taken when a packet is in sight. The route established for a packet-based communication can be determined either at the source node (Source Routing) or independently across the routers of the network (Distributed Routing). In the case of source routing, first in a packet is the header that contains the entire route of the packet. Upon decoding the header, the router passes the packet to the appropriate downstream router. In the case of distributed routing, the route decisions can be made throughout the network by the routers (depending on congestion and other network conditions).

Another classification in the routing mechanism is

based on adaptivity to network conditions. In Deterministic routing, the route between any (source, destination) pair is always the same. XY routing is a widely used low complexity routing mechanism that is deadlock free. On the IP side, a deterministic routing mechanism cannot alter the routes based on network traffic conditions. Paths taken using Adaptive routing can vary and be non-minimal. However, the switch incurs additional complexity to support this mechanism.

4.4. Arbitration

Within a router, arbitration is required while servicing connecting requests. When multiple input ports request a common output port, the requested output port determines the order in which requests are acknowledged. The scheduling can be either static (predetermined) or dynamic. Round-Robin arbitration is a popular technique that provides a fair dynamic granting scheme. Additionally, Quality-of-Service (QoS) guarantees can be provided to certain class of applications by augmenting this arbitration scheme with a priority policy. 4.5. Buffering

At the intermediate nodes, packets need to be temporarily stored, while waiting for a channel access. Based on where the buffers are placed in the routers, Input Buffering and Output Buffering are the two main classifications. While buffering a packet at the input of a router, additional requests from upstream routers are not granted. This situation is called Head-of-Line (HOL) blocking. Using output buffering, the HOL blocking is avoided, thereby decreasing the average latency of the packets.

5. CURRENT RESEARCH IN NOC

The International Technology Roadmap for Semiconductors 2005 is the first to address the inherent performance bottleneck due to poor interconnect scaling. The concept of routing packets instead of wires to gain in performance was introduced by Dally et. al [8] and was later formally presented as a solution paradigm by Benini et al. [9]. The first proof-of-concept was presented by Kumar et al. [10] by implementing a complete NoC framework. Since then, there are several academic and industrial advancements in designs.

The NoC framework [11] by Philips is one of the first

reported industry implementation of an NoC. Sony Entertainment Play Station PS3 has partnered with IBM [12] for their network-on-chip implementation of the multi-core design. Further, Sonics has been involved in development of interconnect structures based on NoCs. Arteris and Silistix are other recent startup companies to provide an NoC solution to applications.

6. FPGA BASED NOC

Marescaux et al. [13] have applied the NoC paradigm

on a Virtex device to enable multitasking by tile-based reconfiguration. The Hermes [14] NoC platform was developed particularly for FPGAs to enable dynamic reconfiguration. Until recently the potential of NoCs to address the performance issues in FPGAs was left unexplored.


831

Figure 3. SoCWire Network-on-Chip (NoC)[17]

Bartic et al. [15] present an adaptive NoC design for FPGAs and analyze its implementation issues. Saldana et al. [6] address multi-processor designs in FPGAs by considering various NoC topologies by scaling the number of IP cores. Hilton et al. [4] design a exible circuitswitched NoC for FPGAs. Kapre et al. [16] compare the suitability of packet and circuit switching for FPGAs.

System-on-Chip Wire (SoCWire) [17] has been developed by IDA, Technical University Braunschweig. It is a Network-on-Chip (NoC) approach based on the ESA SpaceWire interface standard to support dynamic reconfigurable System-on-Chip (SoC) which can be seen in figure 3. SoCWire has been developed to provide a robust communication architecture for the harsh space environment and to support dynamic partial reconfiguration in future space applications. SoCWire provides reconfigurable point-to-point communication, high speed data rate, hot-plug ability to support dynamic reconfigurable modules, link error detection and recovery in hardware, easy implementation in dynamic partial reconfigurable systems etc.

7. CONCLUSION

This paper discusses Network-On-Chip (NOC)

designing aspects on the basis of FPGA. The main trade-offs involved in choosing an interconnect mechanism are given and based on those trade-offs the interconnection mechanisms such as Dedicated FPGA Interconnects, Time-muxed Interconnects, Circuit-Switched Interconnects, Packet-Switched Networks are discussed. Different modules in on chip network such as IP cores (computation units), Network Interface (NI) and the NoC backbone are provided. As choices of suitable network aspects

have large impact on design metrics of NoC, thus Topology, Flow Control, Arbitration, Buffering and Routing are given as reviewed. Various ongoing researches and various developed designs are referred. System-on-Chip Wire (SoCWire), a Network-on-Chip (NoC) based approach is given as recent development in NOC field.

REFERENCES

[1] http://www.ece.cmu.edu/~cssi/research/systems.html [2] Arun Janarthanan, “Networks-on-Chip based High Performance

Communication Architectures for FPGAs”, M.S. Thesis, University of Cincinnati 2008

[3] N.Kapre, “Packet-Switched On-Chip FPGA Overlay Networks”. MS thesis, California institute of technology. 2006.

[4] Clint Hilton and Brent Nelson, “PNoC: a exible circuit-switched NoC for FPGA based systems”. In IEEE PCDT, 2006.

[5] N. K. Kavaldjiev and G. J. M. Smit, “A survey of efficient on chip communications for SoC”. In 4th PROGRESS Symp. on Embedded Systems, Nieuwegein, Netherlands, 2003.

[6] Manuel Saldaa, et al, “The Routability of Multiprocessor Network Topologies in FPGAs”. In SLIP'06, page 49-56, 2006.

[7] Balasubramanian Sethuraman,et al, “LiPaR: A Light-Weight Parallel Router for FPGA-based Networks-on-Chip”. In Great Lakes Symposium on VLSI, 2005.

[8] William J. Dally and Brian Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks”. In Design Automation Conference, pages 684-689, 2001.

[9] Luca Benini and Giovanni De Micheli, “Network on Chips: A New SOC Paradigm”. In IEEE Computer, 2002.

[10] S. Kumar, et al, “A Network on Chip Architecture and Design Methodology”. IEEE Inter. Symposium on VLSI, 2002.

[11] J.Dielissen et al., “Concepts and implementation of the phillips network-on-chip”. In Proceedings of IPSOC'03, 2003.

[12] J.A.Kahle et al., “Introduction to the cell multiprocessor”. In IBM Journal of Research and Development, 2005.

[13] T. Marescaux et al., “Interconnection Networks Enable Fine-Grain Multi-Tasking on FPGAs”. In FPL'2002, 2002.

[14] F. Moraes, et al, “HERMES: an Infrastructure for Low Area Overhead Packet-Switching Networks on Chip”. In INTEGRATION, The VLSI Journal, 2002.

[15] T.A Bartic et al, “Topology Adaptive Network-on-Chip Design and Implementation”. In Computer and Digital Tecniques, IEEE Proceedings, pages 467-472, 2005.

[16] N. Kapre, et al, “Packet-Switched vs. Time Multiplexed FPGA Overlay Networks”. In IEEE Symposium on Field-Programmable Custom Computing Machines, 2006.

[17] http://www.opencores.org/project,socwire

Implementing FPGA based PCI Express design

Sarun O.S. Nambiar,Yogindra Abhyankar,Sajish Chandrababu Hardware Technology Development Group

Centre for Development of Advanced Computing Pune, India

[email protected], [email protected], [email protected]

Abstract— PCI Express (Peripheral Component Interconnect Express) abbreviated as PCIe or PCI-E, is designed to replace the older PCI, PCI-X, and AGP standards. PCIe 2.1 or Gen2 is the latest standard for expansion cards that has come out recently on mainstream personal computers. Conceptually, the PCIe bus can be thought of as a high-speed serial replacement of the older parallel PCI/PCI-X bus. At the software level, PCIe preserves compatibility with PCI; a PCIe device can be configured and used in legacy applications and operating systems that have no direct knowledge of PCIe's newer features. Different solutions for the implementation of PCIe design using FPGAs are available through couple of vendors including Xilinx and Altera. The designer can implement PCIe inside an FPGA by either developing the complete PCIe protocol or by purchasing a readily available IP from the market. Xilinx has both a Soft IP as well as a Hard IP available to the designer to get started with the design. In this paper we present our PCIe implementation in various Xilinx family of devices and further show the advantages and disadvantages with each.

Keywords— PCI Express, Xilinx, Virtex, FPGA.

I. INTRODUCTION PCI express [1] is a third generation high speed I/O bus

used to interconnect peripheral devices in applications such as computing and communication platforms. A key difference between PCIe and earlier buses is a topology based on point-to-point serial links, rather than shared parallel bus architecture. The PCIe architecture has carried forward the most beneficial features from previous generation bus architectures and has also taken advantage of new developments in computer architecture.

In terms of bus protocol, PCIe communication is encapsulated in packets. PCIe shares the same usage model and store-load communication model as PCI and PCI-X. The address space model has been retained allowing existing OS's and driver software to run PCIe system without any modifications.

FPGA vendors have various device families and solutions to implement the PCIe design. In this paper we will mainly look into Xilinx FPGAs [2]. Virtex-II Pro, Virtex-4, Virtex-5, Spartan-6 and Virtex-6 are some of the devices which can be used by the designer to implement the PCIe Design. Virtex-II Pro and Virtex-4 have a Soft IP that can be used to implement PCIe design on the device. Virtex-5, Spartan-6 have Hardware integrated block (Hard IP) which

can be used by the designer. For PCIe Gen 2 implementation, only Virtex-6 is available. Virtex-6 has an integrated block that is compliant to Gen2 and can additionally support root port configuration.

We begin this paper by briefly discussing the PCIe standard in the next section. Section 3 outlines the various FPGA devices from Xilinx that are relevant to PCIe. In later section we describe our implementation of PCIe in FPGAs.

II. PCI EXPRESS SPECIFICATION PCIe is a third generation high speed I/O bus used to

interconnect peripheral devices in applications such as computing and communication platforms. PCIe employs point-to-point interconnects for communication between two devices as against its predecessor buses that used multi-drop parallel interconnect. A point-to-point interconnect implies limited electrical load on link allowing higher frequencies to be used for communication. Currently with the PCIe ver2.0 specifications, the link speed is 5Gbps as against the ver1.1 with 2.5 Gbps speed.

Because of the serial interconnection, the board design cost and complexity has reduced considerably. A PCIe interconnect that connects two devices together is referred to as a link. A link consists of either x1, x2, x4, x8, x16 or x32 signal pairs in each direction. These signals are referred to as Lanes. Figure 1 shows how data is transmitted from a device on one set of signals and received on another set of signals.

PCIe, unlike previous PC expansion standards, is structured around point-to-point serial links, a pair of which (one in each direction) make up a lane rather than a shared parallel bus. These lanes are routed by a hub on the main-board acting as a switch. This dynamic point-to-point

PCIe Device A

PCIe Device B

packet

packet

Link (x1,x2,x4,x8,x16 or x32)

Figure 1. PCIe Link

xyz

Typewritten Text

832

behavior allows more than one pair of devices to communicate with each other at the same time. In contrast, older PC interfaces had all devices permanently wired to the same bus; therefore, only one device could send information at a time. This format also allows channel grouping, where multiple lanes are bonded to a single device pair in order to provide higher bandwidth.

The number of lanes is negotiated during power-up or explicitly during operation. By making the lane count flexible, a single standard can provide for the needs of high-bandwidth cards (e.g., graphics) while being economical for less demanding cards.

Unlike preceding PC expansion interface standards, PCIe is a network of point-to-point connections. This removes the need for bus arbitration or waiting for the bus to be free, and enables full duplex communication. While standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the same data transfer rate, PCIe x4 will give better performance if multiple device pairs are communicating simultaneously or if communication between a single device pair is bidirectional.

Major components in the PCIe system are root complex, switches and endpoint devices. Root complex denotes the device that connects the CPU and memory subsystem to the PCIe fabric. A PCIe root port is a port on a Root Complex, that maps a portion of the PCIe interconnect hierarchy through an associated virtual PCI-PCI bridge. Endpoints are devices other than root ports and switches that are requesters or completers of PCIe transactions. They are peripheral devices that initiate transactions as a requester or respond to transactions as a completer.

III. XILINX FPGA DEVICES Xilinx has different family of devices, starting from low

cost, low power Spartan to high performance, feature rich Virtex series.

The Spartan series of devices are low cost with limited device resources. The current Spartan series is Spartan-6 that is based on 45 nm technology. It works on 1.2V core voltage and includes one integrated Endpoint block for PCIe, GTP resources, Block RAM, memory controller, DSP slices etc. GTP transceivers are used by the Hard IP for its physical layer that provides the serialization and desearlization of the bits. Block RAM is used for buffering the packets that are formed by the interfacing logic.

Xilinx high performance, high density devices are available under the Virtex range, having Virtex-6 as the latest offering. Xilinx has PCIe solution starting from Virtex-II Pro.

Virtex-II Pro was the first device to support the PCIe specification. It was based on 130 nm technology and offered Rocket IO capable of supporting high speed serial interconnect.

Virtex-4 is based on 90 nm technology and made improvements on the Rocket IO of Virtex-II Pro providing a compliant PCIe solution.

Virtex-5 devices are based on the 65 nm process technology with one to two speed grade improvements over

the Virtex-4 family. Since the core voltage has also been reduced from the Virtex-4 family, we get an improved power saving. This family also introduced the 6 input LUT instead of the 4 input LUT used in earlier generation. This allowed for better performance as the logic levels are reduced.

Virtex-6 is the latest offering from Xilinx based on the 40 nm process technology providing higher performance and up to 50% lower power compared to the previous generations. It enhances the logic efficiency with integrated DSP slices, next gen PCIe technology and Ethernet MAC block. The significant reduction in power allows board designers for smaller heat sinks, fans and power supplies. The device operates on 1V core voltage with an option for 0.9V for low power operations.

IV. IMPLEMENTATION OF PCIE In this section we describe our implementation of the

PCIe based design on different families. PCIe can be implemented using either the Soft IP [3] or the Hard IP available in the FPGA. Xilinx has different family of devices, starting from low cost, low power Spartan to high performance, feature rich Virtex series which provide either soft ip or hard ip for our implementation of PCIe design.

Figure 2 shows the general implementation using the soft IP in the PCIe based design. The LogiCORE IP “Endpoint for PCIe” core from Xilinx offers high-bandwidth, scalable, and reliable serial interconnect IP building blocks. As shown in the figure, the Soft IP can be configured for either, a 1x, 4x or 8x lane PCIe design. The Soft IP takes resources from the FPGA there by leaving the designer with fewer resources for his own design. The IP has an interfacing protocol that needs to be followed for its use. The interface logic forms the transaction layer packets and sends it to the IP with some control and handshaking signals. The interface logic has to make sure that the buffers available and the payload size set by the configurable soft IP is met for all transactions.

Figure 3 shows the general implementation of PCIe design using the hard IP block. Hard IP is a dedicated hardware resource inside the FPGA taking a very small area and is optimized for performance and power. This IP has the transaction layer and data link layer implemented and easy interface to the GTP block for realization of the physical

FPGA

Configurable PCI Express Soft IP Transaction layer

Data link layer

Physical layer

32/64 bits

Interface logic

User Design

PCIe lane x1/x4/x8

Figure 2. General block diagram for PCIe based design using a soft IP

layer. To use this IP we require a wrapper around it that helps us utilize the input output ports and other resources. The wrapper utilizes about 1% of the FPGA resources leaving the rest of the FPGA available for the designer.

A. Virtex-II Pro and Virtex-4 Soft IP is the only available option to implement PCIe

Gen1 design in these devices. Soft IP supports 1x, 4x or 8x design. However, an 8 lane design is not possible in Virtex-II Pro due to limitation of resources.

Virtex-II Pro Rocket IO are not fully compliant with the PCIe physical layer constraints regarding clock jitter and needs additional work around to get the desired results. While testing our design on a Nital PExBuilder–X254 board based on Virtex-II Pro, we were not able to run tests for sustained period as the system would reboot. The reason was that the card would loose PCIe signal connectivity.

Virtex-4 GTP’s are compliant with the PCIe constraints and has given good results; however as the IP is soft, it utilizes a considerable amount of LUT's for its implementation. Typically it uses 20% of LUT; this may prompt the designer to migrate to a larger device to implement his complete design and hence a slight drawback. Moreover to achieve an 8 lane interface requires the design to be implemented with 250 MHz as its clock frequency requires tight timing budget. To meet the constraints we did multiple runs of the implementation software and some changes in the design.

B. Spartan-6 The current Spartan series which is Spartan-6 [4] has

LXT family of devices that includes one integrated Endpoint block for PCIe technology that is compliant with the Gen1. Xilinx provides a light-weight (<100 LUT), configurable, easy-to-use LogiCORE IP that ties the various building blocks (the integrated Endpoint block for PCIe technology, the GTP transceivers, Block RAM, and clocking resources) into a compliant Endpoint solution.

Since only a single lane PCIe design can be implemented it severely handicaps the designer in terms of bandwidth. For designs that require less bandwidth this might be the ideal solution

C. Virtex-5 Virtex-5 [5] LXT/SXT/FXT devices have inbuilt Hard IP

block which is compliant with PCIe ver1.1. A wrapper is required to use this IP that binds the block together with GTP and Block RAM to generate the endpoint. The wrapper utilizes around 2150 LUT i.e. less than 1 % of logic for the XC5VLX50T device. There are restrictions on the resource available for the PCIe block. Hence the maximum payload we can support is 512 bytes. There is also limitation on the memory available to the block, hence placing limitations on the FIFO resources available for sending and receiving of transaction packets.

The use of Hard IP helps us utilize the FPGA resource for the whole design as against the Virtex-4 soft ip solution where the Soft IP takes dedicated FPGA resource to implement the PCIe core, forcing the design to migrate to a larger device.

For a 4 lane design, the clock frequency at which the design works is 125 MHz. This frequency is generated by the GTP using the reference frequency of 100 MHz. The reference frequency is available from the PCIe connector and used by the GTP to generate the 125 MHz. We tested this design on ML555 development board from Xilinx. The test results showed good stability and we were able to achieve 740 MB/s bandwidth for memory writes and 480 MB/s for memory read.

For implementing an 8 lane design, the frequency is required to be doubled to 250 MHz. The interface to the IP is through a 64 bit data path and control signals. It is the job of the designer to ensure that the Transaction layer packets are sent to the core satisfying the interface and timing requirements. To make sure that the design works at 250 MHz frequency, the designer has to provide tight timing constraints to the implementation software and probably many reruns.

Figure 4 shows the block diagram of the interface logic that we have implemented for our PCIe design. The transmitter block forms the Transaction layer packets based on maximum payload size or maximum read requests and memory address requirements. The transmitter block also interacts with the receiver and DMA block for its operation. The receiver block is where the transaction layer packets are decoded and processed. The DMA engine has details

Configurable Wrapper

PCI Express Hard IP Block

FPGA

PCIe lane x1/x4/x8

64/128 bits

Interface logic

User Design

Figure 4. General block diagram of PCIe based system using a PCIe hard IP

PCIe Core

Interface logic

User design

Receiver

Transmitter

DMA engine

Figure 3. Basic Interface logic

regarding the buffer allocated to the device by the device driver. All of these blocks work at 125 MHz for a 4 lane design and at 250 MHz for an 8 lane design. The interfacing logic exchanges data with the core using a 64 bit interface and some control lines.

V. MIGRATING TO PCIE GEN2 PCIe Gen2 increases the link speed to 5 Gbps from 2.5

Gbps of Gen1, there by increasing the bandwidth for the same number of lanes. Virtex-6 [6] is the latest family of FPGA from Xilinx that supports 5Gbps lane speed required for PCIe Gen2 operation. The Virtex-6 Integrated Block (hard IP) for PCIe contains full support for Gen1 and Gen2 PCIe Endpoint and Root Port configurations. The root port configuration increases the scope of utilization of this device. It can be used to build the basis for a compatible Root Complex, allows custom FPGA-FPGA communication via the PCIe protocol, and to attach Endpoint devices such as Fiber Channel HBAs to the FPGA.

Hard IP block is highly configurable to system design requirements and can operate 1, 2, 4, or 8 lanes at 2.5 or 5.0 Gbps data rates. For high-performance applications, advanced buffering techniques of the block offer a flexible maximum payload size of up to 1024 bytes. The integrated block interfaces to the high speed GTX transceivers for serial connectivity, and to block RAMs for data buffering. These elements implement the Physical Layer, Data Link Layer, and Transaction Layer of the PCIe protocol.

Xilinx provides a light-weight, configurable, easy-to-use LogiCORE wrapper that ties the various building blocks (the integrated block for PCIe, the GTX transceivers, block RAM, and clocking resources) into an Endpoint or Root Port solution. Using this wrapper, system designer has control over many configurable parameters as lane width, maximum payload size, FPGA logic interface speeds, reference clock frequency, and base address register decoding and filtering. The wrapper utilizes only 1% of the device resources.

For our 4 lane Gen2 implementation the interfacing design frequency is 250 MHz and the reference clock of 250 MHz is required for the PCIe block to function. The interface to the PCIe block is through a 64 bit data path and control lines. The timing constraints required for the design also permits the use of the lower speed grade device to be considered as the minimum option for implementation. The maximum payload size support for a 4 lane design is 1024 bytes, thereby giving an advantage over the earlier generation devices.

In order to implement an 8 lane design we are required to use a higher speed grade device rather than the -1 speed grade. For higher payload size, we will require higher speed grade. The -3 speed grade supports only 512 Bytes of maximum payload size and -2 speed grade supports only 256 Bytes. Also for an 8 lane Gen2 design, we require a 128 bit interface to the PCIe block as against a 64 bit interface for a Gen1 design.

The interface logic as shown in Figure 4 for Gen1 design is followed in Gen2 design also. However, the difference here is that the interface logic now works at 250 MHz and

the data path changes to 128 bit for 8 lane design. For a 4 lane design the data path remains the same i.e. 64 bits.

We used the latest 11.5 version of ISE software for the implementation of PCIe Gen2 design which supports Virtex6 family. All coding was done using VHDL and synthesized using synplify pro.

While designing a Gen2 based PCIe board, one needs to consider the 100 MHz clock from the PCIe connector and convert it to 250 MHz frequency using clock multiplier circuitry. The multiplier should adhere to the PCIe specification namely.

VI. CONCLUSION We implemented PCIe Gen1 on Xilinx Virtex-II Pro,

Virtex-4 and Virtex-5 devices. In order to implement Gen2, we need to migrate to Virtex-6 device. Migration from Virtex-5 design to Virtex-6 is challenging as for Gen2 the frequency requirements are doubled and the board requires redesign based on the Virtex-6 PCB guidelines. However, there are no major changes required to the interface logic apart from the frequency enhancements. By migrating to higher bandwidth PCIe Gen2 based design on Virtex-6 we can leverage a better throughput in our design.

REFERENCES [1] PCIe Specification http://www.pcisig.com/specifications/pciexpress/ [2] Xilinx device docmentation and datasheet

http://www.xilinx.com/support/documentation/index.htm [3] Endpoint for PCIe user guide (Soft IP)

http://www.xilinx.com/support/documentation/ip_documentation/pci_exp_ep_ug185.pdf

[4] Spartan-6 FPGA Integrated Endpoint Block for PCIe user guide http://www.xilinx.com/support/documentation/user_guides/s6_pcie_ug654.pdf

[5] Virtex-5 Endpoint Block Plus Wrapper for PCIe user guide http://www.xilinx.com/support/documentation/ip_documentation/pcie_blk_plus_ug341.pdf

[6] Virtex 6 Integrated block for PCIe user guide http://www.xilinx.com/support/documentation/user_guides/v6_pcie_ug517.pdf

xyz

Typewritten Text

835


836

Analysis and Comparison of Different CMOS 1 Bit-Full Adders under Sub-micron Technology

Pankaj Negi1, Sunil Kumar2

[email protected]

Abstract—This paper deals with the topologies of one-bit full adders, including of those recently proposed, are analyzed and compared for speed, power consumption, and power-delay product. The comparison has been performed on two classes of circuits, the former with minimum transistor size to minimize power consumption, the latter with optimized transistor dimension to minimize power-delay product. The analysis is has been done using 180 nm technology file with supply voltage of 1.8V. Thus design guidelines have been derived to select the most suitable topology for the design features required. In this paper we have analysised a 12 transistors (12T) full adder which has better power, delay performance than the existing adders. Performance comparison of the 12T adder has been made by comparing its performance with 10T SERF, 10T CLRCL, the existing 14T full adders. This paper also discuss a novel figure of merit to realistically compare-bit adders implemented as a chain of one-bit full adders. The paper show that for short chains of blocks or for cases where minimum power consumption is desired, topologies with only pass transistors or transmission gates are suitable. Realistic circuit arrangements are used to demonstrate the performance of each I-bit full adder cell. The paper gives a quantitative comparison of the adder cell performance.

, [email protected]

Index Terms— Adders, SERF, CMOS digital integrated circuits, full adder, performance analysis,Power delay product.

I. Introduction

Addition is one of the fundamental arithmetic operations. It is used extensively in many VLSI systems such as application-specific DSP architectures and microprocessors. In addition to its main task, which is adding two binary numbers, it is the nucleus of many other useful operations such as subtraction, multiplication, division, address calculation, etc. In most of these systems the adder is part of the critical path that determines the overall performance of the system. That is why enhancing the performance of the 1-bit full-adder cell (the building block of the binary adder) is a significant goal. Recently, building low-power VLSI systems has emerged as highly in demand because of the fast growing technologies in mobile communication and computation. The battery technology doesn’t advance at the same rate as the microelectronics technology. There is a limited amount of power available for the mobile systems. So designers are faced with more constraints: high speed, high throughput, small silicon area, and at the same time, low-power consumption. So building low-power, high-performance adder cells is of great interest. In this paper there are four existing low power full adder cells are 10 transistor SERF adder, 10 transistor CLRCL adder and 14 transistor full adder and the new 12 transistor (12T) full adder. The circuit diagram of 10T SERF adder [1] is shown in Figure.1. It can be seen that it has two four transistor xnor structures to perform the sum operation. Both these xnor does not have ground so it has very low power consumption since there is no direct path from V DD

to ground. The carry logic is generated by pass transistor logic.

[email protected] , M.Tech student , YMCA University of Science & Technology, Faridabad

[email protected] , M.Tech student , YMCA University of Science & Technology, Faridabad

Fig. 1. Circuit diagram of 10T SERF full adder

Though SERF adder [1] consumes less power it suffers from threshold loss problem as both sum and carry are generated from pass transistor logic. In order to rectify the defects in the SERF adder [1], the 10T CLRCL [1] adder was introduced. The main aim of the design is that the carry signal must not suffer from distortion as it propagated. The circuit diagram of 10T CLRCL [1] is shown in Figure.2.

Fig -2 circuit diagram of 10T CLRCL full adder

It consists of two transistor xor structures. The inverters used prevent threshold loss problem and propagate the full swing signals to generate the carry. Though the carry signal has full swing operation, this circuit consumes more power. The circuit diagram of Existing 14T full adder [2] is shown in Figure.3. It has a four transistor xor structure and an inverter. The carry is generated using transmission gate logic and the sum is generated from pass transistor logic. The power consumed by this circuit is less when compared with that of 10T CLRCL full adder [2] and more when compared with 10T SERF full adder [1]. The hybrid adder proposed in [3] has more delay penalty in low voltage circuits. The 12T full adder can overcome above drawbacks and its performance characteristics are studied by implementing filter structures proposed in [4]. The remainder of the paper is organized as follows: Section (2) focuses on Design of new 12T Full adder, Section (3) focuses on Implementation and comparison, and the Section (4) is for conclusion.

mailto:[email protected]�



837

Fig-3 Circuit diagram of existing 14T full adder.

II. New 12-T FULL ADDER DESIGN SCHEME

For this adder, the design of the prior different full adders is refined. In order to generate reduce PDP (power delay product) outputs, following steps are adopted. 1) TO provide full swing MID signal: An inverting buffer at the output of the XOR block is added. This will generate a full swing MID signal. 2) TO modify SUM block: Since the improper MID signals are now inverted, the SUM blocks are modified accordingly to generate correct SUM outputs. 3) Modify Cout block: The Cout blocks are also modified to generate correct Cout output. Sum = (A xnor B).C + (A xor B) C(bar) (1)

Carry =(A xor B)⋅C + (A xnor B)⋅A (2) The refined full adders are shown in Fig. 4. With 2 extra transistors as well as the improvements above, we manage to provide correct outputs in TSMC 0.18 μm CMOS process.

Fig-4 circuit diagram of proposed 12T full adder

III. IMPLEMENTATION AND COMPARISON

In order to compare the proposed 12-T Full Adders with the prior Full Adders, a comprehensive Power-Delay comparison is tabulated in Table I. It is obvious that the refined SEREF architecture (12T full addder) possess the minimal Power-Delay product which leads us to a simple conclusion: The SEREF new can be used as a better FA alternative.

Table I Comparison of power and delay of different power delay. Full adder Power (in

µw) Delay (ns) Power delay

product 10^15 Ws

10T CLRCL 21.074 0.1680 3.540432 10T SERF 7.990 0.1078 0.861322 14T 16.960 0.1399 2.372704 New 12T 10.330 0.1123 1.160059 The table 1 shows the power, delay and power-delay product comparison of three existing full adders and the new 12T full adder. It can be seen that 10T SERF has the least power and power-delay product but it suffers from severe threshold loss problem which leads to circuit malfunction when cascaded in larger circuits. The new 12T full adder has less power consumption and delay when compared with existing 14T and 10T CLRCL full adders.

IV. CONCLUSION

The new 12T full adder has 39% power savings when compared with Existing 14T Full adder and 50% power savings when compared to 10T CLRCL adder. The new 12T design possesses the advantages of flexibility, better outputs, and better power-delay product. The comparison between different designs indicates that 12T Full adder has the minimal power-delay product. In short, the new 12T Full adder designs can be taken a better FA alternative.

References

[1] H.-T. Bui, A.-K. Al-Sheraidah, and Y. Wang, “Design and analysis of 10-transistor full adders using novel XOR-XNOR gates,” in Proc. 5th Int. Conf. on Signal Processing, 2000, vol. 1, pp. 619-622. [2]Nan Zhuang and Haomin Wu, “A new design of the CMOS full adder” IEEE Journal of Solid State Circuits, Vol. 27, No.5, pp. 840-844, May 1992. [3] W. Al-Assadi, A.-P. Jayasumana, and Y.-K. Malaiya, “Pass-transistor logic design,” Int. J. Electron., vol. 70, pp. 739V749, 1991. [4] Jin-Fa Lin,Yin-Tsung Hwang, ”A Novel High-Speed and Energy Efficient 10- Transistor Full Adder Design” VOL. 54, NO. 5, MAY 2007. [5] E. Abu-Shama and M. Bayoumi, “A new cell for low power adders,” in Proc. Int. Midwest Symp. Circuits Syst., 1995, pp. 1014–1017. [6] J. Wang, S. Fang, and W. Feng, “New efficient designs for XOR and XNOR functions on the transistor level,” IEEE J. Solid-State Circuits, vol.29, no. 7, pp. 780– 786, Jul. 1994. [7] R. Shalem, E. John and L. K. John, “A novel low power energy recovery full adder cell,” in Proc. the IEEE Great Lakes Symp. of VLSI, 1999, pp. 380-383.

A Comparative Study on Folded Cascode Amplifier Circuits in 0.35µm CMOS Technology

Anil kumar Department of Electronics & Communication Engineering Guru Jambheshwar University of Science & Technology,

Hisar, 125001, India Email: [email protected]

Manoj Kumar Department of Electronics & Communication Engineering Guru Jambheshwar University of Science & Technology,

Hisar, 125001, India Email: [email protected]

Abstract—Folded-cascode op-amp is commonly used in many analog applications. The cascode arrangement has high open loop voltage gain in low frequency ranges. In this paper three different configurations of the folded cascode amplifier circuits namely conventional folded cascode, recycling folded cascode (RFC) and complementary folded cascode (CFC) amplifiers has been studied. Convention folded-cascode amplifier provides gain of 4.7K with frequency of 100.25 MHz. In recycling folded-cascode amplifier, the gain and frequency of 7.6K and 99.99MHz have been obtained. Complimentary folded-cascode amplifier shows gain of 5.5K with 103.05 MHz frequency. Circuits have been designed using 0.35 micrometer (µm) CMOS technology and simulated in spice. Keywords - CMOS, folded cascode, power dissipation, slew rate, operational transconductance amplifier.

I. INTRODUCTION

The advancement of CMOS technologies provide opportunity to integrate entire systems on to a single chip and offer growing market of mobile and portable electronics devices. This growth is driven by the continuous integration of small building blocks like amplifiers, level shifters, oscillators etc on a single chip. Silicon area and power consumption are the two most valued aspects of any VLSI system design [1], [2]. The obstacle to analog CMOS VLSI designs are inconsistent circuit performance, small signal gain, frequency response, phase response, delay and power dissipation. Various analog signals processing system has been integrated for low frequency range. The operational Tran conductance amplifier (OTA) is fundamental analog building block and in many VLSI applications is the largest and most power consuming component. Recently, one of the most commonly used architectures; folded cascode amplifier for high gain has been used as a single stage or first stage in multi- stage amplifiers [1]. PMOS input folded cascode amplifier (FC) has become the prime-choice over its NMOS counterpart for its higher non-dominant poles, lower noise and input common mode level [1]. This paper introduces a comparative study of three folded cascode amplifier circuits. There are three different configuration of folded cascode amplifier named as conventional folded cascode, recycling folded cascode (RFC) and complementary folded cascode (CFC) amplifiers.

II. COMPARATIVE STUDY OF FOLDED CASCODE AMPLIFIERS A conventional folded cascode op-amp has been shown in Fig.1. It shows a simplified schematic of a circuit that applies the folded cascode structure on various integrated circuits. As shown in Fig.1. MP2 and MN3 form one cascode while MP3 and MN4 form another cascode. The current mirror converts the differential signal into a single-ended output by sending variations in the drain current of MN3 to the output. The resulting op-amp is known as folded-cascode amplifier. The cascode in Fig-1 is said to be folded in the sense that it reverse the direction of the signal flow back towards the ground. This reversal has two main advantages when used with other integrated circuits. First, it increases the output swing. Secondly, it increases the common-mode input range [11].

Fig.1. Conventional folded cascode op-amp

On the other hand a modified folded cascode known as recycling folded cascode (RFC) shown in Fig. 2 are intended to use MN1 and MN2 as driving transistor to enhance general performance of folded cascode amplifier by recycling current. First, the input drivers, MP2 and MP3 (fig.1) are split in half to produce transistors MP2, MP3, MP4, MP5, which now conduct fixed and equal current Ib/2 next MN2 and MN1 are split to form the current mirror MN5, MN4 and MN4:MN7 with cross over connection which ensure that small signals

xyz

Typewritten Text

838

currents added at the source of MN7 and MN8 are in phase. Finally, MN1 and MN2 are sized similar to MN7 and MN8 and their addition helps maintain the drain potentials of MN5: MN6 and MN4: MN7 equal for improved matching. We will refer it as recycling folded cascode (RFC) as we are releasing or recycling existing devices and current to perform addition task and this modification provided RFC with enhanced features over that of the conventional folded cascode amplifier except a small increment in power dissipation due to area trade off [1].

Fig.2. Recycling folded cascode op-amp

Another important application of conventional folded cascode amplifier is very high frequency CMOS complementary folded cascode (CFC) amplifier. This is a very fast CMOS transconductance amplifier that can be used for both continuous time and sampled data signal processing application. In this section the complementary folded amplifier (CFC) architecture is present (Fig.3) and compare with present design approach for conventional and recycling folded cascode amplifier. The FC structure passes a reduced current gain stage input capacitance, this allowing it to be designed with a smaller minimum load capacitor limitation [6]. It is thus capable of higher frequency operation than is the current mirrored approach for small capacitive load. Based on this observation a CMOS VLSI amplifier is desired, which can exhibit the fast setting ac response of the FC architecture for small capacitive loads, while providing a high slew rate for the fast stalling of larger loads as depicted by RFC architecture. Such an amplifier would be well suited for on-chip VLSI analog signal processing application in both the continuous time and sampled data domain. In comparison of FC and RFC, it has much larger power dissipation but provide a better result in case of slew rate and delay.

Fig.3. Complementary folded cascode op-amp

III. RESULTS AND DISCUSSIONS

In current work a comparative study on conventional folded cascade, recycling folded cascode (RFC) and complementary folded cascode (CFC) has been done in 0.35um technology. Supply voltage has been taken as 3.3V and circuits are simulated in spice for performance evaluations. Conventional folded-cascode shows power dissipation of 3.3638nW with a gain of 4.7K. It shows slew rate of 35.06K and frequency of 100.25 MHz. On the other hand in case of recycling folded-cascode and complimentary folded-cascode amplifiers the gain and frequency are respectively 7.6 K, 99.99MHz and 5.5K, 103.05MHz while there power dissipation is respectively 1.249uW and 8.1499uW. From the comparative study we conclude that in case of recycling folded-cascode and complimentary folded cascode the power dissipation is much larger than the conventional folded-cascode amplifier because of an increase in area trade-off .The table describe below show the comparison of various parameters of all three configurations for folded cascode amplifiers.

TABLE I MEASURED AND SIMULATION RESULTS FOR THE VARIOUS

CONFIGURATION OF FOLDED CASCODE AMPLIFIERS

PARAMETER

CONVENTIONAL

FOLDED CASCODE

RECYCLING

FOLDED CASCOD

E

COMPLIMENTRY FOLDED CASCODE

POWER DISSIPAT

ION

3.3638 nW 1.249 µW 8.1499 µW

FREQUENCY

100.25MHZ 99.99 MHZ

103.05MHZ

DELAY 8.7441 ns 4.9572 ns 1.4203 ns

SLEW RATE

35.06 K 30.33 K 6.102 K

GAIN 4.7 K 7.6 K 5.5 K

IV. CONCLUSION

Three folded cascade amplifier configurations have been studied in present work. Convention folded-cascode amplifier shows gain of 4.7K with frequency of 100.25MHz while in case of recycling folded-cascode amplifier, the gain and frequency are 7.6K and 99.99MHz respectively. In contrast the gain and frequency of a complimentary folded-cascode amplifier is 5.5K and 103.05MHz respectively. Out of three configurations the recycling folded-cascode (RFC) shows better performance in terms of gain while the excellent wide-band performance of the complimentary folded-cascode (CFC) demonstrated that it is the architecture of choice for implementing high frequency CMOS VLSI amplifiers.

REFERENCES [1] Rida S.Assad, and Jose Silva-Martinez, “The Recycling Folded Cascode: A General Enhancement of Folded Cascode Amplifier,” IEEE J. Solid-state Circuits, vol.44, no.9, September 2009. [2] K. Nakamura and L.R. Carley, “An Enhanced Fully Differential Folded Cascode Op-amp,”IEEE J. Solid-State Circuits, vol 27, pp. 563-568, Apr. 1992. [3] J.Adut, J. Silva-Martinez, and M. Roacha-Perez, “A 10.7 MHz sixth order SC ladder filter in 0.35um CMOS technology,” IEEE Trans. Circuits Syst. I: Reg. Paper, vol. 53, no.8, pp. 1625-1635, Aug. 2006.

[4] R. Assaad and J. Silva-Martinez, “Enhancing general performance of folded cascode amplifier by recycling current,” IEEE Electron, Lett. vol. 43, no.23, Nov. 2004. [5] A. A. Abidi, “On the operation of cascode gain stage,” IEEE J. Solid-State Circuits, vol. 23, no. 6, pp. 1434-1437, Dec. 1988. [6] Richard E. Vallee and Ezz I. EI Marsy, “A Very Hihg-Frequency CMOS Complementary Folded Cascode Amplifier,” IEEE J. Solid-state Circuits, vol. 29, no.2, Feb 1994. [7] M. Milkovic, “Current gain high-frequency CMOS operational amplifiers,” IEEE J. Solid-State Circuits, vol. SC-20, no.4, pp. 845-851, Aug. 1985. [8] R. Klinke, B. J. Hosticka, and H.J.Pfleiderer, “A very-high-slew-rate CMOS operational amplifier,” IEEE J. Solid-State Circuits, vol. 24, no.3, pp. 744-746, June 1989. [9] E. Sackinger and W.Guggenbuhl, “A high-swing, high-impedance MOS cascode circuits,” IEEE J. Solid-State Circuits, vol.25, no.1, pp. 289-298, Feb 1990. [10] Wen-Whe Sue, Zhi-Ming Lin, and Chou-Hai Huang, “A high dc-gain folded-cascode CMOS operational amplifier” IEEE J. Solid-State Circuits, vol.13, no.3, pp. 176-177, March 1998. [11] Gray Hurst and L. Meyer, “Analysis and design of analog integrated circuits” Forth edition, John Wiley & Sons, INC. 2001.

xyz

Typewritten Text

840


841

OMAP: AN APLLICATION ENVIRONMENT Seema

Department of Electronics & Communication Engineering

Lingaya’s Institute of Management & Technology, Faridabad, India [email protected]

Abstract- Texas Instruments’ OMAP: Open Multimedia Application Platform is a highly integrated hardware and software platform, designed to meet the application processing needs of next-generation embedded devices. The dual-core architecture provides benefits of both DSP and reduced instruction set computer (RISC) technologies, incorporating a TMS320C55x DSP core and a high-performance ARM core. The OMAP5912 device is designed to run leading open and embedded RISC-based operating systems, as well as the Texas Instruments (TI) DSP/BIOS software kernel foundation. This paper brings various aspects into picture about this open architecture how and why 3 G technology incorporates this platform for development perspectives. It also describes the OMAP H/W architecture and provides overview of OMAP strategy, concepts, main milestones & achievements. This paper outlines the concept why a combined RISC/DSP approach is superior to RISC only architecture for wireless multimedia applications. It also describes how OMAP seamlessly integrates a software infrastructure, an ARM RISC processor, a high-performance, ultra-low-power TI TMS320C55x generation digital signal processor (DSP) and shared memory architecture on the same piece of silicon. Keywords- OMAP, ARM PROCESSOR, DSP, RISC, Embedded systems,

I. INTRODUCTION

The wireless market is huge and growing. The wireless technology is advancing rapidly beyond voice to include more worldly and refined applications as mobile e-commerce, real-time internet, speech recognition, audio and full-motion video streaming, VOIP. For wireless application speech, audio, image and video data requires heavy signal compression with in available bandwidth before transmission and decoding within the receiver, and if these wireless appliances has to encapsulate into an embedded system that needs additional signal processing capabilities for the concurrent noise suppression and echo cancellation algorithms we need highly power efficient and fast GPPs (General Purpose Processors) in addition to DSPs. The OMAP architecture provides this solution. This unique architecture offers an attractive solution to DSP and OS developers by providing the low-power, real-time signal processing capabilities of a DSP and for the command and control functionality of an ARM. This high performance, ultra-low power OMAP platform coordinates dual processors across the two basic components of the wireless appliance and takes advantage of unique capabilities of each.

In addition, the OMAP application environment is fully programmable. This programmability allows wireless device OEMs, independent developers, and carriers to provide downloadable software upgrades as standards change. With its unique ability to combine ultra-high performance with unmatched power conservation, the OMAP architecture today is playing a major role in the following fields: [3], [9]

• Mobile Communications • WAN 802.11X • Bluetooth • GSM, GPRS, EDGE • CDMA • Video and Image Processing (MPEG4, JPEG,

Windows Media Video, etc.) • Advanced Speech Applications (text-to-speech,

speech recognition) • Audio Processing (MPEG-1 Audio Layer3 [MP3],

AMR, WMA, AAC, and Other GSM Speech Codecs) • Graphics and Video Acceleration • Generalized Web Access • Data Processing

II. OMAP H/W ARCHITECTURE

OMAP is a complex, low power and high performance Multimedia application platform. This dual core architecture integrates a software infrastructure, a host ARM RISC processor and one or more high performance, ultra-low power Digital Signal Processor and shared memory architecture on ASIC core developed by Texas Instruments. The DSP commonly used is of TI’s TM320Series. The OMAP hardware architecture is shown in Figure 1. Both processors utilizes cache memory and MMU( to access virtual memory) to minimize external memory accesses, improving data throughput and conserving system power. In general these features are transparent to program execution. The OMAP core also contains Memory Traffic controller that supports a direct connection to external synchronous DRAMs at up to 100MHz and standard asynchronous memories such as SRAM/ROM, FLASH or burnt FLASH. [1] Both processors share some external peripherals, ports, timers and interfaces to access external devices speedily. All the ports and interfaces are buffered type to improve system efficiency. There are separate ports, timers and peripherals also dedicated to respective processor. The DSP achieves high performance by a high amount of parallelism at low power dissipation. Digital signal processor has own separate RAM,


842

Fig 1. Hardware Architecture

cache memory, data buses and ALUs. 32-bit timers, watchdog timer and multichannel DMA is also present in DSP core. ARM Core also contains separate buses, cache memory, write buffer and Translation-look-aside buffer for MMUs. Multi-channel DMA controller and LCD controller with frame buffers are also present on the platform to allow concurrent accesses. Different power saving mode facility is also provided individually to processors. Camera interface, multimedia card, Secure Digital interface and keyboard interfaces are provided on platform for various applications. The CPU supports an internal bus structure composed of one program bus, three data read buses, two data write buses and additionally, a high Speed Access Bus is available to allow an external device to share the main OMAP system memory by peripherals and direct memory access (DMA). JTAG, or as it is known by its official name, IEEE 1149.1 boundary scan on OMAP board is available to enable engineers to overcome the limited access challenge, reduce risk,

increase quality, decrease defects and increase productivity with lower inventory levels and less board damage. [1], [3], [8]

III. OMAP FAMILY[6] The whole OMAP family is classified into three major product groups according to performance & various applications: A. High Performance This group is originated to use in smart phones with processors to run significant operating system, support connectivity to personal computers and support various audio and video applications. It consist different series of OMAP platform. 1) OMAP1: This series started with TI- enhanced ARM core

and ends with latest ARM926 core. This series has various manufacturing technologies and runs with max. frequency of 220MHz. Various cell phone models and

Nokia 770 Internet Tablets are based on this. Examples: OMAP171X, OMAP162X, OMAP5912, OMAP1510 etc.

2) OMAP2: This series was not much popular, runs with max. frequency of 330MHz. Its product line also includes both Internet Tablets and mobile phones. Examples: OMAP2431, OMAP2430, OMAP 2420.

3) OMAP3: This series is again divided into 3 groups: OMAP 34X, OMAP 35X and OMAP 36X. This is much advanced with new features, having high rate image processing capabilities compared to OMAP1, OMAP2 and runs with max. of 1GHz. This series consist ARM Cortex, DSP and ISP. Its product line includes large handsets like palm pre, Motorola Droid, and Nokia N900 cellular phones.

4) OMAP4: This series is recently announced. Examples include OMAP4430, OMAP4440.

B. Basic Multimedia This group is developed only for highly integrated, low cost chips of consumer products. This series is intended to be used as digital media coprocessors for mobile devices with high mega pixel digital still and video cameras. Examples: OMAP331, OMAP-DM270, OMAP-DM525 etc. C. Modem and Applications This group is intended to use only low cost handset manufactures. Examples: OMAPV1035, OMAP 1030 etc.

IV. THE RISC/DSP ARCHITECTURE ADVANTAGE OVER RISC ONLY

ARCHITECTURE[1], [2] The OMAP architecture is based on a combination of one or many DSP(TMS Series) core and a high performance RISC (ARM ) CPU. RISC architecture like ARM9xx is well suited for control type code (Operating System, User Interface applications) and DSP is well suited for signal processing applications like MPEG4, AAC, high fidelity audio playback, voice recognition. The combination of these two processors based architecture give maximum gain benefit from each.

http://www.asset-intertech.com/jtag_development_po1.htm�

http://www.asset-intertech.com/jtag_development_po1.htm�


843

TABLE 1 COMPARATIVE ALGORITHM EXECUTION

ARM9E StrongARM 1100 TMS320C5510 Units Echo Cancellation /16 bits (32 ms-8 KHz) Echo Cancellation/32 bits (32 ms-8 KHz) MPEG4/H263 decoding QCIF at 15 fps MPEG4/H263 encoding QCIF at 15 fps JPEG decoding (QCIF) MP3 decoding

24

37

33

179

2.1 19

39

41

34

153

2.06 20

4

15

17

41

1.2 17

Mcycles/s

Mcycles/s

Mcycles/s

Mcycles/s

Mcycles/s Mcycles/s

Average cycle ratio with C55x 3.12 3 1 Mcycles/s Many of the features beginning to appear in wireless equipment don’t perform effectively in the typical end-user environment. The latest smart phones, for example, feature speech recognition capabilities. In general, the speech recognition features operate effectively in perfect quiet surroundings. When a user attempts to issue voice commands in a crowded, noisy area the speech recognition feature often fails. The major reason for this failure is that echo cancellation and noise suppression algorithms in these wireless handsets are implemented on RISC processor only. This can be done effectively as long as the appliances operate, for the duration of a communication, as telephone alone. But in migration from telephone to 3G multimedia appliances, it is reasonable to expect users to play games or watch video as they issue voice commands and use the telephone capabilities of the appliance. The added processing power of the DSP allows these applications to run concurrently with above mentioned algorithms, thus providing the environment to successfully use speech recognition. DSPs provide superior power/performance in such applications because these are fundamentally signal processing task and DSPs, by design are optimized specifically for signal processing. TI conducted a comparative benchmarking study, based on published data, which shows that a typical signal processing task executed on the latest RISC machine( ARM9E, Strong ARM) requires three times as many cycles as the same task requires on a TMS320C55x DSP. In terms of power consumption, tests show that a given signal-processing task executed on such a RISC engine consumes more than twice the power required to execute the same task on C55x DSP architecture. The battery life, critical for mobile application is much higher in combined architecture like OMAP rather than a RISC only architecture. A single TMS320Cxxx DSP can process in real time a full videoconferencing application (audio and video at 15 images/sec), using only 40% of total CPU computational capability, therefore 60% of capacity can be employed to run other applications concurrently. At the same time, in the OMAP dual-core platform, the RISC processor is fully available to run the operating system and its related applications. Typically, a mobile user would therefore

still have access to usual OS applications (Word, Excel….) while processing a full vision conferencing application. With a RISC processor alone, the video conferencing application can be run only at a time with twice the power consumption than of TMS320C55x. Moreover, the battery life will be dramatically reduced.

V. ARM-DSP COMMUNICATION[4], [5]

The maker of OMAP platform provide users and developers the option to utilize both ARM and DSP of the OMAP platform.. In general, the all important DSP-ARM communication can be established with three modes: (1) DSP Gateway, an open source software originally developed by Nokia, that uses Linux “device files” to send data to the DSP from ARM side. (2) A DSP Bridge is often highly specialized, high-level library used for DSP-ARM communication which provides very high level transfer functions and make it difficult to reuse it in different applications. (3) DSP/BIOS Link provides a more general layer of abstraction from the information transfer between the two CPU cores, making it easy to reuse across different platforms. To understand the working, let us take an example for processing audio input as shown in Figure 2. The user initiates the ARM to boot the DSP via a user interface a user interface e.g. a GUI command. The ARM then commands the DSP to start acquiring data from the audio codec. The DSP itself processes this acquired data and passes them to the ARM. The ARM copies the data and sends it back to the DSP. Before sending this data back to the DSP, the ARM has the option to process this data if required. The ARM can again set some parameters in the DSP like filtering etc. So after further processing of the data, the DSP sends the signal out through codec output. DSP/BIOS Link is the foundation software for the inter-processor communication between ARM-DSP. It provides a generic API that abstracts the characteristics of the physical link connecting ARM and DSP from the application. On the ARM side, the OS ADAPTION LAYER module sums up the generic OS services that are required by the other components of the DSP/BIOS Link. During DSP-ARM communication, this LINK DRIVER is responsible for controlling the


844

ram Fig.2 (A) The diagram shows how the DSP sends the data to the ARM, where some processing can be done and then send back to the DSP for further processing (B) Software architecture of DSP/BIOS Link, which is the foundation software for the interprocessor communication. execution of DSP and data transfer. The processor maintains a book-keeping record for all components. The DSP/BIOS Link API is interface for all clients on the ARM side. On the DSP side, the LINK DRIVER specializes in communicating with the ARM over the physical link. There is no specific DSP/BIOS Link API for the DSP. The communication is done using the DSP/BIOS modules, namely SIO/GIO/MSGQ.

VI. CONCLUSION

In this paper I described why and how various features have been wrapped into OMAP platform to support heavy and power efficient multimedia applications. The flexible, open and fully programmable architecture of the TI DSP hardware core can also be extended with multi-media specific operations and can be ported to large number of operating systems for further application development. Integrating available third-party and OS native tools with TI's user-friendly Code Composer Studio integrated development environment (IDE) makes complete software development

support available for the OMAP hardware and software architecture. With its unique ability to combine ultra-high performance with unmatched power conservation in wireless Internet appliances, the OMAP architecture today is emerging as the de facto standard for the industry.

REFERENCES

[1] Jamil Chaoui, “OMAP: Enabling Multimedia Application in Third Generation Wireless Terminals”, Dedicated system magazine-2001 Q2, pp.34-39 [2] Tiemen Spits and Paul Werp, “OMAP Technology Overview”, white paper from Texas Instruments, Dec 2000. [3] OMAP5912 Manual: texas Instruments. OMAP5912 Application Processor data manual. Texas Instruments Dec. 2005. Literature No. SPRS231E. [4] TI DSP/BIOS Link 2005: texas Instruments. DSP/ BIOSTM Link, Nov. 22, 2005 [5] TMS320c6000 2006: TMS320C6000 DSP multichannel buffered serial port reference guide, Dec.2006. [6] www.wikipedia.com [7] http://omapspectrumdigital.com [8] TI Product bulletin, Embedded OMAP Processor: OMAP5912, 2004. [9] www.omap.com

ARM (A)

User Interface

Control Thread

Data Processing Thread

DSP

Control Thread

Dsp Link to GPP Dsp Link From GPP

Audio Codec Audio Codec

Processing Task 0

Pre processing Task

Post processing Task

Processing Task 1

Pre processing Task

Prost Processing Task

(B)

DSP/BIOS Link API

Processor Management

DSP/BIOS

GPP

OS

OS Adaptaion

Layer

Link Driver

Link Driver

Other Drivers

GPP

DSP

http://www.wikipedia.com/�

http://omapspectrumdigital.com/�


845

845

Abstract— The demand for portability has made low power usage a key factor in integrated circuit design. Low power circuits normally find use in both digital and analogue mobile systems. Non-mobile applications also prefer to have low voltage (LV) operation to eliminate instrument-cooling requirements. A current mirror for low voltage analog and mixed mode circuits is proposed. The current mirror has high input and output voltage swing capability and can operate at 1.0 V supply. Mentor-Graphics simulation confirm the input current range of 1 μA to 50 μA with 2.5 GHz bandwidth for the proposed current mirror. Level shifting technique increases the input voltage swing capability and decreases the undesired offset current. Key words: - CMOS current mirror, high performance, low power, sensitivity.

1. INTRODUCTION HIS paper proposed a low voltage CMOS current mirror circuit, which is having very good temperature and current

stability. Recently low power, low voltage analog and mixed mode circuits had gained importance, especially for portable electronics and mobile communication systems. It is predicted that future analog and mixed mode circuits will require power supplies of 1 V or less. Low power and thereby low voltage is the ultimate goal so as to reduce the power consumption in peripheral equipments which are necessary for instruments cooling. As the industry trend towards less expensive, basic CMOS processes continues, integrated circuit designers are challenged with the designing circuits that perform equally as well as their bipolar predecessors [1].

The increasing levels of integration cause the increased power consumption per unit area of chip which demands for the low voltage devices for the optimal system performance. A CM is a common building block both in analogue and Mixed mode VLSI circuits. Almost all high impedance CMs reported so far need high input voltages, which give them the capability of high output voltage signal swing.

The input voltage requirement for these CMs depends solely on the threshold voltage (VTH,) of the input MOSFET, thus making them unsuitable for low voltage operation. New Low Voltage Current Mirror (LVCM) circuits. Therefore, need to be investigated for getting high signal swings at input and output terminals. A few CM topologies with reduced input voltage requirements have been reported in [2], where either a level shifter or a bulk-driven approach has been used. This paper is organized as follows: a brief introduction is given in the section 1. Conventional current mirror has been explained in the section 2. The proposed low-voltage current mirror (LVCM) has been introduced in the section 3. The section 4 introduces the performance parameters of the proposed current mirror. This section includes the circuit offset current, sensitivity analysis and the temperature effects. Simulation results and discussion has been introduced in the section 5. Finally, conclusions have been drawn in the section 6.

2. CONVENTIONAL CURRENT MIRROR The drain current IDS of a saturated MOS transistor in sub-

threshold region is given by equation (1). As M 1 operates in saturation region; lower limit of VIN is restricted to at least VTH. However, Iin can be too low, which drives M 1

into sub-threshold region and the sub-threshold drain current is given as [2]:

−≈

)/(exp

1

11 qkTn

VVLWII TNGS

DODS …(1)

where η is the inclination of the curve in weak inversion, k is the Bolt Mann constant, q is the charge of the electron and T is the absolute temperature.

Referring to Fig. 1, the input voltage (VIN) is given as [2]:

Analysis and Implementation of CMOS Current Mirror for Low Voltage and Low Power

Sudhir, Bhoop Singh, Pradeep Sheoran and Sanjiv Dhull

Department of Electronics and Communication Engineering Guru Jambheshwar University of Science and Technology

HISAR-125001 (Haryana) Email: [email protected], [email protected]

T




846

846

TNDO

inTHIN V

II

WLVV +

≈

11

1η …(2)

where W1 and L1 represent the channel width and channel

length for M 1

. VTH(≈ 26 mV) is the thermal voltage and η = 1.5. IDO1 is the saturation current and is normally taken as 20 nA.

Fig. 1 Conventional Current Mirror

In sub-threshold operation of the circuit (Fig. 1), the input

voltage (Vin) described by Equation (2) will be sufficiently low (≤ 100 mV), which is one of the most desired characteristics of a LVCM. To force M 1 and M 2

into sub-threshold region, one needs to increase the aspect ratios of these transistors. This lowers the value of VIN. But, the circuits based on sub-threshold MOSFETs have poor frequency response and these circuits cannot be used in high frequency applications.

3. THE PROPOSED CMOS LVCM CURRENT MIRROR

The proposed LVCM is shown in Fig. 2. The level shifter M 3 is forced to operate in sub-threshold region by selecting Ib1

at sufficiently low level. This ensures that VIN always exceeds minimum required voltage (VGS1 − VTN), necessary to drive M 1 in saturation/sub-threshold region. It is desirable that M1 and M 2 operate either in sub-threshold or in saturation region, while M 3

should operate only in sub-threshold region.

When Iin is zero, M 1 will be in sub-threshold region but the operational region of M 2 is decided by the external bias (VOUT). For low value of VOUT, M 2 will be in triode region while for high VOUT, M 2 will operate in sub-threshold region. M5 provides the suitable gate bias to transistor M4. Input current (Iin) injected intoM1is transferred to M 2

. The use of M4 enhances the output impedance (Rout).

Fig. 2 The Proposed CMOS LVCM Current Mirror.

The output resistance Rout

of the proposed circuit can be given as [3]:

22

2

3

3

d

m

d

mout g

ggg

R ≈ …(3)

The minimum output voltage for the proposed LVCM is give as [3]:

)()((min) 52 satVsatVV DSDSout +≈ …(4) where VDS2 and VDS5 are the drain to source voltage of transistors M2 and M5

respectively.

4. PERFORMANCE PARAMETERS

LVCM design assumes that all the process parameters are ideal, which is never the case. Hence, certain performance indices are required to evaluate its performance. For a current mirror, performance indices are input compliance voltage, output compliance voltage, CM bandwidth, input impedance and output impedance etc [4-6]. Other equally important parameters originate due to either the process variations, or are structural dependent or the environment borne.

These performance indices include the offset current, which

is due to the particular circuit architecture and the mismatches between the various dimensions and process parameters is due the physical process involved. These non-idealities can be minimized but cannot be eliminated. The temperature variations a parameter devoted to environmental changes. 4.1 Offset Current

Offset current is the most disturbing factor in LVCMs that

sets the lower limit of the input current. Due to level shifter


847

847

M 3, the gates of M 1 and M 2 are not at reference potential even

when the input current is zero. Actually, the gate of M 2 is at

elevated voltage and its drain voltage depends on the

externally applied voltage (Vds1 is decided by Iin). Hence a sub-

threshold current will flow in M 2

even when Iin equals zero.

This current is known as offset current, and it’s the lower limit

is given as [6]:

13

2

3

3

2

2b

DO

DOoffset I

II

WL

LWI = …(5)

Appropriate selection of W/L ratios for M 2 and M 3

and low value for Ib1 ensures lower value for Ioffset.

4.2 Sensitivity analysis In case of proposed LVCM, sensitivity analysis is used to

ascertain the effects of variations in input parameters and process parameters like Ibias1, threshold voltages, transistor aspect ratios (W/L), and the mismatch between the threshold voltages of PMOS and NMOS, over its performance. The output current (Iout) is given as [6]:

( ] ( )22

23

3

1

3

3

1

ln2

DSTNTP

DO

bTHin

oxout

VVV

II

WL

VVLWC

I

λ

ηµ

+×−+

+=

…(6)

22 DSIout

VDS VS λ≈ …(7)

22

'23

3

3

2

21

2DS

m

DOTHIoutIb V

gKIV

LW

LWS λ

η≈ …(8)

1≈Iout

WS …(9)

1−≈IoutLS …(10)

2

'22

2

22

2

m

TNIoutVTN g

KVLWS −≈ …(11)

2

'23

2

23

2

m

TPIoutVTP g

KVLWS −≈ …(12)

2

'2

2

2 2

m

TIoutVT g

KVLWS −≈∆ …(11)

The sensitivity analysis shows that the variations in output current due to the change in input variables are almost negligible. 4.3 Temperature Effects

Temperature analysis gives the relationship between Iout and temperature. The mobility (μ) and threshold voltage VT are the temperature dependent parameters. Their temperature dependence is given as [6]:

2/30 )()( −= TTT µµ …(12)

)()()( 00 TTTVTV TTH −−= α …(13) The current output from the mirror is given as [6]:

∆+

+

= −

TH

DO

bTHinox

out

V

II

WLVV

LWCTTI 3

1

3

3

2

22/30

ln2

)(η

µ …(14)

Assuming the temperature changes in THV∆ to be negligible (when αn and αp are equal), we get the fractional temperature coefficient (TC f ) as [6]:

TITC OUTF

5.1)( −≈ …(15)

5. SIMULATION RESULTS AND DISCUSSION The proposed circuit has been analyzed for realization in

0.35μm CMOS technology. The circuit has been simulated for the various parameters in the temperature range of -250C to 1300

C.

I-V characteristics of proposed LVCM

0

5

10

15

20

25

30

35

0 0.5 1 1.5 2 2.5

Applied Voltage (V)

Iout

(uA

)

Fig. 3 I-V characteristics of proposed Low Voltage Current

Mirror (LVCM). The transient analysis of the circuit has been performed to

measure the proposed circuit current and power dissipation.


848

848

In Fig. 3. The proposed circuit output current has been drawn as a function of input supply voltage for different values of input current. Simulation shows that the proposed current mirror produce output currents of 10 µA, 20 µA and 30 µA with a variation of ± 0.05 µA for the input currents of 10 µA, 20 µA and 30 µA respectively.

Power dissipation Vs Supply Voltage

05

1015202530354045

0 0.5 1 1.5 2 2.5

Input supply Voltage (V)

Pow

erD

issi

patio

n (u

W)

Fig. 4 Power dissipation as a function of input power supply

voltage. In Fig. 4, the circuit has been simulated for variation in the

power dissipation as a function of input supply voltage. There is a linear in the power dissipation, which shows the circuit linearity and the output voltage stability in response to the variation the input supply voltage of the proposed LVCM circuit.

6. CONCLUSION In this paper a low-power current mirror based on sub-threshold MOSFETs without using any diode or bipolar junction transistor has been proposed. This LVCM circuit provides a better performance on temperature-dependence and power supply variations in the range of -25 oC to 130 o

C. The minimum power supply is 1V and the maximum power dissipation of the entire circuit is 38.5 μW. Therefore the proposed circuit is suitable for high performance, low-voltage and low-power applications.

ACKNOWLEDGEMENT This work has been performed on Mentor-Graphics Tools in

the support of Department of Electronics and Communication Engineering Guru Jambheswar University of Science and Technology Hisar-125001.

REFERENCES

[1] B.J. Blalock, P.E. Allen, and A.R. Gariel, “Designing 1- V Op Amps using standard digital CMOS technology.” IEEE Trans Circuits and Syst-II, pp. 769–780, 1998.

[2] A. Zeki and H. Kuntman, “Accurate and high output mpedance current mirrors suitable for CMOS current output stages.” Electronics Letter., vol. 33, pp. 1042–1043, 1997.

[3] P.R. Grey and R.G. Meyer, Analysis and Design of Analog Integrated Circuits. John Wiley and Sons: Singapore, 1995.

[4] Y. Cheng and C. Hu, “MOSFET Modeling & BSIM3 User’s Guide,” New York: Kluwer, 1999.

[5] R. J. Baker, H. W. Li, and D. E. Boyce, “CMOS- Circuit Design, Layout and Simulation,” Piscataway, NJ: IEEE Press, 1997.

[6] S.S. Rajput and S.S. Jamuar, “A Current Mirror for Low Voltage, High Performance Analog Circuits,” Analog Integrated Circuits and Signal Processing, 36, 221–233, 2003.

[7] P. E. Allen and D. R. Holberg, “CMOS Analog Circuits Design,” Oxford University Press, inc., 2000.

[8] The mosis service www.mosis.org

http://www.mosis.org/�

FPGA based Network-on-Chip Designing Aspects - CiteSeerX

Documents

Transcript of FPGA based Network-on-Chip Designing Aspects - CiteSeerX