Volume 8, No.3, May - June 2019 International Journal of Advanced Trends in Computer Science and Engineering

Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse84832019.pdf

https://doi.org/10.30534/ijatcse/2019/84832019

# Power and Area Optimized FRA-CSLA for High-Speed NoC Applications

Sangeeta Singh<sup>1</sup>, J.V.R. Ravindra<sup>2</sup>, B.RajendraNaik<sup>3</sup>



<sup>1</sup>Research Scholar, Department of ECE, JNTUK, Kakinada, India, sangeeta.singh@ieee.org
 <sup>1</sup>Department of ECE, C-ACRL, Vardhaman College of Engineering, Hyderabad, India
 <sup>2</sup>Department of ECE, C-ACRL, Vardhaman College of Engineering, Hyderabad, India
 <sup>3</sup>Department of ECE, University College of Engineering, Osmania University, Hyderabad, India

# ABSTRACT

Network on Chip (NoC) is a well-established research field and is being regarded as an assuring model to solve the complexity of communication barrier in future multicore systems. The major advantage of NoC is on performance and scalability which makes it more suitable to solve problems related to interconnect architecture. The trade-off between area/power consumption and performance is one of the dominant challenges in designing of NoC. To enhance the performance of the design, few of the techniques aim to increase the number or size of the buffers used. But this method in turn leads to increase in area and power consumption. This paper proposes an efficient Flexible Router Architecture-Carry Select Adder (FRA-CSLA) method that occupies less area and consumes less power in comparison to the conventional flexible router design. This method emphasizes on use of CSLA approach to generate next level request signals leading to reduction in area and power compared to conventional method of using basic adder cells. The simulation results depict that this proposed approach improves FPGA and ASIC performance of Flexible Router Architecture (FRA) in comparison to existing method used for NoC.

**Key words:** Flexible Router, Carry Select Adder, Network -on-Chip, System on Chip

#### 1. INTRODUCTION

Continuous scaling in technology has lead to integration of large number of computational units onto the same silicon die. These advancements pose stringent communication requirements on the communication architecture. Network on Chip (NoC) is considered as a promising solution to solve the issues related to communication architectures in the multi-core systems [1]. The advantages of high bandwidth, scalability along with reusability makes NoC more suitable perspective to address the problems related to interconnect in comparison to traditional method of bus based architectures.

The NoC is similar to the Local Area Network (LAN), which used to integrate the communication link between several modules. The architecture of NoC primarily consists of i) Switches (Routers) ii) Links iii) Network Interfaces as shown in Figure 1 [2]. There exist various topologies to interconnect components of NoC that determines the system performance and cost.



Figure 1: Basic NoC Architecture [2]

Among all the components in NoC, the method of implementing a router plays vital role in the design of high performance NoCs [3]. A router contains five input ports and also five output ports by name West, East, North, South, and Local respectively. It is interconnected to other routers and Processing Element (PE) through its local port as presented in Figure 2 [4]. Therefore the router's performance needs to be optimized in order to boost up the performance of the complete network.



Figure 2: Basic NoC router architecture [4]

There are different architectures proposed for router so as to increase the performance of NoCs. The Flexible Router Architecture (FRA) is one of the routers which enhance the efficiency of the network by using the same count of available buffers more effectively. The major advantage of this router is its flexibility to store flits of packets arriving at the intended ports having busy buffers when compared to basic routers. The flexible router usually allocates any suitable buffer which is free in any input ports of the router in order to store the incoming packet. This technique is helpful for solving the problem of contention easily without having the need to wait for the initial requested busy input port buffer to turn free. The conventional FR structure becomes more expensive with respect to power, area and cost. To increase the efficiency of the router, the FRA-CSLA method is implemented in ASIC and FPGA platform for the modern NoC system. In this experimental research work, the design of CSLA is being used to process request and grant input port (E) in an input controller module.

The rest of the paper is composed as follows: The previous work related to the architecture of NoC router is described in Section 2. The methodology of efficient FRA-CSLA architecture is presented in Section 3. The experimental setup, results and discussions are carried out in Section-4. Finally the Section-5 provides a brief conclusion.

## 2. RELATED WORKS

There are large numbers of methods suggested by researchers for efficient implementation of NoC. This section presents a brief description of some remarkable contributions related to the existing literature.

Psarras *et al.* [5] presented short-path NoC router with Fine-Grained Pipeline Bypassing (FGPB). The method of FGPB depicted a router architecture that skips all the stages and quickly forwards the flits to the starting point of contention. The process of adopted in short-path NoC router is consistently productive as the stages tare already bypassed won't be repeat even if a flit is lost in arbitration. However, this method resulted in increased hardware complexity and thus raised the cost of implementation.

Chen *et al.* [6] illustrated a NoC architecture that can be customized easily. This is combined with a directory-based data-sharing process to Compute Unified Device Architecture-to-FPGA (CUDA-FPGA) flow for easy scaling of system and thus improves the entire performance of the system. In this mechanism, a synthesizable RTL code for the complete NoC was generated with the help of automated FCUDA-NoC generator. This generator considers the CUDA code and the parameters of the network as a source of input.

Qian. Z.L *et al.* [7] implemented a low-latency NoC method by employing Support Vector Regression (SVR) where the waiting times of channel and source queue is computed based on the information provided by communication of the application and the NoC routing techniques by employing an analytical queuing model. This technique improved the precision of the network system, but doesn't comment on power and area requirements for SVR based approach for NoC.

Yoon et al. [8] presented an analysis of virtual

channels and multiple physical networks approaches for designing of NoC. This provides a better quality of service, improved performance and avoids protocol deadlocks in NoC design. The virtual channels provide low latency compared to multiple physical networks as the contention can be resolved dynamically. However, the multiple physical network NOC approach where it uses simpler routers along with narrow channels presented improvement in the frequency and power of the targetr. But this technique tends to increase in area occupancy for longer queues which is the major limitation.

Chen *et al.* [9] proposed Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART)-hop Setup Request (SSR) system that uses switches and short wires instead of overlapping and long broadcast wires. Thus helps in reduction of the energy and channel wire overhead. The increased complexity of the allocator design in this technique is the main drawback of the SSR system.

Sayed *et al.* [10] implemented new flexible router architecture, which provides improvement in the performance of the entire network by using same number of available buffers. Hence, it requires neither increase in buffers size nor additional Virtual Channel (VC). This router configuration provided an exceptional way of handling the requests coming to a busy buffer with the help of other buffer in the router. It has achieved better performance in terms of maximize hotspot, uniform, nearest neighbour traffic patterns. This flexible router architecture uses a traditional adder to process request and grant signals for the next level.

# 3. FRA-CSLA METHODOLOGY

The performance of NoC is directly proportional to the number of buffers employed per input port along with their size. The two major issues resulting due to increase in count of buffers in a router in NoC systems are: increased power consumption and large area utilization. The major aim of the research work presented in this paper is to improve the ASIC and FPGA performances with the help of CSLA. The working principle of the FRA-CSLA method described below.

## **3.1. Flexible Router Architecture**

The architecture of Base Router (BR) when combined with additional functionality to the inputs forms FRA [10] as shown in the Figure 3. This architecture contains five input ports along with five output ports. They are connected together with the help of a crossbar switch. The Flow\_ctrl denotes the signals related to request and grant whereas the Data\_E, W, N and S signals signify the packets corresponding to the upstream router. The input and output ports from mesh topology used here is related to a particular direction namely: East (E), West (W), North (N), South (S), and Local (L). The network interface element of the NoC connects not only to the local input and output ports but also connects with the Processing Element (PE). The detailed description of the FRA structure along with the way its input and output ports function is given below.



Figure 3: Block diagram of the FR structure [10]

The operation of FRA is equivalent to that of the Base Router (BR) except under the circumstances of contention. To avoid contention, the FRA structure need not wait for the requested FIFO which is full to have at least one or more slots as free when compared to the functionality of the BR, but instead the FFC searches for a free slot by placing requests to the other FIFOs which are not completely full from rest of the input ports. Once if it's able to detect a free slot, then a grant signal is sent back to the upstream router which is followed by transfer of the packet to the intended FIFO.

#### 3.1.1. Input port Architecture

The block diagram of the East-input port of the FRA is depicted in Figure 4 which contains three vital components namely i) FIFO Flexibility Controller, also known as FFC ii) FIFO Buffer and iii) Routing Logic.

### 3.1.1.1. FIFO- Flexibility Controller (FFC)

The functionality of FIFO flexibility controller module is to find a suitable FIFO in the router to store the incoming packet with the help of req\_FFCE\_FIFOW, N, S and gnt\_FFCE\_FIFOW, N, S request and grant signal as shown in the Fig.4). FIFO Flexibility Controller is also responsible for communication with the output ports using "req\_int\_E" and "gnt\_int\_E" so as to transfer the packets received to their downstream with the application of a specific routing algorithm.



Figure 4: Block diagram of the East direction-input port of the FRA

## 3.1.1.2. FIFO buffer

Its role is to stores the packets received from the upstream router "pkt\_US" in FRA-CSLA architecture. For example, FIFO buffer stores information based on FIFO basis. The storage space is normally an array of continuous memory.

#### 3.1.1.2. Routing logic

Its purpose is to apply the process of routing to the header packet present in the FIFO in order to determine the direction of packet based on the destination address inside it so as to select the suitable output port.

### 3.1.2 Output Ports

The Figure 5 shows block diagram of the output port of the flexible router which is composed of three major components by name Arbiter, Output Controller, and Multiplexer.



Figure 5: Block diagram of the output port

### 3.1.2.1. Arbiter

An arbiter is used to resolve the conflicts when many input ports try to access the identical output port. It takes the incoming request signals as input and provides grant (gnt\_int\_E, W, N, S, L) access to any one of the request signals (req\_int\_E, W, N, S, L) for the output port depending on a pre-specified logic. There has been much research on developing different schemes for arbitration process so as to increase the efficiency of an on chip router.

Arbiters are classified into different types like round robin, fixed priority, lottery based, and token ring. Round robin arbitration scheme is the most popular and standard scheme used in NoC to facilitate bus arbitration [11]. Also round robin method has the added advantage of reduced complexity and treats all requests impartially in scheduling.

### 3.1.2.2. Output Controller

This block is responsible for communication with the downstream router with the help of the signals req\_DS and gnt\_DS respectively.

#### 3.1.2.3. MUX

The functionality of the MUX block is to select the packet which can go towards the downstream router based on the response obtained from the arbiter. In this experimental research, the CSLA adder has been used for addition process in the input controller instead of traditional adder, which occupies less area in the FRA design due to it, has achieved less area. The CSLA operation is described in the below section 3.2.

### 3.2. Carry Select Adder Design

The design of CSLA is applied in numerous computational units to reduce the delay involved in carry propagation. This architecture of CSLA employs the Binary-to-Excess-1 Converter (BEC) rather than using Ripple Carry Adder (RCA) and makes Cin=1 to obtain low area overhead and power consumption. The logic gates used to attain the BEC logic is very less and thus replaces n-bit Full Adder (FA). The BEC logic function uses minimal count of logic gates when compared to n-bit FA architecture. This architecture uses RCA structure in combination with BEC. The input arrival time is lesser than the multiplexer selection input arrival time. Based on the selection line input Cin, this adder gives either BEC output or multiplexer output. The multiplexer delay and the corresponding selection arrival time decides the delay of different blocks in the design.



Figure 6: Block diagram of the CSLA design

#### 4. EXPERIMENTAL RESULTS

The comparison results of area occupied and power consumed by the conventional FRA and proposed FRA is described in this section. The designs have been simulated using Xilinx ISE and the corresponding results have been presented in Table-I. The performance related to FPGA implementation was analyzed for different devices corresponding to Virtex-4 and Virtex-5. In addition to the above results, the designs of existing FRA and proposed FRA has been synthesized using Cadence Genus Compiler with UMC 180nm and 45 nm technologies respectively.

The results in Table 1 demonstrate that the number of LUTs, flip-flops, and slices consumed by the router architecture is minimized in FRA-CSLA when compared to conventional method. Also there is increase in frequency of operation in proposed method in comparison to conventional method.

| Target FPGA          | Circuit  | LUT | FF  | Slices | Frequency<br>(MHz) |
|----------------------|----------|-----|-----|--------|--------------------|
| Virtex4<br>xc4vfx12  | Existing | 183 | 136 | 100    | 1113.519           |
|                      | FRA-CSLA | 107 | 40  | 70     | 1386.12            |
| Virtex5<br>xc5vlx20t | Existing | 166 | 136 | 55     | 1019.069           |
|                      | FRA-CSLA | 58  | 40  | 27     | 1276.126           |

 
 Table 1: FPGA implementation results for existing and FRA-CSLA method

Table 2 depicts the performance results of ASIC implementation. There is approximately 24% reduction in power and around 19% reduction in overall area with not much change in delay. The corresponding values of Power-Delay Product (PDP) and Area-Delay Product (ADP) illustrate that the proposed design is more beneficial when compared to existing method of FRA.

 
 Table 2: ASIC performance for existing and FRA-CSLA method

| Technology | Methodology | Area<br>(mm <sup>2</sup> ) | Power<br>(µW) |
|------------|-------------|----------------------------|---------------|
| 180nm      | Existing    | 78.31                      | 2554.43       |
|            | FRA-CSLA    | 63.51                      | 1941.66       |
| 45nm       | Existing    | 64.05                      | 1480.65       |
|            | FRA-CSLA    | 52.13                      | 1134.15       |

The graphical comparison between area measurements for traditional FRA method and the proposed method is depicted in Figure 7.



Figure 7: Area comparison between existing and FRA-CSLA method

The graphical comparison between power measurements for conventional method and proposed method is presented in Figure 8.



Figure 8: Power comparison between existing and FRA-CSLA method



Figure 9: RTL schematic of FRA-CSLA from FPGA implementation

The RTL schematic of FRA-CSLA from FPGA analysis is shown in Figure 9 and the corresponding RTL schematic of FRA-CSLA from Cadence Design Framework is shown in Figure 10.

887



Figure 10: Cadence RTL schematic of FRA-CSLA method

## 5. CONCLUSION

This paper presented the architecture of FRA-CSLA being implemented using Verilog HDL. The results obtained from FPGA implementation show that the frequency of operation is enhanced in proposed method in comparison to existing method. Also the number of LUTs, slices and flip-flops consumed are reduced in the proposed FRA-CSLA when compared to existing method. Similarly the results from ASIC implementation demonstrate reduction in area by 19% and power by 24% in FRA-CSLA method compared to conventional methods. The future work can be carried out to enhance the speed of operation in ASIC implementation so as to increase the efficiency of the router architecture.

## REFERENCES

- [1]Sudeep Pasricha and Nikhil Dutt, "On-Chip Communication Architectures-System on Chip Interconnect", Elsevier, 2008. https://doi.org/10.1016/B978-0-12-373892-9.00006-2
- [2] L.Benini and G.De.Micheli, "Network on Chips: Technology and Tools", Sanfransisco, CA, USA:Morgan Kaufmann, 2006. https://doi.org/10.1016/B978-012370521-1/50002-3
- [3] Sangeeta Singh, et. al, "Power and Area Calibration of Switch Arbiter for High Speed Switch Control and Scheduling in Network-on-Chip", in IEEE 13<sup>th</sup> International SoC Design Conference (ISOCC), pp. 5-6, Oct, 2016, Jeju, South Korea. https://doi.org/10.1109/ISOCC.2016.7799765
- [4] Wen-Chung Tsai, et. al, "Networks on Chips: Structure and Design Methodologies" Hindawi Publishing Corporation Journal of Electrical and Computer Engineering, Volume 2012. https://doi.org/10.1155/2012/509465
- [5] Psarras, Anastasios, *et. al*, "Short-Path: A Networkon-Chip Router with Fine-Grained Pipeline Bypassing", IEEE Transactions on Computers 65, no. 10, pp: 3136-3147, 2016. https://doi.org/10.1109/TC.2016.2519916
- [6] Chen, Yao, *et. al*, "FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow", IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, no. 6, pp: 2220-2233, 2016.

https://doi.org/10.1109/TVLSI.2015.2497259

- [7] Qian, Z.L., et. al, "A Support Vector Regression (SVR)-Based Latency Model for Network-on-Chip (NoC) Architectures", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35(3), pp.471-484, 2016. https://doi.org/10.1109/TCAD.2015.2474393
- [8] Yoon, Young Jin, Nicola Concer, Michele Petracca, and Luca P. Carloni, "Virtual Channels and Multiple Physical Networks: Two Alternatives to Improve NoC Performance", IEEE Transactions on computeraided design of integrated circuits and systems 32, no. 12, pp: 1906-1919, 2013. https://doi.org/10.1109/TCAD.2013.2276399
- [9] Chen, Xianmin, and Niraj K. Jha., "Reducing Wire and Energy Overheads of the SMART NoC Using a Setup Request Network", IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, no. 10, pp: 3013-3026, 2016.

https://doi.org/10.1109/TVLSI.2016.2538284

[10] Sayed, Mostafa S., Ahmed Shalaby, Mohamed El-Sayed, and Victor Goulart, "Flexible Router Architecture for Network-on-Chip", Computers & Mathematics with Applications 64, no. 5, pp: 1301-1310, 2012.

https://doi.org/10.1016/j.camwa.2012.03.074

- [11] Yanhua Liu, et. al, "A dynamic Adaptive Arbiter for Network-on-Chip", Journal of Microelectronics in Electronic Components and Materials., vol.43, no.2,pp.111-118,2013.
- [12]Hossam El-Sayed, et. al, "Hardware Implementation and Evaluation of Flexible Router Architecture in NoCs", in IEEE 20<sup>th</sup> International Conference on Electronics, Circuits and Systems (ICECS), 2013. https://doi.org/10.1109/ICECS.2013.6815491
- [13] Lu Wang, *et. al*, "A High Performance Reliable NoC Router", IEEE International Conf. ASP-DAC, pp.712-718, 25-28 Jan 2016.
- [14] Majdi Elhajji, et. al, "FeRoNoC: Flexible and Extensible Router Implémentation for Diagonal Mesh Topology", Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP), 2011. https://doi.org/10.1109/DASIP.2011.6136890
- [15] Kasem Khalil, et. al, "Flexible Self-Healing Router for Reliable and High-Performance Network-on-Chips Architecture", 31st IEEE International Systemon-Chip Conference (SOCC), 4-7 Sept. 2018. https://doi.org/10.1109/SOCC.2018.8618525
- [16] Sharath Chandra Inguva, et. al, "Enhanced CORDIC Algorithm using an Area Efficient Carry Select Adder", International Conference on Intelligent and Sustainable Systems (ICISS), 2017.
- [17] Munisha Devi, Nasib Singh Gill, "Performance Evaluation of Dynamic Source Routing Protocol in Smart Environment", International Journal of Advanced trends in Computer Science and Engineering, Vol.8(2), 2019.