# Volume 9, No.4, July – August 2020 International Journal of Advanced Trends in Computer Science and Engineering

Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse117942020.pdf

https://doi.org/10.30534/ijatcse/2020/117942020

# FPGA Implementation of PSO Based RGB-Y Filter

N.Sambamurthy<sup>1</sup>, M.Kamaraju<sup>2</sup>

Assistant Professor Researchscholor<sup>1</sup>, Professor & Mentor<sup>2</sup> Department of ECE, Gudlavalleru Engineering college, Gudlavalleru<sup>1,2</sup> sambanaga009@gmail.com<sup>1</sup>, profmkr@gmail.com<sup>2</sup>

## ABSTRACT

FPGA based RGB to Y(Luma) conversion is necessary in image denoising, video processing and computer vision applications. Generally our human eye is less sensitive to RGB color image and require more memory to store than, gray scale images. The gray scale image reduces the complexity, storage space, and increasing speed of operation for various real time image processing applications. The existed color space conversion used RGB streaming and data path elements are very high computational complexity in design and more power is dissipated. To overcome this drawback of existed design, the proposed fixed point RGB-Y color conversion filter uses efficient multi pixel streaming and constant coefficient multiplier. The RGB coefficients are optimized with PSO algorithm with different color standards. The designed multiplier and adder gives luma(Y) output for every RGB pixels at 24.09nsec speed of the targeted FPGA. The power dissipation of designed architecture is 204mw for each RGB window.

**Key words**: FPGA, RGB streaming, constant coefficient multiplier, lumaoutput, PSO algorithm

## **1. INTRODUCTION**

RGB color system is used in computer vision and video surveillance system. The RGB hardware is more complex and easily effected with additive white Gaussian noise. Processing of RGB signals individually it requires more computational time and power. So that there is a need to design of efficient RGB-Y converter is necessary. Recommendations defines the rate for Y (luma) is 8-bit per each RGB color saves the 1/3 saving in bandwidth[1]. RGB-Y color space conversion is helpful in all image and video processing applications. The RGB-Y linear filter increases the throughput and reduces the computational complexity further.

The RGB-Y linear filter increases the memory throughput. The real time requirement of any image processing computation based on low power design is very important. Existed system dissipates more power. So that, there is a provision to development of low power and high speed RGB-Y filter architecture[2] for real time requirement. Ahirwal et.al., discussed the tradeoff's involved in the design of FPGA based RGB color space to Y and vice versa. The existed system still far from the real time performance and complexity also more. B.Gordon and N.Chadha designed a low-complexity RGB color to Y architecture [3] used shift and add operations to perform color conversion. This architecture eliminates the need of multiplier and adder in the conversion of luma, and keeps the same image quality. The RGB images abstain more power dissipation with 180nm CMOS technology. The design consumes more delay and Power.

Faycalbensali[4] proposed FPGA implementation of distributed arithmetic based RGB to Y converter. The designed architecture has a fully pipe-lined and a throughput of 234 mega-conversions/seconds. The hardware complexity and delay of the designed hardware is more. Multiprocessing based architecture designs are very important for image processing and data security applications and suggested techniques[14] also helpful to RGB to Y converters.

The authors[5] presented a RGB to Y converter used multipliers and adders increases more hardware computational complexity and delay. The existed system[5]also does not give satisfied results under real time requirement. The design is consumed upto 40% of processing power in a highly optimized decoder environment[6].So the design includes more complexity and resource utilization is also increased[10].Multiple constants multiplier (MCM) design[7] is based on common sub expression algorithm for intelligent luma conversion. So we proposed a common sub expression elimination algorithm based multiplier is reduced the complexity. but the system needs optimization of coefficients and memory. Optimization of color coefficients is necessary in the proposed design and it is discussed in section II.

Fir filters based on distributed architecture[8] are computationally efficient. LUT based multiplier operated at very high speed and resource utilization is maximum [9]. FPGA based architectures are more convenient to design signal processing and image processing applications is discussed in [12],[13].

#### 2. DESIGN METHODOLOGY Color range equation

Assume the color coefficients: X1=CKA,X2=1-CKA-CKB;X3=CKB;



$$Y = \begin{bmatrix} CKA & 1 - CKA - CKB & CKB \end{bmatrix} \begin{bmatrix} R \\ G \\ B \end{bmatrix}$$
$$Y = CKA*R + (1 - CKA - CKB)*B + CKB*G. --(i)$$
$$0 < CKA < 1; 0 < CKB < 1; 0 < CKC < 1. ---(ii)$$

The RGB coefficients CKA, CKB are chosen between 0 and 1. RGB model has an intensity value ranging from 0 to 1. i.e. 0 means lowest value indicates black pixel and value 1 for highest value and it indicates white pixel. The proposed hardware architecture based on PSO algorithm for luma conversion using Matlab and next implemented on targeted FPGA (kintex) with clock frequency of 200 MHZ. The coefficient values of CKA,CKB and CKC are obtained and shown in table 1.



Figure 1: PSO algorithm based optimization of color coefficients.

PSO algorithm based optimization of color coefficients shown in Figure 1, the color coefficients are optimized for identifying the optimal solution of weighted coefficients.PSO optimizes the coefficients to the filter, which is in turn useful for producing accurate filter output is shown in table -I. The proposed coefficients are more optimized and useful for real time applications [15].

| Table 1: Proposed | l Constant color | coefficients design |
|-------------------|------------------|---------------------|
|-------------------|------------------|---------------------|

| Color        | Proposed     | Proposed     | Proposed     |
|--------------|--------------|--------------|--------------|
| coefficients | Coefficients | Coefficients | Coefficients |
| of RGB       | (0-255)      | (16-240)     | (16-235)     |
| CKA          | 0.299        | 0.1819       | 0.299        |
| CKB          | 0.587        | 0.0618       | 0.114        |
| CKC          | 0.114        | 0.6495       | 0.877        |

2.1 Matlab based RGB-Y converter



Figure 2:Matlab based RGB-Y converter.

In figure 2, System generator based RGB-Y converter uses two multipliers for producing luma output, and it consumes five clock cycles for each RGB pixels. Each pixel in this design are 8 bit width. The pixel bit size is restricted in this design are 8 bit. Because the computational complexity is increased due to increasing the pixel width.

#### 3. HARDWARE DESIGN OF MULTI PIXEL STREAMING BASED RGB-Y CONVERTER

#### (a)Multi Pixel streaming based controller

The multi pixel streaming controller provision of the three sets of inputs with three color coefficients at three times the clock rate of the system.

Multi pixel streaming based color conversion Filter(RGB-Y) shown in figure 3. The designed architecture consists of 66 flip-flops and 3 conjunction logic switches are used for streaming process. The counter performs the modulo-4

operations and counts 00 to 11 respectively.

For mode00 counter performs hold position and mode01 for Red image, mode10 for green image and mode11 for blue image pixels are enabled.

Each time the RGB pixel is multiplied with the color coefficient pixels and generates luma(Y) output for each five clock cycles. This design reuse is useful for image processing applications like RGB-Y conversion.



Figure 3: Hardware architecture of Multi pixel streaming based color conversion Filter(RGB-Y)

# (b)Re configurable Base-2 common sub expression algorithm (BCSE)

Re-configurable BCSE based Constant coefficient multiplication operation is done between the inputs and the coefficients of CKA, CKB and CKC. The word length of 8-bit for multiplier operation can be written as,

$$\frac{(xin)}{2}.a^{7} + (\frac{xin}{4}).a^{6} + (\frac{xin}{8}).a^{5} + (\frac{xin}{16}).a^{4} + (\frac{xin}{32}).a^{5} + (\frac{xin}{64}).a^{2} + (\frac{xin}{128}).a^{1} + (\frac{xin}{256}).a^{0} - (111)$$

The equation shows the xin is the input image sample and a0-a2 shows the constant Coefficients related to color coefficients of RGB Image. The constant coefficient multiplier performs multiply operation between constant colur coefficients and each one of the R,G,B values. The output is stored as luma output or gray scale image.

$$\underbrace{\frac{xin}{2}a7 + \frac{xin}{4}a6 + \frac{xin}{8}a5 + \frac{xin}{16}a4 + \frac{xin}{32}a3 + \frac{xin}{64}a2 + \frac{xin}{128}a1 + \frac{xin}{256}a0}_{X2}$$

$$x2$$

$$a.x1 = y$$

$$x2 + \frac{1}{4}\left\{\frac{x1}{2} + \frac{x1}{4}\right\} + \frac{1}{16}\left\{\frac{x1}{2} + \frac{x1}{4}\right\} + \frac{1}{64}\left\{\frac{x1}{2} + \frac{x1}{4}\right\}$$

$$\underbrace{x2 + \frac{x2}{4} + \frac{x2}{16} + \frac{x2}{64}}_{X3}$$

$$x3 + \frac{1}{16}\left\{x2 + \frac{x2}{4}\right\}$$

$$\underbrace{x3 + \frac{x3}{16}}_{Y}$$

Fixed and re-configurable constant color coefficients (CKA,CKB,CKC) are loaded into the BRAM and each time perform the luma(Y) output, for each RGB values are depicted in figure 4. The hardware reuse based multiplier and adder each time luma output generated and it stores in buffer registers for each 3X3 window.



**Figure 4:** 3-bit Base-2 common sub expression elimination algorithm (3-BCSE)[9].

#### 4. Results and Discussion

The designed architecture is shown in fig.5, simulated using system generator and synthesis using Xilinx vivado and hardware implementation using kintex-7 FPGA.



Figure 5.Simulation results of RGB-Y converter

Figure 5 shows the Simulation results of RGB-Y converter , xilinx Isim simulator performs the given input with RGB samples and color coefficients. For each luma(Y) output can takes 5 clock cycles. The RTL schematic shows the Top module of RGB pixel streaming, multiplier and adder are depicted in figures 6 and 7.



Figure 6: RTL diagram of RGB to Y Conversion



Figure7: Top module of RGB to Y conversion

The RTL schematic of RGB-Y converter gives the information related to the different logic resources are used in the design are RGB pixel streaming and multiplier routing connections are depicted in figures 6 and 7.

**Table 2:** Computational Complexity analysis of proposedarchitecture for RGB-Y converter.

| Multi pixel<br>(RGB)pixel controller | Flip-Flops<br>7(8-bit).<br>Total<br>=7*1=7          | 3 BRAM'S<br>Counters: Total<br>states-4(2 no's)              | Multiplexers<br>2<br>Mux's(4:1),<br>1 Mux(2:1).<br>Total<br>=3*3=9 | AND Gates<br>5(8-bit)<br>Total<br>=5*3=15. |
|--------------------------------------|-----------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------|
| 3 bit BCSE based<br>Multiplier       | Adders<br>1(8-bit),2<br>(9-bit)<br>Total<br>=9*3=27 | Multiplexers/<br>AND Gates<br>3 Mux's(2:1),<br>Total =9*3=27 | -                                                                  | -                                          |

Computational complexity of designed system is 24.04 nsec for each 3-RGB pixels. The hardware utilization proposed of design is very less compared existed system. The computational complexity is reduced by using proposed pixel streaming and multiplier efficiently is shown in table 2.

| Word<br>length | ITU<br>standard | Input<br>RGB<br>Image<br>PSNR | Output<br>Image(Y)<br>PSNR | MSE  | Input<br>RGB<br>Image | Output<br>Image(Y) | MSE  |
|----------------|-----------------|-------------------------------|----------------------------|------|-----------------------|--------------------|------|
|                | Software        |                               |                            |      |                       | Hardware           |      |
|                | 0-255           | 54.1                          | 51.4                       | 4.99 | 54.1                  | 51.6               | 4.62 |
| 8 Bit          | 16-240          | 54.0                          | 51.8                       | 4.07 | 54.0                  | 52.8               | 2.22 |
|                | 16-235          | 53.8                          | 51.5                       | 4.27 | 53.8                  | 52.5               | 2.41 |
|                | 0-255           | 66.6                          | 64.0                       | 3.9  | 66.6                  | 65.0               | 2.4  |
| 10 Bit         | 16-240          | 65.9                          | 63.9                       | 3.03 | 65.9                  | 64.9               | 1.51 |
|                | 16-235          | 65.8                          | 63.6                       | 3.34 | 65.8                  | 64.6               | 1.82 |
|                | 0-255           | 72.0                          | 69.1                       | 4.02 | 72.0                  | 71.1               | 1.25 |
| 12 Bit         | 16-240          | 71.8                          | 68.9                       | 4.03 | 71.8                  | 70.9               | 1.25 |
|                | 16-235          | 70.1                          | 67.8                       | 3.18 | 70.1                  | 69.1               | 1.44 |

**Table 3**: PSNR of input RGB with Y image

Note: PSNR:Peak signal to noise ratio, MSE:Mean squared error.

In the above the conversion error for both software and hardware is shown in table III. The conversion error is less in hardware compared to the software. For real time applications the conversion error up to 5% is tolerable. The designed system used classical clock gating(CCG) technique [11] reduced the dynamic power from 45% to 50% of dynamic power is shown in table V. Even though the resource utilization is increased, the power consumption is decreased.

**Table 4:** RTL Implementation and Design summary:

| Data<br>width | Slice FF's | LUT's | IOBs | Clock Frequency<br>(MHZ) | power(mw) |
|---------------|------------|-------|------|--------------------------|-----------|
| 8             | 48         | 37    | 35   | 158.3(6.3ns)             | 204       |
| 10            | 58         | 42    | 45   | 160.4(6.2ns)             | 198       |
| 12            | 64         | 53    | 56   | 181.3(5.5ns)             | 172       |
| 16            | 70         | 65    | 67   | <b>196</b> (5.1ns)       | 169       |

From the table 4 the complexity is increased by increasing the word length. The hardware complexity is increases slightly but decreases the power dissipation.

**Table 5**: Comparison of power analysis, with and without clock gating techniques:

| Data<br>width | With<br>intelligent clock gating<br>Power dissipation(mw) | Without<br>intelligent clock<br>gating<br>Power<br>dissipation(mw) |
|---------------|-----------------------------------------------------------|--------------------------------------------------------------------|
| 8             | 204                                                       | 456                                                                |
| 10            | 198                                                       | 398                                                                |
| 12            | 172                                                       | 364                                                                |
| 16            | 169                                                       | 269                                                                |

The clock gating technique reduces the power dissipation upto 50%-55%. and it is shown in table V for various word lengths.

**Table 6:** Computational complexity analysis of existed and Proposed system.

| Design               | Device<br>utilization                | Power<br>consumption | Software<br>computing<br>time | Hardware<br>computing<br>time |
|----------------------|--------------------------------------|----------------------|-------------------------------|-------------------------------|
| Proposed<br>system   | Slices :48,<br>LUTS:37<br>IoB's:35   | 204 mw               | 1.04msec                      | 24.04nsec                     |
| Existed<br>system[1] | Slices :174,<br>LUTS:316<br>IoB's:63 | -                    | 1.26                          | 1.2nsec                       |
| Existed<br>system[4] | Slices :144,<br>LUTS:216<br>IoB's:50 | -                    | 1.43msec                      | 0.28msec.                     |

The proposed system is superior compared to the existed systems [1],[4]. It uses less complexity compared to the existed system is shown in table 6.

#### 4. CONCLUSION

The designed RGB-Y filter gives a luma output for every RGB pixels with three times the clock rate of the system. The designed multi pixel streaming designed with intelligent clock gating so that the power dissipation is reduces upto 45%-50% for different data widths. The designed multiplier also reduces the computational complexity and produced output for every two clock cycles. The overall architecture design of Color space converter (CSC) used less number of resources and critical path delay is 24.04nsec for each RGB pixels. The PSO based RGB Coefficients for different values are optimized and gives better PSNR and lower mean squared error. The computational complexity of designed architecture is changes with word length. The optimal word length used by color space converter power consumption is 204 mw.

#### REFERENCES

[1] T.S.Saidani,H.M.Zayani, "Design of high speed and dynamic architecture for conversion with FPGA for real time processing", International Journal of computer science and network security (IJCSNS), Vol.18,No.1 January ,2018.

[2]Ahirwal, Balkrishan, K.Mahesh, R.Mehta "**FPGA based** system for color space transformation **RGB to YIQ and YCbCr**," Intelligent and Advanced Systems, 2007, ICIAS 2007, International Conference on IEEE, 2007.

https://doi.org/10.1109/ICIAS.2007.4658603

[3]B.Gordon,N.Chaddha,"A low power Multiplier less YUV to RGB converter based on human vision perception,"IEEEworkshoponVLSIsignalprocessing,pp.26 -28,October,1994,DOI: 10.1109/VLSISP.1994.574765

[4]F.Bensali, A.Amira, "Design and Implementation of Efficient Architectures for Color Space Conversion," ICGST International Journal on Graphics, Vision and

N.Sambamurthy et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(4), July - August 2020, 5003 - 5008

Image-processing,vol.5, no.1, pp.37-47, Dec. 2004,DOI: 10.1007/978-3-540-30117-2.

[5]A.M.Sapkal, Mousami Munot, Joshi, "**RGB to YCbCr** color space conversion Using FPGA," Proceedings of IET International Conference on Wireless, Mobile and Multimedia Networks, pp. 255-258, Mar. 2008.

https://doi.org/10.1049/cp:20080191

[6]M.Bartkowiak, "**Optimization of Color Transformation for Real Time Video decoding**", Digital Signal Processing for Multimedia Communications and Services, EURASIP, ECMCS,2001, Budapest, September 11-13 (2001)

[7]Y.Pan and P.K.Meher, "**Bit-level optimization of adder-trees for multiple constant multiplications for efficient FIR filter implementation**,"IEEE Trans. Circuits Systems- I, Reg. Papers, vol. 61, no. 2, pp. 455–462, Feb. 2014.

[8]S.Y. Park and P. K.Meher, "Efficient FPGA and ASIC realizations of DA-based re configurable FIR digital filter," IEEE Trans. Circuits System.II, Exp. Brief, vol. 61, no. 7, pp. 511–515, July, 2014.

[9]N.Sambamurthy, M.Kamaraju, "**FPGA Based Optimized Reconfigurable Base-2 Constant Coefficient Multiplier Architecture for Image Filtering**" International Journal of Engineering and Advanced Technology (IJEAT), ISSN: 2249 – 8958, Volume-8, Issue-62, August 2019.

[10]**Zynq-7000 All Programmable Soc**: Technical reference Manual," UG585 (v1.6.1) September,2013.

[11]Dushyant Kumar Soni & Ashish Hiradhar," **Dynamic Power reduction of synchronous digital design by using of efficient clock gating technique**" International Journal of Engineering and Techniques, Volume 1 Issue 3, June, 2015.

[12] N.Sambamurthy, M.Kamaraju, "**Power optimized hybrid sorting-based median filtering**", International Journal of Digital Signals and Smart Systems, Vol.4, issue-1-3,PP.-80-86.,2019.

https://doi.org/10.1504/IJDSSS.2020.106075

[13]A.Ramakrishnaraju, N.Sambamurthy, M.Kamaraju, "FPGA implementation of cordic multiplier based redundant floating point butterfly architecture for signal processing applications," International Advanced Research Journal in Science, Engineering and Technology ,Vol. 3, Issue 7, July 2016.

[14] A.Anusha, N. Sambamurthy," **Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays**", IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 1, Ver. III (Jan -Feb. 2015), PP 01-11.

[15] **ITU Recommendation BT.709-5**, International Telecommunication Union, 2002.