• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Senior engineer in Telechips Co.)



H.265/HEVC, Deblocking Filter, Parallel Processing, VLSI, Memory Allocation

1. Introduction

HEVC is the recently introduced video codec standard developed by Joint Collaborative Team on Video Coding (JCT-VC) to improve the compression ratio of the existing H.264/AVC by 100%. It is widely used in the video applications like mobile devices, digital camcorder, UHD TV and IPTV.

Since HEVC performs operations like quantization, transfor- mation, and prediction based on quad tree blocks, discontinuities can be found from the pixels along the block boundaries of the decompressed image (1). They are also called block artifacts which affect the compression efficiency as well as the quality of the image. To reduce the block artifacts, Deblocking Filter (also called Loop Filter) is used for HEVC as well as H.264/AVC. The block artifacts are discovered at the boundaries of Transform and Prediction blocks. Therefore, Deblocking Filter is applied here conditionally. In H.264, filtering is applied on the 4x4 block boundaries, but in HEVC, filtering is applied on 8x8 block boundaries to reduce the computational complexity.

Various researches were performed on the parallel implemen- tation of Deblocking Filter (7) ~ (11). In (7), the performance is increased by using 6 filters in parallel. However, the gate count is too high and the performance is limited for full HD resolution. In (8), register array-based architecture is used to reduce the overlapped computation, but the target performance is limited up to 4K UHD resolution. In (9), a zigzag allocation of data into SRAM blocks is used to avoid data conflicts and the best performance is shown compared with previous works. However, the size of the circuits is too large (> 600K gates) if the parallelism is increased to 8 filters to handle video data over 16K UHD at 60fps.

In this paper, a pipelined architecture with parallel filters is employed to process the high bandwidth data for Deblocking Filter. After various degrees of parallelism (2, 4, 8, and 16) were tested and compared, parallelism of 8 filters was found to show the best performance for the same target. We also propose a novel memory allocation algorithm to optimize the data access throughput in the on-chip SRAM. By using the proposed architecture and algorithm, the performance can be improved by 90% compared with previous works while the implemented area is not increased much.

This paper is organized as follows. In section 2, HEVC Deblocking Filter algorithm and previous research works are explained. In section 3, proposed VLSI architecture and memory allocation algorithm are explained. In section 4, experimental results are compared. Section 5 concludes the paper.

Fig. 1. Deblocking Filter Computation

../../Resources/kiee/KIEE.2020.69.11.1755/fig1.png

2. Previous Works

2.1 Deblocking Filter Algorithm

Fig. 2. Memory Allocation Scheme in (9)

../../Resources/kiee/KIEE.2020.69.11.1755/fig2.png

Fig. 1 shows the three adjacent 4x4 pixel blocks and the Deblocking Filter computation boundaries. The vertical filtering is performed along the vertical edge. The horizontal dotted box shows the pixels used for vertical filtering at a row as an example. The vertical filtering is performed on all the four rows of the two horizontally adjacent 4x4 blocks. The horizontal filtering is performed along the horizontal edge. The vertical dotted box shows the pixels used for horizontal filtering at a column as an example. The vertical filtering is performed on all the four columns of the two vertically adjacent 4x4 blocks.

In HEVC, Deblocking Filter is applied on 8x8 block boun- daries in the following procedure. Firstly, boundary strength (BS) is computed to determine the strength of the filtering. Secondly, β, and tc values are computed based on the quantization parameter (QP) of the neighboring blocks. Thirdly, dE value is computed for filter on/off, strong and weak filtering decision. Finally, strong or weak filtering is performed along the horizontal or vertical edges.

2.2 Related Works

In (7), the performance is increased by using 6 filters in parallel. In (8), register array-based architecture is used to reduce the overlapped computation and 4 filtering boundaries are computed in parallel. In (9), to improve the performance, three techniques are used: pipelining, filter order optimization, and zigzag memory mapping. (10) and (11) are related to the deblocking filter designs but the scope of the problems is not comparable to the compared designs.

Fig. 2 shows the memory allocation scheme in (9). The adjacent 4x4 blocks are distinguished by black and white colors, and they are stored into different SRAMs alternatively to avoid data conflict during SRAM access. However, the performance of this scheme is not enough for a highly parallel VLSI architecture.

The architecture shown in (7) cannot process 8K UHD or higher video. The architectures in (8) and (9) can process 8K UHD 60fps in real time, but the size of the circuits becomes too large (> 600K gates). In addition, the operating clock frequency is not high enough to process 16K UHD video or higher in real time.

3. Proposed VLSI Architecture

3.1 Parallel Architecture

HEVC is more suitable to parallel processing by further reducing data dependency compared with H.264/AVC. The proposed architecture improves the performance by employing parallelism, a multi-stage pipeline with parallel filters. For parallel filter processing, a special memory allocation algorithm is invented for efficient on-chip SRAM access.

Fig. 3 shows the overall block diagram of the proposed architecture. The proposed architecture consists of the following components: BS calculator, parallel dE calculator, β and tc calculator, parallel filter and SRAM. SRAM is divided into 16 blocks. In SRAM, the current pixels of a 32x32 block and their left and above neighboring pixels are stored. Each SRAM block stores 8 4x4 pixel blocks. The data is read from the SRAM blocks for filtering computation and stored back to the SRAM blocks after filter computation. The read/write SRAM block positions are determined by the proposed memory allocation algorithm.

Fig. 3. Block Diagram of the Proposed Architecture

../../Resources/kiee/KIEE.2020.69.11.1755/fig3.png

The proposed architecture works as a 4-stage pipeline as shown in Fig. 3. The circled number in Fig. 4 denotes the pipeline stage. In stage 1, the data is read from SRAM and the BS calculation is performed. In stage 2, β and tc calculation, and 8 parallel dE calculations are performed. In stage 3, 4x4 block filtering is performed in parallel. In stage 4, filtered results are stored back into SRAM blocks.

3.2 Memory Allocation Algorithm

The data access in the on-chip SRAM without conflicts between the parallel filters is very important to maintain high performance. For the proposed parallel filter architecture, we show an efficient memory allocation algorithm to avoid such conflicts. By using this algorithm, each filter is guaranteed to read and write from 2 out of the 16 SRAM blocks without conflicts between them in parallel.

Fig. 4. 32ⅹ32 Pixel Block (Y, Cb, Cr)

../../Resources/kiee/KIEE.2020.69.11.1755/fig4.png

The filtering boundary of a deblocking filter can be either vertical or horizontal. As shown in Fig. 3, the proposed architecture has 16 SRAM blocks. Fig. 4 shows the 32x32 block waiting for the Deblocking Filter operation in parallel. Each named block is a 4x4 block. Our goal is to allocate these 4x4 blocks into the 16 SRAM blocks such that no data conflicts occur between the filters during the parallel filter computation.

As explained before, each SRAM block stores 8 4x4 pixel blocks. In Fig. 4, each block is named by 2 letters such as AA or BC. The block with the same first letter belongs to the same vertical filtering. For example, In Fig. 4, blocks AE’s, AA, AB’s, AC’s and AD’s belong to the group for same vertical filtering, and all of them have same first letter ‘A’. Fig. 13 is an example final data allocation. The blocks with initial ‘A’ are stored in the first row of the SRAM blocks evenly spread across them without any overlaps. Therefore, the vertical filtering of this data group can be performed by 8 filters accessing different SRAM blocks without conflicts at the same time.

Our proposed memory allocation algorithm shows a deterministic way to decide the SRAM location by using two letters on each 4x4 block. The second letter of the block name is used to denote the same horizontal filtering group.

The filtering order of HEVC is (1) the vertical filtering shown in Fig. 5 done first, and then (2) the horizontal filtering shown in Fig. 6 next. For example, the blocks AA, AB, AC, AD and AE are used for same vertical filtering, and BA, BB, BC, BD and BE are used for another vertical filtering as shown in Fig. 5.

Fig. 5. 32ⅹ32 Pixel Block - Vertical Filtering

../../Resources/kiee/KIEE.2020.69.11.1755/fig5.png

Fig. 6. 32ⅹ32 Pixel Block - Horizontal Filtering

../../Resources/kiee/KIEE.2020.69.11.1755/fig6.png

For horizontal filtering, AA, BA, CA, DA and EA belong to the same filtering group, and AB, BB, CB, DB, and EB belong to another filtering group. Same filtering group must be stored into different SRAM blocks to avoid the data access conflicts during parallel filter computation. By applying these rules on the 32x32 pixel block using 8 alphabet letters as shown in Fig. 4, we can obtain the goal allocation as shown in Fig. 13 eventually. All the pixel blocks are stored into SRAM blocks such that no access conflicts occur during the filter computation.

Fig. 7. Memory Allocation Step 1

../../Resources/kiee/KIEE.2020.69.11.1755/fig7.png

Fig. 7 ~ 12 illustrate a deterministic algorithm to reach a memory allocation result shown in Fig. 13 without any access conflicts. Firstly, as shown in Fig. 7, allocate B-, C-, D- blocks with 4 space gaps between the rows. Note that all the blocks are allocated into SRAM blocks (S_0~S_15) such that each SRAM has no conflicting letters for first and second positions. Therefore there is no conflict in the SRAM access. After this allocation, there are A- and E- blocks waiting for the allocation as shown in Fig. 8.

Fig. 8. Memory Allocation Step 2

../../Resources/kiee/KIEE.2020.69.11.1755/fig8.png

Fig. 9. Memory Allocation Step 3

../../Resources/kiee/KIEE.2020.69.11.1755/fig9.png

Secondly, as shown in Fig. 9, for the SRAM column with -A assigned like SRAM_1, put AE in the first row. For the column with -E, put EA in the fifth row. As shown in Fig. 10, fill in the remaining letter on the column with only one empty slot.

Finally, as shown in Fig. 11, fill in the remaining two rows with AA and EE for one column, and with AE and EA for the other three columns. After all the luminance blocks are allocated, Cb and Cr blocks are allocated similarly. As shown in Fig. 12, allocate the GG blocks in the SRAMs, and then allocate FH and HF blocks. The remaining columns are allocated such that the first and second letters are not overlapped as shown in Fig. 13.

Fig. 10. Memory Allocation Step 4

../../Resources/kiee/KIEE.2020.69.11.1755/fig10.png

Fig. 11. Memory Allocation Step 5

../../Resources/kiee/KIEE.2020.69.11.1755/fig11.png

As long as the letters of each position are not overlapped in a column, there can be many solutions. In this paper, we showed a deterministic algorithm to find a working solution quickly.

Fig. 12. Memory Allocation Step 6

../../Resources/kiee/KIEE.2020.69.11.1755/fig12.png

By using the proposed memory allocation algorithm with an 8 filter architecture, the execution time to perform filter oper- ations on 96 boundaries for a 32x32 pixel block is 12 clock cycles. To add the 6 cycles of starting and ending delays of pipelines, the total execution time is 18 clock cycles.

Fig. 13. Memory Allocation Step 7

../../Resources/kiee/KIEE.2020.69.11.1755/fig13.png

The memory allocation algorithm can be used for other numbers of filter parallelism in a similar fashion. If the number of SRAM is increased, the number of parallel filters can be increased proportionally. For example, 16 parallel filters can be used for 32 SRAM blocks.

4. Experimental Results

For 16K UHD video, the proposed 8 filter architecture is implemented with the gate count 266K and the operating clock frequency 140MHz in TSMC 65nm process. The size of the filter core is 168K gates, the size of SRAM is 66K gates, and the remaining logic blocks are 10K gates. The maximum possible operating frequency of the proposed architecture can be up to 200MHz. A Quarter-LCU can be processed for 18 clock cycles by the pipelined computation.

Table 1 shows the comparison of the previous works and the proposed architecture. In (9), 4K UHD 30fps video is processed at 28.5MHz. The performance of the proposed architecture is 6.2 times higher than (9). However, the gate count is increased 3.25 times than (9) to achieve the performance. The overall improvement considering area and time is 1.9 times (or 90%) higher than (9). Table 1 also shows the results of various filter configuration of the proposed architecture.

For the comparison with other parallel filter configurations versus (9), the target of 4K UHD 30fps is used. For 2-parallel filter configuration, the operating clock frequency is 14MHz and the gate count is 118K. The gate count is increased by 57% and the performance is increased by 103%. For 4-parallel filter configuration, the operating clock frequency is 7.7MHz and the gate count is 160K. The gate count is increased by 113% and the performance is increased by 270%. For 16-parallel filter configuration, the operating clock frequency is 14MHz and the gate count is 118K. The gate count is increased by 57% and the performance is increased by 103%. We found out that the 8-parallel filter configuration showed the best performance.

Table 1. Computing system configuration

Design

Resolution

Clock

Frequency

Gate count

(7)

1920*1080*86

108MHz

36.8K

(8)

4096*2160*30

94.4MHz

54K

(9)

4096*2160*30

28.5MHz

75K

Proposed

(2-parallel)

4096*2160*30

14MHz

118K

Proposed

(4-parallel)

4096*2160*30

7.7MHz

160K

Proposed

(8-parallel)

4096*2160*30

4.6MHz

244K

Proposed

(16-parallel)

4096*2160*30

3MHz

412K

Proposed

(8-parallel)

15360*8640*60

140MHz

244K

5. Conclusion

In this paper, we proposed a parallel VLSI architecture and a novel memory allocation algorithm for high-performance HEVC Deblocking Filter. In previous researches, the performance of the designs was not supporting 8K or higher resolution videos, and the size of the designs was too large. The proposed parallel architecture is a multi-stage pipeline with parallel filters. The deterministic memory allocation algorithm decides the data mapping from 32x32 pixel block into 16 SRAM blocks to avoid data access conflicts during the filter operation quickly. The proposed architecture can process 16K UHD 60fps video at 140MHz which is 90% improvement compared with the previous works.

References

1 
JCT-VC, 2013, High Efficiency Video Coding (HEVC) text specification draft 10 (for FDIS & Last Call), JCTVC-L1003_v34, Geneva, Switzerland, JanGoogle Search
2 
G. J. Sullivan, Dec 2012, Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE Trans. Circuits Syst. Video Technol., Vol. 22, No. 12, pp. 1649-1668DOI
3 
J. R. Ohm, Dec 2012, Comparison of the Coding Efficiency of Video Coding Standards-Including High Efficiency Video Coding (HEVC), IEEE Trans. Circuits Syst. Video Technol., Vol. 22, No. 12, pp. 1669-1684DOI
4 
F. Bossen, B. Bross, K. Sühring, D. Flynn, Dec 2012, HEVC Complexity and Implementation Analysis, IEEE Trans. Circuits and Systems for Video Technology, Vol. 22DOI
5 
B. M. T. Pourazad, C. Doutre, M. Azimi, P. Nasiopoulos, 2012, HEVC: The New Gold Standard for Video Compression: How Does HEVC Compare with H.264/AVC?, Consumer Electronics Magazine, IEEE, June, pp. 36-46DOI
6 
N. Andrey, Dec 2012, HEVC Deblocking Filter, IEEE Trans. Circuits Syst. Video Technol., Vol. 22, No. 12, pp. 1746-1754DOI
7 
Ozcan, Erdem, Yusuf Adibelli, Ilker Hamzaoglu, 2013, A high performance deblocking filter hardware for High Efficiency Video Coding, Field Programmable Logic and Applications (FPL) 2013 23rd International Conference on. IEEEDOI
8 
Jongwoo Bae, 2013, Register array-based VLSI architecture of H. 265/HEVC loop filter., IEICE Electronics Express 10.7 (2013), pp. 20130161-20130161DOI
9 
Weiwei Shen, 2013, A high-throughput VLSI architecture for deblocking filter in HEVC, Circuits and Systems (ISCAS) 2013 IEEE International Symposium on IEEEDOI
10 
P. Hsu, C. Shen, May 2017, The VLSI Architecture of a Highly Efficient Deblocking Filter for HEVC Systems, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 27, No. 5, pp. 1091-1103DOI
11 
S. Baldev, K.K. Anumandla, R. Peesapati, Feb 2020, Scalable Wavefront Parallel Streaming Deblocking Filter Hardware for HEVC Decoder, IEEE Trans. on Consumer Electronics, Vol. 66, No. 1, pp. 41-50 DOI

저자소개

Hyunjun Kim
../../Resources/kiee/KIEE.2020.69.11.1755/au1.png

He received a B.S. and M.S. degree in Dept. of Information and Communication Engineering at Myongji Univ., Korea, in 2013 and 2015 each.

He is currently working for Telechips Co. as a senior engineer for the video IP and SoC development.

His research interests include VLSI design, and image/video processing.

Jongwoo Bae
../../Resources/kiee/KIEE.2020.69.11.1755/au2.png

Dr. Bae received a B.S. degree in Control and Instrumentation from Seoul National University, Korea, in 1988, and M.S. degree and Ph.D. in Computer Engineering from the University of Southern California in 1989 and 1996, respec- tively.

Dr. Bae worked as an engineer for Actel, Avanti, and Pulsent Co. in San Jose, CA, from 1996 to 2002.

He worked as a principal engineer in Samsung Electronics from 2003 to 2007.

He is currently a full professor in the Dept. of Information and Communications Engi- neering at Myongji University, Korea since 2007.

His research interests include VLSI design, image/video processing, digital TV, and pro- cessors.