# An architecture for the new Adaptive Loop Filter of the High Efficiency Video Coding

Victor Covalski<sup>#1</sup>, Fabiane Rediess<sup>#1</sup>, Pargles Dall'Oglio<sup>#1</sup>, Marcelo Porto<sup>#1</sup>, Luciano Agostini<sup>#1</sup>

\*Centre of Technology Development (CDTec), Federal University of Pelotas (UFPel) P.O. Box 354 – 96001-970 – Pelotas – RS - Brazil

<sup>1</sup>vrcjunes, fkrediess, pwdalloglio, porto, agostini@inf.ufpel.edu.br

Abstract— With the increasing sophistication of portable electronic devices, also increases the demand for greater quality of image and video. However, the video quality deteriorates during the process of compression, becoming even more evident with higher compression rates. The main goal of the HEVC is to enable significantly improved compression performance in the range of 50% bit rate reduction for equal perceptual video quality in comparison with the previous standard. To achieve this objective, it was necessary to improve existing techniques from previous standards and incorporate new tools. This work presents a hardware design for the Adaptive Loop Filter (ALF) core for the High Efficiency Video Coding (HEVC). The ALF is an innovation proposed by the HEVC and it is responsible to improve the final video quality by reducing errors that are generated in all encoder steps. The architecture was described in VHDL and synthesized to an Altera Stratix II FPGA, being able of processing Quad Full HD Videos (QFHD-3840x2160) in real time at 326 MHz.

# **Keywords**— Adaptive Loop filter; Video Compression; HEVC; video coding standard

# I. INTRODUCTION

HEVC, the High Efficiency Video Coding standard, is the most recent joint video project of the ITU-T VCEG and ISO/IEC MPEG standardization organizations, working together in a partnership known as the Joint Collaborative Team on Video Coding (JCT-VC) [1].

The main goal of the HEVC standardization is to enable significantly improved compression performance relative to existing standards – in the range of 50% bit rate reduction for equal perceptual video quality [1]. To achieve this objective it was necessary to incorporate new tools and techniques, improving the existing ones in the previous standards.

The coding process aims to reduce many types of redundancy presented in digital videos. During this process, the subjective quality can be deteriorated, especially through the quantization step which inserts artefacts in the video as a collateral effect due to the increase of the compression rate. Recently, new filters are being proposed to be inserted inside the encoders intending to increase the subjective visual quality of the encoded videos. The HEVC proposed a set of three filters called In-Loop Filter which contains the traditional Deblocking Filter (DF), Sample Adaptive Offset (SAO) and the Adaptive Loop Filter (ALF), which is the focus of this work.

This evolution on video coding algorithms increased the computational complexity of the encoders andto process high resolution videos in software solutions is not enough to achieve real-time processing. In order to be considered real time, the processing rate must be at least 30 frames per second to give sense of continuous motion. With the goal of achieving real-time processing for high resolution videos, usually it is used a dedicated hardware solution.

The focus of this work is the hardware design of the ALF core, which was based on the HEVC and it aims to reduce an additional distortion error generated in all previous coding steps and applying ALF both subjective and objective video qualities are improved. This work proposes hardware architectures for the two ALF core sizes proposed in the HEVC emerging standard.

This work is organized as follows: section 2 describes the Adaptive Loop Filter proposed by the HEVC emerging standard. The ALF filter core architecture is shown in section 3. After, the section 4 shows the synthesis results for the architecture and the section 5 brings the related works and comparisons. Finally, in section 6, the conclusions are presented.

#### II. HEVC ADAPTIVE LOOP FILTER

The focus of this work is the In-Loop Filter defined for the emerging standard HEVC. The In-Loop filter is used to reduce the distortions introduced by the previous encoder steps, but especially on the quantization step. The subjective image quality is enhanced when the distortion is reduced.

The In-Loop Filter is an innovation proposed by HEVC standard. It is not an essential tool for the encoder performance, however, when the focus is the quality improvement, the filtering process becomes more relevant. This filter set proposed by the HEVC must be done, at least, with the same throughput of the previous encoder steps, avoiding extra delay. Thus, the filter solutions must have higher processing rates than other codec parts.

The In-Loop Filter is located exactly before the storage of the reconstructed frame in the coding loop, as shown in Fig. 1, and it is composed by the Deblocking Filter (DF), followed by the Sample Adaptive Offset (SAO) and finally by the Adaptive Loop Filter (ALF)[2].



Fig. 1 Video encoder diagram

During the meetings of the JCT-VC, several alternatives were proposed, such as, the ALF Frame-Based[2], the ALF Block-Based[3] and the ALF Quadtree-based[4]. The method Frame-Based uses a flag to indicate the decision about whether or not to apply the filter for the whole picture[2]. The Block-based method utilizes a flag for each block that indicates if that block must be filtered. In this method, the block can assume several sizes that are: 8x8, 16x16, 24x24, 32x32, 64x64 or 128x128 samples. The Block-Based method enhances the quality of the image in comparison to the Framebased, because the evaluation is made on lesser units (blocks)[3]. The method Quadtree-based the behavior is similar to the Block-Based[3], but it is based on a Quad-Tree structure, which corresponds to the CTB – Coding Tree Block structure found on the newest Draft Version. The Quad-Tree is a structure that allows an efficient way of manipulating blocks of variable size and adaptive blocks[4].

On the version 4 of the Working Draft [5]/and HM [6] the filtering process of the ALF makes use of, for luminance samples, multiple 2-D Filters diamond shaped. The following filter formats are allowed 5x5, 7x7 and 9x9.

On the version 7 of the Working Draft [7] and HM [8], it was made a simplification of the ALF core, now it has only two shapes with less operators in comparison to the previous version. The two new shapes are showed on Fig. 2.



Fig. 2 Filter shapes for luminance samples

The ALF was adopted on TMuC – Test Model under Consideration, the software where the first proposals for and new video coding standard were evaluated, and made part of the Working Draft and the reference software of the HEVC

since the beginning. This filter is being studied by the group and it is achieving progress is terms of coding efficacy and complexity reduction [9].

On the eleventh reunion of the group, in Stockholm, even with the obtained gains, the ALF was not adopted by the group as one of the tools that makes part of the HEVC's Main Profile [10]. Because of that, after HM 8.0, the ALF was removed. Even with the gains achieved by the ALF, it presents great computational complexity in function of the complete multiplications necessary. The great question is that the ALF is located right before the reconstructed frame is going to be stored as reference, therefore, it has to be also on the decoder.

Until the newest Working Draft and HM version, which are version 10.0, the ALF was not yet reinserted on the Main Profile of the HEVC.

On the other hand, with ALF's removal from the Working Draft and HM's, discussions were risen on the following meetings that suggested optimizations to the ALF process and reinsertion on another profile.

## III. PROPOSED ARCHITECTURE

This paper proposes a hardware architecture for the HEVC ALF filter core. This design is based on the Working Draft 7 [7] and the Test Model HM-7.0 [8] of the HEVC standard.

Based on the Working Draft 7, three sub-processes were identified in the whole ALF filtering process. The first sub-process is the boundary padding. If the coding unit edge does not coincide with the slice border, the border will be pad with neighbouring samples, but if the coding unit border coincides with a slice border, the pad will be done with a copy of the coding unit border samples.

This step was not implemented because the focus of this work is on the ALF core. Also it is expected a slight increase in use of hardware resources because the boundary padding can be implemented with only unit control.

The next sub-process after the boundary padding is the filter coefficients derivation. The filter coefficients are statically defined by the Wiener Filter which generates the coefficients accordingly to the historical contribution of the filtering process and it is influenced by the image characteristics. However, the investigation of this step was not concluded in this work and the filter coefficients were considered already known.

The last sub-process is the filter itself. The Fig. 3 illustrates the filter process which corresponds to a multiplication of the sample with its corresponding coefficient and the sum of these partial multiplications. Considering that the filter coefficients are symmetrically distributed, thus it is possible to reduce the number of multipliers. Instead of multiplying each sample by the correspondent coefficient, it is possible to first add the samples that will be multiplied by the same coefficient and after perform the multiplication. This process makes possible to reduce the number of multipliers from 17 to 9. This calculation order was already proposed by the HM software. In order to illustrate this process, in the Fig. 3, the coefficient

C6 is present in the top left of the filter and in the bottom right, then the corresponding samples are samples 'l' and 'm' which will be added. After the sum, the result will be multiplied by the coefficient C6. After finalize the multiplications, the partial results are accumulated.



| I |   | n |   | р |
|---|---|---|---|---|
|   | f | h | j |   |
| d | b | а | С | е |
|   | k | i | g |   |
| q |   | 0 |   | m |

Fig. 3 ALF filter shape. (a) Coefficients (b) samples

The ALF core was implemented with eight pipeline stages. The additions between the samples that will be multiplied by the same coefficient are done in the first pipeline stage. The multiplications are done in the second pipeline stage. The adders that accumulate the previous values are located between the third and the sixth stage. During the seventh stage, the accumulated values are added with the constant 128, which rounds the result and allowing to clip in the next stage. Finally, the clipping operation is done within the eighth stage, maintaining the result in the typical image sample bit-width (8 bits). All other bits are discarded. This configuration with eight pipeline stages was the best configuration in terms of operation frequency, because the critical path in any stage has only one operation.



The Table 1 shows the resources needed to the development of the proposed architecture, it is made of 8 pipeline stages and uses 17 adders and 9 multipliers.

Furthermore, it makes the calculation by using 17 samples and 9 coefficients, as explained before.

TABLE 1. ALF Architecture Components

| Characteristics    | Proposed Architecture |  |
|--------------------|-----------------------|--|
| Pipeline Stages    | 8                     |  |
| Adders             | 17                    |  |
| Multipliers        | 9                     |  |
| Clipping           | 1                     |  |
| Input Coefficients | 9                     |  |
| Input Samples      | 17                    |  |

#### IV. RESULTS

The architecture was described in VHDL and synthesized targeting an Altera Stratix II EP2S15F484C3 FPGA using the Quartus II tool also from Altera. Table. 2 presents the synthesis results for the developed architecture of the ALF core.

The designed architecture has low hardware cost, considering that it is capable of processing QFHD(3840x2160) videos in real time. It is used less than 1% of the available memory on the FPGA and only 2% of the ALUTs available.

TABLE 2. Synthesis results.

| Synthesis Results | ALF Core |
|-------------------|----------|
| ALUTs             | 371 (2%) |
| Registers         | 613      |
| Memory bits       | 54 (<1%) |
| DSP block 9-bit   | 9 (9%)   |
| Frequency (MHz)   | 326      |

As Table 3 shows, it is possible to observe that the designed architecture is able to process 353, 157, 79, 39 frames per second on 720p(1280x720), 1080p(1920x1080), WQXGA(2560x1600) and QFHD(3840x2160), resolutions respectively.

Because it is necessary to have at least 30 frames per second to give the feeling of motion in digital videos, the developed architecture is capable of processing QFHD (3840x2160 pixels) videos in real time.

TABLE 3. ALF Architecture Processing Capability

| Video resolution  | Frames / second |
|-------------------|-----------------|
| 720p (1280x720)   | 353             |
| 1080p (1920x1080) | 157             |
| WQXGA (2560x1600) | 79              |
| QFHD (3840x2160)  | 39              |

## V. RELATED WORKS

There are many proposals presented for the ALF at the JCT-VC meetings. Some of them in a high-level, proposing others algorithms and adaptation methods. Another proposals have focus on the filter core, proposing new filter shapes and/or sizes. The filter architectures proposed in this work can be easily adapted the these other shapes and sizes by including or removing parallel operations in first two pipeline stages of the architecture and proportionally adapting the next stages.

Since the HEVC is an emerging standard under development, there are a few works in the literature focusing in hardware solutions for the HEVC tools. We did not find any work focusing in the hardware implementation of the HEVC ALF filter in the literature.

The work presented by Du and Yu [11] shows a hardware implementation of an ALF filter, however, it has a different approach than that used in our work. This makes difficult a fair comparison. The work of Du and Yu [11] is focused on the H.264/AVC standard, it uses a different filter shape and it proposes an architecture combined with the DF. The cited work reached an operation frequency of 211MHz (when targeting a Xilinx Virtex 5 FPGA), achieving real time processing for 1080p videos. Our work presents higher processing rates, reaching real time processing not only for 1080p videos but also for WQXGA resolution. However, this comparison is not really fair because this related work includes the Deblocking Filter in its architecture. The number of operators used in the 9x9 filter proposed by Du and Yu [10] was 41 multipliers and 81 adders, while in the architecture developed in our work only 20 multipliers and 39 adders were used. However, there is a difference in the used shape between the two filter versions, which makes unfair a direct comparison.

#### VI. CONCLUSIONS

This work presented efficient hardware design for the two ALF core sizes: 5x5 and 9x9. The ALF is one filter of the In-Loop Filter proposed by the new HEVC standard. The architectures were described in VHDL and synthesized to an Altera Stratix II FPGA.

The synthesis results show that the designed architecture is able to process 157 1080p frames per second, achieving enough performance for real time processing for both resolutions. Comparing to the only one related work found in the literature, although the comparison is not really fair due to the different focuses and approaches adopted in these two works, our work reached real time processing for higher resolutions using less multipliers.

As future work it is planned to continue the investigation of the Working Draft and HM in order to identify the ALF stages that were not fully concluded in this work, like the coefficients derivation. Then it will be possible to design a complete architecture for the HEVC ALF filter. An optimization in the designed architectures is also planned, intending to reduce the hardware consumption and to increase the processing rates. The last two pipeline stages will be focused in this optimization, where both stages will be merged in a most efficient solution. The use of fast adders in the critical paths of the designed architectures will also be considered.

Other future work is to implement a hardware solution for the ALF filter in the decoder module of the HEVC standard with a reconfigurable architecture.

#### REFERENCES

- [1] B. Bross, W.-J. Han, G. J. Sullivan, J.-R. Ohm, and T. Wiegand, "High efficiency video coding (HEVC) text specification draft 8, "ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-J1003, July 2012.
- [2] VCEG "Adaptive (Wiener) Filter for Video Compression", VCEG-A114, VCEG Meeting, Berlin July 2008.
- [3] VCEG. Block-Based Adaptive Loop Filter. VCEG-AI18, VCEG Meeting. Berlin, 2008b.
- [4] VCEG, "Specification and experimental results of Quadtree-based Adaptive Loop Filter", VCEG-AK22, VCEG Meeting, Yokohama, April 2009.
- JCT-VC "High Efficiency Video Coding (HEVC) text specification draft 4", JCTVC-F003 JCT-VC Meeting, Geneva, November 2011
- [6] JCT-VC: "HM4: High Efficiency Video Coding (HEVC) Test Model 4 Encoder Description ", JCTVC-F802, JCT-VC Meeting, Torino, March 2011.
- JCT-VC "High Efficiency Video Coding (HEVC) text specification draft 7". JCTVC-J0002. JCT-VC Meeting Stockholm July 2012
- [8] JCT-VC "High Efficiency Video Coding (HEVC) Test Model 7 Encoder Description", JCTVC-J0003, JCT-VC Meeting Stockolm. July 2012
- [9] Sullivan, Gary; Ohm, Jens; Han, Woo-Jin; Wiegand, Thomas "Overview of the High Efficiency Video Coding (HEVC) Standard", IEEE Trans. On Circuits and Systems for Video Technology, December 2012.
- [10] JCT-VC, Inclusion of Adaptive Loop Filtering (ALF) in a New Profile or Main, JCTVC-J0307, JCT-VC Meeting, Stockolm, 2012c
- [11] Du, Juan; Yu, Lu. A Parallel and area-efficient architecture for deblocking filter and Adaptive Loop Filter. IEEE International Symposium on Ciruits and Systems (ISCAS 2011)