Abstract—Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such as image classification and speech recognition. Due to their computational complexity, DCNNs demand implementations that utilize custom hardware accelerators to meet performance and energy-efficiency constraints. In this paper we propose an FPGA-based accelerator architecture which leverages all sources of parallelism in DCNNs. We develop analytical feasibility and performance estimation models that take into account various design and platform parameters. We also present a design space exploration algorithm for obtaining the implementation with the highest performance on a given platform. Simulation results with a real-life DCNN demonstrate that our accelerator outperforms other competing approaches, which disregard some sources of parallelism in the application. Most notably, our accelerator runs faster than the state-of-the-art DCNN accelerator on the same FPGA device.

I. INTRODUCTION

Deep Convolutional Neural Networks (DCNN) have recently led to impressive progress in many challenging machine learning problems, such as machine vision, natural language processing and speech recognition.

The complexity of DCNNs presents their real-time performance as a major challenge that hinders their widespread deployment, particularly in resource-constrained embedded systems. In this paper, we address this topic via systematic architecture exploration. Further, we develop a custom accelerator for efficient implementation of DCNNs’ test phase. We restrict our discussion to the test phase, and thus, assume that a trained model with known weights is already available.

Prior work on acceleration of DCNN implementations includes several platform technologies. One approach involves utilization of commodity Graphics Processing Units (GPU) [7] whose software programmability renders them particularly well suited for research on DCNN models and acceleration of the training phase. The significant energy dissipation of commodity GPUs, however, prohibits their integration in energy constrained and mobile embedded systems. Another approach includes development of ASIC chips [3], which offer the well-known advantage of performance and energy efficiency at the disadvantage of significant fabrication cost and limited flexibility. The tradeoff appears to be unattractive at the moment, given the market size and the fluid and rapidly-evolving state of research in DCNN models. Another group of researchers focus on FPGA based accelerator design [2, 10]. Reasonable price, low power consumption and reconfigurability of FPGAs are characteristics that make them attractive for DCNN implementation.

Several groups have attempted to build DCNN accelerators by focusing on custom computation engines. This approach relies on the implicit assumption that DCNNs are computationally bounded [1, 2, 9]. At the other extreme, some prior work have viewed the issue as a memory bandwidth problem. As an example, Peeman et al. focus on maximization of the reuse of on-chip data [8]. Interestingly, DCNNs require both high memory bandwidth as well as high computation resources. Thus, an optimized accelerator needs to judiciously strike a balance between the two interdependent criteria [10]. Focus on one aspect while ignoring the other, is bound to result in a sub-optimal architecture.

In this paper, we present a systematic approach for the design and development of an FPGA-based DCNN accelerator. Specifically, we design an architecture template that is capable of exploiting all sources of parallelism in a DCNN. We develop a model that considers performance and data transfer for the proposed architecture. This allows us to estimate the architecture’s feasibility and performance with respect to a specific FPGA. The model enables us to prudently quantify the tradeoff of exploiting one source of parallelism vs. another. Subsequently, we develop a design space exploration algorithm, which yields the most efficient architecture that would be feasible on the target platform. Our main contribution includes advancing the state of the art in DCNN acceleration [10] via exploiting more sources of parallelism. Experimental results demonstrate that we substantially outperform prior work on DCNN accelerators.
II. CONVOLUTIONAL NEURAL NETWORK

A. Background and Overview

DCNNs form a subset of artificial neural networks in which, the transformation from the input to output feature maps is determined by a set of convolution kernels. Querying a DCNN in the test phase, which is the focus of this paper, requires using the forward path of the trained network.

Fig. 1 illustrates an example DCNN in which, the first five layers from the left are convolutional layers and the last three layers are fully connected layers. More than 90% of the execution time in DCNNs is spent in the convolutional layers [4]; therefore, in this study we only concentrate on accelerating the convolutional layers. In a DCNN there are $L$ convolutional layers $l_1, ..., l_L$ each of which, consists of a set of convolution kernels along with a pooling and sub sampling functions (in Fig. 1, $L = 5$). By sliding one stack of convolution kernels over an input layer and continuously computing convolutions, a single output feature map can be generated. This sliding is performed by different strides ($S$) in different layers. Each layer is associated with a stack of convolution kernels of size $K_l \times K_l$. The size of this stack is $N_l$, and in each layer, there are $M_l$ different stacks, each of them is used to build one output feature map. Variables $N_l$ and $M_l$ are the number of input and output feature maps in layer $l$ respectively. Equation (1) defines a 3D convolution which is used in DCNNs. Variables $R_l$ and $C_l$ are the number of rows and columns in the input feature maps of layer $l$ respectively. $IFM$, $OFM$ and $W$ stand for the Input Feature Maps, Output Feature Maps and weights of the convolution kernels.

As it is illustrated in Fig. 2, every output feature map is the result of the 3D convolution of a kernel with all of the input feature maps (i.e., the number of output feature maps are equal to the number of kernels).

B. Parallelism in DCNNs

In each DCNN, there are several sources of parallelism (Fig. 3). To achieve the best possible speedup, all of these sources should be recognized and exploited properly.

- **Inter Layer Parallelism**
  As $\forall \ l \in \{1, 2, ..., L\} : IFM_{l(t+1)} = OFM_l$, there is data dependency between layers; hence, different layers cannot be executed in parallel. On the other hand, since real life DCNNs are very large, it is infeasible to implement all layers in a pipelined fashion. Even for small CNNs, pipeline based implementations do not always provide the best performance [5].

- **Inter Output Parallelism**
  Different output feature maps are totally independent of each other, and theoretically all of them can be computed in parallel. To do so, Equation (1) should be calculated in parallel for different values of $m$.

- **Intra Kernel Parallelism**
  Each pixel in each output feature map is the result of a set of convolutions. However, as these convolutions are independent for different output pixels, it is possible to compute all of them concurrently. This is another source of data level parallelism that can be exploited. Therefore in order to exploit this source of parallelism, Equation (1) can be calculated for different values of $r$ and $c$ concurrently.

- **Intra Kernel Parallelism**
  Finally, there is a considerable amount of parallelism in each convolution. A convolution is essentially a set of multiplications and additions. As each multiplication between a weight in a kernel and a pixel in an input feature map is independent from another multiplication, all of them can be performed in parallel.

If there were unlimited area, bandwidth and on chip memory, all of the aforementioned sources of parallelism could be exploited to expedite the neural network as much as possible. However, in practice, this is infeasible. Therefore, the challenge is to find the optimal combination of different parallelism sources that both minimizes the execution time and satisfies the constraints of the target chip.

In this paper we find the optimal solution by introducing an architecture which can utilize all of the parallelism sources and optimizing design parameters for it.

III. PROPOSED ARCHITECTURE TEMPLATE

The proposed architecture is shown in Fig. 4. The first layer which is named $A$ consists of blocks of $T_k$ multipliers. The multipliers can be used concurrently to compute a portion of the required multiplications of a convolution. The results of these multiplications are accumulated using
∀ \( l \in \{1, 2, 3, ..., L\} \); \( l \) : layers
∀ \( r \in \{1, 2, 3, ..., R_l\} \); \( r \) : row in feature maps
∀ \( c \in \{1, 2, 3, ..., C_l\} \); \( c \) : column in feature maps
∀ \( m \in \{1, 2, 3, ..., M_l\} \); \( m \) : output feature maps in layer \( l \)

\[
OFM[l][m][r][c] = \sum_{n=1}^{N_l} \sum_{i=-\lfloor K_l/2 \rfloor}^{\lfloor K_l/2 \rfloor} \sum_{j=-\lfloor K_l/2 \rfloor}^{\lfloor K_l/2 \rfloor} IFM[l][n][r+i][c+j] \times W[l][n][m][i+\lfloor K_l/2 \rfloor][j+\lfloor K_l/2 \rfloor] \quad (1)
\]

**Fig. 3.** Available parallelism sources in DCNNs

The limited amount of on-chip memory mandates a tiled data transfer. We are using tiling in the kernel level as well as feature map level. For DCNNs that have large kernel sizes, tiling in the kernel level improves the performance drastically. This tiling which is extended to the kernel level provides us with the opportunity to search for the optimized architecture among a larger number of candidates (i.e., the design space is a superset of the one in [10]). In tiling, a tile with the dimension of \( T_r \times T_c \) is fetched for each input feature map in each iteration. Likewise, for each pixel of this tile, a tile of weights with the dimension of \( T_i \times T_j \) is fetched. For an optimal architecture, it is required to determine the suitable values for \( T_r, T_c, T_i \) and \( T_j \).

**IV. Analytical Model**

In this section, we develop an analytical model that shows the relation between different design parameters and attainable performance. This model is also used to indicate the required memory bandwidth for the design. Given the goal of minimal execution time, this model can be used to calculate how many times each module should be replicated.
The number of execution cycles is equal to the number of MAC (Multiplication and Accumulation) operations. This number can be computed using Equation (2) where \( M, N, R, C, K \) and \( P \) are number of output feature maps, number of input feature maps, number of rows, number of columns, size of the convolution kernels and the pipeline overhead respectively.

As each convolution includes one multiplication and one addition, the total number of required operations can be shown by Equation (3). Hence, the computation roof can be defined and calculated as shown in Equation (4).

Number of Execution Cycles = 
\[
\left\lceil \frac{M}{T_m} \right\rceil \times \left\lceil \frac{N}{T_n} \right\rceil \times \frac{RC}{T_i \cdot T_C} \times \left\lceil \frac{K}{T_i} \right\rceil \times \left\lceil \frac{K}{T_j} \right\rceil \times (T_i \cdot T_C \times \left\lceil \frac{T_i \cdot T_j}{T_k} \right\rceil + P)
\]

Num of Ops = \( 2 \times R \times C \times M \times N \times K \times K \) \hspace{1cm} (3)

Computation Roof = \( \frac{\text{Number of Operations}}{\text{Number of Execution Cycles}} \)
\[
= \frac{2 \times M \times N \times K^2}{\left\lceil \frac{M}{T_m} \right\rceil \times \left\lceil \frac{N}{T_n} \right\rceil \times \left\lceil \frac{K}{T_i} \right\rceil \times \left\lceil \frac{K}{T_j} \right\rceil \times \left\lceil \frac{T_i \cdot T_j}{T_k} \right\rceil}
\]

The optimal architecture minimizes the number of execution cycles, i.e., maximizes the computation roof. Notice that the nominator in Equation (4) only depends on the DCNN. Hence, for a particular neural network, the total number of operations are constant. A well designed accelerator is able to achieve a computation roof that fully utilizes all of the resources of a particular FPGA.

### B. On-Chip Buffers

The estimated computation roof can be achieved if with the selected parameters, the required data transmission is less than the maximum available bandwidth. Computations are performed on the input feature maps and weights and the results are stored as output feature maps. Hence, three different buffers are required to hold the necessary data. The required sizes for input feature maps, weights and output feature maps are shown in Equations (5), (6) and (7) respectively.

\[
\beta_{in} = T_n (ST_r + T_i - S) (ST_r + T_j - S) \times 4 \text{ Bytes}
\]

\[
\beta_{wght} = T_m \times T_n \times T_i \times T_j \times 4 \text{ Bytes}
\]

\[
\beta_{out} = T_m \times T_i \times T_c \times 4 \text{ Bytes}
\]

It is possible to prove that for the most efficient implementation of Equation (1), the number of loads and stores can be calculated using Equations (8), (9) and (10).

\[
\alpha_{in} = \frac{M}{T_m} \times \frac{N}{T_n} \times \frac{R}{T_r} \times \frac{C}{T_c} \times \frac{K}{T_i} \times \frac{K}{T_j}
\]

\[
\alpha_{wght} = \frac{M}{T_m} \times \frac{N}{T_n} \times \frac{R}{T_r} \times \frac{C}{T_c} \times \frac{K}{T_i} \times \frac{K}{T_j}
\]

\[
\alpha_{out} = 2 \times \frac{M}{T_m} \times \frac{R}{T_r} \times \frac{C}{T_c}
\]

Using these values and buffer sizes, the required Computation To Communication ratio (CTC) can be computed as shown in Equation (11). The CTC is a measure for reuse of data which is fetched to the on-chip memory.

\[
\text{CTC} = \frac{\text{Total required computation}}{\text{Total Required communication}} = \frac{\alpha_{in} \times \beta_{in} + \alpha_{wght} \times \beta_{wght} + \alpha_{out} \times \beta_{out}}{\alpha_{in} \times 2 \times M \times N \times R \times C} \times \frac{K^2}{T_k}
\]

### V. Experimental Results

To find the set of parameters that maximize the performance, we explore the design space by enumerating over all possible configurations and computing the Computation Roof. This enumeration must be performed under four constraints:

1. The sum of all required buffers in a design must be less than or equal to the available on chip memory.
2. The required bandwidth should be less than or equal to the available bandwidth on a particular platform.
3. Only a certain number of Computational Engines (CE) can be implemented on any chip. We adopt the number of CEs from [10] to enable a fair comparison. However, as we use Parallel Convolution Engine, the number of

<table>
<thead>
<tr>
<th>L</th>
<th>Layer Specific Solution</th>
<th>Layer Specific Solution (Fixed ( T_k ))</th>
<th>Static Solution</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
</tr>
<tr>
<td>1</td>
<td>16 3 10 117975 89</td>
<td>48 3 3 124025 85</td>
<td>16 3 9 127050 83</td>
</tr>
<tr>
<td>2</td>
<td>44 24 5 233280 96</td>
<td>10 16 3 255879 87</td>
<td>16 3 9 279936 80</td>
</tr>
<tr>
<td>3</td>
<td>15 32 1 79092 95</td>
<td>16 10 3 79092 95</td>
<td>16 3 9 87204 86</td>
</tr>
<tr>
<td>4</td>
<td>15 32 1 118638 95</td>
<td>32 5 3 118638 95</td>
<td>16 3 9 129792 86</td>
</tr>
<tr>
<td>5</td>
<td>10 48 1 79092 95</td>
<td>10 16 3 79092 95</td>
<td>16 3 9 86528 86</td>
</tr>
<tr>
<td>S</td>
<td>628077</td>
<td>656726</td>
<td>755642</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>L</th>
<th>Layer Specific Solution</th>
<th>Layer Specific Solution (Fixed ( T_k ))</th>
<th>Static Solution</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
<td>( T_m ) ( T_n ) ( T_k ) Cycles GFLOPS</td>
</tr>
<tr>
<td>1</td>
<td>16 3 10 117975 89</td>
<td>48 3 3 124025 85</td>
<td>16 3 9 127050 83</td>
</tr>
<tr>
<td>2</td>
<td>44 24 5 233280 96</td>
<td>10 16 3 255879 87</td>
<td>16 3 9 279936 80</td>
</tr>
<tr>
<td>3</td>
<td>15 32 1 79092 95</td>
<td>16 10 3 79092 95</td>
<td>16 3 9 87204 86</td>
</tr>
<tr>
<td>4</td>
<td>15 32 1 118638 95</td>
<td>32 5 3 118638 95</td>
<td>16 3 9 129792 86</td>
</tr>
<tr>
<td>5</td>
<td>10 48 1 79092 95</td>
<td>10 16 3 79092 95</td>
<td>16 3 9 86528 86</td>
</tr>
<tr>
<td>S</td>
<td>628077</td>
<td>656726</td>
<td>755642</td>
</tr>
</tbody>
</table>
CEs should be decreased proportional to $T_k$ as shown in Equation (12).

$$T_m \times T_n \leq \frac{\# \text{CEs}}{T_k} \quad (12)$$

We find the optimal design parameters for AlexNet [7] under the constraints of Xilinx Virtex7 485t FPGA. The results are shown in Table I.

### A. Impact of Reconfigurability

We considered three different scenarios with different degrees of reconfigurability. Initially, the accelerator supports flexible $T_m$, $T_n$, and $T_k$. Subsequently, $T_k$ is fixed for every layer. Finally, $T_m$, $T_n$, and $T_k$ are fixed for all layers. The normalized performance of these three experiments is shown in Fig. 6. Performance is defined as $1/\text{(execution cycles)}$.

**Layer Specific Optimal Solution:** In this case $T_m$, $T_n$, and $T_k$ are fully flexible and the optimal solution is shown in Table I. This configuration delivers the best performance but demands a very complex interconnection network.

The design space for layer 1 of the DCNN is shown in Fig. 5. For any point in the design space, the slope of the line from the origin to that point gives the required bandwidth. The target of this exploration is to find the design with the highest GFLOPS. Among points with highest GFLOPS, point A is the best candidate because it has the best CTC.

**Layer Specific Optimal Solution (Fixed $T_k$):** In the subsequent experiment, ($T_k$) is fixed across different layers. As it is shown in Table I, due to the reduced flexibility, the performance drops by 4.5%. However, the architecture no longer needs to support dynamic sizes for PCE across different layers.

**Static Optimal Solution:** In this experiment $T_m$, $T_n$ and $T_k$ are fixed and the optimal solution is shown in Table I. For a completely static accelerator the number of execution cycles increases by 13% compared to the dynamically reconfigurable accelerator. However, for the comparison between dynamic and static reconfigurability in [2, 10] and this paper, the required area for the interconnection network is not taken into account. Hence, the speedup of 13% is only a very loose upper bound. This indicates that the achievable speedup of dynamic reconfigurability can never be better than 13%.

### B. Performance Comparison

**Performance Comparison Versus CPU:** Based on the execution time which is reported in [10] the proposed accelerator in this paper has a speedup of 23.24X and 6.4X compared to the single and 16 threaded CPU based implementation.

**Performance Comparison Versus Other Accelerators:** The performance of different accelerators are reported for different DCNNs on different FPGAs. Hence, before comparing the performances, it is required to normalize the data as described in [10]. The result are shown in Table II. Compared to the previous approaches, the proposed accelerator has the highest performance density. Our approach is 37% faster compared to the state of the art solution (ISFPGA2015 [10]). In order to determine which accelerator performs better when more resources are available, the design space is explored for an FPGA with 2 times more area and bandwidth than Virtex7 485t and the results are shown in Table III. For this FPGA, our accelerator can achieve a speedup of 1.9X compared to the offered approach in [10]. This shows that the proposed solution can utilize the resources better for more sophisticated chips which will be offered in the future. The normalized performance (divided by [10]) for the proposed solution and [10] are compared in Fig. 7.
TABLE II

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Precision</td>
<td>fixed point</td>
<td>48bits fixed</td>
<td>fixed point</td>
<td>48bits fixed</td>
<td>32bits float</td>
<td>32bits float</td>
</tr>
<tr>
<td>Frequency</td>
<td>150 MHz</td>
<td>125 MHz</td>
<td>125 MHz</td>
<td>200 MHz</td>
<td>100 MHz</td>
<td>100 MHz</td>
</tr>
<tr>
<td>FPGA Chip</td>
<td>VLX240T</td>
<td>SX35</td>
<td>SX240T</td>
<td>SX240T</td>
<td>VX485T</td>
<td>VX485T</td>
</tr>
<tr>
<td>CNN Size</td>
<td>2.74 GMAC</td>
<td>0.26 GMAC</td>
<td>0.53 GMAC</td>
<td>0.26 GMAC</td>
<td>1.33 GFLOP</td>
<td>1.33 GFLOP</td>
</tr>
<tr>
<td>Performance</td>
<td>17 GOPs</td>
<td>5.25 GOPs</td>
<td>7.0 GOPs</td>
<td>16 GOPs</td>
<td>61.62 GFLOPs</td>
<td>84.2 GFLOPs</td>
</tr>
<tr>
<td>GOPs/Slice</td>
<td>4.5E-04</td>
<td>3.42E-04</td>
<td>1.9E-04</td>
<td>4.3E-04</td>
<td>8.12E-04</td>
<td>11.09E-04</td>
</tr>
</tbody>
</table>

TABLE III

Static Optimal Solution on an FPGA with 2X larger area and bandwidth than the baseline FPGA (Virtex7 485 T).

<table>
<thead>
<tr>
<th>ISFPGA2015 [10]</th>
<th>Proposed Accelerator</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$T_m$</td>
<td>$T_n$</td>
</tr>
<tr>
<td>Layer 1</td>
<td>48</td>
<td>3</td>
</tr>
<tr>
<td>Layer 2</td>
<td>64</td>
<td>12</td>
</tr>
<tr>
<td>Layer 3</td>
<td>64</td>
<td>15</td>
</tr>
<tr>
<td>Layer 4</td>
<td>64</td>
<td>15</td>
</tr>
<tr>
<td>Layer 5</td>
<td>64</td>
<td>15</td>
</tr>
<tr>
<td>Total</td>
<td>651757</td>
<td>344027</td>
</tr>
</tbody>
</table>

VI. Conclusion

In this paper, a new accelerator for DCNNs is proposed. This accelerator can effectively leverage all of the available sources of parallelism to minimize the execution time. Moreover, we proposed an improved tiling technique that increases the performance by utilizing the tiling in the convolution kernel level. We also developed an analytical model for the proposed architecture and used that model to determine optimized design parameters for a real-life neural network and a particular FPGA. Experimental results show that the proposed solution outperforms all previous work. Specifically, our accelerator has a speedup of 1.9X compared to the state-of-the-art DCNN accelerator.

REFERENCES