CAP - Laboratory for Embedded and Programmable Systems

Customizable Array Of Processors

Continuation of Moore’s law has enabled integration of many processing elements on the same chip. Chip Multi-Processor (CMP) platforms exhibit substantial performance and energy improvements over conventional uniprocessors, and yet, provide the flexibility of general purpose computing systems. Recent demonstrations of several CMP architectures have reported very promising results on execution of intensive streaming applications on programmable architectures.

Nevertheless, lack of methodologies for efficient evaluation and exploration of architecture design choices, and productive application synthesis framework have impeded proliferation of such hardware architectures in embedded systems domain, where performance and energy consumption are the primary design concerns. To address this problem, we work toward developing a FPGA-based CMP prototyping and evaluation methodology, along with application synthesis and compilation tool set, to explore the design space and develop application-specific CMP architecture.

We aim to develop a customizable array of processors (CAP) in which, the processors are soft in that they can be customized to fit the assigned computation workload. Inclusion or exclusion of functional units, and selecting the proper architecture parameters such as cache sizes and address word width are only a few examples of customization knobs. Moreover, the interconnect architecture can be tailored to better serve a selected group of applications.

Task Assignment

Task assignment is the process of partitioning the application task graph to divide tasks between processing elements. In order to achieve maximum throughput, we must take implementation details into account. Such details include computation power of processing elements, on-chip communication architecture, and software generation methodology. The following examples show how throughput can change based on such details. As a result, we need to optimize different cost functions for different implementations. To address this need, we have developed a graph bipartitioning algorithm that optimally maximizes throughput by minimizing any realistic hardware-inspired cost function. For details of the task assignment algorithm please refer to:

Matin Hashemi, Soheil Ghiasi, “Exact and Approximate Task Assignment Algorithms for Pipelined Software Synthesis”, IEEE/ACM Design Automation and Test in Europe (DATE), March 2008

Area Estimation

The main motivation for CAP area estimation is to assist in the selection of an optimal or near-optimal candidate architecture for a given application. Consider that hundreds of potential architectures may be performance profiled in the selection process. However, each of these potential architectures may not fit within the area bounds of the target FPGA. These bounds may include look-up-table logic utilization, on-chip memory, or dedicated units such as hard multipliers. Although this information is reported during the hardware synthesis process, a designer may wait an hour or more per candidate architecture. We need an alternate means to estimate the area utilization with an acceptable level of accuracy.

The CAP team performs area estimation by means of a software profiling tool. This tool accepts a hardware specification file and input, parses the file for the customizable hardware elements, and provides a variety of utilization measurements. This hardware specification file also serves as the final input for architecture generation. Utilization measurements are made via hardware profiles. Note that although vague area details may be available, not all possible customizations combinations may be described in the specification manuals or even practically measured in the lab. Furthermore, RTL or gate-level descriptions of the soft core are not publicly available. Therefore, the individual area contributions of each customizable element is measured and serve as estimates for larger combinations of multiple levels of customization. These individual contributions led to educated guesses about the inner workings of some hardware structures, but a variety of customization elements still may be treated as a black-box. Note that the ever updated versions of hardware force most of the area estimation software to be FPGA family and soft core processor version dependent.

To verify the accuracy of the area estimation tool, we measured the area of a variety of architectures and levels of customization and compared them with the tool’s estimations. Architectural variation includes using different interconnections architectures such as a chain, star, or mesh. Customization variation includes selecting different knobs such as a barrel shifter, integer and floating point hard units, communication buffer width, or the timer peripheral. We constructed our arrays using Xilinx Microblaze via EDK 9.1i and the XUP Virtex-II Pro Development Board. Out of all the possible architectures, we have yet to find one that did not have an logic area utilization (in terms of look-up-tables or slices) accuracy of anything less than 98%. This far exceeded our greatest expectations, such that the area estimation tool will prove to be invaluable to the performance estimation process. The table below is one example of area estimation results for a 6 processor chain architecture in which accuracy is no lower than 99%. Please consult the Master’s Thesis or software below for more hardware profile details.

6 Processor Chain Accuracy Across Levels of Customization
	Pre-map Slice % Error	Pre-map 4LUT % Error	Post-map Slice % Error	Post-map 4LUT % Error
Base Case	0 %	0 %	0.618 %	0.045 %
w/ Timer Added	0 %	0 %	0.661 %	0.033 %
w/ Random FSL Depths	0.419 %	0.709 %	0.509 %	0.557 %
w/ Random Proc Params	0.160 %	0.785 %	0.613 %	0.814 %
w/ All 6 Procs on OPB	0.466 %	0.635 %	0.026 %	0.938 %