Programming Model and Platforms

Map-Reduce on FPGA

Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers.

Introduction to MapReduce

$$\mathrm{map} ::= \,\, \lt\!\! key, value\!\!\gt \,\, \to \, intermediate \lt\!\! key, value\!\!\gt \\ \mathrm{reduce} ::= \,\, intermediate \lt\!\! key, value\!\!\gt \,\, \to \, result $$

The input data to a computing task is split into many <key,value> pairs and a map function processes these pairs to generate a set of intermediate <key,value> pairs. The intermediate pairs with the same intermediate key are grouped together and passed to reduce function. The communication model within MapReduce is transparent to users so as to alleviate the development efforts. Users only need to design the map and reduce function. Then the MapReduce runtime framework takes care of the parallel execution by issuing multiple map and reduce tasks to computation nodes.

MapReduce Data Flow

FPMR: FPGA MapReduce

The initial <key,value> pairs are prepared by CPU and then transferred to the FPGA through PCI-E bus or CPU bus, e.g. HyperTransport or FSB.

FPMR Framework

The mappers process the initial input <key,value> pairs and generate the intermediate <key,value> pairs. The reducers then merge the intermediate pairs to obtain the final results. In some applications, the outputs of reducers need to be further processed to get a single result, in which case a merger will be implemented. The processor scheduler generates control signals to schedule mappers and reducers. The data controller takes charge of communicating with the host CPU, dispatching data to the mappers, and receiving data from the reducers.

Case Study: RankBoost

RankBoost is a Boosting algorithm targeting for rankings. Giving an exact and complete ranking for large scale objects is difficult. RankBoost is a promising algorithm for this problem by combining many “weak” hypothesises which are partly or nearly right. The result ranking function will be highly accurate by many rounds of training on large scale dataset.

map (int key, pair value):
// key: feature index $fi$
// value: document $bin_{fi}$, document $\pi$
foreach document $d$ in value:
   $hist(bin_{fi}(d)) = hist(bin_{fi}(d)) + \pi(d)$
EmitIntermediate($fi$, $hist_{fi}$)
reduce (int key, array value):
// key: feature index $fi$
// value: histograms $bin_{fi}$, $fi = 1 \ldots N_f $
foreach histogram $hist_{fi}$:
   for $i \gets N_{bin} \ldots 0$:
       $integral_{fi}(i) = hist_{fi}(i) + integral_{fi}(i+1)$
EmitIntermediate($fi$, $integral_{fi}$)

FPGA Virtualization

Cloud computing is becoming a major trend for delivering and accessing infrastructure on demand via the network. Meanwhile, the usage of FPGAs (Field Programmable Gate Arrays) for computation acceleration has made significant inroads into multiple application domains due to their ability to achieve high throughput and predictable latency, while providing programmability, low power consumption and time-to-value. Many types of workloads, e.g. databases, big data analytics, and high performance computing, can be and have been accelerated by FPGAs. As more and more workloads are being deployed in the cloud, it is appropriate to consider how to make FPGAs and their capabilities available in the cloud. However, such integration is nontrivial due to issues related to FPGA resource abstraction and sharing, compatibility with applications and accelerator logics, and security, among others. We proposed a general framework for integrating FPGAs into the cloud and a prototype of the framework is implemented based on OpenStack, Linux-KVM and Xilinx FPGAs. The prototype enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling. Experimental results demonstrate the effectiveness of this prototype, an acceptable overhead, and good scalability when hosting multiple VMs and processes.

FPGA Framework in a Cloud


We propose an accelerator pool (AP) abstraction as a trade-off between current FPGA limitations and cloud principles. In the AP abstraction, each FPGA chip has several pre-defined accelerator slots, e.g. slots A, B, C and D shown in Figure 2. By using the dynamic partial reconfiguration mechanism of modern FPGAs, each slot can be considered as a virtual FPGA chip with standardized resource types, capacity and interfaces. Therefore, each slot can only host an accelerator with compatible resource requirements and interface design. Using AP, FPGA chips become a pool of accelerators with various functions and performance. Instead of requesting programmable resources in PRP, a tenant directly requests various combination of accelerator functions and performance. A cloud provides a list of pre-defined accelerators, handles tenant requests and configures accelerators into idle slots. If no accelerator matches the requirements, a tenant can submit his own designs and the cloud owner performs the compilation and adds the tenant design into the accelerator list.

Design in FPGAs


We proposed a benefit-based metric instead of conventional throughput-based metric, we also presented two scheduling algorithm based on our FPGA accelerated Cloud system. By applying our benefit-based scheduling metric to a real Openstack-based cloud environment, 60.3% computing resources are economized compared to conventional throughput-based metric. Then in view of the characteristics of cloud, a revised algorithm considering the replacement of tasks is proposed, which makes our FPGA accelerated cloud system 1.386 times faster than using the previous algorithm. Finally, we modify our metric involving utilization and interference, achieving 86.75% performance improvement.

we propose a metric that describes the computing capacity of FPGA equivalent to a certain number of standard virtual CPUs (vCPUs). So, if running a task could achieve n times speedup on FPGA compared with running it on a standard vCPU, then the computing capacity of the FPGA area occupied by this task is equal to n vCPUs. We could use an equation as the summation of all the tasks speedup compared with a vCPU to describe the profit of the whole FPGA chip. We name the metric, benefit.

$$ Benifit_{FPGA} \mathop{=}^{\text{def}} \#vCPU = \sum_{ \forall t_i \in task} Speedup_{t_i} $$

Considering interference and utilization, our metric war modified.

$$ Benifit_{FPGA} \mathop{=}^{\text{def}} \#vCPU = \sum_{ \forall t_i \in task} \frac{ {Speedup\_utili}_{t_i}}{Interfere_{t_i}} \\ = \sum_{ \forall t_i \in task} \frac{ {t\_execution}_{t_i} \cdot {Speedup\_ideal}_{t_i}}{ {t\_interval}_{t_i} \cdot Interfere_{t_i}} $$

Based on this metric, two scheduling algorithm was proposed. Algorithm 2 considered the online scheduling with replacement.



  • Jincheng Yu, Zhilin Xu, Shulin Zeng, Chao Yu, Jiantao Qiu, Chaoyang Shen, Yuanfan Xu, Guohao Dai, Yu Wang, and Huazhong Yang, INCAME: Interruptible CNN accelerator for multi-robot exploration , in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2021. pdf
  • Shulin Zeng, Guohao Dai, Hanbo Sun, Kai Zhong, Guangjun Ge, Kaiyuan Guo, Yu Wang, Huazhong Yang, Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud , in, 2020. pdf
  • Guyue Huang, Guohao Dai, Yu Wang and Huazhong Yang, GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks , in, 2020. pdf
  • Shulin Zeng, Guohao Dai, Hanbo Sun, Kai Zhong, Guangjun Ge, Kaiyuan Guo, Yu Wang, Huazhong Yang, Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud , to appear in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020. pdf
  • Shaoxia Fang, Shuling Zeng and Yu Wang, Optimizing CNN Accelerator with Improved Roofline Model , to appear in IEEE System-On-Chip Conference, 2020. pdf
  • Ziqian Wan, Guohao Dai, Yun Joon Soh, Jishen Zhao, Yu Wang, An Order Sampling Processing-in-Memory Architecture for Approximate Graph Pattern Mining , 2020. pdf slide
  • Xiaoming Chen, Yinhe Han, Yu Wang, Communication Lower Bound in Convolution Accelerators , in IEEE International Symposium on High Performance Computer Architecture, 2020. pdf
  • Guyue Huang, Guohao Dai, Yu Wang and Huazhong Yang, GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks , in The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2020. pdf
  • Guyue Huang, Guohao Dai, Yu Wang and Huazhong Yang, Towards Fast Graph Neural Network Training with Efficient and Framework-Compatible Sparse-Dense Matrix Multiplication , in MICRO-53 Student Research Competition (SRC), 2020. pdf
  • Xiaoming Chen, Yinhe Han, Yu Wang, Communication Lower Bound in Convolution Accelerators , in, 2019.
  • Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang and Huazhong Yang, A Survey of FPGA-Based Neural Network Inference Accelerator , in ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol.12, No.1, 2019. pdf
  • Tianhao Huang, Guohao Dai, Yu Wang and Huazhong Yang, HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing , in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp.973-978. pdf
  • Guohao Dai, Tianhao Huang, Yu Wang, Huazhong Yang, John Wawrzynek, NewGraph: Balanced Large-scale Graph Processing on FPGAs with Low Preprocessing Overheads , in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018, pp.208-208. pdf
  • Gushu Li, Guohao Dai, Shuangchen Li, Yu Wang, Yuan Xie, GraphIA: An In-situ Accelerator for Large-scale Graph Processing , in International Symposium on Memory Systems (MEMSYS), 2018, pp.79-84. pdf
  • Yuliang Sun, Lanjun Wang, Chen Wang, Yu Wang, Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs , in ACM Transactions on Embedded Computing Systems (TECS), 2017. pdf
  • Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, Huazhong Yang, ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , in ACM International Symposium on FPGA (FPGA), 2017, pp.217-226. pdf slide
  • Guohao Dai, Yuze Chi, Yu Wang, Huazhong Yang, FPGP: Graph Processing Framework on FPGA , in ACM International Symposium on FPGA (FPGA), 2016, pp.105-110. pdf slide
  • Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li, Huazhong Yang, NXgraph: An Efficient Graph Processing System on a Single Machine , in IEEE International Conference on Data Engineering (ICDE), 2016, pp.409-420. pdf slide
  • Gushu Li, Xiaoming Chen, Guangyu Sun, Henry Hoffmann, Yongpan Liu, Yu Wang, Huazhong Yang, A STT-RAM-based Low-Power Hybrid Register File for GPGPUs , in 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2015, pp.103:1-103:6. pdf
  • Xinyu Niu, Wayne Luk, Yu Wang, EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access , in Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2015, pp.74-83. pdf
  • Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, Tao Wang, Coordinated Static and Dynamic Cache Bypassing for GPUs , in Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) , 2015, pp.76-88. pdf
  • Fei Chen,Yi Shan,Yu Zhang,Yu Wang,Hubertus Franke,Xiaotao Chang,Kun Wang, Enabling FPGAs in the Cloud , in Proceedings of the 11th ACM Conference on Computing Frontiers, 2014, pp.3:1-3:10. pdf
  • Guohao Dai, Yi Shan, Fei Chen, Yu Zhang, Yu Wang, Kun Wang and Huazhong Yang, Online Scheduling for FPGA Computation in the Cloud , in International Conference on Field-Programmable Technology (FPT), 2014, pp.330-333. pdf
  • Xinyu Niu, José Gabriel F. Coutinho, Yu Wang and Wayne Luk, Dynamic Stencil: Effective Exploitation of Run-time Resources in Reconfigurable Clusters , in Proceedings of the International Conference on Field-Programmable Technology (FPT), 2013, pp.214-221. pdf
  • Tianji Wu, Di Wu, Yu Wang, Xiaorui Zhang, Hong Luo, Ningyi Xu, Huazhong Yang, Gemma in April: A matrix-like parallel programming architecture on OpenCL , in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011, pp.703-708. pdf
  • Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, Huazhong Yang, FPMR: MapReduce framework on FPGA , in Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2010, pp.93-102. pdf

copyright 2021 © NICS Lab of Tsinghua University