Goals 
- Patrice the emulation development environment
- Understand Task (Single Work-Item) v.s. NDRange
- Understand Loop Pipeline
- * Understand Coalescing
Prerequisite
a.      PC requirement:
- 64Bit CPU / >4G DDR
- Win7 or Ubuntu 16.04 or other version
b.      About install PC environment
- Obtain Licenses for Quartus , Altera SDK for OpenCL and PR (Partial Reconfiguration), set LM_LICENSE_FILE environment variable to the license
- Install the Quartus II software v. 16.0 standard edition (including Arria10 , Stratix V and Cyclone V device families)
- Please install at c:\altera\16.0 (for run script later, otherwise you need modify the .bat in the test project)
- Download and install the Altera SDK for OpenCL version 16.0
- Compiler Environment
- Windows : Install the Visual C++ 2010 Express version of Microsoft Visual Studio with support for C++ fromhttp://www.microsoft.com/visualstudio/eng/downloads.
- Professional version is OK if you have a license for that.
- Linux :
- Use “apt-get install build-essential” to prepare build environment
- If you do NOT have Visual C++ Professional
- Open the Programs & Features control panel in Windows 7. Manually uninstall all Microsoft Visual C++ 2010 programs that have the word “Redistributable” in them, both x64 and x86 versions
- Download and install Microsoft Windows SDK Version 7.1 from http://www.microsoft.com/en-us/download/details.aspx?id=8279 . This software allows the compilation of 64-bit executables that are needed for the Altera OpenCL solution
- Also, you do NOT need Intel SDK for OpenCL anymore
- Before start the Lab day. Please make sure can understand PSG OpenCL emulator mode
Introduction
   OpenCL allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including Intel CPUs, FPGAs and GPUs. Programming OpenCL on FPGA is much different in Programming OpenCL on GPUs. By leveraging loop-pipeline features, Sequential execution is accepted in many cases to get simple performance improvement.  
  This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element, for example – by summing all the elements in the array. Consider this simple CPU-C code, which sums all the elements in an array:
float reduce_cpu_sum(float* input, int length) {
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
This code is completely sequential! There’s no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How can we loop-pipeline it? How can we parallelize this code? 
 Hope all attendee can understand NDRange and Task implementation through this hands-on.
Lab1 - Task and Loop-pipelined Version
CPU version
float reduce_cpu_sum(float* input, int length) {
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
  Idea - in the above code, length is undetermined variable so compiler can’t unroll the for-loop correctly. We can divide DATA into const-M parts. And wrap the original for-loop by another for-loop with const-M. Each iteration handle length/const-M data, accumulate them and save it into local memory. After that, we can accumulate the local memory again.
Since compiler knows the const-M in compiler times, Compiler can unroll this wrapper for-loop automatically (or manually).
 Advanced Topics - Running on the real device.
To speed up the simulation time, we let DATA_N be a small number 1000. Please increase this number to 1000000 and test the total performance.
Hands-on - implement above algorithm 
Advanced Topics - 
- Running on the real device.
- Compare the performance with Lab1.
Lab2 - Loop-pipelined Version and Memory access patterns
  In the previous code. To speed-up the performance, we unroll the level-1 for-loop in previous code. Since we have 0….M accumulator and global memory read instructions. The memory access pattern is shown below :
 The access pattern is interleave and not continuous. It will cause cache penalty.  
 To improve cache performance and hit rate. We can change the access pattern from above to below - 
Hands-on - implement above algorithm 
Advanced Topics - 
- Running on the real device.
- Compare the performance with Lab1.
Lab3 - NDrange Version (Optional)
 At the OpenCL work-group level. We’ll take advantage of associativity to break the vector into small chunks, each of which we’ll build independent reduction trees for each chunk, and execute them independently, in parallel. We’ll make sure each of the chunks is small enough that it fits in local memory, and then we’ll assign one work-item per element.
 At each stage of the reduction tree, we’ll be loading and storing partial reductions as we compute, so it’s crucial to use local memory to communicate between work-items in the work group. We’ll then execute the reduction tree by using a for loop in conjunction with OpenCL barriers. For example, see the following figure, which performs a min reduction to find the smallest element in a vector:
__kernel
void reduce(
__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
int local_index = get_local_id(0);
// Load data into local memory
if (global_index < length) {
scratch[local_index] = buffer[global_index];
} else {
// Infinity is the identity element for the min operation
scratch[local_index] = INFINITY;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
void reduce(
__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
int local_index = get_local_id(0);
// Load data into local memory
if (global_index < length) {
scratch[local_index] = buffer[global_index];
} else {
// Infinity is the identity element for the min operation
scratch[local_index] = INFINITY;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
LAB - modify host code (opencl_main.cpp) and execute above code correctly.
Hint for LABs
- Naive, serial-based and task based solution is included in the lab-material. Please modify reduce.cl, follow the document and implement your version
- When attendee use Linux environment, Makefile is ready. And attende can use “make run_emulator” to compile, execute and monitor the result.
- When attendee modify the .cl file. All .cpp and .c file don’t required to recompile
Reference
- [OpenCL™ Optimization Case Study: Simple Reductions]
 
