Running TornadoVM on CPUs and FPGAs via oneAPI
Published:
Introduction
In previous posts, we have discussed how to run data-parallel applications on GPUs using TornadoVM. But what if you do not have a GPU? Can we run TornadoVM applications on CPUs and take advantage of all cores?
The answer is yes. All you need is an OpenCL implementation that can run on your CPU. In this post, I will show how you can configure TornadoVM to run on such systems using the Intel oneAPI base toolkit for Intel CPUs. Note that there are also other OpenCL implementations for CPUs, such as PoCL, that can run not only on Intel architectures, but also on ARM and even on RISC-V architectures. I may explore this in future technical articles.
In the case you prefer it, I do have a video-version of this tutorial on YouTube:
Installing Intel oneAPI:
Installing Intel oneAPI is an easy process. All you need is the installer package that you can download from the Intel oneAPI website (https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.htm) . There are different toolkits within the Intel oneAPI, but for running TornadoVM with OpenCL, we only need the base toolkit. If we select the offline installer for Linux, we run the following commands to complete the installation process.
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/fdc7a2bc-b7a8-47eb-8876-de6201297144/l_BaseKit_p_2024.1.0.596_offline.sh
sudo sh ./l_BaseKit_p_2024.1.0.596_offline.sh
Note that, by default, the Intel oneAPI will be installed in /opt/intel/oneapi
. It is important to know this directory to set up the environment when running with oneAPI. Once the installation is finished, we are ready to run on the CPU.
Running on the CPU with TornadoVM
If we have configured TornadoVM to use the OpenCL backend, we do not have to recompile TornadoVM. Just run the applications on the CPU device.
If I run tornado
with the --devices
option on my development system, I get the following devices (using TornadoVM 1.0.4):
$ tornado --devices
Number of Tornado drivers: 1
Driver: OpenCL
Total number of OpenCL devices : 2
Tornado device=0:0 (DEFAULT)
OPENCL -- [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [1024]
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2
Tornado device=0:1
OPENCL -- [Intel(R) OpenCL HD Graphics] -- Intel(R) UHD Graphics 770
Global Memory Size: 24.9 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [512]
Max WorkGroup Configuration: [512, 512, 512]
Device OpenCL C version: OpenCL C 1.2
But, wait, I still do not see the CPU as a target device. In my case, I have access to two devices, an NVIDIA RTX 3070, and an Intel integrated GPU. To get access to the CPU, we need to load the Intel oneAPI environment:
$ source /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.2.26(1)-release
args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
If we inspect the TornadoVM devices again, we see the following:
$ tornado --devices
Number of Tornado drivers: 1
Driver: OpenCL
Total number of OpenCL devices : 4
Tornado device=0:0 (DEFAULT)
OPENCL -- [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [1024]
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2
Tornado device=0:1
OPENCL -- [Intel(R) OpenCL HD Graphics] -- Intel(R) UHD Graphics 770
Global Memory Size: 24.9 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [512]
Max WorkGroup Configuration: [512, 512, 512]
Device OpenCL C version: OpenCL C 1.2
Tornado device=0:2
OPENCL -- [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
Global Memory Size: 31.1 GB
Local Memory Size: 32.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [8192]
Max WorkGroup Configuration: [8192, 8192, 8192]
Device OpenCL C version: OpenCL C 3.0
Tornado device=0:3
OPENCL -- [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
Global Memory Size: 31.1 GB
Local Memory Size: 256.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [67108864]
Max WorkGroup Configuration: [67108864, 67108864, 67108864]
Device OpenCL C version: OpenCL C 1.2
So, based on my configuration, If I want to run a TornadoVM program on the CPU, I need to select device 0:2. And if I want the FPGA, I select device 0:3. Let’s do that.
I will demonstrate the CPU access with TornadoVM by running the Blur Filter example from this GitHub repo.
$ git clone https://github.com/jjfumero/tornadovm-examples
$ cd tornadovm-examples
$ source /path-to-your-Tornado-DIR/source.sh
$ export TORNADO_SDK=/path-to-your-Tornado-DIR/bin/sdk
$ mvn clean package
And we are ready to run. Notice that, based on the devices I have installed on my system, the CPU OpenCL appears in index 0:2 from the TornadoVM device list. This is the index we need to use to run on the multi-core:
$ tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
Task info: blur.red
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.green
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.blue
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
TornadoVM Total Time (ns) = 5526195869 -- seconds = 5.526195869
That’s awesome! Now, we can run some benchmarks to see a performance comparison versus sequential and parallel streams. If you want to benchmark on the CPU with TornadoVM, just make sure the right device is selected from the source code:
diff --git a/src/main/java/io/github/jjfumero/BlurFilter.java b/src/main/java/io/github/jjfumero/BlurFilter.java
index ff4cd42..ffc1cb1 100644
--- a/src/main/java/io/github/jjfumero/BlurFilter.java
+++ b/src/main/java/io/github/jjfumero/BlurFilter.java
@@ -381,7 +381,7 @@ public class BlurFilter {
@Setup(Level.Trial)
public void doSetup() {
// Select here the device to run (backendIndex, deviceIndex)
- blurFilter = new BlurFilter(Options.Implementation.TORNADO_LOOP, 0, 3);
+ blurFilter = new BlurFilter(Options.Implementation.TORNADO_LOOP, 0, 2);
}
@Benchmark
Then we recompile and run:
$ mvn clean package
$ tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter jmh
Benchmark Mode Cnt Score Error Units
BlurFilter.Benchmarking.jvmJavaStreams avgt 5 5966823136.333 ± 22489007.669 ns/op
BlurFilter.Benchmarking.jvmSequential avgt 5 72104124393.000 ± 15762868825.115 ns/op
BlurFilter.Benchmarking.runTornadoVM avgt 5 5095359714.433 ± 102896875.200 ns/op
If we talk about performance, we also need to talk about the setup. This benchmark was executed with TornadoVM 1.0.4 on an Intel 12th Gen CPU i7-12700K with 20 threads. This CPU has 8 performance cores and 4 efficiency cores. In this type of architecture, the performance cores can run with HT (Hyperthreading), thus, giving us a total of 20 threads to run. Besides, I am using the Intel OpenCL runtime 2024.17.3.0.08_160000
from Intel oneAPI. The Java version is Java HotSpot(TM) 64-Bit Server VM (build 21.0.3+7-LTS-152).
As we see, Java Parallel Streams and TornadoVM perform 12x and 14.1x respectively compared to Java sequential code. We compare it with Java sequential because the input Java code annotated with the @Parallel
Tornado annotation is the sequential implementation. If we compare TornadoVM versus the Java parallel streams, TornadoVM outperforms the Java parallel streams by 17% running on the same hardware. This is great!
Now, are you up for a challenge? Is it possible to even run faster? If so, how? Let me know, and, if you like this topic, stay tuned for new performance improvements. Let’s now jump at how we can run on FPGAs using the emulation mode of Intel oneAPI.
Running on the FPGA
Running on FPGAs is a bit tricky compared to the handling of the execution on GPUs and CPUs. This is because, to compile for FPGAs, we need a new compiler and it usually takes a long time (> 30-40 mins).
Fortunately, some FPGA vendors, like Intel, give us the option to emulate the compilation and execution on an FPGA, even if we don’t have a physical FPGA, which is my case.
To run TornadoVM with FPGA emulation mode, we need to set up a new env variable, and then run TornadoVM as usual. This env variable is tailored to Intel FPGAs, so, bear in mind that, if you use an FPGA from another vendor, this env variable might change.
To demonstrate the FPGA workflow, I will select another example from the tornadovm-example
suite. Ideally, we should be able to run the same application (Blur Filter). However, the FPGA in TornadadoVM is still in active development where this application produces some compilation errors. But we can still run on the FPGA using another example:
$ env CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=1 tornado --threadInfo --jvm="-Ds0.t0.device=0:3" -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
Task info: s0.t0
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [512, 512]
Local work size : [64, 1, 1]
Number of workgroups : [8, 512]
Total Time (ns) = 495149249 -- seconds = 0.495149249
We just run the Mandelbrot program on the FPGA in emulation mode! As a disclaimer, never measure performance in emulation mode. This is just for debugging as it never runs on the real hardware, and it is in fact, emulated on the CPU.
If you want to deep-dive into the FPGA compilation, the TornadoVM JIT compiler generates a new folder called fpga-source-comp
(which we can also tune) with all the FPGA generated binaries (called bitstreams) as well as the log information. If we inspect the file fpga-source-comp/outputFPGA.log
, we see the following content:
Command: [aocl-ioc64, --input=/home/juan/repos/tornadovm-examples/fpga-source-comp/mandelbrotFractal.cl, --device=fpga_fast_emu, --cmd=build, --ir=/home/juan/repos/tornadovm-examples/fpga-source-comp/mandelbrotFractal.aocx]
Setting target instruction set architecture to: Default (Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2))
Platform name: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device name: Intel(R) FPGA Emulation Device
Device version: OpenCL 1.2
Device vendor: Intel(R) Corporation
Device profile: EMBEDDED_PROFILE
Using build options: -I "/home/juan/repos/tornadovm-examples/fpga-source-comp"
Compilation started
Compilation done
Linking started
Linking done
Device build started
Options used by backend compiler:
Device build done
Kernel "mandelbrotFractal" was successfully vectorized (8)
Done.
--------------------------------------------------------------------
Standard output:
Command: [/home/juan/tornadovm/TornadoVM/bin/sdk/bin/cleanFpga.sh, intel, /home/juan/repos/tornadovm-examples/fpga-source-comp/, mandelbrotFractal]
This means that TornadoVM, after the JIT compiler transformed the Java bytecode to OpenCL C code, compiled the final binary with the aocl-ioc64
compiler tool. This can be tuned. What happens is that TornadoVM knows how to compile for Intel and Xilinx FPGAs, and it uses a configuration file. By default, the configuration file is in $TORNADO_SDK/bin/sdk/etc/intel-oneapi-fpga.conf
$ cat $TORNADO_SDK/bin/sdk/etc/intel-oneapi-fpga.conf
# Configure the fields for FPGA compilation & execution
# [device]
DEVICE_NAME = fpga_fast_emu
# [compiler]
COMPILER = aocl-ioc64
# [options]
DIRECTORY_BITSTREAM = fpga-source-comp/ # Specify the directory
We can tune these values, or, even better, we can provide our own configuration files. To pass a new configuration file for the FPGA, we use the -Dtornado.fpga.conf.file=FILE
This process is a bit cumbersome, but keep in mind that TornadoVM is still an academic project. Hopefully, the workflow will improve in future versions.
Conclusions
TornadoVM can run on any compatible hardware accelerator with the backend implementations (OpenCL, PTX and SPIR-V if we take the latest version at the time of writing this post, 1.0.4). The hardware acceleration selection includes, not just GPUs, but also CPUs and FPGAs. This post has shown how to run and benchmark Java programs accelerated with TornadoVM to run on multi-cores CPUs through its OpenCL backend and the Intel oneAPI implementation. In addition, since the Intel oneAPI also provides an FPGA emulation mode, this post has shown how to access and configure TornadoVM to run in such devices.