Using oneAPI Construction Kit and TornadoVM to accelerate Java Programs on x86, ARM and RISC-V CPUs

25 minute read

Published:

Key Takeaways

  • oneAPI Construction Kit (OCK) is a new framework that allows software developers and system designers to bridge the gap between hardware accelerators and open standards, by enabling a programming system to implement domain specific instructions for modern hardware such as RISC-V accelerators and custom processors.
  • OCK runs standard OpenCL 3.0 and SPIR-V 1.0. Thus, it can be also used as a standalone acceleration platform that can run on different devices, including ARM and x86-64 CPUs.
  • TornadoVM can use OCK as an OpenCL and SPIR-V driver to deploy the generated OpenCL and SPIR-V code on modern hardware accelerators, including CPUs, allowing Java developers to access a wide range of hardware with minimal programming effort.
  • This post shows how to setup OCK to be used with TornadoVM for Intel, ARM and RISC-V CPUs. Besides, it shows a performance evaluation of TornadoVM/OCK compared to Java Parallel Streams running on the same CPU.

Introduction

The oneAPI Construction Kit (OCK) is an open-source programming framework that allows developers to implement operations for new hardware (e.g., RISC-V accelerators, or any other type of accelerators, such as hardware for AI). 

What makes OCK special is that this framework can facilitate developers to map open programming standards, such as SYCL and OpenCL, to new hardware accelerators. With a few configuration files, developers can define the set of operations for the new hardware, and then map those operations to the implementation of open standards like SYCL. 

Furthermore, OCK could also be used as a runtime dispatcher of OpenCL and SPIR-V kernels, and this is the interest of this post. We could configure OCK to provide an OpenCL dispatcher to run on Intel CPUs, ARM CPUs and even RISC-V cores. 

This post explains how to configure and install OCK to work as a backend for TornadoVM, allowing Java developers to access CPUs from different vendors as OpenCL devices.

The following instructions about how to build and deploy TornadoVM with OCK are focused on Linux systems. However, it should be possible to build OCK for other platforms, such as Windows.

The rest of the post is organised as follows: Firstly, we will explain how to get OCK and TornadoVM started on Intel CPUs, ARM CPUs, and RISC-V CPU Cores with Vector instructions. Secondly, we will show a performance evaluation for Intel and ARM CPUs and compare TornadoVM + OCK with the execution of Java parallel streams and Intel oneAPI running on the same CPU. Do you think that TornadoVM and OCK will run faster than Java? Let’s find out!  

Installing prerequisites

If you use Fedora or Red Hat-based distributions, you will need to install the following dependencies. 

$ sudo dnf install ninja doxygen python3-pip git cmake spirv-tools 
$ sudo pip3 install lit cmakelint 

Installation Guidelines for x86/64 & Performance Evaluation

We are now going to configure and install OCK to be used by TornadoVM as an OpenCL and SPIR-V kernel dispatcher for X86/64 CPUs. Thus, TornadoVM will see a new OpenCL device for our system, and that device can be used to dispatch and accelerate our Java parallel kernels.

Configure LLVM 

mkdir ock 
cd ock 
baseDIR=$PWD 

## Clone LLVM 
git clone https://github.com/llvm/llvm-project.git llvm
cd llvm 
llvmDIR=$PWD 

At the time of writing this post, the supported LLVM version for OCK is 18. Thus, we need to configure LLVM using the 18.x branch:

git checkout release/18.x 

Configure LLVM:

export LLVMINSTALL=$llvmDIR/build-x86_64/install 

cmake llvm -GNinja \
   -Bbuild-x86_64 \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_INSTALL_PREFIX=$LLVMINSTALL \
   -DLLVM_ENABLE_PROJECTS=”clang;lld” \
   -DLLVM_TARGETS_TO_BUILD=X86

Then, built and install:

ninja -C build-x86_64 install 

Configure and Install OCK 

Configure OCK using the LLVM we have just installed:

cd $baseDIR 
git clone https://github.com/codeplaysoftware/oneapi-construction-kit 
cd oneapi-construction-kit 

cmake . -GNinja\
   -Bbuild-x86_64 \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$PWD/build-x86_64/install \
  -DCA_ENABLE_API=cl \
  -DCA_ENABLE_DOCUMENTATION=OFF \
  -DCA_LLVM_INSTALL_DIR=$LLVMINSTALL

Build OCK:

ninja -C build-x86_64 install 

Next, we need to configure the Linux system to use the new OpenCL installation. There are various ways to get this. One of them is updating the folder /etc/OpenCL/vendors/ with a new file that contains the path to the libCL.so installation. 

sudo vim /etc/OpenCL/vendors/ock.icd 

And we add the following line:

## content of the file  <your-path-to-libCL.so> 
/home/juan/repos/ock/oneapi-construction-kit/build-x86_64/install/lib/libCL.so 

Configure TornadoVM

We are now ready to install TornadoVM. Note that, if you have TornadoVM already installed, there is no need to reconfigure it. TornadoVM will detect the new device automatically. 

cd TORNADOVM_ROOT 
./bin/tornadovm-installer --jdk jdk21 --backend=opencl,spirv 
source setvars.sh 

Let’s explore all devices available:

tornado --devices 

Number of Tornado drivers: 2 
Driver: SPIR-V 
  Total number of SPIR-V devices  : 2 
  Tornado device=0:0  (DEFAULT) 
SPIRV -- SPIRV OCL - Intel(R) HD Graphics 630 
Global Memory Size: 28.9 GB 
Local Memory Size: 64.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [256] 
Max WorkGroup Configuration: [256, 256, 256] 
Device OpenCL C version: OpenCL C 1.2 

  Tornado device=0:1 
SPIRV -- SPIRV LevelZero - Intel(R) HD Graphics 630 
Global Memory Size: 28.9 GB 
Local Memory Size: 64.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [256] 
Max WorkGroup Configuration: [256, 256, 256] 
Device OpenCL C version:  (LEVEL ZERO) 1.3 
 
Driver: OpenCL 
  Total number of OpenCL devices  : 4 
  Tornado device=1:0 
OPENCL --[NVIDIA CUDA] -- NVIDIA GeForce GTX 1050 
Global Memory Size: 3.9 GB 
Local Memory Size: 48.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [1024] 
Max WorkGroup Configuration: [1024, 1024, 64] 
Device OpenCL C version: OpenCL C 1.2 
 
  Tornado device=1:1 
OPENCL --[Intel(R) OpenCL Graphics] -- Intel(R) HD Graphics 630 
Global Memory Size: 28.9 GB 
Local Memory Size: 64.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [256] 
Max WorkGroup Configuration: [256, 256, 256] 
Device OpenCL C version: OpenCL C 1.2 
 
  Tornado device=1:2 
OPENCL --[ComputeAorta] -- ComputeAorta x86_64   ## <<<<<<< NEW DEVICE  
Global Memory Size: 7.8 GB 
Local Memory Size: 32.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [1024] 
Max WorkGroup Configuration: [1024, 1024, 1024] 
Device OpenCL C version: OpenCL C 1.2 Clang 18.1.8 

We see a new device using the IDs 1:2 (backend 1, device 2). Note that, depending on your configuration, you might see a different ordering of devices. The ComputeAorta Device is now configured to use the Intel CPU using the OCK.

If we want to run in this device, we simply add the following options to our tornado launcher:

--jvm="-DtaskGraphName.taskName.device=1:2"

The taskGraphName and the taskName` are application specific. For instance, using one of example within the TornadoVM suite:

tornado --printKernel --jvm="-Ds0.t0.device=1:2" --threadInfo \ 
-m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt  

#pragma OPENCL EXTENSION cl_khr_fp64 : enable   
#pragma OPENCL EXTENSION cl_khr_fp16 : enable   
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable   
__kernel void add(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *a, __global uchar *b, __global uchar *c) 
{ 
  long l_7, l_8;  
  ulong ul_13, ul_11, ul_9, ul_1, ul_2, ul_0;  
  int i_10, i_12, i_5, i_6, i_3, i_4, i_14, i_15;  
 
  // BLOCK 0 
  ul_0  =(ulong) a; 
  ul_1  =(ulong) b; 
  ul_2  =(ulong) c; 
  i_3  =  get_global_size(0); 
  i_4  =  get_global_id(0); 
  // BLOCK 1 MERGES [0 2 ] 
  i_5  =  i_4; 
  for(;i_5 < 8;) 
  { 
    // BLOCK 2 
    i_6  =  i_5 + 6; 
    l_7  =(long) i_6; 
    l_8  =  l_7 << 2; 
    ul_9  =  ul_0 + l_8; 
    i_10  =  *((__global int *) ul_9); 
    ul_11  =  ul_1 + l_8; 
    i_12  =  *((__global int *) ul_11); 
    ul_13  =  ul_2 + l_8; 
    i_14  =  i_10 + i_12; 
    *((__global int *) ul_13)  =  i_14; 
    i_15  =  i_3 + i_5; 
    i_5  =  i_15; 
  }  // B2 
  // BLOCK 3 
  return; 
}  //  kernel 
 
Task info: s0.t0 
Backend           : OPENCL 
Device            : ComputeAorta x86_64 L_DEVICE_TYPE_CPU (available) 
Dims              : 1 
Global work offset: [0] 
Global work size  : [8] 
Local  work size  : [8, 1, 1] 
Number of workgroups  : [1] 

But we also configured the SPIR-V backend with TornadoVM. So why don’t we see a SPIR-V device called ComputeAorta? This is because the ComputeAorta device supports SPIR-V 1.0, while TornadoVM supports, at minimum, SPIR-V 1.2.

We can override this setting and force it to use SPIR-V 1.0 (or any other version). Note that TornadoVM does not guarantee that it runs all supported features with previous SPIR-V versions. If we want to enable a previous version of SPIR-V, we simply add the following flag:

--jvm="-Dtornado.spirv.version=1.0" 

For instance:

tornado --jvm="-Dtornado.spirv.version=1.0"--devices 
 
Number of Tornado drivers: 2 
Driver: SPIR-V 
  Total number of SPIR-V devices  : 2 
  Tornado device=0:0  (DEFAULT) 
SPIRV -- SPIRV OCL - Intel(R) HD Graphics 630 
Global Memory Size: 28.9 GB 
Local Memory Size: 64.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [256] 
Max WorkGroup Configuration: [256, 256, 256] 
Device OpenCL C version: OpenCL C 1.2 

  Tornado device=0:1 
SPIRV -- SPIRV OCL - ComputeAorta x86_64     <<<< ComputeAorta 
Global Memory Size: 7.8 GB 
Local Memory Size: 32.0 KB 
Workgroup Dimensions: 3 
Total Number of Block Threads: [1024] 
Max WorkGroup Configuration: [1024, 1024, 1024] 
Device OpenCL C version: OpenCL C 1.2 Clang 18.1.8 

Now we can run on this device with SPIR-V too:

tornado --printKernel --jvm="-Dtornado.spirv.version=1.0 -Ds0.t0.device=0:1" --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt 

; MagicNumber: 0x7230203 
; Version: 1.2 
; Generator ID: 32 
; Bound: 81 
; Schema: 0 
# ...
##  SPIRV CODE
# ...
Task info: s0.t0 
Backend           : SPIRV 
Device            : SPIRV OCL - ComputeAorta x86_64 CPU 
Dims              : 1 
Global work offset: [0] 
Global work size  : [8] 
Local  work size  : [8, 1, 1] 
Number of workgroups  : [1] 

It is also possible to dump the Assembly code that the OCK compiler generates from the OpenCL C code. To do so, we need to configure OCK with debug information:

-DCA_ENABLE_DEBUG_SUPPORT=ON

Then, to dump the assembly code, we export the following variable:

export CA_HOST_DUMP_ASM=1 

And run TornadoVM commands as usual:

tornado --jvm="-Ds0.t0.device=1:3" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 256 

        ... //  
        vpaddq  %ymm0, %ymm10, %ymm10 
        vpaddq  %ymm0, %ymm9, %ymm9 
        vpbroadcastq    (%rsi), %ymm6 
        vpcmpgtq        %ymm9, %ymm6, %ymm11 
        vpcmpgtq        %ymm10, %ymm6, %ymm6 
        vpackssdw       %ymm6, %ymm11, %ymm6 
        vextracti128    $1, %ymm6, %xmm11 
        vpackssdw       %xmm11, %xmm6, %xmm6 
        vpshufd $216, %xmm6, %xmm6 
        vpand   %xmm6, %xmm8, %xmm8 
        vpsllw  $15, %xmm8, %xmm6 
        vpmovmskb       %xmm6, %edx 

We see that Intel vector AVX instructions are being generated!

Performance on Intel x86/64 CPUs 

Let’s run some experiments to evaluate the performance of OCK with TornadoVM on a modern Intel CPU. For this experiment, we are going to run the MatrixMultiplication2D example from the TornadoVM example suite as follows:

tornado --jvm="-Ds0.t0.device=0:1" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 256 

We run this application with four different data sizes (256, 512, 1024 and 2048). The system used is a Linux machine with an Ubuntu 22.04.4 LTS OS and 6.5.0-35-generic Linux Kernel. The Intel driver for OpenCL CPU is 2024.18.6.0.02_160000. The system contains an Intel i9-13900K CPU with 64GB of RAM. 

We use OCK 4.0.0, commit 5be5a8d, and Intel oneAPI 2024.2 for CPUs and OpenCL.  The TornadoVM version used is 1.0.8-dev, commit ea85f22. The Java OpenJDK used is build 21.0.4+8-LTS-274.

Experimental Methodology

We run the matrix multiplication benchmarks for 4 different versions, namely, Java Single-Threaded, Java with Parallel Streams, TornadoVM using OCK, and TornadoVM using Intel oneAPI as a driver for OpenCL.

Each experiment was executed 101 times per version with the goal of comparing the JIT compiled code (after the warmup phase in JVM). Besides, note that in the case of the execution with TornadoVM, the TornadoVM JIT compiler optimises and compiles the Java bytecode to OpenCL and CPU binary in the first run. 

Speedups vs Java Single Threaded 

The following performance plot shows the speedups of each version (Java Parallel Streams, TornadoVM using OCK and TornadoVM using oneAPI for CPUs) versus Java Single Threaded. The reported value is the average of the median of the 101 iterations.

Alt text

The x-axis shows the input data size for the Matrix Multiplication, respectively for 256, 512, 1024 and 2048 square matrices. The y-axis shows speedup of each version compared to Java Single Threaded. Thus, the higher, the better.

As we can see, when we run with small data sizes, the Java Parallel Stream version performs faster than the parallel OpenCL drivers for TornadoVM (both Intel oneAPI and OCK). However, for larger data sizes, TornadoVM outperforms Java Parallel Streams, achieving speedups of up to 52x compared to Java Single-Threaded, and up to 4x faster than Java Parallel Streams for the same CPU. Worth mentioning that OCK is able to autovectorize the code from scalar OpenCL C code.

Regarding OCK, we see that it is generally slower than running the generated OpenCL program with Intel oneAPI. This is due to different optimizations being applied. As the Intel oneAPI is closed source, we do not know exactly which optimizations are being performed, but OCK is in active development and, hopefully, it will become faster overtime with clever compiler optimizations.

Data Distribution

But how are all the runs distributed? The following Figure shows the data distribution of the runtime for each configuration and data size, including the first iteration, in which TornadoVM compiles and optimises the code for CPUs. From left to right, each plot shows the runtime for Java Single-Threaded, Java Parallel Streams, TornadoVM with OCK and TornadoVM with Intel oneAPI.

Alt text

As we see, TornadoVM runs the first iteration faster than the Java sequential code for all the reported data sizes except for the smallest one. In addition, TornadoVM runs faster than Java Streams from size 1024, even from the first iteration.

What can we get from this data?

From this experiment, we see that TornadoVM can outperform the execution of the Java multi-core Parallel Streams of up to 4x using the same Intel CPU for the large data sizes. Note that, in the case of the OCK, the OpenCL JIT compiler is able to auto-vectorize the code, and we can’t confirm this for the Intel oneAPI compiler, since the project is closed source, but quite likely, the code is also auto-vectorized.

If the input data is not big enough, the execution with Java and Java parallel streams should be fast enough compared to TornadoVM using OCK and Intel oneAPI underneath.

TornadoVM for ARM CPUs & Performance Evaluation

Let’s discuss now how to get TornadoVM running on ARM CPUs. We will show you the setup using an ARM Neoverse V2 CPU processor, which is included into the NVIDIA GH200 Grace Hopper Superchip system.

Configuring LLVM for ARM

We need to configure LLVM for Aarch64:

git clone https://github.com/llvm/llvm-project.git 
llvmDIR=$PWD 
cd llvm 

#Remember to switch to the 18.x branch 
git checkout release/18.x 
 
cmake llvm -GNinja \
-Bbuild-aarch64 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=$PWD/build-aarch64/install \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_TARGETS_TO_BUILD=AArch64 \
ninja -C build-aarch64 install

Configure OCK

In a new directory we clone the repo and install OCK as follows:

git clone https://github.com/codeplaysoftware/oneapi-construction-kit 
cd oneapi-construction-kit 

cmake . -GNinja\
-Bbuild-aarch64 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=$PWD/build-aarch64/install \
-DCA_ENABLE_API=cl \
-DCA_ENABLE_DOCUMENTATION=OFF \
-DCA_LLVM_INSTALL_DIR=/home/jfumero/ock/llvm-project/build-aarch64/install \

ninja -C build-aarch64 install

Finally, we need to update the OpenCL ICD env variable to load the libCL.so from OCK. We can do it via either:

a. Adding a new file on Linux under the /etc/OpenCL/vendors/ called ock.icd with the path to the libCL.so from OCK:

<path-to--ock->/oneapi-construction-kit/build-aarch64/install/lib/libCL.so 

b. Updating the OCL_ICD_VENDORS env variable with a new directory that includes the new ICD files. For example:

For example:

export OCL_ICD_VENDORS=/home/jfumero/TornadoVM/icd-ock/ 

Configure TornadoVM for ARM and OCK

cd TORNADOVM_ROOT 
./bin/tornadovm-installer --jdk jdk21 --backend=opencl,spirv 
source setvars.sh 

Then we can run TornadoVM:

tornado --devices 
 
Number of Tornado drivers: 1 
Driver: OpenCL 
  Total number of OpenCL devices  : 2 
  Tornado device=0:0  (DEFAULT) 
        OPENCL --[NVIDIA CUDA] -- GH200 480GB 
                Global Memory Size: 95.0 GB 
                Local Memory Size: 48.0 KB 
                Workgroup Dimensions: 3 
                Total Number of Block Threads: [1024] 
                Max WorkGroup Configuration: [1024, 1024, 64] 
                Device OpenCL C version: OpenCL C 1.2 
 
  Tornado device=0:1 
        OPENCL --[ComputeAorta] -- ComputeAorta AArch64 
                Global Memory Size: 143.5 GB 
                Local Memory Size: 32.0 KB 
                Workgroup Dimensions: 3 
                Total Number of Block Threads: [1024] 
                Max WorkGroup Configuration: [1024, 1024, 1024] 
                Device OpenCL C version: OpenCL C 1.2 Clang 18.1.8 

Performance on ARM Neoverse V2 (NVIDIA GraceHopper Superchip System)

Setup

Similarly to the performance evaluation on the Intel CPUs, we also evaluate the performance of OCK using the same matrix multiplication benchmark on the ARM Neoverse V2 CPU processor. This processor is included into the NVIDIA GraceHopper Superchip system, which shares memory with the GH200 GPU.

The Linux system used is Ubuntu 22.04.4 LTS with the kernel 6.2.0-1015-nvidia-64k. We used OpenJDK 21 build 21.0.3+7-LTS-152. The TornadoVM version used is 1.0.8-dev, commit ea85f22.

For this experiment, we installed the OCK to run on ARM CPUs and compared with Java Single-Threaded and Java Parallel Streams running on the same CPU. Note that Intel oneAPI is not available for ARM systems.

Evaluation

The following Figure shows the speeds of each version compared to Java Single-Threaded (the higher, the better). The speedups are obtained by computing the median value of the 101 iterations. In contrast with Intel CPU systems, the execution on the CPU using Java multithreaded performs faster than TornadoVM dispatching OpenCL via OCK. Only for the largest data size, TornadoVM can benefit in the overall execution compared to Java Parallel Streams, offering up to 16% faster executions.

Alt text

We haven’t looked at the details, and there is plenty of room for analysis at this point, but the Java JIT compiler and the Parallel Stream implementation do a very good job for the ARM ecosystem, at least for this processor. 

Runtime Data Distribution

The following Figure shows the runtime distribution for all tests executed (the lower, the better). For each data size, the boxplot for the Java Single-Threaded, Java Parallel Stream and TornadoVM OCK on the CPU are shown. Note that the y-axis is in log-scale.

Alt text

As we can see, the Java Parallel Stream version using OpenJDK 21 performs faster than TornadoVM with OCK, and only when running the largest data size for the matrix multiplication, TornadoVM OCK outperforms the execution with Java Parallel Streams. Note that even the first iteration of TornadoVM is also a bit slower than the Java Parallel Stream version.

TornadoVM for RISC-V & Debug Modes

Finally, let’s configure OCK to run on RISC-V accelerators and use them as devices for TornadoVM. In the case of RISC-V, since we do not have any hardware available, it can run on emulation mode from an X86/64 machine.

To do so, OCK comes with a simulator, called Codeplay Reference Silicon (RefSi), which is based on SPIKE, and it can be used in combination with our Java/TornadoVM programs to run on RISC-V with Vector Instruction extensions! How cool is this?

Configuring LLVM for RISC-V

We need to install Device-Tree-Compiler:

sudo dnf install dtc

Then, we need to reconfigure LLVM for RISC-V:

cd $llvmDIR 
cmake llvm -GNinja \
-Bbuild-riscv \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=$PWD/build-riscv/install \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_TARGETS_TO_BUILD='X86;RISCV'

ninja -C build-riscv install 

Configuring OCK for RISC-V

Then build/install OCK. The following configuration does not enable DEBUG information:

cd $baseDIR/oneapi-construction-kit 

cmake -GNinja\
-Bbuild-riscv\
-DCA_RISCV_ENABLED=ON \
-DCA_MUX_TARGETS_TO_ENABLE="riscv"\
-DCA_LLVM_INSTALL_DIR=$llvmDIR/build-riscv/install \
-DCA_ENABLE_HOST_IMAGE_SUPPORT=OFF \
-DCA_ENABLE_API=cl \
-DCA_CL_ENABLE_ICD_LOADER=ON \
-DCMAKE_INSTALL_PREFIX=$PWD/build-riscv/install 

ninja -C build-riscv install 

If we want to enable DEBUG information to dump the Assembly code and all the HAL (Heterogeneous Abstraction Layer) calls, we need to configure OCK using -DCA_ENABLE_DEBUG_SUPPORT=ON -DCA_DEBUG_SUPPORT=ON as follows:

cmake -GNinja \
-Bbuild-riscv-debug\
-DCA_ENABLE_DEBUG_SUPPORT=ON \
-DCA_DEBUG_SUPPORT=ON \
-DCA_RISCV_ENABLED=ON \
-DCA_MUX_TARGETS_TO_ENABLE="riscv" \
-DCA_LLVM_INSTALL_DIR=$llvmDIR/build-riscv/install \
-DCA_ENABLE_HOST_IMAGE_SUPPORT=OFF \
-DCA_ENABLE_API=cl \
-DCA_CL_ENABLE_ICD_LOADER=ON \
-DCMAKE_INSTALL_PREFIX=$PWD/build-riscv-debug/install 

ninja -C build-riscv-debug install 

Running TornadoVM on RISC-V

To enable the debug information when running our TornadoVM applications, we need to export the following env variables as follows:

export CA_RISCV_DUMP_ASM=1    ## Print Assembly code 
export CA_HAL_DEBUG=1         ## Print calls to HAL 

Let’s run some examples:

export CA_RISCV_DUMP_ASM=1    ## Print Assembly code 
export CA_HAL_DEBUG=1         ## Print calls to HAL 
 
$ tornado --printKernel --jvm="-Ds0.t0.device=0:3" --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt 
 
refsi_hal_device::mem_alloc(size=4, align=128) -> 0x9800ff80 
refsi_hal_device::mem_alloc(size=512, align=128) -> 0x9800fd80 
refsi_hal_device::mem_alloc(size=56, align=128) -> 0x9800fd00 
refsi_hal_device::mem_alloc(size=56, align=128) -> 0x9800fc80 
refsi_hal_device::mem_alloc(size=56, align=128) -> 0x9800fc00 
refsi_hal_device::mem_alloc(size=24, align=128) -> 0x9800fb80 
 
… 
 
.LBB0_9: 
addi	sp, s0, -80 
ld	ra, 72(sp) 
ld	s0, 64(sp) 
ld	s2, 56(sp) 
ld	s3, 48(sp) 
ld	s4, 40(sp) 
ld	s5, 32(sp) 
ld	s6, 24(sp) 
ld	s7, 16(sp) 
ld	s8, 8(sp) 
ld	s9, 0(sp) 
addi	sp, sp, 80 
ret 
.Lfunc_end0: 
.size	"add.mux-kernel-wrapper", .Lfunc_end0-"add.mux-kernel-wrapper" 
 
.type	notes_global,@object 
.section	notes,"a",@progbits 
.globl	notes_global 
notes_global: 
.size	notes_global, 192 
 
.ident	"clang version 18.1.8 (git@github.com:llvm/llvm-project.git 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)"
.section	".note.GNU-stack","",@progbits 
refsi_hal_device::mem_write(dst=0x9800fd00, size=56) 
refsi_hal_device::mem_write(dst=0x9800fc80, size=56) 
refsi_hal_device::mem_write(dst=0x9800fb80, size=24) 
Task info: s0.t0 
Backend           : OPENCL 
Device            : RefSi G1 RV64 CL_DEVICE_TYPE_ACCELERATOR (available) 
Dims              : 1 
Global work offset: [0] 
Global work size  : [8] 
Local  work size  : [8, 1, 1] 
Number of workgroups  : [1] 
 
refsi_hal_device::program_find_kernel(name='add.mux-kernel-wrapper') -> 0x00010000 
refsi_hal_device::kernel_exec(kernel=0x00010000, num_args=7, global=<8:1:1>, local=<8:1:1>) 
refsi_hal_device::pack_arg(offset=0, align=8, value=0x000000009800fb80) 
refsi_hal_device::pack_arg(offset=8, align=8, value=0x000000009800ff80) 
refsi_hal_device::pack_arg(offset=16, align=8, value=0x0000000000000008) 
refsi_hal_device::pack_arg(offset=24, align=8, value=0x000000009800fd80) 
refsi_hal_device::pack_arg(offset=32, align=8, value=0x000000009800fd00) 
refsi_hal_device::pack_arg(offset=40, align=8, value=0x000000009800fc80) 
refsi_hal_device::pack_arg(offset=48, align=8, value=0x000000009800fc00) 
refsi_hal_device::kernel_exec finished in 0.006 s 
refsi_hal_device::mem_read(src=0x9800fc00, size=56) 

TornadoVM and OCK spit debug information, such as the generated kernel, the generated assembly, and device in which the application is executed and all the calls to the HAL (Heterogeneous Abstraction Layer) API.

However, if we pay attention, there are no RVV instructions being generated. This is because we need to export the following variable with the vector width:

export CA_RISCV_VF=4 

If we run TornadoVM again with the OCK debug information on, we obtain the following:

tornado --printKernel --jvm="-Ds0.t0.device=0:3" --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.arrays.ArrayAddInt 

…  
.LBB0_18: 
add	a0, s5, s8 
add	a0, a0, s4 
vsetvli	zero, zero, e64, m2, ta, ma 
vadd.vx	v10, v8, a0 
vsetvli	zero, zero, e32, m1, ta, ma 
vnsrl.wi	v12, v10, 0 
vmsle.vi	v0, v12, 7 
vsetvli	zero, zero, e8, mf4, ta, ma 
vmv.x.s	a0, v0 
andi	a0, a0, 15 
beqz	a0, .LBB0_17 
slli	a3, s7, 2 
add	a2, t3, a3 
add	a0, t4, a3 
add	a3, a3, t6 
vsetvli	zero, zero, e64, m2, ta, ma 
vsll.vx	v10, v10, s2 
vsra.vx	v10, v10, s2 
.LBB0_20: 
vsetvli	zero, zero, e32, m1, ta, ma 
vle32.v	v12, (a3), v0.t 
vle32.v	v13, (a0), v0.t 
vadd.vv	v12, v13, v12 
vse32.v	v12, (a2), v0.t 
vsetvli	zero, zero, e64, m2, ta, ma 
vadd.vx	v10, v10, s9 
vmsle.vi	v12, v10, 7 
vmand.mm	v0, v12, v0 
vsetvli	zero, zero, e8, mf4, ta, ma 
vmv.x.s	a4, v0 
andi	a4, a4, 15 
add	a2, a2, a5 
add	a0, a0, a5 
add	a3, a3, a5 
bnez	a4, .LBB0_20 
j	.LBB0_17 
.Lfunc_end0: 
.size	"__vecz_v4_add.mux-kernel-wrapper", .Lfunc_end0-"__vecz_v4_add.mux-kernel-wrapper" 
 
            … 
 
refsi_hal_device::mem_write(dst=0x9800fd00, size=56) 
refsi_hal_device::mem_write(dst=0x9800fc80, size=56) 
refsi_hal_device::mem_write(dst=0x9800fb80, size=24) 
Task info: s0.t0 
Backend           : OPENCL 
Device            : RefSi G1 RV64 CL_DEVICE_TYPE_ACCELERATOR (available) 
Dims              : 1 
Global work offset: [0] 
Global work size  : [8] 
Local  work size  : [8, 1, 1] 
Number of workgroups  : [1] 

And, if we export the following variable:

export SPIKE_SIM_DEBUG=1 

We can even run step by step with a debugger that is included within the RefSi Simulator. How cool is this? 

Conclusions 

This post has shown the potential of dispatching and running Java and TornadoVM programs on CPUs via the oneAPI Construction Kit (OCK), allowing Java developers to accelerate, not just on GPUs but also on CPUs. This post has shown how to configure OCK for three types of CPUs: Intel, ARM and RISC-V CPU.

Furthermore, in this post we have analysed the performance of the Matrix Multiply applications (widely used these days for AI and Deep Learning workloads) on Intel CPUs and ARM CPUs, and compared the performance against Java Parallel Streams with OpenJDK, and Intel oneAPI, achieving speedups of up to 4x compared to Java Parallel Streams.

In summary, running TornadoVM applications with OCK and Intel oneAPI on Intel CPUs performs better than running with Java and OpenJDK. However, this is not the case for ARM CPU systems.

Hardware systems are now fully heterogeneous, containing a wide range of specialized systems for specialized tasks. TornadoVM can run with multiple configurations and on multiple devices. Besides, TornadoVM can run with the help of the dynamic reconfiguration, which is a feature that we did not explore in this post, thanks to which TornadoVM explores the fastest device by itself and adapts the application to the fastest performing device. This means that TornadoVM is able to select the fastest accelerator when using the Java runtime for the execution.

The ability to select and combine different parallel runtimes is the key to achieve higher performance, in a way, TornadoVM is able to select the right tool/runtime for the right workload.

Acknowledgments

I would like to thank Colin Davidson from Codeplay for the support regarding the oneAPI Construction Kit for TornadoVM. I also thank the University of Pisa for the access to the NVIDIA GH200 GraceHopper Superchip system through the AERO EU Project. Furthermore, thanks to Christos Kotselidis and Thanos Stratikopoulos from the University of Manchester for their constructive feedback.