Can TornadoVM run Matrix Multiply faster than OpenCL Native?

26 minute read

Published:

TornadoVM, a Java parallel programming framework to run on hardware accelerators, can sometimes outperform OpenCL code on GPUs, despite the latter being closer to the hardware. This is possible due to the TornadoVM’s ability to automatically apply compiler and runtime optimizations.

In this article, we are going to explore and analyse the different optimizations that are applied in TornadoVM using the Matrix Multiplication application as an example. Furthermore, we are going to build an OpenCL C++ application from scratch to replicate, step by step, all compiler and runtime optimizations that TornadoVM automatically applies.

Java Baseline

Since applications written for TornadoVM are minimally extended from the Java version, let’s start with the Java sequential code. You can find all Java examples used in this post in the following GitHub repository:

https://github.com/jjfumero/tornadovm-examples

The Java sequential code is described as follows:

public static void mxmSequential(FloatMatrix a, FloatMatrix b, FloatMatrix c) {
   for (int i = 0; i < a.M(); i++) {
       for (int j = 0; j < b.N(); j++) {
           float acc = 0;
           for (int k = 0; k < c.M(); k++) {
               acc += a.get(i, k) * b.get(k, j);
           }
           c.set(i, j, acc);
       }
   }
}

The code shows the canonical O(n3) matrix multiplication. FloatMatrix is a custom data type, and it uses a Java memory segment to store the data in 1D format. For simplicity, we are going to run squared-matrices, and for our experiments, we choose matrices of 1024x1024 elements.

The FloatMatrix type is defined as follows:

private static class FloatMatrix {

   private static final int FLOAT_SIZE = 4;

   private final int m;
   private final int n;
   private final MemorySegment segment;

   public FloatMatrix(int m, int n) {
       this.m = m;
       this.n = n;
       final long segmentByteSize = n * m * FLOAT_SIZE;
       segment = Arena.ofAuto().allocate(segmentByteSize, 64);
   }

   public void set(int i, int j, float value) {
       final int index = i * m + j;
       segment.set(JAVA_FLOAT, index * FLOAT_SIZE, value);
   }

   public float get(int i, int j) {
       final int index = i * m + j;
       return segment.get(JAVA_FLOAT, index * FLOAT_SIZE);
   }

   public void initRamdom() {
       Random r = new Random(71);
       for (int i = 0; i < m; i++) {
           for (int j = 0; j < n; j++) {
               set(i, j, r.nextFloat());
           }
       }
   }

   public int M() {
       return m;
   }

   public int N() {
       return n;
   }
}

Hardware and Software Setup

So, how fast is this implementation? To talk about performance, we need to discuss the environment and machine in which this application runs on. We are going to use an on-premise server with an Intel i9-13900K CPU on a system with 64GB of RAM. The JDK used is OpenJDK 21.0.4 that runs on top of an Ubuntu 22.04.4 LTS OS using the Linux 6.8.0-47 kernel. In a nutshell:

Hardware:

  • CPU: 13th Gen Intel(R) Core(TM) i9-13900K
  • GPU: RTX 4090
  • RAM: 64GB

Software:

  • OS: Ubuntu 22.04.5 LTS
  • Kernel: Linux 6.8.0-47-generic #47~22.04.1-Ubuntu
  • NVIDIA-DRIVER: 550.107.02
  • CUDA: cuda_12.1.r12.1/compiler.32688072_0
  • GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • TornadoVM: 1.0.9-dev (1f0dba4) using the OpenCL Backend.
  • JDK: OpenJDK “21.0.4” 2024-07-16 LTS

Application:

  • Matrix Multiplication in FP32
  • Matrix Size: 1024x1024 elements.
  • Link

Performance of Java Sequential Code

The example provided on GitHub contains the instructions to run all benchmarks:

$ export CLASSPATH=target/tornadovm-examples-1.0-SNAPSHOT.jar:$CLASSPATH
$ tornado io.github.jjfumero.MatrixMultiplication --params="--jmh"

If using JMH for micro-benchmarking, we get the following results (showing only the Java sequential implementation):

Benchmark                                                  Mode  Cnt           Score          Error  Units
MatrixMultiplication.Benchmarking.mxmSequential            avgt    5  2113164943.347 ± 17774213.889  ns/op

This is 2.1 seconds. Now, let’s run the Java multi-core implementation using the Java Vector API and analyse how fast we can get with Java parallel code on this hardware.

Java Vectorized Parallel Code

Here’s the multi-threaded and vectorized implementation. To make this work, matrix b needs to be transposed before invoking this method. As we can see, Java streams parallelize the outer loops, and the Java Vector API handles the reduction part of the kernel.

public static void mxmParallelVectorized(FloatMatrix a, FloatMatrix b, FloatMatrix c) {
   VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;
   IntStream.range(0, a.M()).parallel().forEach(i -> IntStream.range(0, b.N()).parallel().forEach(j -> {
           float acc = 0;
           for (int k = 0; k < c.M(); k += species.length()) {
               FloatVector vector1 = FloatVector.fromMemorySegment(species, a.segment, (i * a.M() + k) * FLOAT_BYTES, ByteOrder.nativeOrder());
               FloatVector vector2 = FloatVector.fromMemorySegment(species, b.segment, (j * b.N() + k) * FLOAT_BYTES, ByteOrder.nativeOrder());
               acc += vector1.mul(vector2).reduceLanes(VectorOperators.ADD);
           }
           c.set(i, j, acc);
       }));
}

When we run this version with JMH, we obtain the following timers:

Benchmark                                                  Mode  Cnt           Score          Error  Units
MatrixMultiplication.Benchmarking.mxmParallelVectorized    avgt    5    13586191.583 ±    78030.170  ns/op
MatrixMultiplication.Benchmarking.mxmSequential            avgt    5  2113164943.347 ± 17774213.889  ns/op

The Java multi-core version that is parallelized with Java streams and the Vector API runs under 14 ms. This is 155x faster than the sequential implementation.

The CPU’s performance using the Vector API is quite impressive. The CPU has 8 performance cores (P-Cores) with Hyper-Threading support, enabling 16 threads with AVX2 instructions (processing up to 8 FP32 values per instruction). Additionally, there are 16 e-cores (energy efficiency cores) with AVX2 support, running at a lower frequency.

Running on the GPU RTX 4090 with TornadoVM

Next, we are going to implement the matrix multiplication Java program to use the TornadoVM parallel programming framework and automatically run on the NVIDIA RTX 4090 GPU.

private static void mxmTornadoVM(Matrix2DFloat a, Matrix2DFloat b, Matrix2DFloat c, final int size) {
   for (@Parallel int i = 0; i < size; i++) {
       for (@Parallel int j = 0; j < size; j++) {
           float sum = 0.0f;
           for (int k = 0; k < size; k++) {
               sum += a.get(i, k) * b.get(k, j);
           }
           c.set(i, j, sum);
       }
   }
}


private static TornadoExecutionPlan createTornadoVMPlan(Matrix2DFloat a, Matrix2DFloat b, Matrix2DFloat c) {
   TaskGraph taskGraph = new TaskGraph("mxm");
   taskGraph.transferToDevice(DataTransferMode.FIRST_EXECUTION, a, b) //
           .task("mxm", Multiplication::mxmTornadoVM, a, b, c, a.getNumRows()) //
           .transferToHost(DataTransferMode.EVERY_EXECUTION, c);
   TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(taskGraph.snapshot());
   executionPlan.withWarmUp().withDevice(TornadoExecutionPlan.getDevice(0, 0));
   return executionPlan;
}

We are not going to explain the TornadoVM parallel programming model in this post. However, it is worth mentioning that the Java code that is parallelized with TornadoVM is implemented using the naive sequential matrix multiplication.

Furthermore, note that we have not specified any low-level optimization, apart from indicating that the two outermost loops can run in parallel on the target accelerator, in our case, the NVIDIA RTX 4090.

For more details about the parallel programming model, you can follow the following tutorials:

When we run the TornadoVM version with JMH, we obtain the following report:

Benchmark                                                  Mode  Cnt           Score          Error  Units
MatrixMultiplication.Benchmarking.mxmParallelVectorized    avgt    5    13586191.583 ±    78030.170  ns/op
MatrixMultiplication.Benchmarking.mxmSequential            avgt    5  2113164943.347 ± 17774213.889  ns/op
MatrixMultiplication.Benchmarking.mxmTornadoVM             avgt    5      814137.714 ±      587.040  ns/op

This is just under 1 millisecond (0.81 milliseconds), which means a speedup of 2595x compared to the Java sequential execution, and 16.6x faster than the vectorized Java multi-core version!

Comparing TornadoVM with Native OpenCL C++

Ok, that’s great, but how far is TornadoVM compared to OpenCL C++? Since we executed TornadoVM on the RTX 4090 using its OpenCL backend, we could implement an OpenCL program in C++ to solve the matrix multiplication.

If you are curious, we can download and run the OpenCL C++ versions from the following GitHub repository:

https://github.com/jjfumero/opencl-samples/tree/master/mxm

If we run the OpenCL C++ version on the same machine and the same GPU (and RTX 4090), we get the following result:

## -p <number> to select the OpenCL platform. In my case, platform 2 corresponds with the 4090 GPU
$ ./mxm -p 2 
...
[INFO] OpenCL MxM 
[INFO] Size = 1024x1024
[INFO] 3 has been detected
[INFO] Platform: 0
        [INFO] Vendor: Codeplay Software Ltd.
[INFO] Platform: 1
        [INFO] Vendor: Intel(R) Corporation
[INFO] Platform: 2
        [INFO] Vendor: NVIDIA Corporation
[INFO] Using platform: 2 --> NVIDIA Corporation
[INFO] Using Device with Index: 0 -> name: NVIDIA GeForce RTX 4090
[INFO] Kernel <mxm> loaded
…

Median KernelTime: 4.39085e+06 (ns)
Median CopyInTime: 631200 (ns)
Median CopyOutTime: 268384 (ns)
Median TotalTime: 4.73631e+06 (ns)      << End-To-End time

Note that the application launches the kernel 100 times, and it computes the median timers for the kernel runtime, copy-in/copy out, and total time. In addition, to make it fair with TornadoVM, it copies the input data only once (just as TornadoVM does when declaring the transfer to device data as FIRST_EXECUTION).

We are interested in the total time, which represents the end-to-end time, including data transfer and kernel execution of the whole application, in order to make it comparable with the TornadoVM reported timers.

We see that the total time in OpenCL is 4.7 milliseconds. This means that the OpenCL C++ version is 5.8x slower than TornadoVM! Is this possible? How?

The answer is that TornadoVM applies a lot of optimizations for us in order to increase performance. TornadoVM is more than a pure translator from Java to OpenCL. It also optimizes the code. In fact, TornadoVM optimizes the code in two ways:

  1. Compiler optimizations: e.g., parallel-thread-id insertion, math optimizations, etc.
  2. Runtime optimizations: e.g., better thread-scheduling.

But, is this a fair comparison? My initial thoughts are, no. It is not fair because the OpenCL native version implements a naive matrix multiplication without any optimization, as we can see from the following code snippet.

__kernel void mxm(__global float *a,  __global float *b, __global float *c, const int n) {

	uint idx = get_global_id(0);
	uint jdx = get_global_id(1);

	float sum = 0.0;
	for (int k = 0; k < n; k++) {
		sum += a[idx * n + k] * b[k * n + jdx];
	}
	c[idx * n + jdx]  =  sum;
}

However, despite implementing a basic sequential approach with two annotations, the TornadoVM version was also not optimized, and the JIT compiler is able to apply well known optimizations (as we will discuss in a bit) to improve performance. But to achieve high performance in OpenCL, CUDA, and SYCL, software optimization is crucial. Without optimization, significant performance gains could have been left untapped.

Understanding TornadoVM Performance

So, where is the “magic”? TornadoVM actually performs well-known compiler optimizations for GPUs, and it could even be faster than it actually is.

For this second part of the post, let’s do the following. Let’s try to explore, step by step, all optimizations that TornadoVM applies in order to achieve such a high performance. Our goal now is to improve our native OpenCL C++ code in order to match the performance of TornadoVM.

Before we dive into all the optimizations, let’s do a recap. The following performance plot shows a summary of the different implementations we have executed so far.

The plot shows the total runtime in nanoseconds of each implementation. Thus, the lower the better. The red line shows the total runtime of the Java sequential implementation. The dash lines show, just for reference, the 1 second and 1 millisecond marks.

Alt text

How can we know the optimizations applied in TornadoVM?

For this post, we are going to analyse the optimizations applied by studying the generated OpenCL code that the TornadoVM JIT compiler generates. To dump the OpenCL generated kernels, you need to pass the --printKernel option from the command line to TornadoVM.

However, if you want to dive deep into the whole list of optimizations, you can use the IGV (Ideal Graph Visualizer. IGV is a tool developed by the GraalVM team to visualize the Intermediate Representation (IR) for all compiler optimizations within Graal. But, since TornadoVM extends the Graal compiler, it can also use the IGV tool as an IR debugger.

To enable IGV with TornadoVM, you need to install IGV from the GraalVM repository:

https://github.com/oracle/graal/blob/master/docs/tools/ideal-graph-visualizer.md

And then use the --igv option with TornadoVM:

tornado --igv MyProgram

As I mentioned before, we are not going to use IGV for this post, but for the future, I am planning to write about IGV with TornadoVM in detail, allowing TornadoVM and GraalVM users to get a sense of the optimizations being applied, and even apply your own optimizations.

1_Loop_Interchange

The following code snippet shows a sketch of the generated OpenCL kernel by the TornadoVM JIT compiler. Let’s focus on the general control flow structure. We see three nested loops that correspond to the three loops from the original Java code.

The first thing to notice is that the loops seem to be interchanged. The induction variable for the outermost loop uses the function get_global_id(1) from OpenCL, and the first nested loop uses the get_global_id(0).

Note that In TornadoVM, a parallel loop is replaced by a get_global_id(parallelDimension). Although it keeps the loop structure, it iterates in steps of get_global_size(parallelDimension).

__kernel void mxmTornadoVM(__global long *_kernel_context,
                           __constant uchar *_constant_region, 
                           __ocal uchar *_local_region, 
                           __global int *_atomics, 
                           __global uchar *a,
                           __global uchar *b,
                           __global uchar *c,
                           __private int size) {
  // BLOCK 0
  ul_0  =  (ulong) a;
  ul_1  =  (ulong) b;
  ul_2  =  (ulong) c;
  i_3  =  get_global_size(0);
  i_4  =  get_global_size(1);
  ul_5  =  ul_2 + 32L;
  ul_6  =  ul_1 + 32L;
  ul_7  =  ul_0 + 32L;
  i_8  =  get_global_id(0);
  i_9  =  get_global_id(1);
  // BLOCK 1 MERGES [0 8 ]
  i_10  =  i_9;
  for(;i_10 < 1024;)                        // get_global_id(1)
  {
    // BLOCK 2
    i_11  =  i_10 << 10;
    i_12  =  i_11 + 6;
    // BLOCK 3 MERGES [2 7 ]
    i_13  =  i_8;
    for(;i_13 < 1024;)                      // get_global_id(0)
    {
      // BLOCK 4
      i_14  =  i_13 + 6;
      // BLOCK 5 MERGES [4 6 ]
      f_15  =  0.0F;
      i_16  =  0;
      for(;i_16 < 1024;)
      {
               // innermost loop 

       }  // B6
    }  // B7
  }  // B8
}  //  kernel

We see that TornadoVM applies a very well-known compiler optimization called loop interchange, in which two loops can be reordered and swapped to increase performance by reducing memory cache misses.

In a nutshell, if we have the following structure, an optimizing compiler could reorder any of the loops if the loops are perfectly nested.

Alt text

There are many combinations that an optimizing compiler could apply. For example, interchange i <-> j, or j <-> k, etc. We are going to manually apply the one selected by the TornadoVM JIT compiler, which is interchanging the i and j loops as follows:

Alt text

To run this version from our OpenCL C++ program:

$ ./mxm -p 2 -k mxmLI 
…
Median KernelTime: 730448 (ns)
Median CopyInTime: 699264 (ns)
Median CopyOutTime: 271024 (ns)
Median TotalTime: 1.08369e+06 (ns)

We have improved our OpenCL program runtime from 4.7 ms to 1.08 ms (4.3x) just by enabling loop interchange.

Alt text

This optimization allows us to increase the speedup from 448x to 1912x compared to Java, but still, far away from TornadoVM. What else is TornadoVM optimising?

2_FMA_Instruction

When we look closely at the TornadoVM generated code for the Matrix Multiplication, we see that TornadoVM is also able to generate FMA instructions. FMA (Fused multiply-add) is a common optimization to execute a multiplication followed by an addition in a single instruction.

  ...
  ul_30  =  ul_25 + l_29;
  f_31  =  *((__global float *) ul_30);   
  f_32  =  fma(f_23, f_31, f_15);  // FMA operation generated by the TornadoVM JIT Compiler
  i_33  =  i_16 + 1;
  f_15  =  f_32;
  ...

Let’s apply this optimization in our OpenCL kernel:

__kernel void mxmLIfma(__global float *a,  __global float *b, __global float *c, int n) {
	uint idx = get_global_id(1);
	uint jdx = get_global_id(0);
	float sum = 0.0f;
	for (int k = 0; k < n; k++) {
	    float op1 = a[idx * n + k];
	    float op2 = b[k * n + jdx];
	    sum = fma(op1, op2, sum);
	}
	c[idx * n + jdx]  =  sum;
}

When we run this application:

./mxm -p 2 -k mxmLIfma
...
Median KernelTime: 730976 (ns)
Median CopyInTime: 669280 (ns)
Median CopyOutTime: 267440 (ns)
Median TotalTime: 1.07918e+06 (ns)

This looks a bit faster. Let’s plot this new number and compare it with the rest of the implementations.

Alt text

3_Loop_Unroll

Let’s continue our exploration of optimizations. When looking again at the general control flow structure of the generated kernel, we see that TornadoVM was able to replace the loop bounds (size parameter from Java) with the actual matrix size:

  // first loop
  for(;i_10 < 1024;)     // get_global_id(1)
  {
    // second loop
    for(;i_13 < 1024;)   // get_global_id(0)
    { 
        for (... ) { ... } 
    }
  }

This can create a new opportunity for optimizations for the underlying OpenCL JIT compiler. This is because the final compilation step is performed by the OpenCL driver, which in our case, is the NVIDIA driver. This compiler could potentially apply loop unrolling and group instructions to use vector types.

__kernel void mxmLIfmaUnroll(__global float *a,  __global float *b, __global float *c, int n) {

	uint idx = get_global_id(1);
	uint jdx = get_global_id(0);

	float sum = 0.0;

	#pragma unroll 32
	for (int k = 0; k < n; k++) {
	    float op1 = a[idx * n + k];
	    float op2 = b[k * n + jdx];
	    sum = fma(op1, op2, sum);
	}

	c[idx * n + jdx]  =  sum;
}

We can tune the loop unroller using different values. In my case I selected 32 iterations with the intention to match the wrap size on NVIDIA GPUs. However, as we will see, this optimization will not do too much in this context. Although the TornadoVM JIT can also apply loop unrolling by itself, I initially thought this could be beneficial for this kernel. So, what is the performance of this implementation?

Alt text

As you can see from the previous performance plot, it looks like this version performs worse than the previous optimization. Let’s keep that in mind and continue with the rest of the enhancements in TornadoVM.

4_Compiler_Flags

Another optimization that TornadoVM applies is that it passes, by default, certain compiler flags to the NVIDIA OpenCL JIT compiler for the clBuildProgram function invocation.

-cl-mad-enable -cl-fast-relaxed-math

The first flag replaces the fma with a mad operation, which computes fma with reduced accuracy. The second flag tells the NVIDIA JIT compiler to make assumptions about valid data (e.g., no NaNs, +Inf, etc). This enables faster computation at the cost of precision.

When we run this version, we see again some performance regression. Not by much, but I have the feeling that the loop unroll was not a good decision. So let’s remove it for the next versions.

Alt text

5_Runtime_Thread_Block

When programming for GPUs, compiler optimizations are as important as scheduling optimizations. One of the things I noticed is that TornadoVM selects a specific block of threads. But, what does it mean exactly?

In OpenCL (and CUDA and SYCL too), developers can select a specific block of threads that can run concurrently on a Streaming Multiprocessor (SM) of the GPU. All threads within the same block will run on a SM (equivalent to a CPU core). These threads can share memory and can synchronize. Besides, GPUs have the excellent capability of potentially running multiple blocks in the same SM concurrently, increasing resource utilization.

What we want to achieve is to maximise resource utilisation and improve occupancy of the GPU. Choosing the right block is crucial to help improve occupancy: if the block of threads is too large, then the GPU scheduler can’t schedule multiple blocks to the same SM. In contrast, if it is too small, we might underutilize GPU resources. Finding the right balance is key.

Unfortunately, there is no single solution, and this is more try-and-error and playing with heuristics. While there are some guidelines, TornadoVM implements an heuristic following the NVIDIA guidelines for improving occupancy.

You can enable the --printThreads option in TornadoVM to dump the thread-block along with other scheduling information to debug the scheduler in real time.

Alt text

For this particular problem and the RTX 4090 GPU, TornadoVM selected blocks of threads of 16x16 (in 2D). If we enable the same thread-block scheduling in our OpenCL program, we obtain the following performance:

$ ./mxm -p 2  -f -k mxmLIfma -w 16 

… 
Median KernelTime: 463616 (ns)
Median CopyInTime: 742368 (ns)
Median CopyOutTime: 265824 (ns)
Median TotalTime: 808460 (ns)

The total time now is below one millisecond. When we compare this new version against the previous ones, we see that the new OpenCL program achieves 2594x compared to Java, and practically the same speedup as TornadoVM!

Alt text

We made it! These are all the main optimizations exploited by TornadoVM to achieve performance. The following performance graph shows the impact of each optimization compared to TornadoVM (the higher, the better).

Alt text

But to be fair, as soon as loop-interchange and the right thread-block are enabled, we can obtain the same speedup as the TornadoVM version.

Discussion about GPU Optimizations

As you can see, TornadoVM is able to minimally optimize the program, achieving great results. But if you know Matrix Multiplication and GPUs, you will know that more optimizations can be implemented. Two more important optimizations play an important role in this application:

  • Exploitation of shared (as in CUDA) or local (as in OpenCL) memory.
  • Using loop-tiling within the kernel.

While the TornadoVM upstream version does not offer these optimizations, there are a few experiments showing the potential of this in the TornadoVM JIT compiler. achieving an extra 2.5x just by using local memory.

Alt text

Link to the paper: link

Hopefully, future TornadoVM versions will integrate this PoC into the main open-source public branch.

Is TornadoVM always better than Native OpenCL?

No, it is not. The example used in this post is not representative of every application you can run with TornadoVM. In some cases, you might get better performance compared to a non-optimize GPU code. However, this is very rare.

What TornadoVM aims to achieve is a balance between programming effort and performance portability. Hopefully, future TornadoVM versions will include smarter and better optimizations. Besides, since TornadoVM is open-source, anyone can contribute to improve TornadoVM.

FLOPS Achieved for each version

We can measure the Floating Points Operations Per Second (FLOP/s) for each of the versions, and compare it against the maximum theoretical FLOP/s of the system. Since we are running the application on two different processing units, we need to compare each version to its own theoretical target. This is the maximum theoretical performance for the CPU for the Java versions, and the maximum theoretical performance for TornadoVM and OpenCL when using the GPU.

For the NVIDIA GPU, the manufacturer publishes this number along with the all the GPU specifications:

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889 .

We can see that for the 4090 GPU, the theoretical maximum performance in FP32 is 82.58 TFlops (82580 GFLOPS).

For the Intel CPUs I could find this public document from Intel specifying the GFLOPS for the most recent Intel processors. The Intel i9 13900K sets 1152 GFLOPS (although my initial theoretical calculations were a bit higher, taking into account that we can execute up to 16 FMA instructions per cycle, but let’s go with what it seems to be the official number).

Let’s now compute the GLOPS achieved and the % of the theoretical maximum:

VersionGLOP/sMax. Theoretical Perf.% Max
Java Sequential1.041152 (CPU)0.09%
Java Parallel Vectorized1521152 (CPU)13.19%
TornadoVM220782580 (RTX 4090 GPU)2.67%

In reality, it is very hard to pass, or even achieve 60% of the maximum theoretical performance, and it depends on many factors such as bandwidth limitations, thermals. etc.

But, by looking at these results, it seems that TornadoVM performs very poorly. Let’s compare TornadoVM with the Matrix Multiplication implemented in the CUDA Sample suite using the same matrix size on the RTX 4090 GPU:

$ ./matrixMul -wA=1024 -hA=1024 -wB=1024 -hB=1024 


[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Ada" with compute capability 8.9

MatrixA(1024,1024), MatrixB(1024,1024)
Computing result using CUDA Kernel...
done
Performance= 6510.26 GFlop/s, Time= 0.330 msec, Size= 2147483648 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

CUDA is able to achieve 6.5 TFlops, this is 3x faster than TornadoVM, and it achieves ~7% of the maximum theoretical peak performance. However, in a closer look to the CUDA sample source code, we see that the time reported is the kernel time, not the end-to-end time.

  // Record the start event
  checkCudaErrors(cudaEventRecord(start, stream));

MatrixMulCUDA<16><<<grid, threads, 0, stream>>>(d_C, d_A, d_B, dimsA.x, dimsB.x);

// Record the stop event
checkCudaErrors(cudaEventRecord(stop, stream));

// Wait for the stop event to complete
checkCudaErrors(cudaEventSynchronize(stop));
float msecTotal = 0.0f;
checkCudaErrors(cudaEventElapsedTime(&msecTotal, start, stop));

To compare it, we can enable profiling information in TornadoVM to print the kernel execution time too using the flag --enableProfiler console:

{
    "mxm": {
        "COPY_IN_TIME": "0",
        "TOTAL_TASK_GRAPH_TIME": "1082216",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "12992",
        "TOTAL_KERNEL_TIME": "438144",
        "COPY_OUT_TIME": "992",
        "TOTAL_DISPATCH_KERNEL_TIME": "3616",
        "ALLOCATION_BYTES": "12583104",
        "TOTAL_COPY_OUT_SIZE_BYTES": "4194368",
        "mxm.mxm": {
            "BACKEND": "OPENCL",
            "METHOD": "Multiplication.mxmTornadoVM",
            "DEVICE_ID": "0:0",
            "DEVICE": "NVIDIA GeForce RTX 4090",
            "TOTAL_COPY_IN_SIZE_BYTES": "24",
            "POWER_USAGE_mW": "24254",
            "TASK_KERNEL_TIME": "438144"             // GPU Elapsed Kernel Time by TornadoVM
        }
    }
}

TornadoVM achieves 4.9 TFlops just by looking at the kernel time (0.438 ms)! Not bad at all!

Conclusions

In conclusion, this exploration has shown the impressive performance capabilities of TornadoVM on GPUs, showcasing its ability to sometimes surpass even naive native OpenCL code in scenarios like matrix multiplication. Through automatic application of compiler and runtime optimizations, such as loop interchange and strategic thread block selection, TornadoVM bridges the gap between the ease of Java programming and the high performance achievable with hardware accelerators.

While the current version may not encompass all possible GPU optimizations, the framework’s potential for future enhancements, coupled with its open-source nature, holds promise for continued advancements in performance portability and streamlined parallel programming experiences.

To know more about TornadoVM: https://www.tornadovm.org/

References

Acknowledgments

I would like to thank my colleague Thanos Stratikopoulos from the University of Manchester for the constructive feedback.

Anexo

Performance with JMH on the 4090 Server:

Benchmark                                                  Mode  Cnt           Score          Error  Units
MatrixMultiplication.Benchmarking.mxmParallelStreams       avgt    5   105416705.030 ±  3718552.314  ns/op
MatrixMultiplication.Benchmarking.mxmParallelThreads       avgt    5   109084702.302 ±  5743253.306  ns/op
MatrixMultiplication.Benchmarking.mxmParallelVectorized    avgt    5    13586191.583 ±    78030.170  ns/op
MatrixMultiplication.Benchmarking.mxmSequential            avgt    5  2113164943.347 ± 17774213.889  ns/op
MatrixMultiplication.Benchmarking.mxmSequentialVectorized  avgt    5   191306987.055 ±   696420.220  ns/op
MatrixMultiplication.Benchmarking.mxmTornadoVM             avgt    5      814137.714 ±      587.040  ns/op