Understanding Memory Allocation Size Limitations with Level Zero

6 minute read

Published:

Key Takeaways

  • By default, there is a size limitation when allocating memory buffers with Level Zero
  • The maximum size can be queried using the device properties from the Level Zero API
  • There is a way to bypass the maximum size limitation with an extended allocation mode offered by Level Zero

Performance engineers and developers always want to use all the capabilities of the available hardware to increase performance and energy efficiency. This not only applies to functional units and full occupancy of the device, but also to memory. For example, if a GPU has 12GB of physical memory, we would like to use it all.

A naive approach would be to allocate and manage large buffers, as large as we can to utilize the whole memory on the accelerator. However, if you are coming from OpenCL, you might know that we are limited to the allocation of a single buffer size to ¼ of the total memory capacity of the GPU. Quoting from the OpenCL standard:

“CL_DEVICE_MAX_MEM_ALLOC_SIZE - Max size of memory object allocation in bytes. The minimum value is max (1/4th of CL_DEVICE_GLOBAL_MEM_SIZE, 12810241024) for devices that are not of type CL_DEVICE_TYPE_CUSTOM.”

Additionally, this has been an active discussion topic in implementations such as the NVIDIA OpenCL implementation. This limitation means that we might be underutilizing the GPU memory if only one buffer is used.

To overcome this limitation, another approach would be to use multiple small buffers. However, sometimes this is not possible, especially if kernels are autogenerated by a tool or a compiler and/or memory allocation and management are handled automatically.

Yet this limitation is not present in alternative parallel and heterogeneous programming frameworks such as CUDA. In this post we want to explore the memory capabilities of the Level Zero API, and, examine its constraints with respect to memory allocation.

But first, just a disclaimer, I am not an expert on the Level Zero API. This is just a collection of notes and gathered information in order to better understand some of the concepts within Level Zero. Therefore, feel free to comment, and share your views and solutions.

Checking the Maximum Buffer Allocation Size

Let’s start by checking the maximum allocation size of a single buffer. This value can be queried by using the device properties as follows:

ze_device_properties_t deviceProperties = {};
zeDeviceGetProperties(device, &deviceProperties);
auto maxSize = deviceProperties.maxMemAllocSize;

The maxMemAllocSize field from the device properties will return the max size in bytes for a single buffer allocation. This value should be similar, if not the same, as the OpenCL max allocation size. On my computer setup [3], this value is set to 4294959104 bytes (4GiB).

Allocating Buffers with Level Zero

To allocate a single buffer in Level Zero, we can use the zeMemAllocShared, zeMemAllocDevice, and zeMemAllocHost functions. If the requested buffer size is larger than the maximum size set for the device, an error is returned during the allocation of the buffer.

status = zeMemAllocShared(context, 
                          &memAllocDesc, 
                          &hostDesc,
                          allocSize, 
                          64,
                          device,
                          &sharedA);
if (status == ZE_RESULT_ERROR_UNSUPPORTED_SIZE) 
   std::cout << "Size argument is not supported by the device" << std::endl;

Therefore, by following the steps until this point, we are limited to a 4GiB buffer size (on a configuration such as my system). As mentioned earlier, in some situations, we would like to allocate larger buffer sizes (e.g., for large batch jobs). Now, let’s take a look at how this can be achieved with the Level Zero API.

Relaxed Allocation Size Limits

There is a way to instruct the malloc function that is used by Level zero that we want to exceed the memory limit capacity and, therefore, be able to allocate larger buffer sizes. We do this by declaring a variable of type ze_relaxed_allocation_limits_exp_desc_t and setting the flag ZE_RELAXED_ALLOCATION_LIMITS_EXP_FLAG_MAX_SIZE as follows:

ze_relaxed_allocation_limits_exp_desc_t exceedCapacity = {
       ZE_STRUCTURE_TYPE_RELAXED_ALLOCATION_LIMITS_EXP_DESC,
       nullptr,
       ZE_RELAXED_ALLOCATION_LIMITS_EXP_FLAG_MAX_SIZE
   };

Now we need to pass this struct variable to the descriptor of the memory allocator to be used. For instance, if we want to allocate a buffer using shared memory, we can do this as follows:

ze_device_mem_alloc_desc_t memAllocDesc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC};
   memAllocDesc.flags = ZE_DEVICE_MEM_ALLOC_FLAG_BIAS_CACHED;
   memAllocDesc.ordinal = 0;
   memAllocDesc.pNext = nullptr;

ze_host_mem_alloc_desc_t hostDesc = {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC};
hostDesc.pNext = nullptr;

hostDesc.pNext = &exceedCapacity;
memAllocDesc.pNext = &exceedCapacity;

result = zeMemAllocShared(context, 
                 &memAllocDesc, 
                 &hostDesc, 
                 6147483648,  // Alloc Size set to ~6GB
                 alignment, 
                 device, 
                 &sharedBuffer);

if (result == 0x78000009) {
        std::cout << "size argument is not supported by the device \n";
} 

On GitHub, there is a code sample that illustrates how to use this flag for large buffer size allocations. If you execute this sample with the following argument, you can see that a shared memory buffer of about 6GB is allocated.

$ ./levelZeroAlloc 6147483648  
====================================
Device   : Intel(R) UHD Graphics [0x9bc4]
Type     : GPU
Vendor ID: 8086
====================================

Allocating Shared: 6147483648 bytes - 6.14748 (GB) 

Warning Note and possible Ideas to improve

From my experience using the extended size mode, it is the responsibility of the programmer to ensure that the requested size can be satisfied. I can confirm via my experimentation that , if I allocate a few buffers and they exceed 26 GBs in total (maximum memory capacity of my system), the application crashes and exits without any error handler. This could be something to improve in Level Zero. Perhaps with a call such as:

zeCanDeviceBufferSizeBeAllocated(...) 
zeCanHostBufferSizeBeAllocated(...) 
zeCanSharedBufferSizeBeAllocated(...) 

As a side note, device buffer are residency allocations, or lazy allocations. Quoting from Jaime Arteaga from Intel:

“This works as this: you can allocate several allocations, as long as each allocation is up to the maximum allocatable size. [ … ] That is, you could have N allocations, but your workload might need only one at time when executing in the device. So actually, the device memory doesn’t need to hold all simultaneously.”

If you are interested in this topic, you can follow the discussions online for more information: https://github.com/oneapi-src/level-zero/issues/89

Summary

This post showcased how to check the maximum size of buffer allocation in Level Zero, and how to use the relaxed allocation memory limit to bypass the default maximum buffer size. You can find a whole example on GitHub.

Discussions

Additionally, if you have any comments or suggestions feel free to contact me or interact in the following thread for this entry post:

https://github.com/jjfumero/jjfumero.github.io/discussions/5

References

  1. https://github.com/oneapi-src/level-zero/issues/88
  2. Device Memory Properties
  3. Laptop Configuration
  4. https://github.com/oneapi-src/level-zero/issues/89

Acks

I want to thank Christos Kotselidis from The University of Manchester for the constructive feedback on this post.