Optimization Methodology
In this section we are going to present an overview of the method used to optimize the volumetric path tracer. The chapter follows the design decisions that someone should evaluate when approaching this type of problem. Different design paths have different advantages and disadvantages that we are going to cover.
In figure 4.1 it is possible to see the exact design decisions tree. At the beginning, we are going to examine how to use GPU and CPU together and this we lead us to a decision on a single versus multi kernel approach. After that, we are going to evaluate how to maximize the utilization of GPU regenerating threads, which become idles during the computation. Finally, we are going to see how to maximize the memory throughput and the data locality compacting together the threads that are still active.
In figure 4.6 there are some scenes that we are going to use for the tests: two of them from the application field of 3D printing and the other two from visualization. The volumetric path tracer tests showed in this section are all running without Russian Roulette russian-roulette.
Testing Hardware
We will use for testing a MacBook Pro with CPU Intel Core i7 quad-core and GPU Nvidia GeForce 650M with the following characteristics:
-
Kepler architecture, compute capability 3.0
-
Total amount of global memory: 512 MB
-
Total number of registers available per block: 65536 (equal to registers per SM)
-
Maximum number of threads per block: 1024
-
Total amount of shared memory per block: 49152 bytes
more details about the hardware are provided in the Appendix.




Host Control or Device Control
Single Kernel versus Multi Kernel
The topic of using a single “megakernel” versus multiple lighter kernels has already been widely covered in the literature. A naive single kernel implementation holds all the algorithm and requires all the resources associated with all the different parts of it. Nvidia OptiX () utilizes a similar but more sophisticated approach. In this framework, a Just-In-Time compiler combines all the different stages of the path tracer, in the form of PTX (assembly language for NVIDIA GPU), into a single kernel. The kernel is formed as a state machine where, to minimize execution divergence, a scheduler selects a single state for an entire SIMT unit. If a thread is not requiring that state, it remains idle during that iteration. compared the traditional single kernel approach to a wavefront formulation, which uses different kernels for different control flows. This method allows to completely avoid code divergence, i.e. sorting the paths based on their next interaction, dividing them into different bins and launching a kernel only for the paths in the same bin. Moreover, the kernels launched are perfectly specialized for the task and they can utilize the GPU resources completely. However, this work is not containing this possibility. There are different reasons for that:
-
Dynamic parallelism and global device synchronization: supported only by the newer CUDA GPU (compute capability 3.5 and higher), dynamic parallelism gives the ability to a single CUDA thread of launching a new kernels with its own configuration. This technique reduces the need to transfer execution control and data between host and device (). Moreover, the global device synchronization introduced with CUDA 9.0 (using special cooperative groups) removes the need of launching a new kernel only for synchronizing all the threads running on the device.
-
The type of scenes that we are analyzing in this work contains just one object: the volume that we want to render. For this reason there are not many different type of interactions. The only two cases are if the analyzed point is inside the volume or on the boundary (exactly how it is explained in the section 1.3). However, the wavefront formulation is best suited for scenes with many different and complex materials.
Inspired by results showed by others like and the only multi kernel approaches that we are going to analyze are composed by a maximum of 2 kernels. The table 4.1 shows the difference on using a naive approach with a single kernel call or multiple kernel calls. The naiveSK method (naive with a single kernel) consists of a simple volumetric path tracer implementation where each thread computes only one path and all the data is stored on local memory (the algorithm can be found in the work of ). The naiveMK method (naive with multiple kernels) consists of a simple volumetric path tracer divided in two kernels. The first kernel is generating the paths and computing the first intersection with the object, while the second kernel extends the active paths computing a new intersection with the scene. It is clear that the single kernel approach wins over the multi kernel one in terms of speed. However, there are also advantages on using multiple kernel calls. From a user perspective, using smaller kernels allows the CPU to take the control of the application and, in case only one GPU is available in the system, it avoids to freeze the user interface for too much time.
Method | rays/sec |
---|---|
naiveSK | 3.91 |
naiveMK | 1.19 |
Image Tiling
Another possibility for interleaving the control between host and device is using image tiles. That is, the image is divided into equal tiles and the device can work on each of them separately at the same time. This technique allows to lower the memory requirements for running the software, because we need only to store the size of one tile. This also means that the rendering is divided into multiple kernel calls. Each of this kernels have to render a single tile and for this reason is faster. In this perspective, it is important to not use tiles that are too small, because this will decrease the utilization of the hardware and therefore the performance (we will talk more about utilization in the next section). The consequence is that, in order to obtain the maximum performance, the size of the tile should depend on the type of hardware. The tile size is also affecting the specific algorithm in use. Most of the algorithms that we are showing in this work have different memory requirement and different behaviors depending on the tile size. However, from our tests it emerges that the behavior of each algorithm using different tile sizes is comparable with the use of different number of samples. Potentially, a tiled image allows also to decrease the time to transfer the image from the device to the host. In the practical case, however, the size of the image is usually small and the time to transfer it is not comparable with the time for computation. For this reason, also using a second buffer for the image tile (this technique is usually called Double Buffering) and the CUDA asynchronous memory copy to overlap computation and image transfers, the performance improvement is negligible. Table 4.2 shows the results of using different image tiles. From this table it is possible to see that launching multiple kernels on different image tiles is not affecting the performance as far as the number of tiles is not becoming too high. Indeed, the computation remains stable until the number of pixel processed by the algorithm becomes too small, decreasing the utilization of the device for the kernel call. This result shows also that the real performance bottleneck on dividing the path tracing task is not represented by the kernel launch overhead but from the repetitive loading and storing of the same data.
tiling setting | ||||||
(1,1) | (2,2) | (4,4) | (8,8) | (32,32) | (64, 64) | |
n\. paths per tile | 18432000 | 4608000 | 1152000 | 288000 | 18000 | 4500 |
StreamingSK | 61.27 | 67.98 | 72.72 | 76.13 | 90.66 | 224.04 |
regenerationSK | 7.99 | 10.02 | 9.44 | 9.48 | 34.11 | 98.40 |
naiveSK | 219.47 | 223.65 | 221.93 | 224.59 | 224.85 | 123.32 |
Summary
In this chapter we have seen the first design decision that should be made when approaching this problem: giving the control of the program to the host or to the device. In the first case, the advantage is to have a more responsive UI and, if the kernels are well divided into different tasks, it allows to use efficiently the device resources for the task. In our case, most of the computation concerns the same simple task: the interaction with the medium. For this reason, dividing the computation will not improve the performance. We have also seen another way to divide the computation by rendering separate image tiles. This allows to use the same algorithms on a different hardware, whose memory requirements depend on the image tile size. Moreover, overlapping communication and computation decreases the latency on getting the resulting image. The tile size is affecting also the utilization of the GPU which is upper bounded by the number of paths to compute. Indeed, the number of paths to compute an image tile is equal to
\[\text{paths} = \text{tile.width} \cdot \text{tile.height} \cdot \text{iterations}.\]While the number of iterations is given by the user, the tile width and height can be specified according to the hardware used, which allows to improve the utilization for the specific hardware. In the next chapter we will see which other factors can decrease the utilization of the GPU and how to prevent it.
Maximizing Utilization and Hiding Multiprocessor Latency
The objective of maximizing the utilization of the device is achieved when the device is always active until the end of the computation. In a path tracer, the problem is given by the different lengths of the paths computed. When a path in a thread has terminated, that thread will not be utilized until the termination of all the others. To overcome this inefficiency  have proposed to use path regeneration. This technique consists of using a persistent thread system (more information about persistent thread programming can be found in the work of ) where new paths, taken from a global pool of paths, are assigned to idle threads (the path is therefore regenerated).
Another problem that can affect the utilization on the system is the number of registers and the shared memory necessary for the kernel. Most of the time those resources are stored in a fast memory (L1 cache in devices with compute capability 3 or higher) that is limited. When the kernel reaches this limit the maximum number of resident blocks in a multiprocessor is decreased. If the kernel requires more registers than available on the multiprocessor, then the compiler will attempt to minimize register usage, or it can be forced to do it, while keeping them stored in the local memory (register spilling). On the other hand, if the kernel requires too much shared memory the only solution is to decrease the number of resident block in a multiprocessor. When the number of threads resident on the same multiprocessor is high then the latency of a inactive warps, which cannot perform their next instruction, can be hidden efficiently. The full utilization in this case is achieved when the warp scheduler can always issue some instruction for some warp during the latency period.
Persistent Thread
The persistent thread style of programming helps a developer to separate the task to compute from the hardware that is running it. The threads are active for the entire duration of a kernel and every thread gets new work from a work queue when it finishes its current task. In our case this means regenerating the terminated paths and tracing new ones. A new path can be created by using an identification number like in algorithm regenerate. For this reason, the work queue is represented by one global value that is atomically incremented during regeneration. In this paragraph, we analyze the different approach that we may take for regenerations:
-
Thread regeneration: a single kernel is launched and new paths are assigned to every thread. When the thread becomes inactive the path is regenerated.
-
Warp regeneration: a single kernel is launched and new paths are assigned to every warp. When all the threads in a warp become inactive, all the warp is regenerated incrementing the atomic counter only once per warp by the size of the warp.
-
Block regeneration: a single kernel is launched and new paths are assigned to every block. At every path extension, if the path queue is not empty, all the threads in a block are synchronized and all the inactive ones are regenerated.
-
Device regeneration: multiple kernel are launched, one kernel is performing regeneration and another one path extension. If the path queue is not empty, after each extension all the inactive threads are regenerated.
Those regeneration techniques require different types of synchronization points: the first one is the classic implementation where if a thread becomes inactive it is immediately regenerated, i.e if only one thread is regenerating the rest of the threads in the warp, we have to wait for it before continuing. The second one is using the warp vote functions (which can be found in the documentation by ) to decide weather to regenerate or not the warp. In CUDA 9 and on devices with compute capability higher than 3.0 this can be efficiently done using the warp shuffle functions, which allows to exchange values between threads in the same warp, to get the right path for each thread. The third one requires synchronization of all the threads in a block and it regenerates all the block all together. This latter is similar to what is done by . Finally, the last one is regenerating all the processed paths and requires to synchronize all the devices. We did not include this last one in our study because we have already tested that the multi-kernel approach is not favoring our settings. In the implementation presented the synchronization points are performed after each path extension (unless the path queue is empty). This can be not optimal, as the synchronization point for regeneration should be placed exactly when the performance risks to decrease for not having enough active threads (in the next paragraph we will see what exactly this means). However, this is not only hardware dependent but also scene dependent. For this reason, we did not include this variation in this work. The table in 4.6 shows the results of the different methods on our scenes. We can see that the regenerationSK (regeneration with a single kernel) on a single thread works best on the scenes with constant density (cgg-logo and ad) for which the warps are almost always synchronized. In the other two scenes the warp regeneration is performing better. When the paths start to differentiate, this method allows to maintain the coherence inside a warp by restarting all the warp at once.
method | rays/sec | |||
cgg-logo | ad | artifix | manix | |
naiveSK | 3.88 | 2.13 | 8.34 | 8.45 |
regenerationSK (thread) | 81.62 | 42.37 | 11.80 | 11.42 |
regenerationSK (warp) | 17.13 | 17.05 | 12.11 | 14.82 |
regenerationSK (block) | 3.52 | 3.21 | 7.35 | 7.41 |
Occupancy
The occupancy of a kernel consists of the number of active warps (SIMD group) on a multiprocessor when launching the kernel. If the occupancy is maximal, then all the warps that can simultaneously reside on a multiprocessor are active and the warp scheduler can hide the latency of the warps which cannot execute their next instruction. However, if the kernel requires too many registers or too much shared memory, it have to limit the number of threads active on a single multiprocessor. If the kernel requires too much shared memory, there is nothing that the compiler could do to increase the occupancy. However, if the problem is the number of registers used, then one possibility is to limit this number using register spilling. In our case, the kernel of our volumetric path tracer requires approximately 64 registers for the compiler, which limits the occupancy to 50%. To achieve the 100% of occupancy we could force the compiler to reduce the number of registers used by our path tracer.
Bounding Register Usage
Bounding the number of registers used by a kernel is not always a good idea. The compiler is usually limiting the number of register used by itself and tries to make the best decision between increasing the occupancy and efficiently storing memory on registers. Sometimes, however, it is possible that this decision is not the one that gives the best performance. For this reason, another possibility is to force a kernel to use a minimum number of blocks per multiprocessor by limiting even more the number of registers used. In the table 4.4 we can see the effect of achieving maximum occupancy using this technique. All the results are worse respect to the version with half the occupancy in table 4.6. By analyzing the process, it is possible to see that also if the number of warps in the multiprocessor is maximum, the latency is higher than before. The time required to get the spilled registers is therefore higher than the latency hiding capability of the device.
method | rays/sec | |||
cgg-logo | ad | artifix | manix | |
regenerationSK (thread) | 39.04 | 32.81 | 12.66 | 11.37 |
regenerationSK (warp) | 4.37 | 7.02 | 11.92 | 10.95 |
Maximize Register Usage
Increasing the occupancy is not the only way to hide threads latency; another possibility is to leverage the instruction-level parallelism. Indeed, a warp that has subsequent independent instructions can hide the latency of an instruction performing the next one. This technique allows to lower the occupancy needed to obtain the device peak of performance, which can be calculated with Little’s law:
\[\text{parallelism} = \text{latency} \cdot \text{throughput}\]We have used this technique in our volumetric path tracer by assigning more than one path to each thread. This allows to decrease the dependency among subsequent instructions and, therefore, to hide the latency inside a single thread using the instruction-level parallelism. Recent tests have demonstrated how the use of more registers per thread, using less threads per multiprocessor, can lead to better performance with smaller occupancy. We have also tested our volumetric path tracing with a smaller occupancy, but without any visible improvements. The conclusion that we may take is that in this case the compiler is doing an optimal choice about the number of registers and threads to use.
Summary
We have seen in this section different application of the persistent thread techniques, which differ on synchronization points. Moreover, we have seen some techniques to achieve better performances at lower occupancy utilizing instruction-level parallelism and more registers per thread. However, the difference in latency between arithmetic operations and memory operations is high and the best way to improve the performance is decreasing this gap. For this reasons, in the next section, we will discuss about localizing memory access inside the kernels to maximize the cache usage and decrease the memory access latency.
Data Locality and Code Divergence
We have discussed until now all the performance limiting factors regarding the full utilization of the GPU and the logic communication between host and device. In this chapter, we are going to discuss about one of the most important limiting factor in GPGPU: the bandwidth. In the section 3.2 we have seen that access on every level of GPU memory hierarchy have different bandwidth. Localizing the data access inside a kernel is a key factor for decreasing the number of low bandwidth data transfers. Code divergence is another limiting factor which can drop the performance causing the instruction throughput to decrease. When a warp executing a kernel meets a flow control instruction, it diverges if the threads in the warp follow different flows of execution and if the different instructions are substantial (otherwise predication is used). The hardware maintains a bit vector of active threads and executes the code once for the active and then for the inactive. When all the executions paths are complete the warp re-converge to the original path. All those limitations are present in a Monte Carlo Path Tracer that is based on a random control flow and thus random memory access. It is not straightforward to create a predetermined data access pattern that can be exploited in this case. Dietger et al. have proposed a solution based on stream compaction of the active threads at every path extension. In this solution, paths that are still active are compacted together and stored in the global memory of the device. In this section, we will propose a new method that aims to merge the benefits of compacting with the reordering of the rays to exploit warp and data coherence.
Compaction
As we have already discussed in the introduction of the chapter, active thread compaction allows to separate the threads into active and idle. There are two main advantages on using this approach for a MC path tracer:
-
SIMD efficiency increase: the compaction of the active threads allows to have warps fully active or fully inactive.
-
Primary ray coherence is maintained: the regenerated paths usually follow the same control flow path during the first iteration without any code divergence and access the same GPU caches exploiting data locality.
Also in this case we want to analyze the different approaches to compaction:
-
Device Compaction (StreamingMK): all the active threads inside the device are compacted together. The methods requires two streams of global memory for path data one in input and one in output. The compaction is efficiently performed using an atomic operation which tracks the number of elements written to the output stream. This method requires two kernels, one for regenerating inactive paths and one for extending the paths.
-
Tiled Block Compaction (StreamingSK): the active threads are compacted only inside a block. The path data is still stored in the global memory, but the compaction of the active threads is performed only on a tile of this data, large as much as the number of threads inside a block. The graphical output of the compaction can be seen in figure 4.8 and pseudo-code in algorithm streamingSK.
In both cases the compaction is performed by the following steps (which are graphically showed in figure 4.7):
-
an exclusive sum on the active labels of the thread allows to find the position of each active thread inside the output stream (which is the same as the input for the tiled block compaction)
-
the number of active threads is updated (the Device compaction requires an atomic operation).
-
the active threads write their data on the output stream.
In our tests we have seen a performance boost of the block compaction versus the device compaction. We think that most of the performance gain is given by the use of a single kernel. However, compared to the RegenerationSK on a warp the StreamingVolPTsk is getting better result, demonstrating the improvement given by the compaction.
method | rays/sec | |||
cgg-logo | ad | artifix | manix | |
regenerationSK (thread) | 81.62 | 42.37 | 11.80 | 11.42 |
regenerationSK (warp) | 4.37 | 7.02 | 11.92 | 10.95 |
streamingSK | 52.69 | 37.81 | 7.42 | 8.02 |
streamingMK | 6.59 | 4.70 | 5.44 | 5.18 |
Reordering
In this section we are going to discuss a new method which is based on the previously detailed StreamingVolPTsk. We have seen in the previous chapter that compaction is helping on maintaining high SIMD efficiency and primary ray coherence. However, the secondary rays are usually not coherent and they are compacted to the others active rays independently on their position or direction. The method that we propose consists in using the Morton order, also called Z-order, to reorder the active rays while compacting. Moon et al. have already used this technique to achieve a cache-oblivious ray reordering for path tracing; they achieved more than an order of magnitude performance for big models that cannot fit into main memory. Our scope is extending this work to GPU and integrating it with the already discussed streaming compaction.
Morton Order
The Morton order (also called Z-order for its characteristic shape as in figure 4.9) is based on a space filling curve which is a function that maps multidimensional data to one dimension, preserving locality of the data points. In our implementation, the z-value of a point is calculated by interleaving bits of the quantized coordinates of the multidimensional data  . The quantized coordinates, which are lying between 0 and 1, are transformed linearly into 10-bit integers. Those integers are then expanded using zeros between the bits of each coordinate, so that the different coordinates can be combined together, interleaving their bits, into a single value. This value represent the z-value of the coordinate (graphical structure of the method can be found in figure 4.10). The resulting ordering correspond of a depth-first traversal of a commonly used 3D data structure called Octree.
![]() |
![]() |
![]() |
Z-Sorting
In our contest, the main source of data is given by the volumetric data that is accessed through the texture memory of the device. Texture accesses in CUDA give best performances in applications where memory access patterns exhibit spatial locality. For this reason reordering rays using the z-order should improve the access pattern on the texture. The coordinate that have to be used for that is the position of the ray inside the volume. Thus, the bounding box of the volume is used to quantize the ray position. The quantized position are then transformed as we have previously seen into z-values. Finally, we are reordering the rays using the block radix sort, where the keys used for reordering are the z-values of the ray position (as shown in figure 4.11. If the thread is not active the maximum z-value is associated with the ray. This method allows to sort and compact at the same time the threads inside a single block. We have analyzed two possible implementation of the z-sorting:
-
Sorting and compacting before texture access (StreamingSK with sorting variant): consists in using the same tiled block compaction already seen before, but this time, instead of using a scan operation to compact the active rays, we are using Morton sorting.
-
Sorting and compacting after texture access (SortingSK) : consists in delaying the texture access for the albedo volume of the volumetric path tracer after reordering and compaction of the rays (a pseudo code of the algorithm can be found in sortingSK).
The first option should help to create more coherent rays, which in the next path extension will have more probability to access data that is near in memory. The second option uses a shared array of labels describing if the albedo texture is accessed by the thread during the last extension. This array is then used after reordering and compaction to decide which of the reordered paths need to access the albedo texture. Finally, the texture is accessed by the reordered threads. That is, the last option allows to access the texture with the more coherent way given random access positions.
method | rays/sec | |||
cgg-logo | ad | artifix | manix | |
regenerationSK (thread) | 81.62 | 42.37 | 11.80 | 11.42 |
regenerationSK (warp) | 4.37 | 7.02 | 11.92 | 10.95 |
streamingSK (compaction) | 52.69 | 37.81 | 7.42 | 8.02 |
streamingSK (sorting) | 20.41 | 17.61 | 6.78 | 6.74 |
sortingSK | 20.90 | 17.83 | 7.02 | 6.87 |
Summary
In this section we have discussed different techniques to improve data locality and thread divergence. Compaction allows to create coherent primary rays and to improve SIMD efficiency, while reordering of rays tries to improve secondary rays coherency inside the volume. Unfortunately, the results show that the increased data coherence are not affecting the performance of the algorithm. One of the reasons for this behavior could be the too high texture resolution compared to the number of sample points inside the volume, represented by the texture origins. If the ratio of those two values is too big, then the method cannot explore data coherence by reordering the rays. Indeed, in this case, considering the random position of the rays inside the volume it is improbable that their position will be close. Therefore, even if the method is accessing the texture in a coherent order, the texture cache is not big enough to store the values between one texture access and the next one.