Adaptable Sliding Window Operations with CUDA


This project focuses on the issues surrounding writing truly reusable and adaptable GPGPU (specifically CUDA) implementations. This is a crucial issue for sustainable development and maintainance of GPU libraries. While many problems can be written in an easily parameterizable fashion many can not. For these types of problems, current GPGPU coding approaches generally result in kernels that are targetted for a specific problem instance or a small range of problem instances and often are restricted to values fit the hardware architecture.

While there is a significant amount of work begin done in the GPGPU compiler space, libraries are still needed to provide maximum performance. A human can embed knowledge about the algorithm that a compiler may not be able to extract. We are focusing on two-dimensional sliding windows, which provides a rich set of problem instances with widely varying characteristics. Through studying a variety of sliding window problems, we hope to find techniques for selecting the correct parameterization for library elements and writing CUDA code that can adapt to different problem instances and hardware. These techniques will then hopefully be generalizable to problems outside the sliding window space.

By using problem-specific kernel compilation, we have created a CUDA template matching application that can adapt to arbitrary template sizes while maintaining good performance when compared to a multi-threaded CPU implementation. In the future we would like to expand adaptability to additional parameters to cover more of the sliding window problem space and develop knowledge about proper parameterization methods for CUDA library elements.

Background: Adaptable CUDA Code Difficulties

There are a number of characteristics of the CUDA programming environment that make writing general code difficult. These difficulties arrise from limitations of the CUDA architectural abstraction and the actual GPU harware implementation. Adaptability is then further complicated by changes in both aspects over time.

Architecture Abstraction

CUDA specifies an abstract hardware target with extreme amounts of available parallelism, in the form of thousands of threads. To organize this parallelism, CUDA provides a two-level thread heirarchy composed of thread blocks and the grid. Thread blocks are limited to 512 threads total and can have one, two, or three logical dimensions. The grid specifies a one or two dimensional space consisting of identical thread blocks, and scales up to 65,536 blocks in either dimension. The threads within a thread block have access to shared resources such as shared memory and registers. Threads within the same block can communicate and synchronize with one another effeciently, while threads from different blocks generally can not.

This thread hierarchy defines and restricts the structure of the parallelism available in CUDA. Adjustable grid sizes allow for the scalability of CUDA applications as well as hardware realizations. However, applications are generally limited to small-scale local (to the same thread block) communication.

Hardware Realization

Real-world hardware limitations also impose restrictions on CUDA GPU kernels. In NVIDIA hardware, blocks are executed by streaming multiprocessors (SMs). The resources per SM are limited to 16,384 32-bit registers and 16 KB of shared memory. To hide latencies and increase throughput, multiple thread blocks are often mapped to a single SM. In this case, the available resources are shared equally among all resident thread blocks, which often decreases the ideal upper bound for per-thread block resource usage.

During execution thread blocks are are flattened to a single dimension and broken up into groups of 32 consecutive threads. These groups are called warps, and threads within a warp execute the same instruction. As a result, whenever a warp encounters conditional instructions, the warp must execute all necessary branches and the threads not participating in a given branch are masked. This characteristic imposes a performance penalty for divergent code.

Memory Heterogeneity

There are also memory-related issues when working with the CUDA environment. CUDA provides a number of different memory types, and each has different performance characteristics and limitations. The following table summarizes a number of important characteristics for each of the memory types:

Memory Type Scope Size Limitations Read/Write Cached O(Latency)
Register Thread 64 KB RW NA 1
Shared Block 16 KB RW No 10
Global Global 64 MB - 4 GB RW No 100
Constant Global 64 KB R Yes 100
Texture Global Varies R Yes 100

In addition to the differences among common characteristics, each memory type also has some unique aspects that can either facilitate increased or inhibit kernel performance. For global memory, memory accesses can be grouped together into larger batch transactions in a process called coalescing. While the details vary among GPUs, generally half-warps should access contiguous memory locations. Bank aliasing can be an issue for both global and shared memory. Global memory has a device dependent number of banks, and shared memory is configured with 16 banks. When using constant memory, all threads within a half-warp must access the same address. When bank aliasing occurs or different addresses in constant memory are requested, memory accesses are serialized introducing additional latency. Texture memory is optimized for 2D locality and can provide some special interpolation and addressing modes that can simplify kernel implementation.

Changes Over Time

In addition to the base level of complexity when working with CUDA, adaptable implementations are made more difficult based on variations in CUDA over time. The above discussion uses numbers for GPUs with compute capability 1.3, but changes in both the amount block-level resources have been made and new features have been added. The following table shows the number of block-level resources for each CUDA Compute Capability level:

Compute Capability Maximum Threads Shared Memory Shared Memory Banks 32-bit Registers
1.0 - 1.1 512 16 KB 16 8 K
1.2 - 1.3 512 16 KB 16 16 K
2.0 - 2.1 1024 16 or 48 KB 32 32 K

Each level of increasing compute capabilty has also added new features that can be utilized by kernel implementers. Some major feature additions have included double-precision floating-point support (1.3), 32-bit global memory atomic operations (1.1), and the surface memory type (2.0), which is similar to texture memory but supports write as well as read accesses. In addition to new features, the relative costs of various operations has changed. The biggest changes have come with the introduction of Fermi (2.x) GPUs, and are summarized below:

Feature 1.x 2.x
24-bit Integer Arithmetic Intrisic Emulated
32-bit Integer Arithmetic Emulated Intrinsic
Texture:Global Memory Bandwidth ~1 <1
Cached Global Memory No Yes
Scalability Issues

As a result of all of this complexity, current CUDA programming practices restrict implementations to target a specificproblem instance or a small range of similar problem instances. While this practice is often sufficient when building a single application, it prevents significant code resuse and is a problem for library developers. Implementing and maintaining a large number of problem specific GPU kernels is not a feasible approach for library developers. Our goal is to create adaptable implementations that maintain as much performance as possible while covering as much of the given problem space as possible. This requires developing both adaptable coding techniques as well as proper parameterization of the library elements.

Sliding Window Problem Space

We have selected the general 2D sliding window problem as an initial area for exploring the issues related adaptability and parameterization. Sliding windows have a large number of applications, including: convolution/correlation (with a focus on non-separable), linear image filtering, tracking (template matching), stereo vision block matching, and particle image velicmetry (PIV).

Considering only non-separable approaches, there are exisiting GPU impelemntations for accelerating 2D convolution and correlation that operate in both the time domain and the frequency domain. In the time domain, these implementations are optimized for small or fixed template sizes and sizes that fit well with GPU architectures. Frequency domain approaches can achieve good performance, but the operations performed for each window location must be amenable to frequency domain implementation, which is only the case for certain operations.

In general, the 2D sliding window problem has a large parameter space with problem instances distinguished by the following characteristics:

  • Size of the region of interest (ROI)
  • Number and spacing of ROIs
  • Size of the template
  • Number and source of templates
  • Search space per ROI
  • Operation per window location

As a specific example of the sliding window problem space, consider linear image filtering, as shown in the following image:

Linear image filter operations per window location

For linear image filter, there is often a single ROI covering the entire input image. Additionally, there is usually a single fixed template that generally is only a few pixels per side and much smaller than the image. The search space covers every possible location for the center of the template, resulting in a large search space that often requires padding on the edges of the image. At each window location, an element-wise multiplication and a summation are performed

Another example of a different part of the problem space is represnted by basic block matching used in stereo vision applications to generate a disparity map. As shown in the image below, a large number of small ROIs are defined around every pixel in one image (left). A template unique to that ROI is extracted from the other image (right) centered around the corresponding pixel. In contrast to linear image filter, we have a large number of relativley small ROIs (often with purely horizontal search) and many unique templates.

Offset dependent template example

Large-Template Tracking Application

The first application that we have targetted for acceleration is a tumor tracking application developed by Cui et al. A screenshot of the application is shown below.

The application uses multiple fixed templates with Person's correlation as a similarity score. Pearson's correlation is performed by the corr2() function in the MATLAB Image Processing Toolbox, and is represented by the following equation:

The A and B in the above equation represent matrix mean. So, for a given template, A, the contribution to the denominator is fixed for all sliding window positions, as is the templates contribution to the denominator. Since the templates are calculated once and used for the duration of a session, these parts of the corr2() function can be computed once and reused, reducing the computation to the following:

It should be noted that the value of B, the frame data mean, depends on the current window location. This aspect of the computation prevents an efficient frequency domain implementation.

Data Sets

We have been working with a number of sample data sets, which are summarized in the following table:

PatientFramesTemplates Template Size (pixels) Shift ±V/±H (pixels)

The sizes of the templates are specified by hand and may vary based on tumor size as well as human factors. As a result, the range of template sizes varies significantly. They are also much larger than the templates usually considered for GPU-accelerated sliding window applications. Another difference is that while there is only one ROI per frame, like linear image filtering, the ROI and corresponding search space is much smaller. These factors place this tumor tracking application into a different part of the sliding window problem space.

Problems Mapping to CUDA

There are a couple problems that arrise when considering a mapping of this problem into CUDA. The first has to do with data placement and memory selection. A common approach for mapping sliding window operations is to assign one or more template positions to a thread. The threads within a block will be assigned adjacent sets of one or more window positions per thread. This works well with the shared memory in NVIDIA GPUs, as adjacent window positions require all of the same underlying image data with the exception of one row or column. Loading the union of needed image data into shared memory and processing out of shared memory significantly decreases the demands on global memory and keeps frequently accessed values in the SM. In general, proper usage of shared memory is crucial for GPGPU performance.

A problem with this approach is that the template sizes in our data sets are larger than usually considered. Only the template and ROI for the smallest patient will fit into shared memory for 1.x devices.

A second issue with mapping this tracking application to CUDA is where to generate parallelism. There are only a few templates. Assuming streaming operation only one ROI is resident at a time, and for each ROI the search space is small - 95 to 703 search positions.

Tiled Template

Our solution to these problems is to break the large templates into tiles and process the tiles separately. Since the outer loops of both the numerator and denominator of the corr2() function are summations, the tiles can be processed independently. With each tile mapped to a thread block, this approach generates more sources for independent parallelism and reduces the working set size for shared memory. A second step is then required to combine the contributions of each tile to produce the final result.

With tiles mapped to the CUDA grid this approach scales to arbitrarily large template sizes, limited only by the limitations on grid size (65,536 in each direction). As can be seen in the above image, a tile size that executes efficiently on a target GPU may not cover an arbitrary template size with an integer number of tiles. In these cases, there will be leftover tiles on the right and/or bottom. When both right and bottom leftover tiles are present, there will also be a corner tile. The defintion of the corr2() function, with the matrix mean subtraction, complicates the padding.

Our current implementation only adjusts one dimension for these edge case tiles. Only the width of the tile will be adjusted for the tiles on the right, and only the height of the tile will be adjusted on the bottom. Depending on the main tile size chosen, this can affect the number and type of leftover tiles. As an example of this, the following table shows various tile sizes and counts for Patient 4 (156×116 pixel template size) when different main tile sizes are selected:

156×116 pixel template
Main Tile Size 8×8 16×10 4×4
Main Blocks 19×14 9×11 39×29
Right Tile Size 8×4 16×6
Right Blocks 19 9
Bottom Tile Size 4×8 12×16
Bottom Tiles 14 11
Corner Tile Size 4×4 12×6

More addition is used to combine the contributions of each tile in a reduction kernel. The intermediate data is grouped by tile. Since each tile is shifted over the same shift area regardless of its dimensions, the main, left, bottom, and corner tiles have the same data layout. This attribute results in a relatively simple and fully coalesced reduction kernel.

The frame data averages, frame data contribution to the denominator, and the numerator are performed in order each each consists of the two stages described above: a tiled summation step and then a reduction. A simple element-wise kernel is used to implement the final fraction.

Current Performance

For the six data sets above, arbitrary tile sizes ranging from 4×4 to 16×16 were benchmarked. The GPU results were compared with reference MATLAB and multi-threaded C implementations of the tumor tracking application. The hardware used for these results was an Intel Xeon W3580 (4 i7 cores @ 3.33 GHz and 6 MB L2 cache) and a NVIDIA Tesla C1060. The machine was running Ubuntu 10.04 64-bit, MATLAB R2010a and CUDA 3.1.

The graph below shows the total application run times for the data sets, including data transfer. The tiled GPU implementation achieves good performance across the data sets. The only scenario where the GPU is not faster than the multi-threaded CPU implementation is for Patient 2, where the small size of the template combined with the small search area prevent enough parallelism to be generated for efficient GPU execution, even when using small templates.

The following table presents the best speedup over the multi-threaded CPU version combined with which of the arbitrary main tile sizes achieved the best performance:

Patient 1 2 3 4 5 6
Template Size 53×54 23×21 76×45 156×116 86×78 141×107
Best Tile Size 16×2 4×4 8×8 16×10 8×8 8×8
Total Speedup 3.61 0.64 4.09 6.92 5.66 6.86

As can be seen in the graph below, the numerator dominates the GPU version runtime. This is because the numerator is the most complicated step as well as the fact that it has to be performed for each template. In contrast, the frame data averages and denominator contribution can be computed only once per frame and are applied to each template.

Ongoing Work

The tiled implementation described above is being used as the basis for further exploration of the issues surrounding adapatable CUDA implementations. Ongoing research includes augmenting the tiled template GPU kernels to handle arbitrary shift areas, examining the proper dimensions for parameterizaiton of a sliding window library element for both problem and hardware target adaptation, and examining the benefits of problem-specific kernel compilation.

Shift Area Tiling

The results described above used a maximum thread count of 192 threads per block. The benchmakred implementation uses one thread per window location, so any time a patient data set consisted of more than 192 search positions per ROI, multiple kernel calls were used. Furthermore, a maximum of 512 (for 1.x GPUs) threads per block is possible for a kernel launch, and one of the patient data sets consists of 703 search positions. In general, between 128 and 256 threads per block is a desirable target for CUDA applications.

To eliminate the need for multiple kernel launches and enable the GPU implementation to handle sliding window problem instances with much larger search areas, we are working on tiling the search space in manner similar to that used for tiling the template. This would allow a single implementation to cover problems with large templates, like the tumor tracking application above, as well as problems with a large search space, such as linear image filtering. Tiling of the search space can take the form of assigning multiple shift positions to each thread as well as launching a larger grid with more thread blocks to cover the extra search area.


Besides the parameters related to the sliding window problem, the current GPU impelentation introduces two implementation parameters: tile size and shift area per block. These parameters allow the implementation to trade limited block resources for the less constrained grid dimensions.

In general, these types of tradeoffs can be used to help adjust implementations to different execution hardware. For example, adjusting the tile size trades shared memory requirements for grid area. The shift area assigned to each block trades grid area for both thread count per block and shared memory requirements. With these implementation parameters, for example, the data working set size can be adjusted to the target GPUs shared memory size (16 KB or 48 KB) and maximum threads per block count (512 vs 1024).

Obtaining knowledge and generating heuristics for selecting the values for implementation parameters for an incoming problem instance is one of the main goals of our ongoing work. We would also like to determine which target hardware parameters are relevant for making these decisions, or if some type of empirical benchmarking search is necessary.

Problem-Specific Kernel Compilation

While not mentioned in the discussion of the tiled template implementation, our GPU kernels are compiled at run time specifically for the current problem. As it stands, run time compilation is used to compile in up to four custom copies of the main computation loops, one for each needed tile size. This allows for full unrolling of the loops, and important GPU compile-time optimization that is not avaiable when the kernel is compiled once before the template dimensions have been determined. Loops for the reductions can also be unrolled. When main tile sizes that are powers of two are selected, the compiler can perform strength reduction on modulus and division operations converting them to bitwise operations.

This approach eliminates overheads assocated with handling the edge cases tiles and their random tile sizes. A good example of this phenomenon occurs for Patient 4. While a 4×4 pixel main tile size eliminates edge-case tiles, the best performance was achieved with a 16×10 pixel main templatesize.

While the current runtime compilation uses appear to be beneficial, particularly in terms of reducing per thread register usage, we are currently exploring greater use of problem-specific kernel compilation. This can be of particular help when dealing with adaptable GPU kernels. GPUs tend not to deal well with control heavy code (adaptable code), as things like thread divergence and loops have relatively high overheads. By adding shift area tiling, we are also mapping an increasing number of parameters into the fixed CUDA thread hierarchy requiring an increasing number of non-compute instructions used to have each thread calculate its offest into the problem space. Compiling custom kernels for each problem helps to sidestep some of these issues by moving repeated expensive runtime evaluation to compile time.

In addition, this approach also has some hardware target adaptability benefits. For example, CUDA C supports C++ style templates that can be used to select between * and __[u]mul24() for integer arithmetic depending on the target. Any additional knowledge the compiler has about the target can also be applied.

An area of interest is exploring and analyzing the benefits of run time kernel compilation. We would like to quantify the benefits, if any, beyond annectdotal experiences of loop unrolling and register usage reduction. In addition, we would like to know the limits of this approach: When should parameters be compiled into a kernel? When should customized instances be included? How much can be done in terms of hardware adaptability? Can we swap out the per-window location processing to reduce redundant code?

Particle Image Velocimetry

The tumor tracking application has provided a starting point for this work and a represents a unique part of the problem space. In order expand problem space coverage and help us derive heuristics for automatic configuration selection and parameterization, we are currently looking at particle image velocimetry (PIV).

PIV is an active area of interest for our group and has been studied for FPGA implementation. Our current formulation of the problem uses multiple non-adjacent ROIs per frame with a small to medium search. This places the problem in a different part of the sliding window problem space.


  1. Y. Cui, J. G. Dy, G. C. Sharp, B. Alexander, and S. B. Jiang, "Multiple Template Based Fluoroscopic Tracking of Lung Tumor Mass without Implanted Fiducial Markers," Physics in Medicine and Biology, Vol. 52, pp. 6229-6242, 2007.
  2. A. Bennis, M. Leeser, G. Tadmor, "Implementing a Highly Parameterized Digital PIV System On Reconfigurable Hardware", 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 32 - 37, 2009.
  3. N. Moore, M. Leeser, and L. S. King, "Efficient template matching with variable size templates in CUDA", IEEE 8th Symposium on Application Specific Processors (SASP), pp. 77 - 80, 2010.
  4. N. Moore, M. Leeser, and L. S. King, "Adaptable and Efficient Variable Size Template Matching in CUDA", 14th High Performance Embedded Computing Workshop (HPEC), 2010.
  5. N. Moore, M. Leeser, "Accelerating a MATLAB Application with Nvidia GPUs: a Case Study for GPU Library Construction", 13th High Performance Embedded Computing Workshop (HPEC), 2009.