Reconfigurable and GPU Computing Laboratory

Portable Application Framework for Heterogeneous Systems

James Brock, Miriam Leeser, and Mark Niedre

Abstract

Architecturally diverse systems have become common in high-performance computing, comprised of any variety and number of processing elements (multicore processors, FPGAs, GPUs, etc.). All of these architectures present their own programming challenges and complexities. Each often has its own programming language, development environment, and processing constraints. MIT Lincoln Laboratory developed the Parallel Vector Tile Optimizing Library (PVTOL) as a means of writing high-performance signal and image processing code that is portable across a large number of multicore general purpose computing architectures[1].

This work extends the PVTOL Tasks and Conduits framework to include support for Graphics Processing Units (GPUs), alleviating the difficulty of interfacing with different GPU architectures and creating portable applications. The PVTOL Tasks and Conduits API provides a consistent, portable programming model that hides the complexity of the underlying processor configuration and memory hierarchies, including support for both task and data parallelism. With this framework, it is easy for a programmer to construct an application pipeline, like the one depicted in Figure \ref{pvtol-fig1}. \textit{Tasks} are hierarchical, modular abstractions for data processing that contain single program multiple data (SPMD) code, which can execute on one or more processing units. \textit{Conduits} synchronize and transfer data between tasks. We have extended this framework to include graphics processing units, making use of NVIDIA's Compute Unified Device Architecture (CUDA)[2].

PVTOL Tasks and Conduits

The original version of PVTOL Tasks and Conduits contained support for general-purpose homogeneous and heterogeneous processing platforms (CPUs, Clusters, etc.), but had not been extended to include graphics processing units and other coprocessing elements. Extending the PVTOL Tasks and Conduits framework to include these devices allows programmers to exploit the processing power of GPUs in their applications with minimal effort. A version of Tasks and Conduits for GPU coprocessors has been developed to provide abstractions for intelligently allocating memory, transferring data between the host and device, and executing kernels. Support for GPUs is seamlessly integrated with existing PVTOL Task and Conduit API using the Compute Unified Device Architecture (CUDA). Key to PVTOL Tasks and Conduits framework is based on the concept of separating and abstracting data processing and data transfers. The PVTOL API provides a consistent, portable programming model that hides the complexity of the underlying processor configuration and memory hierarchies. Additionally, the GPU Tasks and Conduits can execute kernel functions from popular third-party function libraries, such as GPUFFTW, CUFFT, and CUBLAS, reducing the amount of user programming needed for many high-performance applications.

PVTOL Tasks, are hierarchical, modular structures that abstract data processing so that it can be divided among processing elements in the system. Tasks enable data parallelism by encapsulating single program multiple data (SPMD) code, and are mapped to one or more processing elements[1]. Another concept used in PVTOL is that of a map, which describes how data within a Task will be allocated among processing elements. The Task and Conduit constructs in PVTOL use C++ templates to allow maps to be passed as template arguments at initialization. This separates the developer's code from how data processing is mapped to the hardware, thus achieving portability across different hardware platforms.

PVTOL Conduits are responsible for managing the data buffers, data transfers, and synchronized data communication between Tasks. PVTOL Conduits can manage data communication between two Tasks utilizing separate or a shared memory system. In the case where a Conduit's endpoint Tasks use separate memory systems, separate data buffers are allocated on each and transferred from source to destination as soon as the data is available at the source. If the Conduit's two endpoints share the same memory system, a single data buffer is allocated. Whenever the source endpoint has completed filling a buffer with data, it releases control of the data and the destination endpoint takes control of the same buffer for reading. To alleviate potential data hazards, PVTOL Conduits supports multi-buffering. Once the source Task fills a data buffer, it can begin filling the next consecutive data buffer, while the destination task reads available data from the first buffer. Figure 1 shows an example pipeline application built with standard PVTOL Tasks and Conduits. Figure 2 shows the same example built with PVTOL GPU Tasks and Conduits. The only change here, is the use of a GPU kernel within the DAT task, instead of a standard CPU function.

pvtol1
Figure 1: Example PVTOL application

gpu-pvtol1
Figure 2: Example PVTOL GPU application

Fluorescence Mediated Tomography

Fluorescence mediated tomography research is a rapidly developing field of study that is yielding a means of molecular imaging that supports the 3-dimensional visualization of live animals non-invasively. FMT utilizes fluorescent indicators to highlight particular types of tissue and molecules, as to make them more responsive to the wavelengths of light being transmitted through them. The technique of FMT has very useful applications in a number of medical imaging fields[2]. Briefly, image reconstruction in FMT involves three steps; i) optical measurement of the fluorescence intensity transmitted through an animal between light source and detector pairs, ii) accurate modeling of light propagation between source and detector pairs to yield system weight functions (i.e. the forward problem), and iii) inversion of the resulting system of equations to yield the fluorescence image. A major challenge in FMT is the high degree of light scatter through biological tissue which limits the potential imaging resolution of the technique [3]. This work focuses on solving the problem of part ii) of this technique.

Several authors (including Niedre et. al.) have shown that measurement of early-arriving photons allows selection of photons that have taken a relatively straighter path through the biological tissue than with conventional, continuous-intensity FMT [6] and therefore can effectively reduce light scatter. Therefore, the image resolution obtained with early-photon tomography (EPT) is significantly improved versus standard FMT. Accurate modeling of photon propagation at early time gates is a difficult problem, particularly when arbitrary physical geometries and highly heterogeneous media need to be modeled. Briefly, the forward problem (i.e. step ii above) entails modeling of photon propagation through diffusive biological tissue between all source and detector pairs, which frequently number in the thousands for a particular EPT (or FMT) scanner. A brief depiction of the process can be seen in Figure 3.

Figure 3: (a) X-ray CT of a mouse with two implanted fluorescent tubes (arrows) in the torso. Sinograms of the transmitted fluorescence (normalized to transmitted intesity) as a function of rotation angle are shown for (b) un-gated and (e) early-photons. Intermediate angles (less than the 5o rotation stepsize) were determined using a standard re-binning algorithm. Inspection of a single projection where the tubes are oriented perpendicular to the imaging axis of the CCD shows that tubes are difficult to separate with (c) un-gated photons, but are easily visualized with (f) early-photons. The size and location of the reconstructed tubes are shown for (d) un-gated and (g) early-photons. The early photon image is significantly more accurate in terms of tube size and separation distance.

Monte Carlo techniques represent a rigorous but flexible method of modeling photon propagation in biological tissue[5]. This is done numerically by assuming that large numbers of photons are incident on a tissue volume. Despite the higher accuracy in modeling early photons, Monte Carlo to date has not been routinely used to calculate the forward problem in EPT, since computation times have become prohibitively long, i.e. since long processing times are required for each simulation and calculation of thousands of weight functions corresponding to each source-detector pair are required. Further, only a small fractions of total incident photons exit at specific early time gates and detector positions, so that large numbers of photons (> 10⁹) must be simulated to obtain accurate statistics. Therefore, a major motivation of this work is to use GPU acceleration to reduce Monte Carlo processing times so that efficient calculation of the early-photon forward problem using Monte Carlo is feasible. To our knowledge, this has not been performed previously.

Current results

The current state of the project has implementations of GPU Tasks, and the following GPU Conduits:

Host → GPU
GPU → Host
GPU_x → Host → GPU_y
GPU_x → GPU_x

Additionally, two basic versions of the FMT application have been built using the PVTOL Tasks and Conduits framework, a C/C++ version and a CUDA version. A diagram of the application can be seen in Figure 4.

Figure 4: FMT GPU Application for photon propagation

To verify the accuracy of the GPU-enabled application, the output of the C/C++ code compared to the CUDA code can be seen in Figure 5. The output shown is of a test of 10 million photons travelling from one source to one detector through a homogeneous medium of biological tissue for un-gated photons and for early arriving photons. The end-to-end application (all memory allocation, transfer, setup included) speed-up achieved by running the code on the GPU as opposed to the CPU is currently approximately 35x, reducing the run-time from about 8 hours to about 13 minutes.

Figure 5: Output of a C/C++ simulation of 10 million photons from one source to one detector for un-gated photons (a) and early time-gated photons (b). The output of the same simulation run in CUDA for un-gated photons (c) and early time-gated photons (d).

References

Hahn Kim et. al., "PVTOL: providing productivity, performance and portability to DoD signal processing applications on multicore processors," in DoD HPCMP Users Group Conference, 2008. DOD HPCMP UGC, 2008, pp. 327–333.
Sanjeev Mohindra et. al., "Task and conduit framework for multi–core systems," in DoD HPCMP Users Group Conference, 2008. DOD HPCMP UGC, 2008, pp. 506–513.
R. Weissleder and M. J. Pittet, "Imaging in the era of molecular oncology," Nature, vol. 452, no. 7187, pp. 580–589, Apr. 2008.
A. H. Hielscher, "Optical tomographic imaging of small animals," Current Opinion in Biotechnology, vol. 16, no. 1, pp. 79–88, Feb. 2005.
S. L. Jacques, L. Zheng, and L. Wang, "MCML–Monte carlo modeling of light transport in multi–layered tissues." Computer Methods and Programs in Biomedicine, vol. 47, no. 2, pp. 131–146.
Mark Niedre et al., "Early photon tomography allows fluorescence detection of lung carcinomas and disease progression in mice in vivo," Proceedings of the National Academy of Sciences, vol. 105, no. 49, pp. 19126–19131, Dec. 2008.
Lihong V. Wang and Hsin i Wu. Biomedical Optics: Principles and Imaging. Wiley–Interscience, 1 edition, May 2007.