A Tutorial for using OpenCL in ImageJ
This tutorial is meant to help you leverage OpenCL from Java for use with ImageJ.
Contents
- 1 Background
- 2 Setting up the Development Machine
- 3 Setting up OpenCL
- 4 Setting up Eclipse and needed plugins on Ubuntu
- 5 Downloading and running the ImageJ OpenCL examples
- 6 Understanding platform-specific JOCL native libraries
- 7 SobelFilter example
- 8 Understanding ImageJ + OpenCL
- 9 ImageJ OpenCL: An incremental approach to applying OpenCL
- 10 OpenCL ImageJ plugins following enterprise java patterns
- 11 Hosting OpenCL-accelerated algorithms using Oracle's GlassfishV3
- 12 GPU Based Processing Techniques and the ImageJ Architecture
Background
To use OpenCL from Java in ImageJ we rely on JOCL. JOCL is written on top of a low level JNI API to make using OpenCL a bit easier. The OpenCL code you write can also leverage JOCL to accelerate execution of ImageJ plugins from Java. We have created an OpenCL deconvolution example to demonstrate compute acceleration using OpenCL (both locally and remotely as a binary web service).
Setting up the Development Machine
We set up an Ubuntu based development machine for OpenCL development and testing. OSX 10.6, Windows 64/32, and Linux 64/32 pass the tests and are supported by this package. Here are the steps for setting up Ubuntu:
- Obtain an ISO for Ubuntu 10.04
- Install Ubuntu on the target machine.
- Install g++ with:
sudo apt-get install g++
- Install needed development libs with:
sudo apt-get install freeglut3-dev sudo apt-get install libxi-dev sudo apt-get install libxmu-dev
Setting up OpenCL
You will need to install OpenCL for your OpenCL enabled hardware if it does not come installed as part of the OS.
Setting up OpenCL for ATI
If you have ATI GPU hardware, check out these instructions.
Setting up OpenCL for NVidia
For NVidia hardware, install development drivers, CUDA Toolkit, and GPU Computing SDK code samples by following these installation instructions.
Use the wget tool for downloading the three needed install files from NVidia's download site.
wget http://developer.download.nvidia.com/compute/cuda/\ 3_2_prod/drivers/devdriver_3.2_linux_64_260.19.26.run wget http://developer.download.nvidia.com/compute/cuda/\ 3_2_prod/toolkit/cudatoolkit_3.2.16_linux_64_ubuntu10.04.run wget http://developer.download.nvidia.com/compute/cuda/\ 3_2_prod/sdk/gpucomputingsdk_3.2.16_linux.run
Stop the graphical desktop manager by typing:
sudo gdm stop
from the command line.
Install the required files:
sudo sh devdriver_3.2_linux_64_260.19.26.run sudo sh cudatoolkit_3.2.16_linux_64_ubuntu10.04.run sh gpucomputingsdk_3.2.16_linux.run
Per the installation instructions, we setup the environment variables (in .bashrc
):
export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64" export PATH="/usr/local/cuda/bin"
Test the installation by compiling and running a few of the NVidia provided OpenCL samples by changing directories to:
/NVIDIA_GPU_COMPUTING_SDK/C
and running:
make
We ran into an error and ended up editing the file
/NVIDIA_GPU_COMPUTING_SDK/C/common.mk
by replacing line 169 with:
NVCCFLAGS += --compiler-options -fpermissive
and re-running:
make
Change directories to
/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
and run
./bandwidthtest
to check the binary CUDA install.
Build the OpenCL examples by changing directories to
/NVIDIA_GPU_COMPUTING_SDK/OpenCL
and running:
make
Change directories to
/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release
to run the OpenCL Bandwidth sample using:
./oclBandwidthTest
Setting up Eclipse and needed plugins on Ubuntu
To configure the development environment, we started by installing the JRE with:
sudo apt-get install openjdk-6-jdk
and downloading Eclipse for J2EE Developers from:
http://www.eclipse.org/downloads/download.php
and following the Eclipse installation steps.
We added the SVN plugin to Eclipse by clicking on Help-> Install New Software and adding the SVN adapter site:
http://subclipse.tigris.org/update_1.6.x
Downloading and running the ImageJ OpenCL examples
The ImageJ OpenCL examples can be imported as an Eclipse project by right clicking in the Package Explorer window and choose Import. Select Git project and add the site:
https://github.com/uw-loci/opencl-decon
Import the branch and assign a general project name like imagej-opencl.
The folder structure of the source consists of the following:
- src - Java and OpenCL source files (extension .cl) Notice the files fht.cl and sobel.cl in the src directory. When executed, the Java code in provided in the examples compile these the OpenCL for execution. Note: Runtime compilation of the OpenCL source files allows execution on any potential OpenCL enabled device.
- sourcedata - (Point Spread Function) PSF and 3D data used as a small sample data set for the FHT3D Example.
- lib - libraries needed for classes using JOCL, ImageJ, and Hessian 4.0.7
We have included the necessary JOCL native libraries for Windows 32/64, Apple, and Linux 32/64 platforms inside this directory. To use OpenCL from Java in ImageJ we leverage JOCL. JOCL uses JNI to make calls into the OpenCL API. The OpenCL code you write can also leverage JOCL to accelerate execution of ImageJ plugins from Java. Since each OS has different native JOCL native libraries, the runtime environment must be configured such that the Java code can load the needed native libraries.
Understanding platform-specific JOCL native libraries
For these samples, three native libraries are needed: gluegen-rt, jocl, and JOCL-'platform'-'arch'. If you look in the lib folder, you will find -natives-xyz.jar files containing the respective libraries. You need to unzip each of the three jar files and copy the dynamic files (.so, .dylib, or .dll) into the parent directory if they are not already present. Notice the below example where the libgluegen-rt.dylib
, libJOCL-apple-x86_64.dylib
, and libjocl.dylib
files are in the platform specific directory.
Then ensure that the platform specific jar is exported during the project build. For example notice that the JOCL-0-1.4-beta1.jar file is referenced in the project. (To see this menu right click the project and choose Properties -> Java Build Path -> Libraries.)
Finally, ensure that the platform specific files are exported:
Start exploring the examples by viewing the developer comments in the file src/publication/SobelFilterExample.java
. Notice the Main()
method calls run()
which use an awt.Image
type as an input parameter. Modify and run the Main()
method as a Java application and adjust the VM Arguments (E.g. -Xmx1024m
) if needed.
SobelFilter example
Without modification, SobelFilterExample.java
loads an image from a web server, process it locally using OpenCL, and displays the results. There is nothing novel about this example. It simply allows runtime testing of several system configuration steps to ensure working configuration of JOCL and OpenCL native libraries. Modify this example to suite your needs, but please ensure proper JOCL and OpenCL configuration before proceeding.
Understanding ImageJ + OpenCL
Working within ImageJ: If developing an ImageJ plugin using OpenCL realize that programmatic control is passed to your plugin inside the PlugIn
(or PluginFilter
) run()
method. An example of this can be found in src.demos.OpenCL_SobelFilter.java
. For this plugin to run within ImageJ, the JOCL jars and native libraries respective to the target platform will need to be available by the ImageJ class loader. The supporting JOCL native libraries can be copied into the plug-in directory within ImageJ to allow plugin implementations using OpenCL to reference the native libraries provided by the OpenCL installation.
ImageJ OpenCL: An incremental approach to applying OpenCL
Now that you have demonstrated use of OpenCL from Java and within ImageJ, you may wish to see a compute intensive example demonstrating modification of an existing Java implementation that delegates a portion of its implementation to OpenCL. Take a look at the developer comments in the FHT3D_3D_Deconvolution.java
example to see what steps are used for brokering data between Java and OpenCL between steps within an algorithm's implementation.
The approach used to start delegating to OpenCL from an existing Java implementation:
- Assess the performance of the existing implementation to identify the most compute intensive region of code
- Develop a test data set before and after that region
- Write OpenCL code that replaces the compute intensive region
- Test to ensure the new OpenCL code generates the same results using the test data
- Add conditional delegation logic to handle runtime compute capabilities
OpenCL ImageJ plugins following enterprise java patterns
Finally, some users and academic labs are building "GPU Supercomputers" to expose compute resources to a wide range of applications running locally. In this case, you wish to leverage to look at the FHTEJBService
and Iterative_Deconvolve_3D_WS
classes for an example on how to remotely serve up the your GPU accelerated resources using open source J2EE technologies.
In this example, Hessian Binary Web Services are used to broker data between the Java consumer and the Hessian Servlet. This approach is only recommended for those labs having sufficient throughput between the client application and the OpenCL/GPU servlet host.
Hosting OpenCL-accelerated algorithms using Oracle's GlassfishV3
Start out with the installation instructions available here.
To set up OpenCL support on Glassfish for deploying the ImageJ/Fiji Java based EJBs, navigate to the system's lib directory (for example: /opt/glassfishv3/glassfish/lib
) and install the required jars/native libs.
sudo wget http://jogamp.org/deployment/webstart/jocl-natives-linux-amd64.jar sudo unzip jocl-natives-linux-amd64.jar sudo rm -rdf META-INF sudo rm jocl-natives-linux-amd64.jar sudo wget http://jogamp.org/deployment/webstart/gluegen-rt-natives-linux-amd64.jar sudo unzip gluegen-rt-natives-linux-amd64.jar sudo rm -rdf META-INF/ sudo rm gluegen-rt-natives-linux-amd64.jar sudo wget http://jocl.org/downloads/JOCL-0.1.4-beta1-bin-linux-x86_64.zip sudo unzip JOCL-0.1.4-beta1-bin-linux-x86_64.zip sudo mv JOCL-0.1.4-beta1-bin-linux-x86_64/*.so . sudo mv JOCL-0.1.4-beta1-bin-linux-x86_64/*.jar . sudo rm -rdf JOCL-0.1.4-beta1-bin-linux-x86_64 sudo rm JOCL-0.1.4-beta1-bin-linux-x86_64.zip sudo wget http://jogamp.org/deployment/webstart/gluegen-rt.jar sudo wget http://jogamp.org/deployment/webstart/gluegen.jar sudo wget http://jogamp.org/deployment/webstart/jocl.jar
The only other thing needed to get glassfish setup to support JOCL is to login to the admin console, under Common Tasks › Configuration › JVM Settings › Path Settings.
Native Library Path Prefix: /opt/glassfishv3/glassfish/lib
GPU Based Processing Techniques and the ImageJ Architecture
Introduction
The primary focus of this paper is to provide an introduction to and evaluation of two common GPU technologies (CUDA and OpenCL) as they could be used within ImageJ. The intent is to provide a light introduction to the software libraries used to perform two basic image processing tasks and present performance metrics that may be useful for deciding future efforts in this area.
Many of the algorithms within ImageJ and ImageJ plug-ins can be implemented to take advantage of GPU and multi-core CPU processors. Having the capability to support plugins that leverage 'many-core hardware processors' poses important architectural issues for ImageJ. An intent of the ImageJ refactoring effort is to implement support for native code integration in a manner that leverages hardware devices 'behind-the-scenes'. Performance is not as important as compatibility with external native libraries and ease of use by non-programming scientists.
Note: The use of 'device' refers to GPU based hardware devices and 'host' refers to GPU based devices.
Background on the use of ImgLib
The future release of ImageJ will adopt the ImgLib generic processing library. A very minor change has been introduced into the ImgLib codebase that allows data to be stored in Java.NIO arrays. The NIO backed arrays are allocated outside of the Java Virtual Machine and allow for a single copy of data to be shared with the native code.
There are several issues that are encountered when developing GPU based code:
- Byte ordering differences need consideration when using NIO Buffers and exchanging data between different hardware devices with different byte ordering.
- The amount of available host memory, device memory, number of GPU processors, and the computational capabilities of devices may vary significantly.
To address these issues, helper methods can be used to dynamically assess a given host's capabilities at runtime. Working memory for the device and host are important along with the performance characteristics for a device. Profiling performance is also important in assessing a device since a device that is shared between several applications may achieve lower performance than if the device is not shared.
When considering how to access GPU resources from Java, several open-source APIs were considered. For the purposes of this evaluation, Olivier Chafik's JavaCL was chosen due to its Lesser General Public License.
Introduction to GPU processing pipeline
The processing pipeline when using GPUs as compute device in ImageJ involves several steps:
- Get the data in native arrays with the needed byte ordering from an imglib object
- Choose a device, compile the kernel, and associate the native arrays with the kernel
- Launch the kernel
- Return the results to a compatible Imglib object
Metric/Method
Sobel filter is a common image processing routine that is used for edge detection. It is ideally suited for this evaluation due to implementation simplicity as well as the GPU code's similarity to the existing open source implementation.
For purposes of timing processing, the 8-bit test image will be loaded into an Imglib NIO backed buffer. The kernel source code is precompiled. The timer is started before the call to execute the kernel and concludes after the results are returned to the Imglib NIO backed buffer. 100 iterations are averaged to determine the recorded value.
Note: It is almost certainly possible to optimize any of the following implementations, however the primary goal of this assessment is not performance.
Implementation
The following code demonstrates a partial implementation of sobel filter within ImageJ:
public byte[] filter(int width, int height, byte[] inputImageArray) { byte[] pixels = new byte[width*height]; int p1, p2, p3, p4, p5, p6, p7, p8, p9; int offset, sum1, sum2=0, sum=0; int rowOffset = width; for (int y=1; y 255) sum = 255; pixels[offset++] = (byte)sum; } } return pixels; }
There are a few properties that make the above partial implementation ideal for GPU computation. Each resultant pixel's value is independent of those around it. The values consumed in calculating the resultant pixel share a sequential relationship can leverage performance advantages. Several computations are performed for each pixel.
Here is the partial implementation of Sobel filter in OpenCL:
__kernel void sobel( __global char* input, __global char* output, int width, int height) { int x = get_global_id(0); //find the X id int y = get_global_id(1); //find the Y id int p[9]; //allocate a local array used for intermediate values int offset = y * width + x; //determine the offset if( x < 1 || y < 1 || x > width - 2 || y > height - 2 ) //is this an edge pixel? { output[offset] = 0; //This partial implementation does not calculate edge values } else { p[0] = input[offset - width - 1] & 0xff; p[1] = input[offset - width] & 0xff; p[2] = input[offset - width + 1] & 0xff; p[3] = input[offset - 1] & 0xff; p[4] = input[offset] & 0xff; p[5] = input[offset + 1] & 0xff; p[6] = input[offset + width - 1] & 0xff; p[7] = input[offset + width] & 0xff; p[8] = input[offset + width + 1] & 0xff; int sum1 = p[0] + 2*p[1] + p[2] - p[6] - 2*p[7] - p[8]; int sum2 = p[0] + 2*p[3] + p[6] - p[2] - 2*p[5] - p[8]; float sum3 = sum1*sum1 + sum2*sum2; int sum = sqrt( sum3 ); if (sum > 255) sum = 255; output[offset] = (char) sum; //write the result to the output array } };
The above OpenCL kernel is almost identical to the Java implementation with the exception that an index is used to identify the per value offset (rather than looping through an array). This allows the computation to be spread over many cores and thus provide the potential for speed up.
The following example demonstrates how an image is loaded using Imglib in preparation for GPU computation:
//Create an array container factory ArrayContainerFactory arrayContainerFactory = new ArrayContainerFactory(); //Set the backing type to NIO arrayContainerFactory.setNIOUse(true); //Create a image backed by an NIO typed array given an input file Image inImg = LOCI.openLOCIFloatType( file.getPath(), arrayContainerFactory );
ArrayContainerFactory.setNIOUse(true)
ensures that NIO backed arrays are used. The reason for using NIO backed arrays rather than Java native arrays is due to optimal data sharing between Java and native code as well as for improved throughput between the host and device. Both CUDA and OpenCL benefit from the use of host arrays that are not paged to disk. This type of memory is referred to as paged-locked memory. Section 5.3.1 of "CUDA Programming Guide Version 3.0" has more specific information on this detail.
Note: OpenCL may use page-locked host memory when the "CL_MEM_ALLOC_HOST_PTR" flag is set.