기술 관련 글2011. 10. 1. 10:20
[원본 출처:  http://www.realworldtech.com/]

By: David Kanter | 12-07-2010


Introduction to OpenCL

Using a GPU for computational workloads is not a new concept. The first work in this area dates back to academic research in 2003, but it took the advent of unified shaders in the DX10 generation for GPU computing to be a plausible future. Around that time, Nvidia and ATI began releasing proprietary compute APIs for their graphics processors, and a number of companies were working on tools to leverage GPUs and other alternative architectures. The landscape back then was incredibly fragmented and almost every option required a proprietary solution ? either software, hardware or both. Some of the engineers at Apple looked at the situation and decided that GPU computing had potential ? but they wanted a standard API that would let them write code and run on many different hardware platforms. It was clear that Microsoft would eventually create one for Windows (ultimately DirectCompute), but what about Linux, and OS X? Thus an internal project was born, that would eventually become OpenCL.

The goals for OpenCL are deceptively simple: a cross-platform API and ecosystem for applications to take advantage of heterogeneous computing resources for parallel applications. The name also makes it clear ? that OpenCL is the compute analogue of OpenGL and is intended to fill a similar role. While GPUs were explicitly targeted, a number of other devices have considerable potential, but lack a suitable programming model, including IBM’s Cell processor and various FPGAs. Multi-core CPUs are also candidates for OpenCL, especially given the difficultly inherent in parallel programming models, with the added benefit of integration with other devices.

OpenCL has a broad and inclusive approach to parallelism, both in software and hardware. The initial incarnations focus on data parallel programming models, partially because of the existing work in the area. However, task level parallelism is certainly anticipated and on the road map. In fact, one of the most interesting areas will be the interplay between the two.

The cross-platform aspect ensures that applications will be portable between different hardware platforms, from a functionality and correctness stand point. Performance will naturally vary across platforms and vendors, and improve over time as hardware evolves to exploit ever more parallelism. This means that OpenCL embraces multiple cores and vectorization as equally valid approaches and enables software to readily exploit both.

OpenCL is a C-like language, but with a number of restrictions to improve parallel execution (e.g. no recursion and limited pointers). For most implementations, the compiler back-end is based on LLVM, an open-source project out of UIUC. LLVM was a natural choice, as it is extensively used within Apple. It has a more permissive license than the GNU suite and many of the key contributors are employed with Apple.

The first widely supported, programmable GPUs were the DX10 generation from Nvidia, accompanied by a proprietary API, CUDA, and a fledging software ecosystem. To take advantage of this, Apple worked closely with Nvidia on their early efforts. The result is that OpenCL was heavily influenced by CUDA. In essence, CUDA served as a starting point and Apple then incorporated their own vision and a great deal of input from AMD, Imagination Technologies (which is responsible for nearly all cell phone graphics solutions) and Intel. Once the project was in good enough shape, Apple put OpenCL into the hands of the Khronos Group, the standards body behind OpenGL.

The lion’s share of the early OpenCL work was done by Apple and Nvidia. The first software implementation of OpenCL was a key feature in the v10.6 of the Mac OS, which was released in August of 2009. In order to promote the burgeoning standard, Apple mandated hardware support on all their PC systems, from the humble Mac Mini to the Mac Pro. Since Nvidia was the only compatible hardware solution early on, this gave them a virtual monopoly on Apple’s chipsets and graphics cards for the first several years. The rest of the industry signed onto OpenCL in fairly short order, however, actual hardware and software has only just begun to catch up and take shape.

The progress in the PC ecosystem has just started. Nvidia supports OpenCL across their full product line, as they have from inception. AMD took a slightly indirect route, first releasing OpenCL for CPUs (and GPUs using OS X) in August of 2009 and adding GPU support for Windows and Linux in December 2009. S3’s embedded graphics added OpenCL 1.0 in later 2009, as did VIA for the video processors in their chipsets. IBM has also a version of OpenCL for PowerPC and Cell processors. Of all the major players, Intel is taking the longest to release OpenCL compatible products. Their first CPU implementation will arrive in early 2011 with Sandy Bridge. Unfortunately, the Sandy Bridge GPU lacks certain required functionality, so the first GPU implementation of OpenCL will be on Ivy Bridge, the following year. Of all the different vendors, Nvidia’s support is by far the most full featured and robust, since it leverages their existing investment in CUDA. On the software side, things are moving slightly slower with only a handful of early adopters ? partially because the hardware support has just started to move beyond Nvidia.

Just as OpenGL is used in both the PC and embedded worlds, OpenCL also has generated substantial interest within the mobile and embedded ecosystem. Imagination Technologies, which is responsible for the vast majority of cell phone GPUs, announced OpenCL 1.0 support for the SGX545 graphics core. Samsung has a compatible solution, based on an ARM Cortex A9 microprocessor for cell phones. Perhaps more importantly, Khronos, has released an ‘Embedded Profile’ for OpenCL that relaxes some of the requirements to improve power efficiency and cost. Outside of the mobile world, it is conceivable (albeit unlikely) that FPGA vendors may use OpenCL as a programmer friendly interface (compared to Verilog) for their hardware, at the cost of some efficiency.
 
 
OpenCL Execution Model

General purpose computing on GPUs has been a topic of interest for a considerable time. The early work was in academia, primarily in the Stanford graphics group, and focused on using the existing limited shader languages (e.g. Brook) for general workloads. Many of the Stanford graphics graduate students went into industry and influenced the evolution of GPUs into programmable hardware. The first commercial API was CUDA, which has in turn influenced later APIs such as OpenCL and DirectCompute. All three APIs use variants of C that add and remove certain features. None of the languages are a superset of C, so not all C programs will map cleanly to the respective languages. Given the shared ancestry and shared starting language, it should not be surprising that there are many similarities between the three.

OpenCL, DirectCompute and CUDA are APIs designed for heterogeneous computing ? with both a host (i.e. CPU) and an OpenCL device. The device can be the same hardware as the host ? for instance a CPU can serve as both ? however, the OpenCL device is often different (e.g. a GPU or DSP).

OpenCL applications have serial portions, that execute on the host CPU, and parallel portions, known as kernels. The parallel kernels may execute on an OpenCL compatible device (CPU or GPU), and synchronization is enforced between kernels and serial code. OpenCL is distinctly intended to handle both task and data parallel workloads, while CUDA and DirectCompute are primarily focused on data parallelism.

A kernel applies a single stream of instructions to vast quantities of data that are organized as a 1-3 dimensional array (called an N-D range). Each piece of data is known as a work-item in OpenCL terminology, and kernels may have hundreds or thousands of work-items. At a high level, this sounds a lot like SIMD execution where each work-item is a SIMD lane. However, one of the key goals of OpenCL is to provide an extensible form of data parallelism that isn’t explicitly tied to specific vector lengths and can be mapped to all sorts of different hardware. So in some sense, an OpenCL kernel is a generalization of SIMD. The kernel itself is organized into many work-groups that are relatively limited in size; for example a kernel could have 32K work-items, but 64 work-groups of 512 items each. Unlike traditional computation, arbitrary communication within a kernel is strongly limited. However, communication and synchronization is generally allowed locally within a work-group. So work-groups serve two purposes. First, they break up a kernel into manageable chunks, and second, they define a limited scope for communication.

Kernels form the basis of OpenCL, but they can be composed into a task graph via asynchronous command queues. The programmer indicates dependencies between kernels, and what conditions must be met for a kernel to start execution. The OpenCL run-time layer can simultaneously execute independent kernels, thus extracting task parallelism within an application. While the initial uses of OpenCL will probably focus on data parallelism, the best performance will be achieved by combining task and data parallel techniques.

OpenCL defines a broad universe of data types for computation in each work-item. On the integer side, data types include boolean, character, short, int (32-bit), long and long long (128-bit). Most of these integer types are available in both signed and unsigned variants.

For floating point, OpenCL both defines a variety of data types and also specifies precision for most operations. The floating point data types are relatively standard ? single precision is required and double precision is optional. In addition, there is half precision (16-bit) floating point for data storage; computation is still done at single precision, but for less precise data, the storage requirements can be cut in half. Thankfully, OpenCL also enforces a minimum level of floating point precision and accuracy, generally consistent with IEEE 754. Double precision has the most stringent requirements, including a fused-multiply-accumulate instruction, all four rounding modes (nearest even, 0, +infinity, -infinity), and proper handling of denormal numbers, infinities and NaN. Single precision is somewhat more lax and only requires round to nearest even and handling infinities and NaN. In both cases, all operations have a guaranteed minimum precision ? this is especially critical for math functions that are implemented in libraries, such as transcendental functions. Half precision requires an IEEE compatible storage format and correct conversion.

OpenCL also provides a number of more sophisticated data types on top of these basic ones. Most data types (except half-precision and boolean) are part of the specification in vector form, with lengths 2, 4, 8 and 16. Vector operations are component-wise so that each lane is independent. This is a clear contrast to DirectCompute and CUDA, which only support vectors of length 2-4. OpenCL has pointers for many data types, which is beneficial to make developers comfortable, but it does come with a cost because it ends up creating potential aliasing problems (just as in C). Vectorization is critical for performance on many CPUs and GPUs (although not Nvidia GPUs), and will be much more heavily emphasized in OpenCL than in CUDA.

There are also data types for 2 and 3-dimensional images and texture sampling and filtering of images. The standard has reserved a number of other data types such as complex numbers (using floating point formats for the imaginary and real parts), matrices and high precision formats (128-bit integers and floating point). These are not part of OpenCL, but it is clear that they are all candidates for inclusion.


OpenCL Memory Model

The OpenCL memory model defines how data is stored and communicated both within a device and also between a device and the host CPU. There are four memory types (and address spaces) in OpenCL, which closely correspond to those in CUDA and DirectCompute, and they all interact with the execution model.

The first region, global memory, is available to any work-item for both read and write access. Global memory may be cached in the OpenCL device for higher performance and power efficiency, or may reside strictly in DRAM. Global memory is also fully accessible by the CPU host. Constant memory is a read-only region for work-items on the OpenCL device, but the host CPU has full read and write access. Since the region is read-only, it is freely accessible to any work-item. Conceptually, constant memory can be thought of as a portion of global memory that is read-only for the OpenCL device.

The remaining memory regions are only usable by the OpenCL device and are inaccessible to the host. The first is private memory, which is accessible to a single work-item for reads and writes, and corresponds roughly to an architectural register file in a classic instruction set. The vast majority of computation is done using private memory, thus in many ways it is the most performance critical. The second region is known as local memory and is accessible to a single work-group for reads and writes. Local memory is intended for shared variables and communication between work-items, in essence, it is an architectural register file that is shared between a limited number of work-items. Local memory can be held in DRAM and cached ? which is how most CPUs will implement it, while GPUs tend to favor dedicated hardware structures that are explicitly addressed.

The memory consistency model for OpenCL is fairly relaxed, with a number of primitives to assist. OpenCL defines four work-group synchronization primitives ? a barrier, and 3 types of fences (read fence, write fence, and a general memory fence). The barrier synchronizes an entire work-group, so the scope is limited by definition. The strength of the memory consistency is progressively weaker as the scope widens ? which makes sense. A strongly ordered model is easier with fewer caching and memory agents, and increasingly difficult to scale as more agents are added.

At the smallest scope, each work-item has fairly strong consistency and will preserve the ordering between an aliased load and store; however, non-aliased memory instructions can be freely re-ordered. Local memory is a bit weaker ? it is only consistent across a work-group at a barrier. Without a barrier, there are no ordering guarantees between the different work-items. Global memory is even weaker still; a barrier will guarantee consistency of global memory within a work-group, however, there are absolutely no guarantees between different work-groups in a kernel. Global atomic operations were an optional part of OpenCL 1.0 and are required in 1.1; they are used to guarantee consistency between any work-items in a kernel, specifically between different work-groups. Atomic operations are primarily defined for 32-bit integers, with an optional extension for 64-bit integers. They acquire exclusive access to a memory address (to ensure ordering) and perform a read-modify-write, returning the old value. Both OpenCL and CUDA return the old value, while this is strictly optional for DirectCompute. However, the performance cost of atomic operations is fairly high on some hardware, and should be avoided for scalability and performance. Since the constant memory is read-only, it needs no consistency or ordering model.

OpenCL uses a combination of pointers and buffers to move data within an application. Pointers are valid within a kernel - however, they are flushed at the end. So passing data between kernels (or between the host and device) uses buffers. This is another area where OpenCL diverges from CUDA - the latter persists pointers across kernels and does not use any buffers.


Terminology and Summary

One of the more confusing aspects of GPU computing is the terminology. The lexicon for CPUs and computer architecture is relatively consistent across vendors. For graphics APIs, there is a common language and understanding formed by DirectX and OpenGL that most hardware and software can follow. However, for graphic hardware, the terminology varies considerably and is often imprecise and subject to change ? fortunately, the common APIs give some semblance of order. In contrast to graphics and computer architecture, the idea of using GPUs for computation is relatively new. The industry standards are nascent, but hopefully OpenCL and DirectCompute will provide a relatively standard language to understand the software aspects of GPU computing. While the terminology in these APIs may not be universally adopted, they will reduce confusion by providing common ground. Equally important, since OpenCL is intended to run on almost any device, the common software architecture will be very helpful to understand the different flavors of hardware.

Table 1 - Comparison of OpenCL, DirectCompute and CUDA


The table shows the correspondence between different terminology in OpenCL, DirectCompute and CUDA and also compares certain features. One of the differences in the execution model for OpenCL and DirectCompute specifically omit any microarchitectural aspects of execution and avoids horizontal operations. Both of these choices improve portability and performance across many different devices. CUDA is a proprietary API and portability was never a goal, so Nvidia exposes warps and certain horizontal warp functions through the API.

The three APIs have changed the local memory capacity over time, in tune with advances in hardware. Early versions including OpenCL 1.0, Compute Shader 4.x and 1.x CUDA specified 16KB local memory; although in OpenCL that was only a minimum. The local memory for OpenCL 1.1 must be at least 32KB, while DirectCompute requires exactly 32KB. CUDA 2.x takes a slightly different tack and mirrors Nvidia’s Fermi hardware ? allowing local memory to be configured as either 16KB or 48KB. The reason that OpenCL focuses on minimum sizes for local memory is enabling a diversity of hardware. Most CPUs will use regular system memory, held in a cache, for local memory. Even L1 caches can easily exceed 32KB, and L2 and L3 caches are orders of magnitude larger. Moreover, the Cell processor actually has 256KB of local memory.

The changing storage capacity serves to highlight one of the pitfalls of OpenCL. While the specification does ensure functional correctness across different platforms, it does not guarantee optimal performance. Hardware can vary across a number of aspects: number of work-groups, latency, bandwidth and capacity of on-chip and off-chip memory (e.g. cache or registers, DRAM, etc.). Tuning for a specific platform will often result in suboptimal code for other platforms. For example, using 4-wide vectors on AMD GPUs is necessary for optimal performance, while Nvidia GPUs only see mild gains. As a result, software optimized for Nvidia platforms typically is unvectorized and will not run efficiently on AMD GPUs. This problem is universal for almost any cross-platform environment. However, the variations in performance for OpenCL on GPUs will be much larger than say, Java on CPUs, because the variations in microarchitecture are also much larger. One related issue is that all memory is statically allocated in OpenCL (i.e. at compile time), without any knowledge of the underlying hardware. Dynamically allocating memory (i.e. at run-time) would help to improve performance across different hardware. As a simple example, software that is written for a smaller 16KB local memory will leave performance on the table when using hardware that has more capacity (say 32KB, like AMD’s GPUs).

Despite its flaws, OpenCL holds great promise as an open, compatible and standards based approach to parallel computing on GPUs and other alternative devices. At present though, OpenCL is still in the very early stages with limited hardware and software. However, it has broad support throughout the PC and embedded ecosystems, and is just starting down the path to maturity as a common API for software developers. Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

'기술 관련 글' 카테고리의 다른 글

Visual Studio Achievements "Bring Some Game To Your Code!"  (0) 2012.04.07
C++ Books  (0) 2012.03.31
RAID Level 설명  (0) 2011.09.02
감독 대 위원회, 애플 대 구글  (0) 2011.08.22
안드로이드와 특허, 구글의 위선  (0) 2011.08.06
Posted by 세월의돌