- 最后登录
- 2017-9-18
- 注册时间
- 2011-1-12
- 阅读权限
- 90
- 积分
- 12276
- 纳金币
- 5568
- 精华
- 0
|
1 Introduction
Particle-based simulations have been used to simulate granular materials,
fluids, and rigid bodies. To achieve realistic behavior, a
large number of particles have to be simulated. Particle-based simulations
are suited for GPUs because the computation of each particle
is almost the same, (i.e., the granularity of the computation
is uniform over the particles). This is preferable for GPUs with a
wide SIMD architecture. However, particle-based simulation on the
GPU has been mostly restricted to simulating particles of identical
size [Harada et al. 2007]. This is because the work granularity is
non-uniform if there are particles with different radii, which leads to
inefficient use of the GPU. Heterogeneous CPU/GPU architectures,
such as AMD Fusion
R APUs, can solve this simulation efficiently
by using the CPU and the GPU at the same time. On a PC with
a CPU and a discrete GPU, whenever a computation is dispatched
to the GPU, the data has to be sent via PCI Express
R bus, which
introduces a latency. However, heterogeneous architectures have
a CPU and a GPU on the same die with a tightly coupled shared
memory, so the same memory space can be accessed from the GPU
and the CPU without any copying, which can facilitate a tight collaboration
between the two processors. In this paper, we describe
a particle-based simulation with particles of various sizes running
on a heterogeneous architecture by dispatching and simultaneously
processing work on the GPU and CPU depending on the granularity.
2 Method
The simulation we developed maximizes the use of all the available
resources of a heterogeneous architecture by performing computation
concurrently on both the CPU and the GPU components.
The simulation uses a CPU thread for dispatching work to the GPU
(GPU control thread) and multiple CPU threads for computation
(CPU computation threads), whereas an application using only the
GPU uses one CPU thread. The target architecture was an AMD ASeries
APU with four CPU cores and a GPU. Our implementation
e-mail: takahiro.harada@amd.com
!"#$
$$$
%"#$
$$$
&'()*$
+,,-$
./0',/'01$
..$
%2))(3(24$
.$
54/1607824$
9$
54/1607824$
99$
%2))(3(24$
&'()*$
+,,-$
./0',/'01$
.:4,;024(<7824$
"23(824$
=1)2,(/
!0(*$
>20,1$
.:4,;024(<7824$
9.$
%2))(3(24$
Figure 2: A step of the simulation using two CPU threads.
for the architecture used the GPU, a GPU control thread, and three
CPU computation threads. For simplicity, we first describe a simulation
using the GPU and two CPU threads (a GPU control thread
and a CPU computation thread). Then we describe how to scale the
simulation to the GPU, a GPU control thread, and multiple CPU
computation threads.
2.1 Simulation using the GPU, a GPU control thread
and a CPU computation thread
A simulation with particles of various sizes as shown in Fig.1 is
a coupling of two simulations: a simulation with identical-sized
particles (small particles colored with blue) and a simulation with
varying-sized particles (large particles colored with red and green).
If the interaction between large and small particles is not considered,
the simulation of small particles has a uniform work granularity.
Thus it is suited to be processed by the GPU. On the other
hand, using the CPU is a better choice for the computation of large
particles because the granularity of the simulation of large particles
is not uniform. Therefore, small and large particle simulations are
performed on the GPU and the CPU respectively. Note that they
are also running concurrently.
A simulation step consists of three steps; building an acceleration
structure, collision, and integration. Brute-force collision computation
is prohibitively expensive when there are a large number of
particles, so an acceleration structure is built to improve the efficiency
of collision. Colliding particles are searched for and repulsion
forces are calculated. The integration step updates particle velocity
and positions.
For a coupled simulation as shown in Fig.1, we have to think about
how to handle the collision between large and small particles (LS
collision). LS collision is performed by searching for colliding
small particles for each large particle and accumulating the forces
on the small and large particles. To improve the efficiency of the
search, we can reuse the data structure built for small-small (SS)
collision. For each large particle, a bounding box in the coordinate
system of the uniform grid is calculated and small particles found in
the grid cells overlapped with the bounding box. Work granularity
for each large particle depends on the size of the particle because
the number of overlapping cells depends on the size of a bounding
box. Therefore it is more efficient to perform LS collision on the
CPU computation thread. |
|