[其它] Heterogeneous Particle-based Simulation

彬彬

797 主题	1 听众	1万积分

资深设计师

Rank: 7 Rank: 7 Rank: 7

纳金币: 5568
精华: 0

电梯直达

楼主

发表于 2012-1-4 17:11:08 |只看该作者 |倒序浏览

1 Introduction

Particle-based simulations have been used to simulate granular materials,

fluids, and rigid bodies. To achieve realistic behavior, a

large number of particles have to be simulated. Particle-based simulations

are suited for GPUs because the computation of each particle

is almost the same, (i.e., the granularity of the computation

is uniform over the particles). This is preferable for GPUs with a

wide SIMD architecture. However, particle-based simulation on the

GPU has been mostly restricted to simulating particles of identical

size [Harada et al. 2007]. This is because the work granularity is

non-uniform if there are particles with different radii, which leads to

inefficient use of the GPU. Heterogeneous CPU/GPU architectures,

such as AMD Fusion

R APUs, can solve this simulation efficiently

by using the CPU and the GPU at the same time. On a PC with

a CPU and a discrete GPU, whenever a computation is dispatched

to the GPU, the data has to be sent via PCI Express

R bus, which

introduces a latency. However, heterogeneous architectures have

a CPU and a GPU on the same die with a tightly coupled shared

memory, so the same memory space can be accessed from the GPU

and the CPU without any copying, which can facilitate a tight collaboration

between the two processors. In this paper, we describe

a particle-based simulation with particles of various sizes running

on a heterogeneous architecture by dispatching and simultaneously

processing work on the GPU and CPU depending on the granularity.

2 Method

The simulation we developed maximizes the use of all the available

resources of a heterogeneous architecture by performing computation

concurrently on both the CPU and the GPU components.

The simulation uses a CPU thread for dispatching work to the GPU

(GPU control thread) and multiple CPU threads for computation

(CPU computation threads), whereas an application using only the

GPU uses one CPU thread. The target architecture was an AMD ASeries

APU with four CPU cores and a GPU. Our implementation

e-mail: takahiro.harada@amd.com

!"#$

$$$

%"#$

$$$

&'()*$

+,,-$

./0',/'01$

..$

%2))(3(24$

.$

54/1607824$

9$

54/1607824$

99$

%2))(3(24$

&'()*$

+,,-$

./0',/'01$

.:4,;024(<7824$

"23(824$

=1)2,(/

!0(*$

>20,1$

.:4,;024(<7824$

9.$

%2))(3(24$

Figure 2: A step of the simulation using two CPU threads.

for the architecture used the GPU, a GPU control thread, and three

CPU computation threads. For simplicity, we first describe a simulation

using the GPU and two CPU threads (a GPU control thread

and a CPU computation thread). Then we describe how to scale the

simulation to the GPU, a GPU control thread, and multiple CPU

computation threads.

2.1 Simulation using the GPU, a GPU control thread

and a CPU computation thread

A simulation with particles of various sizes as shown in Fig.1 is

a coupling of two simulations: a simulation with identical-sized

particles (small particles colored with blue) and a simulation with

varying-sized particles (large particles colored with red and green).

If the interaction between large and small particles is not considered,

the simulation of small particles has a uniform work granularity.

Thus it is suited to be processed by the GPU. On the other

hand, using the CPU is a better choice for the computation of large

particles because the granularity of the simulation of large particles

is not uniform. Therefore, small and large particle simulations are

performed on the GPU and the CPU respectively. Note that they

are also running concurrently.

A simulation step consists of three steps; building an acceleration

structure, collision, and integration. Brute-force collision computation

is prohibitively expensive when there are a large number of

particles, so an acceleration structure is built to improve the efficiency

of collision. Colliding particles are searched for and repulsion

forces are calculated. The integration step updates particle velocity

and positions.

For a coupled simulation as shown in Fig.1, we have to think about

how to handle the collision between large and small particles (LS

collision). LS collision is performed by searching for colliding

small particles for each large particle and accumulating the forces

on the small and large particles. To improve the efficiency of the

search, we can reuse the data structure built for small-small (SS)

collision. For each large particle, a bounding box in the coordinate

system of the uniform grid is calculated and small particles found in

the grid cells overlapped with the bounding box. Work granularity

for each large particle depends on the size of the particle because

the number of overlapping cells depends on the size of a bounding

box. Therefore it is more efficient to perform LS collision on the

CPU computation thread.