[其它] T&I Engine: Traversal and Intersection Engine for Hardware Accelerated Ray Traci

晃晃

1023 主题	3 听众	359 积分

设计实习生

Rank: 2

纳金币: 335582
精华: 0

电梯直达

楼主

发表于 2011-12-28 10:24:54 |只看该作者 |倒序浏览

T&I Engine: Traversal and Intersection Engine for Hardware Accelerated Ray Tracing

Jae-Ho Nah Jeong-Soo Park Chanmin Parky Jin-Woo Kim Yun-Hye Jung Woo-Chan Parkz Tack-Don Han

Yonsei University, Korea ySamsung Electronics, Korea zSejong University, Korea

Abstract

Ray tracing naturally supports high-quality global illumination ef-

fects, but it is computationally costly. Traversal and intersection

operations dominate the computation of ray tracing. To accelerate

these two operations, we propose a hardware architecture integrat-

ing three novel approaches. First, we present an ordered depth-ﬁrst

layout and a traversal architecture using this layout to reduce the

required memory bandwidth. Second, we propose a three-phase

ray-triangle intersection architecture that takes advantage of early

exit. Third, we propose a latency hiding architecture deﬁned as the

ray accumulation unit. Cycle-accurate simulation results indicate

our architecture can achieve interactive distributed ray tracing.

CR Categories: Computer Graphics [I.3.7]: Computer

Graphics—Three-Dimensional Graphics and Realism–Ray tracing

Keywords: ray tracing, ray tracing hardware, global illumination

1 Introduction

Ray tracing [Whitted 1980; Cook et al. 1984] is the most

commonly-used algorithm for photorealistic rendering. Ray trac-

ing generates a more realistic image than does rasterization, but

it requires tremendous computational power for traversal and ray-

primitive intersections. For this reason, it has been used for ofﬂine

rendering for most of the last decade.

For real-time ray tracing, many approaches utilizing CPUs, GPUs,

or custom hardware have recently been studied. These approaches

do not yet provide sufﬁcient performance for processing 1G rays/s

for real-time distributed ray tracing [Govindaraju et al. 2008].

Most performance bottlenecks in ray tracing are in traversal and in-

tersection tests [Benthin 2006]. Traversal is the process of search-

ing an acceleration s***cture (AS), such as a kd-tree or bounding

volume hierarchy (BVH), to ﬁnd a small subset of the primitives

for testing by the ray. A ray-primitive intersection test determines

the visibility of primitives found during the traversal.

We believe a dedicated hardware unit for traversal and the intersec-

tion test is a suitable solution for real-time distributed ray tracing.

In this paper, we present a custom hardware architecture, called

T&I (traversal and intersection) engine. This architecture can be

integrated with existing programmable shaders, as with raster op-

erations pipelines (ROps) or texture mapping units. Also, it com-

prises three novel approaches that are applicable to the traversal and

intersection test processes.

First, an ordered depth-ﬁrst layout (ODFL) and its traversal archi-

tecture are presented. The ODFL is the enhancement of an eight-

byte kd-tree node layout [Pharr and Humphreys 2010]. It arranges

the child node, which has a larger surface area than its sibling, ad-

jacent to its parent to improve parent-child locality. We apply this

layout to our traversal architecture to effectively reduce the miss

rate of the traversal cache. The ODFL also can be easily applied

to other CPU or GPU ray tracers. This concept was previously an-

nounced in the extended abstract [Nah et al. 2010].

Second, we propose a three-phase intersection test unit, which di-

vides the intersection test stage into three phases. Phase 1 is the

ray-plane test, Phase 2 is the barycentric coordinate test, and Phase

3 is the ﬁnal hit point calculation. This conﬁguration reduces the

need for further computation and memory requests for missed tri-

angles that are identiﬁed in either Phases 1 or 2. Phases 1 and 2 are

performed in a common module because they use roughly the same

arithmetic operations.

Third, a ray accumulation unit is proposed for hiding memory la-

tency. This unit manages memory requests and accumulates rays

that induce a cache miss. While the waiting missed block is fetched,

other rays can perform their operations. When the missed block is

fetched, the accumulated rays are ﬂushed to the pipeline.

We verify the performance of our architecture with a cycle-accurate

simulator and evaluate resource requirements and performance. We

also performa simulation with three types of rays that have different

coherence. The proposed architecture achieves 44-1188 Mrays/s

ray tracing performance at 500 MHz on 65 nm process.

The remainder of this paper is s***ctured as follows. Section 2 de-

scribes related work. Section 3 gives an overview of the proposed

architecture. In Sections 4 to 6, we cover the details of our three

approaches (a traversal unit with the ODFL, a three-phase inter-

section test unit, and a ray accumulation unit). In Section 7, we

describe the experimental results of the proposed architecture sim-

ulation. Finally, we conclude the paper in Section 8.

2 Related Work

2.1 Dedicated ray tracing hardware

SaarCOR [Schmittler et al. 2004] is a ray tracing pipeline that con-

sists of a ray generation/shading unit, a 4-wide SIMD traversal unit,

a list unit, a transformation unit, and an intersection test unit. Woop

et al. [2005] presented the programmable RPU architecture, which

performs ray generation, shading, and intersection tests with pro-

grammable shaders. For dynamic scenes, D-RPU [Woop et al.

2006a; Woop 2007] has a node update unit [Woop et al. 2006b]

unlike RPU. RTE [Davidovic et al. 2011] is an optimized version

of D-RPU that uses tail recursive shaders with treelets.

全文请下载附件：