- 最后登录
- 2018-6-29
- 注册时间
- 2011-7-1
- 阅读权限
- 20
- 积分
- 359
- 纳金币
- 335582
- 精华
- 0
|
T&I Engine: Traversal and Intersection Engine for Hardware Accelerated Ray Tracing
Jae-Ho Nah Jeong-Soo Park Chanmin Parky Jin-Woo Kim Yun-Hye Jung Woo-Chan Parkz Tack-Don Han
Yonsei University, Korea ySamsung Electronics, Korea zSejong University, Korea
Abstract
Ray tracing naturally supports high-quality global illumination ef-
fects, but it is computationally costly. Traversal and intersection
operations dominate the computation of ray tracing. To accelerate
these two operations, we propose a hardware architecture integrat-
ing three novel approaches. First, we present an ordered depth-first
layout and a traversal architecture using this layout to reduce the
required memory bandwidth. Second, we propose a three-phase
ray-triangle intersection architecture that takes advantage of early
exit. Third, we propose a latency hiding architecture defined as the
ray accumulation unit. Cycle-accurate simulation results indicate
our architecture can achieve interactive distributed ray tracing.
CR Categories: Computer Graphics [I.3.7]: Computer
Graphics—Three-Dimensional Graphics and Realism–Ray tracing
Keywords: ray tracing, ray tracing hardware, global illumination
1 Introduction
Ray tracing [Whitted 1980; Cook et al. 1984] is the most
commonly-used algorithm for photorealistic rendering. Ray trac-
ing generates a more realistic image than does rasterization, but
it requires tremendous computational power for traversal and ray-
primitive intersections. For this reason, it has been used for offline
rendering for most of the last decade.
For real-time ray tracing, many approaches utilizing CPUs, GPUs,
or custom hardware have recently been studied. These approaches
do not yet provide sufficient performance for processing 1G rays/s
for real-time distributed ray tracing [Govindaraju et al. 2008].
Most performance bottlenecks in ray tracing are in traversal and in-
tersection tests [Benthin 2006]. Traversal is the process of search-
ing an acceleration s***cture (AS), such as a kd-tree or bounding
volume hierarchy (BVH), to find a small subset of the primitives
for testing by the ray. A ray-primitive intersection test determines
the visibility of primitives found during the traversal.
We believe a dedicated hardware unit for traversal and the intersec-
tion test is a suitable solution for real-time distributed ray tracing.
In this paper, we present a custom hardware architecture, called
T&I (traversal and intersection) engine. This architecture can be
integrated with existing programmable shaders, as with raster op-
erations pipelines (ROps) or texture mapping units. Also, it com-
prises three novel approaches that are applicable to the traversal and
intersection test processes.
First, an ordered depth-first layout (ODFL) and its traversal archi-
tecture are presented. The ODFL is the enhancement of an eight-
byte kd-tree node layout [Pharr and Humphreys 2010]. It arranges
the child node, which has a larger surface area than its sibling, ad-
jacent to its parent to improve parent-child locality. We apply this
layout to our traversal architecture to effectively reduce the miss
rate of the traversal cache. The ODFL also can be easily applied
to other CPU or GPU ray tracers. This concept was previously an-
nounced in the extended abstract [Nah et al. 2010].
Second, we propose a three-phase intersection test unit, which di-
vides the intersection test stage into three phases. Phase 1 is the
ray-plane test, Phase 2 is the barycentric coordinate test, and Phase
3 is the final hit point calculation. This configuration reduces the
need for further computation and memory requests for missed tri-
angles that are identified in either Phases 1 or 2. Phases 1 and 2 are
performed in a common module because they use roughly the same
arithmetic operations.
Third, a ray accumulation unit is proposed for hiding memory la-
tency. This unit manages memory requests and accumulates rays
that induce a cache miss. While the waiting missed block is fetched,
other rays can perform their operations. When the missed block is
fetched, the accumulated rays are flushed to the pipeline.
We verify the performance of our architecture with a cycle-accurate
simulator and evaluate resource requirements and performance. We
also performa simulation with three types of rays that have different
coherence. The proposed architecture achieves 44-1188 Mrays/s
ray tracing performance at 500 MHz on 65 nm process.
The remainder of this paper is s***ctured as follows. Section 2 de-
scribes related work. Section 3 gives an overview of the proposed
architecture. In Sections 4 to 6, we cover the details of our three
approaches (a traversal unit with the ODFL, a three-phase inter-
section test unit, and a ray accumulation unit). In Section 7, we
describe the experimental results of the proposed architecture sim-
ulation. Finally, we conclude the paper in Section 8.
2 Related Work
2.1 Dedicated ray tracing hardware
SaarCOR [Schmittler et al. 2004] is a ray tracing pipeline that con-
sists of a ray generation/shading unit, a 4-wide SIMD traversal unit,
a list unit, a transformation unit, and an intersection test unit. Woop
et al. [2005] presented the programmable RPU architecture, which
performs ray generation, shading, and intersection tests with pro-
grammable shaders. For dynamic scenes, D-RPU [Woop et al.
2006a; Woop 2007] has a node update unit [Woop et al. 2006b]
unlike RPU. RTE [Davidovic et al. 2011] is an optimized version
of D-RPU that uses tail recursive shaders with treelets.
全文请下载附件:
|
|