A quick look at Apple Silicon's hardware accelerated ray tracing performance

To develop my main ray tracing renderer project, I have been working on a Macbook Air with an M2 chip. The renderer’s performance on the laptop has not been great, with frame rates in the decimals at fullscreen resolutions.

The M3 chip was introduced fairly recently (half a year ago at the time of writing) with support for ray tracing in the hardware. Anectodally, I have seen reports of 100 % improvements in performance. In this post, a simple ray tracing shader is executed on 4 different Macbooks to gauge what sort of ray tracing performance boost can be obtained in a home-made renderer.

The benchmark

Shader

To try to isolate the ray tracing activity, the benchmark program is a simple Metal renderer which only casts one ray per pixel per frame. The render pass renders a fullscreen quad to the screen, and for each pixel, casts a single ray into the scene. The albedo texture color of the ray’s intersection with the scene is used as the color output. All the textures are stored in a single buffer, and the texture lookup reads a single pixel from the texture at the UV coordinate without any filtering.

Here is the gist of the fragment shader.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <metal_stdlib>
#include <metal_geometric>
#include <metal_raytracing>

using namespace metal;

// Each triangle is associated with a `PrimitiveData` entry.
// These can be looked up using the intersection's `primitive_id`.
struct PrimitiveData {
    float2 uv0;
    float2 uv1;
    float2 uv2;
    TextureDescriptor textureDescriptor;
};

// ...

half4 fragment fragmentMain( VertexOutput in [[stage_in]],
                    constant const Uniforms& uniforms [[buffer(0)]],
                    raytracing::acceleration_structure<> accelerationStructure [[buffer(1)]],
                    device const uint32_t* textureData [[buffer(2)]],
                    device const PrimitiveData* primitiveData [[buffer(3)]] )
{
    half4 color = half4(0.0, 0.0, 0.0, 1.0);
    raytracing::intersector<raytracing::triangle_data> intersector;
    const raytracing::ray ray = generateCameraRay(uniforms.camera, in.uv.x, 1.0 - in.uv.y);
    typename raytracing::intersector<raytracing::triangle_data>::result_type
                                        intersection = intersector.intersect(ray, accelerationStructure);
    if (intersection.type == raytracing::intersection_type::triangle) {
        const uint32_t primitiveIdx = intersection.primitive_id;
        device const PrimitiveData& primitive = primitiveData[primitiveIdx];
        const float2 uv0 = primitive.uv0;
        const float2 uv1 = primitive.uv1;
        const float2 uv2 = primitive.uv2;
        const float2 barycentricCoord = intersection.triangle_barycentric_coord;
        const float2 uv = uv0 * (1.0 - barycentricCoord.x - barycentricCoord.y) +
                                              uv1 * barycentricCoord.x +
                                              uv2 * barycentricCoord.y;
        const half3 rgb = textureLookup(textureData, primitive.textureDescriptor, uv);
        color = half4(rgb, 1.0);
    }
    return color;
}

Measurement

The fragment stage duration is measured in milliseconds using GPU counters. The method described in “Converting GPU Timestamps into CPU Time” is used to normalize the GPU timestamps into CPU time.

The median fragment stage time from the latest 100 frames is displayed in the UI.

Scene

The benchmark scene in use is the well-known Crytek Sponza scene, with 262267 triangles. The glTF model was sourced from the Khronos Group’s glTF sample assets repository.

m2-benchmark

Results

Occasional spikes in fragment stage duration were observed. The number was read from the UI once the frame rate had settled for roughly 100 frames.

Hardware Median fragment stage duration
M2 25.20 ms (36.69 FPS)
M3 12.66 ms (79.01 FPS)
M1 Pro 17.56 ms (56.93 FPS)
M3 Pro 10 ms - 5 ms (100 - 200 FPS)

The fragment stage duration never settled to a stable number on the M3 Pro, and the results reported for the last 100 frames was difficult to read. The range reported is roughly equal to the lowest and highest frame rate that was visible. The benchmark ran on a colleagues computer, and there may have been something running in the background. Alternatively, the normalization of the GPU timestamps may not have been working exactly correctly and fluctuations in core frequency might yield inaccurate time stamps. Nevertheless, the upper limit of the fragment stage duration on the M3 Pro (10 ms) is a similar multiple (~2x) away from the baseline M2 performance. The 5 ms duration should probably be taken with a grain of salt.

In summary, it seems that the anecdotal evidence was accurate! 2x performance gains can be reaped by switching to an M3 chip, assuming that the ray tracer is not running divergent workloads in a loop in the shader.

Contents