Testing wavefront ray tracing across Apple's M-series

Books, articles, and blog posts present the wavefront renderer architecture as a way to improve ray tracing performance on GPUs. In practice, it’s difficult to find information on whether I should expect performance improvements for my simple, monolithic kernel hobby renderer on Apple’s M1, M2, or M3 chips—the latter having hardware-accelerated ray tracing. On Apple hardware, would switching to a wavefront architecture yield better performance?

There’s only one way to find out whether the wavefront architecture helps: write both a monolithic and a wavefront ray tracer with identical features and measure their performance across different devices. The results were unexpected: the wavefront renderer isn’t a clear winner, and performance varies significantly between chips. It turns out that for simple rendering use cases, a monolithic renderer is actually a better choice on the latest Apple hardware.

What is a wavefront ray tracer?

A monolithic ray tracer uses a for-loop to calculate the path light takes through a scene, bouncing off multiple surfaces. The following pseudocode summarizes my monolithic ray tracer, which is simple—it supports just one material (with Lambertian reflectance) and samples only one light source: the Sun.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def path_trace(primary_ray, scene_geometry, sky_model, sun_direction):
    radiance = 0.0
    throughput = 1.0
    current_ray = primary_ray
    
    for bounce in range(MAX_BOUNCES):
        intersection = intersect(current_ray, scene_geometry)
        
        if intersection.hit:
            # Intersection data
            surface_point = intersection.position
            surface_normal = intersection.normal
            albedo = intersection.color

            # Direct lighting (sun sampling)
            light_direction = sample_sun_disk(sun_direction, SOLAR_RADIUS)
            cosine = dot(surface_normal, light_direction)
            if cosine > 0 and is_shadow_ray_clear(surface_point, light_direction):
                sun_radiance = evaluate_sky_model(light_direction, sun_direction)
                reflectance = cosine * albedo / π
                pdf = pdf_sun_disk(light_direction)
                      + pdf_cosine_hemisphere(light_direction)
                radiance += throughput * sun_radiance * reflectance / pdf

            # Indirect lighting (sampling lambertian reflectance)
            new_direction = sample_cosine_hemisphere(surface_normal)
            current_ray = Ray(surface_point, new_direction)
            cosine = dot(surface_normal, new_direction)
            reflectance = cosine * albedo / π
            pdf = pdf_sun_disk(new_direction) + pdf_cosine_hemisphere(new_direction)
            throughput *= reflectance / pdf
        else:
            # Ray hits sky
            sky_radiance = evaluate_sky_model(current_ray.direction, sun_direction)
            radiance += throughput * sky_radiance
            break
    
    return radiance

On a GPU, this for-loop combined with branching execution during intersection testing, lighting, and material evaluation can result in very inefficient execution. Physically Based Rendering provides an excellent overview of GPU architecture and the challenges of running a naive monolithic kernel on a GPU.

In a wavefront architecture, the monolithic ray tracer’s loop body becomes its own kernel. Instead of containing a loop, the wavefront kernel is invoked from within a loop. The branches for material and light evaluation can be moved into separate kernels as well. State is transferred between shader invocations by reading and writing to buffers. Jacco Bikker’s article provides a great overview of implementing a wavefront renderer.

The wavefront renderer used in this post is an interpretation of the renderer presented in Bikker’s article, using this subset of phases:

  1. Ray generation
    • Appends primary rays into the ray buffer
  2. Extension
    • Intersects the rays with the scene. My adaptation: appends the next ray into the ray buffer
  3. Connection
    • Intersects the shadow ray with the scene

Material queues are skipped since the renderer supports only one material. Shading is moved into the extension phase to reduce memory transfers.

Here’s pseudocode for the ray extension and connection shaders. The ray generation shader isn’t shown in pseudocode, as it simply appends primary rays to a buffer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Extension phase

def extend_rays(input_ray,
                scene_geometry,
                sky_model,
                sun_direction,
                output_rays,
                shadow_rays,
                paths):
    path = paths[input_ray.path_index]

    intersection = intersect(input_ray, scene_geometry)
    if intersection.hit:
        # Intersection data
        surface_point = intersection.position
        surface_normal = intersection.normal
        albedo = intersection.material.color

        # Generate shadow ray for direct lighting
        light_direction = sample_sun_disk(sun_direction, SOLAR_RADIUS)
        cosine = dot(surface_normal, light_direction)
        if cosine > 0:
            reflectance = cosine * albedo / π
            sun_radiance = evaluate_sky_model(light_direction, sun_direction)          
            pdf = pdf_sun_disk(light_direction) + pdf_cosine_hemisphere(light_direction)
            shadow_rays.append(ShadowRay(
                surface_point,
                light_direction,
                path.throughput * reflectance * sun_radiance / pdf,
                input_ray.path_index
            ))

        # Generate indirect ray
        new_direction = sample_cosine_hemisphere(surface_normal)
        cosine = dot(surface_normal, new_direction)
        reflectance = cosine * albedo / π
        pdf = pdf_sun_disk(new_direction) + pdf_cosine_hemisphere(new_direction)
        path.throughput *= reflectance / pdf
  
        output_rays.append(Ray(
            surface_point,
            new_direction,
            input_ray.path_index
        ))            
    else:
        # Ray hits sky - accumulate radiance and terminate path
        sky_radiance = evaluate_sky_model(ray.direction, sun_direction)
        path.radiance += path.throughput * sky_radiance
1
2
3
4
5
6
7
8
# Connection phase

def connect_shadow_rays(shadow_ray, scene_geometry, paths):
    origin = shadow_ray.origin
    direction = shadow_ray.direction

    if is_shadow_ray_clear(scene_geometry, origin, direction):
        paths[shadow_ray.path_index].radiance += shadow_ray.radiance

Time measurements

Both the monolithic and wavefront renderers execute on the GPU using compute shaders. One sample per pixel is gathered in a single compute pass. GPU execution time is measured using Metal’s timestamp counter, sampling at the beginning and end of the compute pass:

1
2
3
compute_sample_buffer_attachment.sampleBuffer = mtl->compute_timestamp_sample_buffer;
compute_sample_buffer_attachment.startOfEncoderSampleIndex = 0;
compute_sample_buffer_attachment.endOfEncoderSampleIndex = 1;

Compute execution timings are gathered by rendering the following image at a fixed resolution across different devices:

cornell box

The Cornell box scene consists of 128 × 128 × 128 voxels. The generated mesh is not optimized, making the scene a reasonably complex test case despite its simple appearance.

Performance

Performance was measured on the following devices:

  • M1 Pro (MacBook Pro)
  • M2 (MacBook Air)
  • M3 (iPad Air)

Here are the timings (lower is better):

The plot shows distinct performance clusters for each M-series processor:

  1. M3: Fastest overall—monolithic (~38ms) beats wavefront (~72ms)
  2. M1 Pro: Wavefront (~141ms) outperforms monolithic (~167ms)
  3. M2: Slowest overall, with wavefront significantly faster than monolithic

It’s remarkable that an iPad beats a MacBook Pro. However, the M3’s strong performance isn’t entirely surprising given its hardware-accelerated ray tracing capabilities.

One way to understand why the wavefront renderer may outperform the monolithic renderer is by examining GPU traces.

GPU trace of monolithic execution on the M2

m2 monolithic kernel trace

GPU trace of wavefront execution on the M2

m2 wavefront kernel trace

Two metrics from the GPU performance traces stand out:

  • Kernel ALU Inefficiency: Percentage of ALU instructions predicated out due to divergent control flow or partial SIMD groups (when threadgroup size isn’t a multiple of SIMDgroup size)
  • Kernel Arithmetic Intensity: ALU operations per byte of memory traffic

On the M2:

  • Monolithic kernel: Higher ALU inefficiency but better arithmetic intensity
  • Wavefront kernel: Lower ALU inefficiency but reduced arithmetic intensity

These numbers are expected. The ALU inefficiency increases for the monolithic kernel as more loop iterations execute and each thread’s state diverges from its neighbors. In turn, each wavefront kernel iteration must read its inputs from memory and write intermediate results back to memory, reducing ALU intensity. Reading from and writing to memory doesn’t necessarily hurt performance if there are plenty of threads to schedule, but on mobile hardware, excessive memory I/O drains battery.

Performance counter comparison with the M3 processor would be interesting, but unfortunately the GPU trace for the M3 appears broken. ALU intensity was nowhere to be found, and ALU inefficiency was zero across all captured traces. I’m not sure whether this is user error or if these metrics can’t be captured from an iPad.

Without matching numbers, it’s difficult to say why the monolithic renderer outperforms wavefront on the M3, but it’s clear that hardware-accelerated ray tracing is significantly reducing the impact that intersection testing has on the monolithic kernel.

Which renderer to choose?

For my hobby renderer, the choice is clear. The renderer is intended to run only on Apple’s M-series processors, and when an M1 or M2 chip is detected, the wavefront renderer is chosen. Otherwise, the monolithic renderer is used.

A renderer with the feature set presented here appears to be on the borderline of benefiting from a wavefront architecture. The more features added to the renderer, the more divergence will occur during execution, giving an increasing edge to the wavefront renderer. When more materials and lights are added in the future, it will be necessary to re-evaluate performance again.

Contents