GPU Rendering | 🌵 Federico Forti (fe0437)

In this tutorial we hand the scene’s draw calls to the GPU itself, removing all per-object CPU work from the render loop:

Pack all scene textures into a single MTLHeap so they can be declared resident in one call
Build argument buffers so shaders can index any mesh or texture by integer ID
Run a compute kernel that reads the argument buffers and writes draw commands into an MTLIndirectCommandBuffer
Each frame: the CPU submits one indirect draw; the GPU executes all scene draw calls itself

Prerequisite: Make sure you have completed Tutorial 5 — Tiled Rendering.

⚙️ Setting up the project

In MTMetalTutorialsApp.swift set MT6ContentView() and run:

@main
struct MetalTutorialsApp: App {
    var body: some Scene {
        WindowGroup {
            MT6ContentView()
        }
    }
}

🔗 Metal API in this tutorial

Object	Docs	Scope	Role
`MTLHeap`	↗	Scene lifetime	Arena allocator; sub-allocates textures/buffers from one large block
`MTLHeapDescriptor`	↗	Creation only	Configures heap size, storage mode, type
`MTLArgumentEncoder`	↗	Scene/frame setup	Fills resource handles (textures, buffers) into argument buffer `MTLBuffer`s
Argument buffers	↗	Pass	A `MTLBuffer` containing resource handles a shader can index into
`MTLIndirectCommandBuffer`	↗	Scene/frame	GPU-writable array of draw/dispatch commands
`MTLIndirectCommandBufferDescriptor`	↗	Creation only	Configures command types, buffer bind counts
`MTLComputeCommandEncoder`	↗	Per compute pass	Records compute kernel dispatches into the command buffer
`MTLComputePipelineState`	↗	Scene lifetime	Compiled compute shader (kernel function)
`MTLBlitCommandEncoder`	↗	Asset loading	Records copy/fill commands (e.g., staging → heap textures)
`useHeap(_:)`	↗	Per encoder	Declares heap residency so all sub-resources are accessible

Scene separation: `MT6Scene`

Tutorial 6 introduces the MT6Scene class — a design pattern that separates scene data from renderer logic. This approach makes it easier to share the same scene across multiple renderers (e.g., forward vs. deferred) without reloading assets.

classDiagram
    class MT6Scene {
        +staticGPUHeap: MTLHeap
        +mtkMeshes: [MTKMesh]
        +init(device, commandQueue)
    }
    class MT6DeferredRenderer {
        -meshesBuffer: MTLBuffer
        -shadowsArgBuffer: MTLBuffer
        -indirectCommandBuffer: MTLIndirectCommandBuffer
        +draw(in: MTKView)
    }
    class IndirectMesh["MT6::Indirect::Mesh"] {
        +vertexBuffer: constant float*
        +texCoordsBuffer: constant float*
        +indexBuffer: constant uint*
        +materialArgBuffer: constant MaterialArgument*
    }
    class MaterialArgument {
        +baseColorTexture: texture2d
        +specularTexture: texture2d
        +normalTexture: texture2d
    }
    MT6DeferredRenderer --> MT6Scene : reads heap + meshes
    MT6DeferredRenderer --> IndirectMesh : encodes meshesBuffer
    IndirectMesh --> MaterialArgument : materialArgBuffer
    MT6Scene --> MTLHeap : staticGPUHeap holds textures

class MT6Scene {
    let staticGPUHeap: MTLHeap   // all scene textures packed here
    let mtkMeshes: [MTKMesh]     // CPU-side mesh data

    init(device: MTLDevice, commandQueue: MTLCommandQueue) {
        // 1. Load USD meshes into mtkMeshes
        // 2. Pack textures into staticGPUHeap
    }
}

The renderer (not MT6Scene) owns and encodes the argument buffers — this keeps scene data separate from the GPU-driven draw machinery.

📦 USD asset loading

With the scene class structure clear, let’s walk through what happens inside MT6Scene.init. The first step is loading the 3D asset — Tutorial 6 switches from OBJ to USD to get embedded material and texture data:

Tutorial 6 uses USD (Universal Scene Description) format assets:

let assetURL = Bundle.main.url(forResource: "toy_biplane_idle", withExtension: "usdz")!
let allocator = MTKMeshBufferAllocator(device: device)
let asset = MDLAsset(url: assetURL,
                     vertexDescriptor: vertexDescriptor,
                     bufferAllocator: allocator)
let (mdlMeshes, mtkMeshes) = try! MTKMesh.newMeshes(asset: asset, device: device)

USD supports hierarchical scene graphs with materials, so textures are embedded:

// Iterate material properties to extract textures
if let prop = mdlMesh.submeshes?.first as? MDLSubmesh,
   let material = prop.material {

    let baseColor = material.property(with: .baseColor)
    if let url = baseColor?.urlValue {
        // load texture from URL
    }

    let normalMap = material.property(with: .tangentSpaceNormal)
    let specular  = material.property(with: .specular)
}

The vertex descriptor for Tutorial 6 adds tangent + bitangent to support normal mapping:

// Position  — float3  @ attribute 0
// Normal    — float3  @ attribute 1
// Tangent   — float3  @ attribute 2
// Bitangent — float3  @ attribute 3
// TexCoords — float2  @ attribute 4

MTLHeap — packing all textures into one allocation

With meshes loaded, we pack all their textures into a single MTLHeap. This is what allows us to bind all scene textures to the GPU with a single useHeap call instead of tracking each texture individually — a requirement for argument-buffer-accessed resources.

MTLHeap — GPU memory arena. A heap is a contiguous block of GPU memory from which you sub-allocate MTLTexture and MTLBuffer objects using heap.makeTexture(descriptor:) and heap.makeBuffer(length:options:). All sub-resources share the heap’s single residency tracking entry — this is the key performance benefit: instead of making N individual textures resident with N useResource calls, one useHeap call covers all of them. Heaps are also required when resources are accessed indirectly through argument buffers, because Metal’s resource tracking doesn’t follow pointer indirections — the heap must be declared resident so the GPU knows all its sub-resources are available.

To size the heap correctly, call device.heapTextureSizeAndAlign(descriptor:) for each texture. The alignment value is crucial — sub-allocations must be placed at aligned offsets, so the total heap size is the sum of sizeAlign.size rounded up to sizeAlign.align for each resource. Using MTLHeapDescriptor, set storageMode = .private for GPU-only textures and type = .automatic (the default) to let Metal choose between placement and automatic heaps.

A MTLHeap is a block of GPU memory you allocate once and then sub-allocate textures from. Benefits:

One allocation call instead of N separate makeTexture calls
Better memory locality → better cache behavior
Required for argument buffers that reference textures (resources must be in a heap for residency tracking)

// 1. Measure the size each texture would need
var heapSize = 0
for url in textureURLs {
    let desc = textureLoader.textureDescriptor(url: url)
    let sizeAlign = device.heapTextureSizeAndAlign(descriptor: desc)
    heapSize += sizeAlign.size + sizeAlign.align
}

// 2. Create the heap
let heapDesc = MTLHeapDescriptor()
heapDesc.size = heapSize
heapDesc.storageMode = .private
let heap = device.makeHeap(descriptor: heapDesc)!

// 3. Allocate each texture inside the heap
// 4. Use a blit encoder to copy from staging buffers into heap textures
let blit = commandBuffer.makeBlitCommandEncoder()!
for (staging, heapTex) in zip(stagingTextures, heapTextures) {
    blit.copy(from: staging, to: heapTex)
}
blit.endEncoding()

Argument Buffers

With the heap allocated and all textures packed, we need a way to let the GPU access any of them by index without the CPU issuing one setTexture call per draw. That’s what argument buffers solve — they’re MTLBuffers whose contents are GPU resource handles.

Argument buffers — indirect resource binding. Traditionally in Metal (and in earlier tutorials), each buffer/texture is bound individually with setVertexBuffer(_:offset:index:) or setFragmentTexture(_:index:). Argument buffers replace this with a single MTLBuffer whose contents are resource handles (GPU pointers to textures, buffers, and samplers). The shader sees it as a struct of typed resource references and indexes into it with regular pointer arithmetic. Benefits:

Fewer CPU API calls per draw — bind one argument buffer instead of dozens of individual resources

GPU-driven binding — the GPU itself can write resource handles into argument buffers (no CPU involvement), enabling truly GPU-driven rendering where the CPU doesn’t know which specific resources each draw will need

Tier 2 argument buffers (Apple GPU, A13+, M1+) allow indexing into arrays of unbounded size — the foundation of “bindless” rendering

MTLArgumentEncoder is the CPU-side tool that fills in the handles. You create it from a function reflection object (so Metal knows the expected argument buffer layout), then call setBuffer, setTexture, etc. to write the handles.

An argument buffer is a MTLBuffer that contains resource handles (textures, buffers, samplers) that shaders can access by index — like a C struct of pointers.

In this tutorial, MTLArgumentEncoder fills three argument buffers:

Mesh argument buffer — `MT6::Indirect::Mesh`

The Mesh struct (inside the MT6::Indirect namespace) holds pointers to a single submesh’s vertex data plus a pointer to its material argument buffer. It maps directly to MT6VertexBufferIndeces:

namespace MT6 {
namespace Indirect {
    struct Mesh {
        constant float           *vertexBuffer    [[id(MT6VertexBuffer)]];
        constant float           *texCoordsBuffer [[id(MT6TextureCoordinatesBuffer)]];
        constant uint            *indexBuffer     [[id(MT6IndecesBuffer)]];
        constant MaterialArgument *materialArgBuffer [[id(MT6MaterialArgBuffer)]];
    };
} // namespace Indirect
} // namespace MT6

The renderer encodes one Mesh entry per submesh into a MTLBuffer at index MT6MeshesBuffer (14).

Material argument buffer — `MaterialArgument`

Contains only three heap textures — no float constants:

struct MaterialArgument {
    texture2d<float> baseColorTexture [[id(MT6BaseColorTexture)]];
    texture2d<float> specularTexture  [[id(MT6SpecularTexture)]];
    texture2d<float> normalTexture    [[id(MT6NormalTexture)]];
};

Shadow argument buffer — `ShadowsArgBuffer`

A thin wrapper that lets the ICB kernel bind the shadow depth texture by index:

struct ShadowsArgBuffer {
    depth2d<float> shadowTexture [[id(0)]];
};

Encoding on the Swift side

// Encode the meshes buffer (array of Indirect::Mesh)
let meshEncoder = meshFunctionArguments.makeArgumentEncoder(bufferIndex: Int(MT6MeshesBuffer.rawValue))
meshEncoder.setArgumentBuffer(meshesBuffer, offset: 0)
for (i, mtkMesh) in scene.mtkMeshes.enumerated() {
    meshEncoder.setBuffer(mtkMesh.vertexBuffers[Int(MT6VertexBuffer.rawValue)].buffer,
                          offset: 0,
                          index: Int(MT6VertexBuffer.rawValue))
    meshEncoder.setBuffer(mtkMesh.vertexBuffers[Int(MT6TextureCoordinatesBuffer.rawValue)].buffer,
                          offset: 0,
                          index: Int(MT6TextureCoordinatesBuffer.rawValue))
    // … index buffer, material argument buffer …
}

MTLIndirectCommandBuffer

The argument buffers describe what to draw. The indirect command buffer is how the GPU issues the actual draw calls. A compute kernel runs one thread per submesh, reads from the argument buffers, and writes a complete drawIndexedPrimitives command into the ICB — no CPU involvement once the dispatch is submitted.

MTLIndirectCommandBuffer — GPU-driven draw call generation. An ICB is a fixed-size array of pre-encoded command objects (draw or dispatch) that can be filled by a GPU compute kernel and later executed by a render encoder with executeCommandsInBuffer(_:range:). This removes the CPU from the draw-call generation loop entirely. The key advantages over CPU-driven rendering:

No CPU-GPU sync per draw — the GPU generates its own draw calls without waiting for CPU readback

GPU culling is possible — a compute kernel can conditionally skip draws for off-screen objects by leaving command slots empty

Parallel command generation — the compute kernel runs one thread per submesh, generating all draw calls in parallel

The ICB descriptor (MTLIndirectCommandBufferDescriptor) must declare the command types (commandTypes = [.drawIndexed]) and the maximum buffer bind counts. The counts must be exactly right — too few causes crashes; too many wastes memory because each command slot is a fixed-size binary structure.

In Metal Shading Language, the render_command type (from <metal_command_buffer>) wraps one ICB slot. You call cmd.set_vertex_buffer(...), cmd.set_fragment_buffer(...), and cmd.draw_indexed_primitives(...) exactly as you would call them on a CPU MTLRenderCommandEncoder — the kernel just encodes them into the ICB for later replay.

An ICB is a GPU-writable array of command objects. Each entry in the ICB is one draw call.

MTLComputeCommandEncoder — dispatches compute kernels. The compute encoder is the CPU-side interface for launching compute shaders. You set a MTLComputePipelineState (compiled from a kernel function), bind buffers and textures, then call dispatchThreads(_:threadsPerThreadgroup:) to define the grid. In Tutorial 6 each thread in the dispatch corresponds to one submesh — so the kernel runs submeshCount threads in parallel, each writing one command into the ICB. This is the GPU-driven pattern: CPU dispatches a single compute call that generates N draw calls.

CPU setup

let icbDesc = MTLIndirectCommandBufferDescriptor()
icbDesc.commandTypes         = [.drawIndexed]
icbDesc.inheritBuffers       = false
icbDesc.maxVertexBufferBindCount   = 4
icbDesc.maxFragmentBufferBindCount = 2

let indirectCommandBuffer = device.makeIndirectCommandBuffer(
    descriptor: icbDesc,
    maxCommandCount: meshes.count,
    options: [])!

GPU compute kernel fills the ICB

The drawKernel receives all per-mesh and per-frame data it needs, wraps the ICB in CommandArgBuffer, and encodes one render_command per thread:

namespace MT6 {
namespace Indirect {
    struct CommandArgBuffer {
        command_buffer indirectCommandBuffer [[id(0)]];
    };

    kernel void drawKernel(
        uint                                              threadId              [[thread_position_in_grid]],
        device CommandArgBuffer                          *pCommandBuffer        [[buffer(MT6IndirectCommandBuffer)]],
        constant Mesh                                    *pMeshes               [[buffer(MT6MeshesBuffer)]],
        constant MT6VertexUniforms                       *pVertexUniformsArray  [[buffer(MT6VertexUniformsBuffer)]],
        constant MT6FragmentUniforms                     &fragmentUniforms      [[buffer(MT6FragmentUniformsBuffer)]],
        constant ShadowsArgBuffer                        *pShadowArgBuffer      [[buffer(MT6ShadowsArgumentsBuffer)]],
        constant MTLDrawIndexedPrimitivesIndirectArguments *drawArgumentsBuffer [[buffer(MT6DrawArgumentsBuffer)]])
    {
        render_command cmd(pCommandBuffer->indirectCommandBuffer, threadId);

        constant Mesh &mesh = pMeshes[threadId];

        cmd.set_vertex_buffer(mesh.vertexBuffer,     MT6VertexBuffer);
        cmd.set_vertex_buffer(mesh.texCoordsBuffer,  MT6TextureCoordinatesBuffer);
        cmd.set_vertex_buffer(mesh.materialArgBuffer, MT6MaterialArgBuffer);
        cmd.set_vertex_buffer(&pVertexUniformsArray[threadId], MT6VertexUniformsBuffer);

        cmd.set_fragment_buffer(&fragmentUniforms,            MT6FragmentUniformsBuffer);
        cmd.set_fragment_buffer(pShadowArgBuffer,             MT6ShadowsArgumentsBuffer);
        cmd.set_fragment_buffer(mesh.materialArgBuffer,       MT6MaterialArgBuffer);

        const MTLDrawIndexedPrimitivesIndirectArguments &drawArgs = drawArgumentsBuffer[threadId];
        cmd.draw_indexed_primitives(
            primitive_type::triangle,
            drawArgs.indexCount,
            mesh.indexBuffer,
            drawArgs.instanceCount,
            drawArgs.baseVertex,
            drawArgs.baseInstance);
    }
} // namespace Indirect
} // namespace MT6

CPU dispatches compute, then executes ICB

// Step 1: compute pass — GPU fills the ICB
let computeEncoder = commandBuffer.makeComputeCommandEncoder()!
computeEncoder.setComputePipelineState(_drawKernelPSO)
computeEncoder.setBuffer(_commandArgBuffer,    offset: 0, index: Int(MT6IndirectCommandBuffer.rawValue))
computeEncoder.setBuffer(_meshesBuffer,        offset: 0, index: Int(MT6MeshesBuffer.rawValue))
computeEncoder.setBuffer(_vertexUniformsBuffer,offset: 0, index: Int(MT6VertexUniformsBuffer.rawValue))
computeEncoder.setBuffer(_fragmentUniformsBuffer, offset: 0, index: Int(MT6FragmentUniformsBuffer.rawValue))
computeEncoder.setBuffer(_shadowsArgBuffer,    offset: 0, index: Int(MT6ShadowsArgumentsBuffer.rawValue))
computeEncoder.setBuffer(_drawArgumentsBuffer, offset: 0, index: Int(MT6DrawArgumentsBuffer.rawValue))
computeEncoder.useHeap(scene.staticGPUHeap)
computeEncoder.dispatchThreads(
    MTLSize(width: submeshCount, height: 1, depth: 1),
    threadsPerThreadgroup: MTLSize(width: 1, height: 1, depth: 1))
computeEncoder.endEncoding()

// Step 2: render pass executes the ICB
let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor)!
renderEncoder.useHeap(scene.staticGPUHeap)
renderEncoder.executeCommandsInBuffer(_indirectCommandBuffer, range: 0..<submeshCount)
renderEncoder.endEncoding()

The useHeap call is critical: it tells Metal that all textures in the heap may be accessed by this pass, enabling proper residency tracking. Without useHeap, Metal’s resource tracking doesn’t know the heap’s sub-resources are in use — on Apple Silicon this can cause GPU page faults; on macOS with discrete GPUs it can cause incorrect eviction. Both the compute encoder (which writes ICB commands referencing heap textures) and the render encoder (which executes those commands) need useHeap called.

MTLBlitCommandEncoder — copy/fill operations. The blit encoder handles data movement on the GPU: copying textures and buffers, generating mipmaps, and filling buffer regions. In Tutorial 6 it copies texture data from CPU-accessible staging buffers into the heap’s GPU-private textures. You cannot directly write to a .private texture from the CPU — you must upload to a .shared or .managed staging resource first, then use a blit encoder to copy to the final .private destination. This staging pattern is standard for all static assets (textures, meshes) in production Metal renderers.

The full CPU→GPU flow for one frame:

sequenceDiagram
    participant CPU
    participant ComputePass as Compute Pass (GPU)
    participant ICB as MTLIndirectCommandBuffer
    participant RenderPass as Render Pass (GPU)

    CPU->>ComputePass: encode drawKernel dispatch
    CPU->>ComputePass: setBuffer(meshesBuffer, MT6MeshesBuffer)
    CPU->>ComputePass: setBuffer(shadowsArgBuffer, MT6ShadowsArgumentsBuffer)
    CPU->>ComputePass: useHeap(staticGPUHeap)
    CPU->>ComputePass: commit
    ComputePass->>ICB: render_command cmd(commandArgBuffer.icb, threadId)
    ComputePass->>ICB: cmd.set_vertex_buffer(mesh.vertexBuffer, ...)
    ComputePass->>ICB: cmd.draw_indexed_primitives(...)
    CPU->>RenderPass: encode executeCommandsInBuffer(ICB)
    CPU->>RenderPass: useHeap(staticGPUHeap)
    CPU->>RenderPass: commit
    RenderPass->>RenderPass: execute all N draw calls from ICB

Buffer index layout

The buffer index constants are defined in MT6Input.h and split into two enums — one for per-mesh vertex data (low indices, packed tightly) and one for pass-level data (starting at 10):

// Per-vertex attribute buffers — used directly in vertex shaders and ICB commands
typedef enum MT6VertexBufferIndeces {
    MT6VertexBuffer             = 0,   // position + normal + tangent + bitangent (interleaved)
    MT6TextureCoordinatesBuffer = 1,   // UV coords (separate buffer)
    MT6IndecesBuffer            = 2,   // index buffer
    MT6MaterialArgBuffer        = 3,   // per-submesh material argument buffer
} MT6VertexBufferIndeces;

// Pass-level buffers — uniforms, GPU-driven drawing structures
typedef enum MT6BufferIndices {
    MT6BufferIndexMeshPositions = 10,  // full-screen quad vertex buffer
    MT6VertexUniformsBuffer     = 11,  // array of per-mesh MT6VertexUniforms
    MT6FragmentUniformsBuffer   = 12,  // MT6FragmentUniforms (light position)
    MT6IndirectCommandBuffer    = 13,  // CommandArgBuffer wrapping the ICB
    MT6MeshesBuffer             = 14,  // array of Mesh argument structs
    MT6DrawArgumentsBuffer      = 15,  // MTLDrawIndexedPrimitivesIndirectArguments[]
    MT6ShadowsArgumentsBuffer   = 16,  // ShadowsArgBuffer (depth2d shadow texture)
} MT6BufferIndices;

Render pass architecture

The render loop looks similar to Tutorial 4 from the CPU’s perspective — three passes in one command buffer — but now the Shadow and GBuffer passes execute GPU-generated draw commands from the ICB instead of CPU-encoded ones.

Tutorial 6 keeps the same 3-pass structure as Tutorial 4 (Shadow → GBuffer → Lighting), but:

Shadow pass and GBuffer pass are now driven by the ICB (GPU encodes the draw calls)
The executeCommandsInBuffer method is used to execute these commands

For a renderer in which the required passes depend on the requested output, see HdRestir’s initial render-pipeline chapter. It compares the fixed Metal sequence with passes that declare named inputs and outputs and can be pruned for the requested Hydra AOV.

GPU Frame Capture

The Xcode Metal debugger shows the full pass dependency graph for one frame — Shadow Pass → Tiled Render Pass → presentDrawable:

Performance Tips for Apple Silicon

Apple Silicon GPUs excel at parallel processing, so leveraging GPU-driven rendering can significantly improve performance. By offloading draw call generation to the GPU, you allow the CPU to focus on other tasks, such as physics or AI calculations.

📚 Key concepts recap

Concept	Apple Docs	What it is
`MTLHeap`	↗	GPU memory arena; sub-allocate textures/buffers; one `useHeap` call covers all sub-resources for residency
`MTLArgumentEncoder`	↗	CPU-side tool that writes resource handles (textures, buffers) into an argument buffer `MTLBuffer`
Argument buffers	↗	A `MTLBuffer` whose contents are GPU resource handles; enables bindless and GPU-driven resource access
`MTLIndirectCommandBuffer`	↗	GPU-writable array of draw commands; filled by compute kernel, executed by render encoder
`MTLComputeCommandEncoder`	↗	Records compute kernel dispatches; used here to run the ICB-filling `drawKernel`
`MTLComputePipelineState`	↗	Compiled compute shader; like `MTLRenderPipelineState` but for `kernel` functions
`MTLBlitCommandEncoder`	↗	Copies/fills GPU memory; used to upload staging textures into heap `.private` textures
`useHeap(_:)`	↗	Declares all heap sub-resources resident for the current pass; required for argument-buffer-accessed resources

What GPU-driven rendering enables

Efficient use of CPU and GPU resources by offloading work from the CPU to the GPU.
Improved performance on Apple Silicon GPUs due to parallel processing capabilities.
Simplified scene management through separation of concerns between the renderer and scene data.

🎉 Congrats — you’ve completed all Metal Tutorials!

All source code is on GitHub.