Tiled Rendering | 🌵 Federico Forti (fe0437)

In this tutorial we exploit a hardware advantage of Apple Silicon to eliminate the DRAM round-trip for the GBuffer. The visual result is identical to Tutorial 4; the bandwidth cost drops to zero:

Mark GBuffer textures as memoryless — they live only in on-chip tile memory, never written to DRAM
Merge the GBuffer pass and the lighting pass into a single MTLRenderPassDescriptor
Rewrite the lighting fragment shader to read GBuffer data directly from tile memory

Prerequisite: Make sure you have completed Tutorial 4 — Shadows.

⚙️ Setting up the project

In MTMetalTutorialsApp.swift set MT5ContentView() and run:

@main
struct MetalTutorialsApp: App {
    var body: some Scene {
        WindowGroup {
            MT5ContentView()
        }
    }
}

🔗 Metal API in this tutorial

Object	Docs	Scope	Role
`MTLStorageMode.memoryless`	↗	Texture creation	Tile-memory-only texture; never written to DRAM
`MTLStoreAction.dontCare`	↗	Pass descriptor	Discard tile contents after pass — mandatory for memoryless
`device.supportsFamily(.apple1)`	↗	Runtime query	Checks if device is a tile-based (Apple Silicon / A-series) GPU
Tile function shaders	↗	Shader stage	Shaders that run directly on tile memory without DRAM roundtrip

💡What changes from Tutorial 4

This tutorial refines the scene setup by implementing changes that improve performance without altering the bunny, plane, or shadow elements. The only changes are:

GBuffer textures use MTLStorageMode.memoryless → they live only in GPU tile memory, never written to DRAM
The GBuffer pass and the Lighting pass are merged into a single MTLRenderPassDescriptor derived from view.currentRenderPassDescriptor
The lighting fragment shader receives the GBuffer as a direct GBuffer struct parameter instead of reading from textures

The result: the GBuffer data never leaves the tile — zero bandwidth cost.

MTLStorageMode.memoryless — tile-memory-only allocations. Apple’s A-series and M-series GPUs use a Tile-Based Deferred Renderer (TBDR) architecture: the GPU splits the screen into small tiles (typically 32×32 or 64×64 pixels) and processes each tile completely before moving to the next. Within a tile, there is ultra-fast on-chip SRAM (“tile memory”) where all fragment outputs and depth values live during processing. A memoryless texture is backed only by this tile SRAM — it has no DRAM allocation at all. This means:

Reading or writing it costs no DRAM bandwidth — the data never leaves the chip during the frame.

It cannot be read on the CPU or in a separate render pass — the data is discarded when the GPU moves to the next tile.

It requires .storeAction = .dontCare — if you tried to .store it, there’s no DRAM to write to.

The texture size is still required (for validation), but no backing memory is actually allocated.

On discrete GPUs (AMD, NVIDIA, Intel) MTLStorageMode.memoryless is silently treated as .private — the code is fully portable, just without the bandwidth saving.

📁 Files

File Name	Description
MT5ContentView.swift	Sets up Metal view and starts rendering.
MT5DeferredMetalView.swift	Manages Metal layer, frame timing, and render loop.
MT5TiledDeferredRenderer.swift	Handles GBuffer creation, texture setup, and command buffer encoding.
MT5DeferredRendering.metal	Contains vertex and fragment shaders for geometry pass and lighting pass.
MT5RenderTargets.h	Defines constants for render targets (e.g., albedo, normal, position).
MT5Uniforms.h	Declares uniform structs used in the Metal shaders.

👨‍💻 Code

All the changes are in two places: the render pass descriptor (which merges the GBuffer and lighting passes) and the lighting fragment shader (which reads GBuffer data from the tile instead of from DRAM textures). The geometry pass and its shaders are untouched.

GBuffer fragment — unchanged

The geometry-pass shaders (vertex_main, gbuffer_fragment) are identical to Tutorial 4. Metal infers the tile memory write from the [[color(N)]] attributes on the GBuffer struct.

Lighting fragment — new `GBuffer` struct parameter

In Tutorial 4 the lighting shader read from three texture2d bindings. Tutorial 5 replaces that with a direct GBuffer parameter:

struct GBuffer {
    float4 albedo    [[color(MT5RenderTargetAlbedo)]];
    float4 normal    [[color(MT5RenderTargetNormal)]];
    float4 position  [[color(MT5RenderTargetPosition)]];
};

fragment float4 deferred_lighting_fragment(
    QuadInOut                   in       [[ stage_in ]],
    GBuffer                     gBuffer               ,   // ← reads from tile, not DRAM
    constant MT5FragmentUniforms &uniforms [[buffer(1)]])
{
    float4 albedo_vis_at_pix = gBuffer.albedo;
    float4 normal_at_pix     = gBuffer.normal;
    float4 position_at_pix   = gBuffer.position;
    // … same GGX lighting calculation as Tutorial 3/4
}

The GBuffer parameter with [[color(N)]] attributes reads directly from the tile memory render targets. No texture.read() call, no DRAM access.

Why the GBuffer parameter works in MSL. In Metal Shading Language, a fragment function can accept a struct parameter where each field is tagged with [[color(N)]]. This binds the field directly to the Nth color attachment in the current render pass — reading from the tile’s framebuffer at the current pixel position. It’s the same mechanism used to write multiple render targets (return a [[color(N)]] struct), but applied in reverse to read from them. This only works within a single MTLRenderCommandEncoder — once you call endEncoding(), tile memory is committed to DRAM (or discarded if .dontCare) and is no longer accessible. Splitting into two encoders (as in Tutorial 3) forces DRAM roundtrip and requires texture2d bindings instead.

This technique is sometimes called Programmable Blending or framebuffer fetch in other APIs (OpenGL ES extensions, Vulkan VK_EXT_rasterization_order_attachment_access). In Metal it works natively with no extensions needed on Apple GPU hardware.

What `.storeAction = .dontCare` actually means

Now that the GBuffer lives only in tile memory, the store actions on those attachments change meaning — setting .store on a memoryless texture is actually a validation error:

Store action	What happens	When to use
`.store`	GPU writes tile → DRAM	When you need the texture in a later pass or on the CPU
`.dontCare`	GPU discards tile contents	When the attachment is intermediate (GBuffer, intermediate depth)
`.multisampleResolve`	Resolve MSAA	For MSAA render targets

Using .dontCare on the GBuffer is not just “safe” — it’s required for .memoryless textures. If you set .store on a memoryless texture, Metal will error at validation time.

Performance comparison

On Apple Silicon, the bandwidth saving from .memoryless GBuffers is significant:

Scenario	T3 GBuffer bandwidth	T5 GBuffer bandwidth
1080p, rgba8 albedo	~8 MB/frame written + ~8 MB/frame read	0
1080p, rgba16f normals + positions	~16 MB/frame written + ~16 MB/frame read	0
60 fps total	~2.88 GB/s for GBuffer alone	0

The saved bandwidth also reduces power consumption — important for mobile (iPhone/iPad) targets.

⚠️ When NOT to use `.memoryless`

Memoryless textures are a great default for intermediate attachments, but they can’t be used when:

You need to read the texture in a later separate pass (e.g., shadow map reads in the geometry pass)
You need to read the texture on the CPU (e.g., for image capture, screenshots)
You’re running on a non-tile GPU (Mac with AMD/NVIDIA) — .memoryless is ignored silently on those, which is fine; the texture simply allocates normally

Metal provides device.supportsFamily(.apple1) to detect tile GPU support if you want to conditionally enable memoryless storage.

📚 Key concepts recap

Concept	Apple Docs	What it is
TBDR architecture	↗	GPU renders one tile at a time using fast on-chip memory; basis for all the optimisations below
`MTLStorageMode.memoryless`	↗	Texture backed only by tile SRAM — no DRAM allocation, no bandwidth cost, never readable outside the pass
`MTLStoreAction.dontCare`	↗	Discard tile contents after pass — required for memoryless; also correct for depth when depth is not needed after the pass
Merged render pass	—	GBuffer + Lighting in one `MTLRenderCommandEncoder` — necessary for tile memory reads
`GBuffer [[color(N)]]` parameter	↗	Lighting shader reads GBuffer from tile memory directly — no `texture2d`, no DRAM fetch
Tile memory	—	Ultra-fast GPU-local SRAM, ~10–100× faster than DRAM; shared between all fragments in a tile
`device.supportsFamily(.apple1)`	↗	Runtime check for TBDR / tile GPU support (A7 and later, all M-series)

🎉 Congratulations

You’ve successfully implemented Tile-Based Deferred Rendering (TBDR) in Metal, optimizing your rendering pipeline for Apple Silicon GPUs. Now that you have a solid foundation, try experimenting with different tile sizes and observe how it impacts performance and memory usage.