Screen space ambient occlusion is getting a lot of interest around the net and I wanted to add this to our Raz0r engine as an easily toggleable feature. Whilst developing the code, I borrowed a lot of knowledge from other people and this tutorial is my payback for all that help.
I’m not claiming that my method is the best and only way to implement SSAO, so I present this simply as the way I did it. If you find any optimisations or tweaks to improve it, please feel free to point them out.
One of my main goals here is to keep the code fully DirectX9.0c compatible and as fast as possible, which basically means no MRT and no fat format textures. This entire technique works only with bog-standard RGBA 32 bit textures and old-fashioned single target rendering. The code as presented will probably run on shader model 2 if you cut down the number of samples per pixel too, so this method is about as safe and reliable as you’ll find anywhere. It should also run comparatively quickly given the low bandwidth we’re using with our textures.
Right, lets get our hands dirty. Raz0r doesn’t use deferred shading, so the first thing we need to do is produce an RGBA texture of our scene that contains the normals and positions of every point we can see. Given that the trivial way to store both a position and a normal would require 24 bytes per pixel, and I said we’d be using only 4, there’s clearly some compression required here.
Encoding a camera space normal
Given we know that a normal is meant to be, er, normalised, we can throw away the z component and reconstruct it in the shader. All we need to know are the x and y components and the sign of the z and from that we can reconstruct the size of z by hand. There are actually several ways to encode normals into two components and Aras Pranckevicius has a great page presenting these concepts here. I’ve gone for the fastest of the options available to us, which is to just encode the normal’s x and y components into the R and G channels of our target texture and let the shader rebuild the z component. We don’t need to store the sign of z because we’re in camera space and therefore anything we can see must by definition point towards us, so the sign is always negative. (Actually this is not technically accurate. Because of perspective projection, it is possible to see a face on the screen that points away from the camera slightly, but how we’re using these normals later on renders this merely an academic issue)
Okay, we need only store two components of our normal, but that’s still 8 bytes worth as 2 floats. To fix that, we simply don’t store them as floats! In the pixel shader, taking a full fat float3 normal then multiplying it by 0.5 and adding 0.5 will effectively convert the components into 8 bit unsigned values that fit very nicely into the R & G channels of our texture. The resolution is perfectly fine and is not a compromise at all – I store my model normals packed into 3 bytes of a UBYTE4 field and you’d never know.
Encoding camera space linear depth
The next section will explain why we actually need this, but another task we need to perform is to fill the remaining B & A channels of our texture with a linear scene depth. Not only can the depth buffer not be read directly/efficiently in DX9 anway, but we want our depths to be linear, so the accuracy of the 16 bits we have available is spread over the entire visible scene, not just the first metre or so. It is a simple matter to have a vertex shader output a depth at each point (code will follow, don’t worry), but the icky part will be breaking up this float value inside our pixelshader into effectly two unsigned 8 bit values so we can stuff these into the B & A channels.
One detail not to be overlooked here is that a float value called “Depth” could be a very large, practically infinite number. We don’t need full resolution though, as our camera space scene will never show anything beyond the camera’s far clipping point. With this in mind, we normalise our depth value in the vertexshader by dividing the linear depth by the camera’s “far”, and pass that “far” value to the pixelshader so that it can be scaled back up again from the normalised value as it’s being sampled and reconstructed from the B & A texture channels.
Reconstructing position from linear depth
MJP has a great page presenting this, so I’ll gloss over it here. The reason we need a camera space linear depth available is that if we also supply our pixelshader with projection/clip space rays to the corners of our viewing frustrum (which we can pass in the vertices of our clip space quad), we can reconstruct an accurate 3D position anywhere in camera space just by reading this depth value and projecting along those rays.