Compute Shader thread index to 2D coordinate

I often find its useful to have four threads (or lanes, depending on your preferred parlance) collaborate on 2×2 pixels. Traditionally this would be done with group shared memory. However with Shader Model 6, we have the QuadReadAcrossX/Y/Diagonal intrinsics.

The advantage of these intrinsics instead of that group shared memory is that group shared memory is a resource, using too much of it can restrict occupancy and reduce shader performance. Using shared memory may also come with synchronisation requirements and depending on your hardware, subtle performance penalties such as bank conflicts, which I won’t go into here. Generally speaking, I expect the QuadRead intrinsics to outperform group shared memory.

The documentation (see Quad-wide Shuffle Operations) gives the required order of the threads, the expected behaviour being:

[0, 1]
[2, 3]

// QuadReadAcrossX for Thread0 will obtain Thread1's value and visa versa.
// QuadReadAcrossX for Thread2 will obtain Thread3's value and visa versa. 
// QuadReadAcrossY for Thread0 will obtain Thread2's value and visa versa. 
// QuadReadAcrossY for Thread1 will obtain Thread3's value and visa versa.

The question then is for a given threadIndex (obtained via WaveGetLaneIndex()), how do we generate the 2D coordinate the thread is to operate on? There is no default in spec. Of course, you could just do a buffer read to map thread ID to a 2D coordinate, but you might want to use maths instead, depending on what your shader’s bottlenecks are.

I came up with two different layouts. A cheap version with two periods of tiling, and a more expensive version with three periods, which can be more useful for reduction operations, especially creating mip maps with wave64.

(Below: the lane to 2D coord cheap ‘rectangular’ mapping, I’ve highlighted the first 4 quads of a wave64)

In the more expensive version I was able to reduce the number of operations by packing the x.y coordinate into the top and bottom 16bits of a dword, thus allowing me to perform the same operation concurrently on both x and y. The code might take some dissecting, so the unoptimised version is reproduced in the comments.

(Below: the more epxensive lane to 2D coord ‘square’ mapping, I’ve highlighted the first 4 quads of a wave64)

The comment block show what tiling patterns are for wave64, but its easy to extrapolate what the code does for other wave sizes.

uint2 ThreadIndexToQuadCoord(uint threadIndex)
    uint2 coord;

    // could load from buffer instead, depending on what shader bottlenecks are
    /* Sightly more expensive square version produces:

     0  1  4  5 16 17 20 21
     2  3  6  7 18 19 22 23
     8  9 12 13 24 25 28 29
    10 11 14 15 26 27 30 31
    32 33 36 37 48 49 52 53
    34 35 38 39 50 51 54 55
    40 41 44 45 56 57 60 61
    42 43 46 47 58 59 62 63

    // 3 periods of tiling, non optimised for clarity:
    x = ( index & 1      ) + ((index & 4) >> 1) + ((index & 16) >> 2);
    y = ((index & 2) >> 1) + ((index & 8) >> 2) + ((index & 32) >> 3);

    // duplicate index in top 16 bits, but pre-shifted right by 1
    threadIndex |= threadIndex << 15;

    // two bitwise ANDS for the price of one
    coord.x  = threadIndex & 0x10001;
    coord.x |= (threadIndex >> 1) & 0x20002;
    coord.x |= (threadIndex >> 2) & 0x40004;

    coord.y = coord.x >> 16;
    coord.x &= 0x7;
    /* Cheap rectangular version produces:

     0  1  4  5  8  9 12 13
     2  3  6  7 10 11 14 15
    16 17 20 21 24 25 28 29
    18 19 22 23 26 27 30 31
    32 33 36 37 40 41 44 45
    34 35 38 39 42 43 46 47
    48 49 52 53 56 57 60 61
    50 51 54 55 58 59 62 63

    // 2 periods of tiling, non optimised for clarity:
    x = (index & 1) + ((index & 12) >> 1);
    y = ((index & 2) >> 1) + ((index & 48) >> 3);

    uint indexSHR1 = threadIndex >> 1;
    coord.x = (threadIndex & 1) + (indexSHR1 & 6);
    coord.y = (indexSHR1 & 1) + ((threadIndex & 48) >> 3);
    return coord;

You could simplify the math a little for wave16, but these functions handle the common wave32 and wave64 sizes equally well.


PS2 Vector Unit Lighting on Shadowman2

I got my break as a graphics engineer at Acclaim working on the first Shadowman title, starting before graduating, though I did graduate. Shadowman2 was my first full game, and the studio’s first PS2 title. Development was troubled from the outset; a sorry tale of redundancies, management changes, toning down the content to avoid a mature rating, a script rewrite and engine change half way through, senior staff leaving, & etc.. I stuck it out, writing all of the core rendering code, mostly in ASM on the vector units. However I then spent almost a year as lead, a time when the rendering code remained static, due fire fighting so many other fires. We never did implement triangle strips, an essential optimisation for PS2, but the world geometry did support quads. This was a first generation PS2 title, unfortunately delayed by a year.

Early on character artist Robert Nash wanted an increase in poly budget to open up Shadowman’s chest, adding a point light and effect, which enabled the night to be truly dark without being unplayable, but allowing monsters to lurk unseen. Additionally we wanted to support environment and weapon based point lights, a maximum of four lights affecting any one ‘VU packet’ of world geometry at once. The problem was, we could not afford the required divide and sqrt instructions. (And this is per-vertex, the PS2 did not have pixel shaders) Four lights is also a natural number for the PS2 VU’s, since floating point SIMD instructions typically had a latency of 4 cycles between issue and being able to use the result, but you could issue an instruction every cycle.

The Playstation2 VU’s had two pipes, the upper is a SIMD4 single precision floating point unit, which operated on each of .xyzw in parallel. From memory divides were very expensive at 7 or 13 clock cycles (the general divide instruction being faster than the reciprocal instruction!), sqrt and rsqrt were similarly expensive. Worse these operated on a single floating point value, there was no vector4 divide or vector4 sqrt instruction.

The text book point light implementation for 4 point lights would have required well over 100 clock cycles, and of course we had to transform the vertex, apply 1/w to the UV’s etc.. I needed something which ran an order of magnitude faster!

The first ‘trick’ was not store vectors as .xyz, as it was almost impossible to use the w component of the 4 component registers effectively. (and this remained true throughout the next console generation as well, X360/PS3) So the four point light positions were stored in three vector registers by packing all four lights position X components in one register, all Y in the next and all Z in the last. (.xxxx, .yyyy. .zzzz) The PS2 VU’s had a great instruction set which allowed you code the following very efficiently:

float4 Lx = lightPositionX.xyzw - worldPos.x;
float4 Ly = lightPositionY.xyzw - worldPos.y;
float4 Lz = lightPositionZ.xyzw - worldPos.z;
float4 distToLightSqr = Lx * Lx + Ly * Ly + Lz * Lz;

Quick and with no wasted register lanes. The next part was how to produce the attenuation term as cheaply as possible? This is easiest explained with a code fragement (though this isn’t what shipped)

float4 attenuation = 1.0 - distToLightSqr.xyzw * light.invRadius2.xyzw;
vertexColour += attention.x * lightColour0;
vertexColour += attention.y * lightColour1;
vertexColour += attention.z * lightColour2;
vertexColour += attention.w * lightColour3;

Here we pre-calculate 1/radius^2 on the CPU. In fact the above is a simplification for understanding. The final step was to multiply the attenuation calculation by light.colour and negate the last term, so that we can use the multiply-add instruction.

vertexColour += distToLightSqr.x * lightColour0DivRadius2 + lightColour0;
vertexColour += distToLightSqr.y * lightColour1DivRadius2 + lightColour1;
vertexColour += distToLightSqr.z * lightColour2DivRadius2 + lightColour2;
vertexColour += distToLightSqr.w * lightColour3DivRadius2 + lightColour3;

The CPU pre-calculated: -(lightColour / radius^2).

Just for completeness, here’s the whole ‘shader’ run on VU1. This of course doesn’t look anything like modern code for point lighting.

float4 Lx = lightPositionX.xyzw - worldPos.x;
float4 Ly = lightPositionY.xyzw - worldPos.y;
float4 Lz = lightPositionZ.xyzw - worldPos.z;
float4 distToLightSqr = Lx * Lx + Ly * Ly + Lz * Lz;
vertexColour += distToLightSqr.x * lightColour0DivRadius2 + lightColour0;
vertexColour += distToLightSqr.y * lightColour1DivRadius2 + lightColour1;
vertexColour += distToLightSqr.z * lightColour2DivRadius2 + lightColour2;
vertexColour += distToLightSqr.w * lightColour3DivRadius2 + lightColour3;

The eagle eyed will have noticed also there’s no N.L term, I simply couldn’t afford that, so point lights attenuated with distance only, they didn’t attenuate with the surface angle. But that was it, 4 points lights without divide or sqrt instructions, efficently utilising all four SIMD lanes and the multiply-add instruction.

This super cheap attenuation curve is not a great model of reality, somewhat the inverse of what real light does, but it worked well enough for magical ‘voodoo’ effects, with small/medium sized triangles, and the rasterizers non-perspective correct(!) interpolation of these radiance values across the triangle. (Only textures were perspective correct on PS2, vertex colour was not)

I shared the point lighting code with Acclaim Studios Cheltenham who used is in the 60fps racer Extreme G3 on PS2. I’m not aware of many other PS2 games doing point lighting.

A personal bug bear was that the collision system was extremely ropey, and there was no seperate collision skin, so the artists kept the world geometry simple in areas the player could traverse, with a lot of the triangle budget often going to the ceiling! I eventually entirely rewrote the collision system prior to shipping, but far too late to effect the art content.

Dynamic lighting for instanced objects (as opposed to world geometry) worked differently. Here the CPU converted point lights into direction lights before uploading to VU1. That is the CPU computed a single distance attenuation value and a single light direction vector for the whole object. Meaning you could compute N.L angle attenuation very cheaply on the vector unit, but could not do per-vertex distance attenuation. I assume a lot of games did this.

Shadowman’s chest light didn’t look great automatically converted to a direction light, being inside the model, direction vectors flicked widly with the animation played. So I forced the light to point straight down, which helped also with his shadow and the visual aid for making jumps. The undesirable feature of using the render geometry as the collision geometry had a silver lining. I was able to use the collision geometry to project a disc between shadowman’s feet and draw a shadow. The CPU found the subset of collision triangles overlapping the bounds of the shadow, and then rendered these with UV’s computed to render a disk, darkening the underlying geometry between Shadowman’s feet, following any undulations in the geometry perfectly. (Unlike Shadowman1, which rendered shadowman squashed onto a single plane and in black, without any transparency)

Something I wish I’d had time to code was an option to light large instance objects the same way as world gometry, where the conversion of a point light to a direction light didn’t work well. Again I was fire fighting other problems all the way to shipping.

The live side levels (as opposed to dead side) had a sun/moon and a real time day night cycle was essential to the game. Instanced objects periodically raycast to the sun, I think every 16 frames, and would fade their contribution from the sun/moon up or down over time. Live side world geometry vertices had precalculated visibility from the sun for 16 different times of day, stored as individual bits. This same visibility was used at night for the moon. Something which really helped our RAM and storage problems was that the only world geometry that needed vertex normals was liveside outdoor sectors. So deadside levels and indoor sectors didn’t have vertex normals. We also had height fog, which was pretty unusual for the time and I don’t think even possible on PC at that time with DirectX7.

I had a prototype of just-in-time world lighting on VU0 which was a lot faster but didn’t ship, due to there being some edge cases I never found time to code, including adding in the day night cycle shadows. We did however do just-in-time vertex skinning on VU0, so skinning and render ran in parallel on the two vector units.

Something I prototyped after Shadowman2 shipped was ‘dot3’ bump mapping on the PS2. This was not the full screen multi-pass algorithm Sony developed. Instead I uploaded 256 normals to VU0, computed N.L and point rendered the result to a texture palette as the render target. The idea was that I would do distance attenuation per vertex as in Shadowman2, and then modulate this per texel with the N.L term from the texture. This did look good for the time (I wish I had a screenshot!), however the problems here were that because the PS2 could only single texture, you needed a pass per light, and then you had to render the geometry again to blend the albedo texture on top. We were going to try this for Forsaken2, perhaps only for the ships and select models, avoiding organic shapes, so the 256 normal limitation was likely going to work out ok.

Sadly Forsaken 2 was canned, Acclaim folded, and I never found the opportunity to implement this tech in another title. Of course ‘dot3’ didn’t really take off, and was superseded by the far superior and now ubiquitous tangent space normal mapping.

Fast, Near Lossless ‘Compression’ of Normal Floats

Here’s a trick for lossless storage of ‘normal’ floating point numbers I came up with years ago, but was only reminded of recently. Realising I haven’t seen it anywhere else since, time for a blog.

The IEEE754 single precision ‘float’ is ubiquitous in computer graphics, and much better undertood than it used to be, thanks to some great blogs and engineers pushing the envelope being forced to get to grips with its limitations.

In computer graphics, its extremely common to store normal numbers, signed [-1..1] or unsigned [0..1], so much so, we have universal GPU support for SNORM and UNORM formats. Of course its also common to quanatize normal numbers to use less than 32bits, with great research in particular into high quality, compact storage of three dimensional normal vectors, for g-buffers and other applications. These are lossy, but that’s the point.

My technique stores an unsigned normal 32bit floating point number using only 24 bits with a maximum error of 5.96046448e-8 (0.0000000596448), and with zero error at 0.0 and 1.0. This is trivially extended to signed normal numbers.

To give one use case, storing normalised linear depth after rendering, you could pack linear depth into 24 bits and stencil into the other 8 bits. Giving an old school style D24S8 surface, but with negilable loss of precision vs a full 32bit float.

There are plenty of excellent resources on how floating point storage work, I’m not going to repeat these, but I need to cover just a little of how a ‘float’ is stored to explain the technique. This is the simplisitic way I think of the three components of the IEEE754 single precision float:

  • A sign bit – simple
  • An 8 bit exponent – the power of 2 range that contains the number
  • 23 bits of mantissa – an interpolation from the lower power of 2 in this range, up to but not quite including the next power of 2.

So for example, the exponent might specify ranges [0.5..1} or [1..2}, or [4..8} etc.. Its the range [1..2} which is key to this technique, since the delta of the stored numbers in this range is 1, or nearly 1 to be precise.

Dealing with unsigned normal numbers only for a moment, if we add 1 to our number, then we can store off the 23bits of mantissa and discard the rest of the floating point representation. To reconstruct we bitwise OR in the infamous 0x3f800000 (1.0) and then subtract 1 to get back into the original range. Unfortunately we also want to handle the case that the number stored is exactly 1, so we need another bit for that. This then is how we get to 24 bits, move the normal float into the [1..2} range, store the 23bit mantissa and store an extra bit to indicate if the value is exactly 1.

Here’s the code in HLSL, note there’s actually a problem with the compress function, but I’ll come to that in a bit.

// note this function has an edge case where it breaks, see below for why and a fixed version!
uint CompressNormalFloatTo24bits(float floatIn)
    return (floatIn == 1.0) ? (1 << 23) : asuint(floatIn + 1.0) & ((1 << 23) - 1);

// input needs to be low 24 bits, with 0 in the top 8 bits
float DecompressNormalFloatFrom24bits(uint uintIn)
    return (uintIn & (1 << 23)) ? 1.0 : asfloat(uintIn | 0x3f800000) - 1.0;

Clearly both ‘compression’ and ‘reconstruction’ are extremely cheap operations, especially as the compiler can resolve some of the bitwise operations to a constant. Why any error at all? The error creeps in from the fact we are manipulating the floating point number out of the [0..1} range, the storage of which uses one of many different possible exponents, then by adding 1 we move into a single exponent range that covers all of [1..2}, and this is not a lossless operation. However typically in computer graphics, an engineer is unlikely to be put off by a max floating point accuracy error of 5.96046448e-8.

So what’s the problem with the above compression function? There issue is, there is one number which can be stored in the [0..1} range, but when we add one, it cannot be represented in the [1..2} range. This is 0.99999994, the hexidecimal 0x3f7fffff gives a clue as to the problem, all mantissa bits are set. When we add 1.0 to this, we get 2.0, not 1.99999994 (as this number is not representable), 2.0 is not covered by our chosen exponent, and so the above function breaks. Fortunately the fix for our compression function is simple and ordinarily no additional cost, at least on a GPU:

uint CompressNormalFloatTo24bits(float floatIn)
    // careful to ensure correct behaviour for normal numbers < 1.0 which roundup to 2.0 when one is added
    floatIn += 1.0;    
    return (floatIn >= 2.0) ? (1 << 23) : asuint(floatIn) & ((1 << 23) - 1);

The eagle-eyed will have noticed I changed == to >=, this is just a safety feature for bad input and not actually part of the fix, clamping our input for free, which is always nice.

Handling signed normal floats we need to store the sign bit also which is trivial, and then we can use the same functions by taking the abs of the input. Of course you might wish to keep to 24 bits, and so you might sacrifice the least significant mantissa bit.

24bits is of course a bit of an odd size for a computer to deal with, so this is really a tool in your toolbox for packing with other data. The ability to drop least significant mantissa bits gives some flexibility in packing.

I’ve only used this on IEEE754 single precision floats, thinking out loud, there are some interesting possibilities for other floating point representations:

  1. Half precision floats (and NVidia’s TensorFloat) have 10 bits of mantissa. A three component signed normal vector would require 12+12+12 = 36 bits. To get into 32bits you could either drop 1 or 2 mantissa bits from each component, or you might chose to drop the ability to store exactly -1 and 1., saving a bit from each and only having to drop 1 mantissa bit total.
  2. Brain floats have 7 bits of mantissa, this trick for a unsigned normal numbers would only require a byte.

As a bonus, here’s some functionality for C++ guys wanting to run the same functions

union floatint32_u
    float f;
    uint32_t u;
    int32_t s;

uint32_t asuint(float input)
    floatint32_u t;

    t.f = input;
    return t.u;

float asfloat(uint32_t input)
    floatint32_u t;

    t.u = input;
    return t.f;

Min/Max Buffer Precision Improvement

This is a simple trick I came up with years ago. I’ve finally decided to create a blog and share a little with the community, wish me luck!

In computer graphics you sometimes end up storing min/max data pairs. An obvious example is in deferred lighting, where a game engine will compute and store the min and max of all depth values in a screen tile, say 16×16 pixels large. Lights are then culled against this depth range. (or multiple ranges, in the case of clustered schemes)

Of course, graphics engineers are concious of memory use and moreover implications for bandwidth and cache pressure. Therefore its common to quantize data to the smallest type we can get away with. So for example we might chose to store min/max linear depth values as 16bit unorm’s. e.g. using a 2D texture with the format DXGI_FORMAT_R16G16_UNORM. Probably as I do, converting from reverse non-linear Z during rasterization to forward linear Z for deferred passes.

The min/max texture for a terrain scene looks like this:

The red channel stores the minimum depth value, and the green channel the maximum in each screen tile. RG 1,0 (an illegal value) is being used to denote a clear tile, i.e. sky. Where min and max depth are similar, we have some shade of yellow based on distance. When there is a large depth range, the colour tends towards green, as the green channel is storing a max value substantially larger than the min value. Intuitively this occurs on silouettes. Such storage is common, but wasteful of precision, since both min and max channels can store anything in the range 0..1.

Depth was originally a 32bit float and in converting to 16bits we lost a lot of information.

Fortunately we have an exploitable constraint in our data, i.e. min <= max and conversely max >= min. The trick is to make one of these values relative to the other. In the following I choose to gift min more precision, but its just as easy to do the same for max instead.

Using the same texture format, instead of storing min and max directly, I store max and a delta which interpolates between 0 and max. So as long as max is less than 1.0, we have improved the precision of min. This is trivial to code:

// encode for texture write
encodedRG = float2(min / max, max);
// decode after texture read
min = encodedRG.x * encodedRG.y;
max = encodedRG.y;

In this scheme the green channel of the texture looks exactly as it did before, however the red channel is drastically altered.

When there is very little difference between min and max, the encoded delta value is close to 1. Only when a large depth discontinuity exists do we see smaller values in the red channel.

If the max value is 0.25 for example, which the far mountains are in this scene. The minimum value benefits from effecively four times the precision, since the same 16 bits are now being used to store values in the range 0 – 0.25, instead of 0 – 1.

(Note I have modified the histogram of the images slightly to make the colours stand out more)

This results in ~0.4% speed up in my deferred lighting, due to less pixels processed with zero attenuation. Not bad for such a small change, but not earth shattering either. YMMV, and of course improvements in precision are sometimes about quality rather than optimisation.

A future extension would be to stop using a 2 channel texture and instead pack delta min and max into a single R32_UINT. The potential benefit here would be to gift a different number of bits to each of delta min and max. Say giving max 17 bits and delta min 15 bits. This of course requires the shader to perform more operations in packing and unpacking.