I often find its useful to have four threads (or lanes, depending on your preferred parlance) collaborate on 2×2 pixels. Traditionally this would be done with group shared memory. However with Shader Model 6, we have the QuadReadAcrossX/Y/Diagonal intrinsics.
The advantage of these intrinsics instead of that group shared memory is that group shared memory is a resource, using too much of it can restrict occupancy and reduce shader performance. Using shared memory may also come with synchronisation requirements and depending on your hardware, subtle performance penalties such as bank conflicts, which I won’t go into here. Generally speaking, I expect the QuadRead intrinsics to outperform group shared memory.
The documentation (see Quad-wide Shuffle Operations) gives the required order of the threads, the expected behaviour being:
[0, 1]
[2, 3]
// QuadReadAcrossX for Thread0 will obtain Thread1's value and visa versa.
// QuadReadAcrossX for Thread2 will obtain Thread3's value and visa versa.
// QuadReadAcrossY for Thread0 will obtain Thread2's value and visa versa.
// QuadReadAcrossY for Thread1 will obtain Thread3's value and visa versa.
The question then is for a given threadIndex (obtained via WaveGetLaneIndex()), how do we generate the 2D coordinate the thread is to operate on? There is no default in spec. Of course, you could just do a buffer read to map thread ID to a 2D coordinate, but you might want to use maths instead, depending on what your shader’s bottlenecks are.
I came up with two different layouts. A cheap version with two periods of tiling, and a more expensive version with three periods, which can be more useful for reduction operations, especially creating mip maps with wave64.
(Below: the lane to 2D coord cheap ‘rectangular’ mapping, I’ve highlighted the first 4 quads of a wave64)

In the more expensive version I was able to reduce the number of operations by packing the x.y coordinate into the top and bottom 16bits of a dword, thus allowing me to perform the same operation concurrently on both x and y. The code might take some dissecting, so the unoptimised version is reproduced in the comments.
(Below: the more epxensive lane to 2D coord ‘square’ mapping, I’ve highlighted the first 4 quads of a wave64)

The comment block show what tiling patterns are for wave64, but its easy to extrapolate what the code does for other wave sizes.
uint2 ThreadIndexToQuadCoord(uint threadIndex)
{
uint2 coord;
// could load from buffer instead, depending on what shader bottlenecks are
#if CSQUADCOORD_SQUARE != 0
/* Sightly more expensive square version produces:
0 1 4 5 16 17 20 21
2 3 6 7 18 19 22 23
8 9 12 13 24 25 28 29
10 11 14 15 26 27 30 31
32 33 36 37 48 49 52 53
34 35 38 39 50 51 54 55
40 41 44 45 56 57 60 61
42 43 46 47 58 59 62 63
// 3 periods of tiling, non optimised for clarity:
x = ( index & 1 ) + ((index & 4) >> 1) + ((index & 16) >> 2);
y = ((index & 2) >> 1) + ((index & 8) >> 2) + ((index & 32) >> 3);
*/
// duplicate index in top 16 bits, but pre-shifted right by 1
threadIndex |= threadIndex << 15;
// two bitwise ANDS for the price of one
coord.x = threadIndex & 0x10001;
coord.x |= (threadIndex >> 1) & 0x20002;
coord.x |= (threadIndex >> 2) & 0x40004;
coord.y = coord.x >> 16;
coord.x &= 0x7;
#else
/* Cheap rectangular version produces:
0 1 4 5 8 9 12 13
2 3 6 7 10 11 14 15
16 17 20 21 24 25 28 29
18 19 22 23 26 27 30 31
32 33 36 37 40 41 44 45
34 35 38 39 42 43 46 47
48 49 52 53 56 57 60 61
50 51 54 55 58 59 62 63
// 2 periods of tiling, non optimised for clarity:
x = (index & 1) + ((index & 12) >> 1);
y = ((index & 2) >> 1) + ((index & 48) >> 3);
*/
uint indexSHR1 = threadIndex >> 1;
coord.x = (threadIndex & 1) + (indexSHR1 & 6);
coord.y = (indexSHR1 & 1) + ((threadIndex & 48) >> 3);
#endif
return coord;
}
You could simplify the math a little for wave16, but these functions handle the common wave32 and wave64 sizes equally well.