Here’s a trick for lossless storage of ‘normal’ floating point numbers I came up with years ago, but was only reminded of recently. Realising I haven’t seen it anywhere else since, time for a blog.

The IEEE754 single precision ‘float’ is ubiquitous in computer graphics, and much better undertood than it used to be, thanks to some great blogs and engineers pushing the envelope being forced to get to grips with its limitations.

In computer graphics, its extremely common to store normal numbers, signed [-1..1] or unsigned [0..1], so much so, we have universal GPU support for SNORM and UNORM formats. Of course its also common to quanatize normal numbers to use less than 32bits, with great research in particular into high quality, compact storage of three dimensional normal vectors, for g-buffers and other applications. These are lossy, but that’s the point.

My technique stores an unsigned normal 32bit floating point number using only 24 bits with a maximum error of 5.96046448e-8 (0.0000000596448), and with zero error at 0.0 and 1.0. This is trivially extended to signed normal numbers.

To give one use case, storing normalised linear depth after rendering, you could pack linear depth into 24 bits and stencil into the other 8 bits. Giving an old school style D24S8 surface, but with negilable loss of precision vs a full 32bit float.

There are plenty of excellent resources on how floating point storage work, I’m not going to repeat these, but I need to cover just a little of how a ‘float’ is stored to explain the technique. This is the simplisitic way I think of the three components of the IEEE754 single precision float:

- A sign bit – simple
- An 8 bit exponent – the power of 2 range that contains the number
- 23 bits of mantissa – an interpolation from the lower power of 2 in this range, up to but not quite including the next power of 2.

So for example, the exponent might specify ranges [0.5..1} or [1..2}, or [4..8} etc.. Its the range [1..2} which is key to this technique, since the delta of the stored numbers in this range is 1, or nearly 1 to be precise.

Dealing with unsigned normal numbers only for a moment, if we add 1 to our number, then we can store off the 23bits of mantissa and discard the rest of the floating point representation. To reconstruct we bitwise OR in the infamous 0x3f800000 (1.0) and then subtract 1 to get back into the original range. Unfortunately we also want to handle the case that the number stored is exactly 1, so we need another bit for that. This then is how we get to 24 bits, move the normal float into the [1..2} range, store the 23bit mantissa and store an extra bit to indicate if the value is exactly 1.

Here’s the code in HLSL, note there’s actually a problem with the compress function, but I’ll come to that in a bit.

```
// note this function has an edge case where it breaks, see below for why and a fixed version!
uint CompressNormalFloatTo24bits(float floatIn)
{
return (floatIn == 1.0) ? (1 << 23) : asuint(floatIn + 1.0) & ((1 << 23) - 1);
}
// input needs to be low 24 bits, with 0 in the top 8 bits
float DecompressNormalFloatFrom24bits(uint uintIn)
{
return (uintIn & (1 << 23)) ? 1.0 : asfloat(uintIn | 0x3f800000) - 1.0;
}
```

Clearly both ‘compression’ and ‘reconstruction’ are extremely cheap operations, especially as the compiler can resolve some of the bitwise operations to a constant. Why any error at all? The error creeps in from the fact we are manipulating the floating point number out of the [0..1} range, the storage of which uses one of many different possible exponents, then by adding 1 we move into a single exponent range that covers all of [1..2}, and this is not a lossless operation. However typically in computer graphics, an engineer is unlikely to be put off by a max floating point accuracy error of 5.96046448e-8.

So what’s the problem with the above compression function? There issue is, there is one number which can be stored in the [0..1} range, but when we add one, it cannot be represented in the [1..2} range. This is 0.99999994, the hexidecimal 0x3f7fffff gives a clue as to the problem, all mantissa bits are set. When we add 1.0 to this, we get 2.0, not 1.99999994 (as this number is not representable), 2.0 is not covered by our chosen exponent, and so the above function breaks. Fortunately the fix for our compression function is simple and ordinarily no additional cost, at least on a GPU:

```
uint CompressNormalFloatTo24bits(float floatIn)
{
// careful to ensure correct behaviour for normal numbers < 1.0 which roundup to 2.0 when one is added
floatIn += 1.0;
return (floatIn >= 2.0) ? (1 << 23) : asuint(floatIn) & ((1 << 23) - 1);
}
```

The eagle-eyed will have noticed I changed == to >=, this is just a safety feature for bad input and not actually part of the fix, clamping our input for free, which is always nice.

Handling signed normal floats we need to store the sign bit also which is trivial, and then we can use the same functions by taking the abs of the input. Of course you might wish to keep to 24 bits, and so you might sacrifice the least significant mantissa bit.

24bits is of course a bit of an odd size for a computer to deal with, so this is really a tool in your toolbox for packing with other data. The ability to drop least significant mantissa bits gives some flexibility in packing.

I’ve only used this on IEEE754 single precision floats, thinking out loud, there are some interesting possibilities for other floating point representations:

- Half precision floats (and NVidia’s TensorFloat) have 10 bits of mantissa. A three component signed normal vector would require 12+12+12 = 36 bits. To get into 32bits you could either drop 1 or 2 mantissa bits from each component, or you might chose to drop the ability to store exactly -1 and 1., saving a bit from each and only having to drop 1 mantissa bit total.
- Brain floats have 7 bits of mantissa, this trick for a unsigned normal numbers would only require a byte.

As a bonus, here’s some functionality for C++ guys wanting to run the same functions

```
union floatint32_u
{
float f;
uint32_t u;
int32_t s;
};
uint32_t asuint(float input)
{
floatint32_u t;
t.f = input;
return t.u;
}
float asfloat(uint32_t input)
{
floatint32_u t;
t.u = input;
return t.f;
}
```