Compressing 2D grid of categorical data

Suppose we had a 100k by 100k grid of u8. That would take 10GB to store.

Suppose these u8 stored linear data, then we could view this as a 100k by 100k jpeg (slightly beyond the 65536 jpeg boundary), and store it lossily as a jpeg, losing some data, but saving lots of space.

====

Suppose now that we have a 100k by 100k of some enum that takes on 32 different values (can be stored in 5 bits) -- but this is categorical data, i.e. the enum value "2" can not be thought of as the avg of enum 1 and enum 3.

For example: enum 1 = animal, enum 2 = car, enum 3 = ship.

In this case, is there a standard way to store this data in a lossy way?

(Thins like jpeg compression make no sense, in this case, since the data is categories and not linear.)

Must it be lossy?

If not, you could just zip it, which seems like can be done in memory.

It does not have to be lossy, but I think allowing it to be lossy allows better compression ratios than zip / gzip -9 / bzip -9.

This is outside my comfort zone, so pardon me if I say something stupid...

What does it mean to compress categorical data in a lossy way? I guess it means that when you inflate the data again, you sometimes don't get back the same variant of the enum, but instead you get one that is "close enough". But does that make sense? From your example, is an animal close enough to a car?

The best compression algorithms for a given piece of data take advantage of structure/regularities in the data. In particular lossy compression algorithms like JPEG or MP3 are tuned to a model of human perception. They avoid storing parts of the original that don’t affect the perception of the final result much.

I think the best choice for your use case depends upon what kind of losses are considered acceptable, as well as any structure to the data.

If the location in the grid corresponds to something like a physical location such that things that are close in the grid then that suggests you could perhaps store the grid at a smaller resolution, after applying some sort of lossy mapping. Say if a 2 by 2 grid has all the same category in it, store that with fewer bytes than the full small grid.

If most of the cells have the same value, maybe many are category 0 for example, then standard sparse matrix storage may be worth looking into.

If there are lots of runs from left to right in the data then RLE could be effective. Maybe chopping of the last few bits of the run lengths would be acceptable? It’s hard to know without knowing the structure of the data.

If the data truly has no structure then you could just not store a random subset of it. But in that case you might as well make it simpler and say only store the first half. But if there is no structure at all, why are you not just storing 32 totals instead of a grid? Again, if there is any meaning to data being in cell (x,y) vs (x+1,y+1) then you probably to take that into account.

Fair question. I will admit up front -- I don't know the right way to define 'loss' for categorical data -- if I did, I'd have googled that term / put it in the original question. :slight_smile:

That said, one dumb way we can define loss is s follows: what % of the entries are wrong ?

0% would be lossless.
100% would be utterly useless.

5% error, 10% error? If not all clumped together in the same area, would probably still be useful (and "close enough" in many cases).

It depends what the acceptable loss is. The most trivial solution would be to reduce resolution.

If you're gzipping it, then denoising of the data would help (make it have contiguous runs of the same value).

If the data is spatial, it may help to split it into 2d blocks, so that nearby data is more similar (as opposed to row-oriented storage that is more 1d for gzip).

If you're able to estimate/predict the data based on previous values, then compress difference between your prediction and actual values. The better your prediction, the better compression you have. This may also naturally give you a lossy codec if you choose to ignore small differences between your prediction and actual data.