Search json data using GPU


I was wondering if anyone used a GPU to parse/search through several terabytes of json data. So far I found only one lib using Python/C++:


Parsing is not a task which matches GPU architecture, even for very big data. GPU is SIMD, which means, that even if you have many cores, all of them has to execute exactly the same instruction at the time - only the data on which it performs differ. This makes them very good for image processing, rendering, or tensor calculation (so for many scietnific purposes, includes neural network based AI), but parsing is full of branching and looping so gpu would be poor choice for this.

Library you found is a scrientific calculation library, probably just alternative to tensorflow. It has some loading data functionality, but my guess is, that they are only utility done by CPU, and only science is done on GPU.

1 Like

To complement @hashedone's remark, one important thing to keep in mind when programming for GPUs is that the GPU and CPU communicate via the PCI-express bus, which is much slower than the main memory of either chip.

For example, in high-end systems from a couple years ago, CPU to RAM bandwidth was typically around 100 GB/s, GPU to VRAM was typically around 500 GB/s, and PCIe x16 was around 16 GB/s. The exact numbers are moving constantly, but the ratios don't change much. And here I'm only discussing bandwidth, but PCIe latency is pretty huge as well wrt anything the CPU normally does.

This means that when offloading an isolated task to the GPU, said GPU always starts at a disadvantage: in order for GPU processing to be faster overall, the increase in speed brought by GPU processing must compensate the cost of transferring data from the CPU to the GPU, which is very large and doesn't exist in CPU-based processing.

Searching something in json data sounds like a memory bound task (a task which is limited by available memory bandwidth, by how fast you can scan through the input string), if implemented correctly at least, so in such a task FLOPs do not matter. The only thing that matters is how fast you can scan through the whole dataset. And here, the GPU has no chance of winning against a well-written CPU program in my opinion, because its advantage in memory bandwidth is completely overshadowed by the slowness of transferring the data through PCIe.

People normally work around PCIe slowness by trying to move data to the GPU early on and keep it there as long as possible. But would this pattern apply in your use case ?


JSON parsing can be done with the help of SIMD, though. See simdjson.

An interesting question is whether the simdjson approach would apply to GPUs.
SIMD on CPUs means executing the same code on four or eight values in parallel. SIMD on the GPU is a whole different beast. It means executing the same code on thousands of data points in parallel.

Good piece of talk! But I am afraid, that SIMD on x86 is incomparable to SIMD on GPUs. On CPU we are talking about parsing 64 characters at once, when using 512 bits registers (and in talk it is said, they are not, so assuming that parsing at most 32 chars at once is the limit). On GPU it is up to 4k cores, so parsing 4k characters at once. In some cases (checking utf validity) it might work, but for others (lookup for digits chains) not so much - we are not expecting 4k character numbers in our jsons. The question is if it would scale this much in practice. And also it doesn't solve the problem of passing data around which might be serious bottleneck for this use case.


Good point but my json is in the format of several relatively smaller lines in a file. So probably can chunk it and send them to the GPU.

Cudf (which is a part of the rapidsai framework) has this function. I didn't get to try it yet. read_json ( path_or_buf , engine='auto' , dtype=True , lines=False , compression='infer' , byte_range=None , *args , **kwargs )

Load a JSON dataset into a DataFrame.

Also,see this article:
@raphlinus: could you please share if you have any update on this?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.