Dirty memory reads and mpmc (low-latency communication between threads)

I have a weird question. There is something I clearly don't know about low level programming and memory.

My experience with comm protocols, over defective hardware ports and bad phone lines led to some adaptive dynamic code.

If I have a producer guard tied to a bit (2 actually) status for the consumer then all I need to know is:

  • The memory write lag to unlock after consuming the data. The timing is closely related to setting the in-progress bit and the process time of the producer.

  • The approximate amount of time in both loops.

  • The cost of the producer checking the guard (again) before writing because it is costly. It must do a Next if the block was processed after all.

  • How much "fixed" lag have I created to avoid a consumer memory write timing error?

It's a little trick where both sides know the token (mutex) or lock bit expires after X ms. The producer sets it's exclusive data ready bit. The consumer has exclusive in progress and completed bits. Normally the time to detect the producer data ready bit and spinning are the issue.

Where the consumer fails or is slow the producer gets control of the buffer again. It can retry or apply more advanced handling. A timeout right? It can happen.

Therefore... Buffer overruns and or dropped bits (the issue) did not cause a problem. It's a mathematical characteristic of timing.

Well... restarts of crashed consumers were complex but that's off topic. The algorithm was fine.

SO........ I bet this doesn't work with static memory and the stack for super low level programmer reasons I, kniw nothing about.

But it SHOULD work. In theory. It worked before.

On the other hand, a mutex seems SO cheap that it would out perform my crappy guard. It seems like I would gain very little my way. I don't know.

Any advice?

I'm not sure if I understand it correctly, but it seems you want to have low-latency communication between threads, via shared memory.

The answers to your questions are deeply in the "it depends" category.

First, there is no "memory". There are multiple incomplete, inconsistent copies of the memory at different cache levels, which are lazily and partially synchronized at some points in time.

Then, there's no "time". CPU cores run asynchronously a few things at a time. And your thread can be suspended at any time, and resumed at any other CPU core any time later. Also, unless you explicitly tell otherwise, the compiler can reorder your code and execute instructions in a different order than you've written it in.

So you can't really expect neither hardware not software to reliably behave in any sensible predictable way.

The way to deal with it is via memory fences/barriers, and atomic instructions with specific ordering guarantees. At that point it's not much working with the hardware (too many fuzzy layers of abstraction), but a multi-dimensional logic puzzle of what where needs to require what conditions for the overall algorithm to stay correct.

Anyway, for tasks such as efficiently locking a mutex, or sharing a buffer with another thread, we've already got well-tuned algorithms. parking_lot and crossbeam-channel already do all of the tricks you mention.

1 Like

Kornel describes my use case and the issue I could see really well. My mission here is to just find the best practice, lang feature and authority. Thanks, expect to be quoted.

parking_lot:
https://docs.rs/parking_lot/0.1.0/parking_lot/

crossbeam-channel:
https://docs.rs/crossbeam-channel/0.3.9/crossbeam_channel/

Although judicious use of copy and clone come later my worst case and a common one is passing large unknown objects with fat pointers for sequential processing on threads. Type identification and validation (ie. to an external forensic api) is a part within that.

If you use Box, then "passing" of objects between threads won't actually pass much, and it'll boil down to just decision which thread uses it. The object will stay where it was, with no copying (not counting very low-level things like separate L1/L2 caches and NUMA, but micromanagement of that would require quite specialized cases).

2 Likes

Yes I don't know what the (dereferencing) overhead would be on string slices and Box (on the heap) vs this above solution which is clearly performant. Most of my data rightly should use Box and at most Arc.

In reading I was curious about having a prioritized vs random selection here:

If you need to select over
 a dynamically created list
 of channel operations, 
 use Select instead. 
The select! macro is just
 a convenience wrapper
 around Select.

Both return a random selection.

Dynamically created suggests some opportunity there to abstract or extend the Rust semantics but there is a right and wrong way to do it (or not at all) that might be obvious. Is this discussed anywhere?

I assumed that is an application level abstraction (grouped threads) and not a language concern? It's pretty cheap to move a thread pointer per an enum defining priority (including Next.)

I'm confused what you're trying to do. There aren't different kinds of dereferencing, so whether you use a Box via select, queue, task, channel, mutex, lazy_static, or whatever, it's still the same bunch of bytes in memory, accessed the same way.

What is the problem that you're trying to solve? Do you have a bottleneck in an existing implementation?

Omg I wish you were kidding about the stack and heap. This is similar to the disconnect between OS and file system. I have object design guidelines and a utility requirement to address that. I am doing a similar simplification here seeing that is the case.

Well that performance issue what I need to resolve now. Box vs static in a nutshell. I had the impression that static buffers are still essential and unboxing has a cost and why static buffs persist in use. If I say boxed types questions will arise.

Contradicting that concern I am pretty sure the Rust compiler optimizes that away if you use explicit enough types, traits and macros.

This is kind of foundational at the design stage. In particular, I must determine which among these requirements require customized code in Rust.

Why I am pestering you is that some chunks must (ultimately) manage to run in a very low performance platform (perhaps even IOT) and in VM's. The perfect storm.

So I am (committed to Rust and) intensely interested in core vs std and of course this extremely rapid compiler evolution here.

I can trust your answer so if it's Box then even better. If I can intelligently quote you.

Time saving follow-up: Would you say I can prototype using Box if I allow for a crate needed later? IF needed, gains would be predictable against coding effort from what I see here on site. I have had zero concerns about crates addressing patterns and want to finalize this.

MY USE CASE:
APP: Debate tool.
Users: All. :wink:

For Rust: The meshnet and key graph routines.

My types mainly consist of graph objects, documents and compressed files. Debates are social media with constraints (and thus extensions) but shared base types. A SM graph but a DAG concurrently.

So, at one level I have a Meshnet (GnuNet) and all that implies.

Over at the site crawler utility we have performance considerations but to a MUCH lesser degree. Static memory is not even an issue here to my knowledge.

SCOPE:
Huge. It's a SM integrated debate tool. Prerequisite: A secure and private (by default) platform. GnuNet over Whonix.

More important, it's the only user device and must be usable for business too.

The market. Normal people not us! Lol.

So this requires a verified subset of the standard FOSS library covering most domains. The user's annual budgets vary exponentially and start at $200 plus data cost. Ie. I would launch it in Africa.

Another requirement that's tough eh? So R-Pi (very problematic) to keep it simple here.

LINKS:

An equivalent idea in Rust. A vetted library:

The speed of memory backing 'static objects, the stack, and Boxes on heap is identical. It's all the same memory.

It's a bit of a stretch, but it works almost as if the operating system loaded your program and all your static variables/buffers into an equivalent of Box<YourCodeWithStaticVariables>, and gave you Box<YourStack> to run the program in.

The performance differences come from the cost of bookkeeping which memory is in use, e.g. creation and destruction of Box<Foo> needs to update more places than increment/decrement of a stack pointer for Foo, but once the object is allocated, the bytes in memory are all the same. Another performance factor is whether memory is likely to be in cache — stack is used all the time, so it's very likely to be in caches. But if you have an object on the heap, and use it frequently, it'll get cached too.


For processing network-based things you probably don't need to worry about any of this. The speed of caches and locks is measured in nano- and micro- seconds, and networks in milliseconds (it's like receiving data once a year, and worrying about processing it in a minute).

For fast, parallel computation on large graphs you might want to use a memory pool for fast creation of nodes, and ensure nodes are sized and aligned to cache line size. However, I'd recommend starting first with some naive solution, like Arc for nodes, because it may turn out to be fast enough, even on RPI.

3 Likes

That's great and you might like my only guideline to date mentioned in an earlier post.

Conceptually the fields are ordered in a struct by some ranking. Frequency of use perhaps but it is use case specific.

That said, my smaller objects are viewed as being composed of 64 byte alignment segments. Within any segment having multiple fields they are reordered by descending size for optimal padding.

There is no need to order the entire struct by descending size as I saw suggested here and it is a bad strategy.

LINK: