AFAIK, the only way to use !Send types with async is by ensuring that those are only used inside a LocalSet (by Tokio, which keeps execution on the same thread). Since you use #[tokio::main], the executor is not a LocalSet, and therefore everything that is used by an async fn will have to be presumed to be Send safe.
If a task is being moved from one thread to another, is there a memory barrier inserted? i.e. do the CPU registers/L1/L2/L3 caches get flushed to main memory?
If so, the Rc's counter that makes it Send-unsafe, would also be flushed to main memory and be visible on the next thread on which this task gets scheduled.
Provided I don't explicitly spawn another thread and hand it over a clone of this Rc (except for these async/awaits), would it be fine to mark it (the structure containing the Rc) as Send?
When you say "mark it as Send", do you mean this: unsafe impl Send for MyNotSendType?
Don't do this. Stay away from unsafe unless you're purely experimenting and wanting to learn the language (the hard way). By using unsafe, you cannot eliminate the possibility of the unsafe code causing bugs when bugs arise, making debugging a real pain.
This is getting into ideas that are more OS-specific behavior relating to how context switching is handled. We can assume the OS will ensure each thread's context doesn't change unless the program itself changes it.
In short, if you want to use single-threaded primitives, use Rc, RefCell, and Cell types.
If you want to use multi-threaded primitives, use Arc, RwLock, Mutex, and other atomic types.
Not in any high-performance async runtime. A memory barrier is expensive, and if your task is Send, it's unnecessary, since your task has the right barriers in place already.
Flushing to main memory is almost never guaranteed, and is not part of a memory barrier. A memory barrier only ensures that the system's cache coherency protocol will pick up the right data if needed - but it might do so by a message to the owning cache instead of an access to main memory.
If you're not familiar with cache coherency, then you don't yet have a good mental model of what goes on to make memory accesses from different threads work, and thus of the relative costs and benefits of using Rc versus Arc. If that's the case, I highly recommend buying a copy of Rust Atomics and Locks by Mara Bos and reading it until you fully understand what's going on - the book is incredibly well-written, and is by an expert in the field.
No, it wouldn't. While the OS takes care of any required barriers when it migrates a thread between CPU cores, the async runtime deliberately does not do anything when it migrates a task between worker threads, and thus your reference count may still be in the store buffer of another core, and not visible to cache coherency at this point.
In practice, it might work just fine on some systems, but not on others, depending on the details of their load/store buffering, cache and memory subsystems. This is really not a good place to be in, since you've got code that works fine until you upgrade your system.
Strictly speaking, if you keep all clones of the Rc within the same task so that they are all moved together when the task is moved from one worker thread to another, then your code will work. However, this is rather error prone, and I wouldn't recommend it.