I guess one could quickly knock up a hundred line program that has function doing the same thing but with Rc and Arc. And cargo bench would very soon tell you something.
But what would it tell you? Would it have any relevance to any application you want write?
It is pretty much impossible to predict the performance of modern day processors, even without threads, what with their mutiple cache levels, pipelines, instruction reordering, parallel dispatch, branch predictors etc. Gone are the days of looking at your assembler code, looking up the number of clock each instruction took in the data sheet and hence knowing how long things would take.
Throw threads into the mix and it gets a thousand times worse.
My naive take on all this is that an atomic operation is akin to saying that the data in question is not in cache and at least requires a trip to main memory and back. So that is already massively slower than cache friendly code. That paper indicates a 30 times performance hit, which sounds pretty good to me, I presume a lot of that memory round trip is short-circuited in the cache coherence hardware or whatever they have in there.