On modern programming languages and growing hardware complexity


Sorry for the long paragraphs everyone. I do have a mildly verbose writing style due to the unhealthy combination of using paragraphs to separate discussion subjects, and being very bad at keeping text short (especially in a non-native language like English). I’ll try to do better.

I have a similar impression.

There have been many attempts in the past to bring a functional flavour to imperative languages (Scala of course, but also Python, C++11, Java 8…), and functional languages have always had to keep a dirty imperative corner around because a program with zero side-effects cannot send any output out of the CPU and is therefore of little use. But anytime you learn any of these languages, you can distinctly feel what is the preferred path, and what has been glued on top.

Rust is quite impressive at bringing the two paradigms together, probably in no small part because it made the hard choice of doing away with inheritance, which adds a lot of complexity if done right (with covariance and contravariance for example), and decidedly does not play so well with functional constructs like pattern matching.

Personally, my gateway drug into functional programming was OCaml, but that’s probably because I live in the native country of that language and we have some awesome teachers of it around.

Erlang is definitely an intriguing beast. For example, the error handling and software update strategy seems very unique compared to anything I’ve played with before. From personal experience, it can be hard to get into due to accumulated syntactic warts (like older Fortrans if you’ve played with that). But from a quick PM conversation with @josephDunne from the telecoms thread, Elixir seems to fix that without destroying the overall spirit of the language. So probably that’s what I’m going to try in my next learning attempt.


There’re some projects that Oracle is spearheading that should help Java be a bit more mechanically sympathetic. There’s Valhalla which is looking at value types and generic specialization. There’s Panama which is looking at FFI, layout, and a vector (SIMD) API. The SIMD part has an interesting concept they’re exploring - machine snippets. This is sort of like a macro assembler or inline assembly with a higher level API.

There’s also a new GC that Oracle has been working on and has recently been publicly disclosed: ZGC. It’s a similar high level design to Azul’s C4 and Red Hat’s Shenandoah.

These are interesting (and long overdue!) projects but time will tell how they impact performance.


Do you mean something like an inline assembler for (SIMD-augmented) JVM bytecode as opposed to native code?


Agreed - I actually think this is the more interesting subject - how do you design for large and complex systems? What languages help in ensuring correctness? How do they lend themselves to refactoring? In other words, how do they scale with project complexity and scale?

Take the above and then add performance to the mix - is it “easy” to retain performance as the complexity and size of the software increases.

It’s been noted that C++ doesn’t scale well in the above. Large complex systems are too risky to refactor; ownership isn’t always clear and cloning/copying is done “just to be safe”. Concurrency is trickier to introduce. And so on.

Java scales well in the sense that its IDEs are second to none. The type system is pretty weak and invariants are enforced via dynamic checks (if at all). Performance may or may not scale - depends on whether the increase in code also puts pressure on the JVM. Some abstractions may get more expensive - PGO may see polluted profiles, some optimizations done with fewer types loaded are no longer done, etc. The one thing that does remain is “cross module” inlining (subject to regular JIT whims about inlining decisions) - if you split your code across jars (or modules in Java 9 world) then it doesn’t change inlining decisions (on the whole). In Rust, you’ll need to remember to add #[inline] if you move functions to separate crates.


What’s more, inlining can see through virtual dispatch sometimes.


There are no bytecode changes for SIMD itself. Java doesn’t add bytecodes unless absolutely necessary - as one might imagine it’s an expensive thing to do. The next bytecode changes will be for value types.

http://www.oracle.com/technetwork/java/jvmls2016-graves-3125549.pptx is a good overview of the SIMD and snippets stuff (note it’s a joint project between Oracle and Intel).


I’d phrase it as JVMs with a PGO based JIT can see through virtual dispatch - that then allows inlining :slight_smile:

But yes, Java basically requires a PGO based JIT (if performance is at all wanted) given everything is virtual by default.


Probably partially because of my past programming experience, I find Rust far simpler and easier to write/read/understand compared to Haskell.


Just wanted to throw out that I agree that the borrow checker is most often an amazing safety net, but as the potential improvements that Non-lexical lifetimes hope to provide show, there are times in current Rust where you are literally fighting with the borrow checker keeping you from writing valid, safe code.


@matklad So, after fully going through Martin Thompson’s talk and reading @vitalyd’s posts again, I think I just managed to put my finger on what I dislike about runtime sophistication such as concurrent GC and JIT compilation from the point of view of performance optimization.

While these technologies start out from good intentions (e.g. by liberating blocks of memory in a grouped fashion, one can amortize overhead with respect to the one-by-one approach of manual memory management, RAII and reference counting), and can have some strong performance benefits (like the always-on PGO that is discussed in the talk), making them fast requires, on the language runtime side, building a software complexity monster that is not too far away from the hardware complexity that it is supposed to make accessible.

As a result, you have something that will get your application running with decent performance in an impressively small amount of time, but as a counterpart will face a much larger complexity wall when you try to improve the performance later on, because now you need to grasp not only the complexity of your code and the hardware that it’s running on, but also the huge complexity of your language runtime.

This is why I’m interested in exploring the Julia + Rust combo for scientific computing. I think it naturally maps the sociological ecosystem that we have there, where to oversimplify we have a heterogeneous population composed of a large amount of scientists with relatively weak software training and a small amount of software engineers and computing staff with relatively weak science training.

It’s good to keep the scientists focused on the problem space, and to save them from the need of thinking about the machine 90% of the time, and for this I think Julia comes better armed than its Python / MATLAB / R competitors because the focus on a fast implementation means that people won’t need to jump through the horrible linear algebra contortions that are the hallmark of “slow interpreter with a fast BLAS” ecosystems. Let’s keep linear algebra language features for linear algebra, and write a loop when we truly need a loop, life is just better that way.

At the same time, when the performance limit of current dynamic language technology is reached (and, let’s be honest, it will always be reached in nontrivial ways by naive code), I would much rather optimize code written in a language whose implementation has an easy mental model, than something of the level of complexity of Julia’s implementation. Similarly, when the code gets large, I think the code organization and compile-time error detection of a correctness-focused statically typed language like Rust are too good to ignore. So the “Start with idiomatic Julia, and partially or fully move to Rust if the code gets too large (say, ~10kLOC) or the performance too low” strategy has a lot of appeal to me.

When I find the time to explore this strategy in more depth, I will see how well it works out in practice.


Yes, this is exactly my experience after programming a while in Kotlin and Rust! Thank you very much for expressing this so eloquently (and I do enjoy long paragraphs :wink: ) :slight_smile:

My feeling is that majority of programmers don’t face problems, which require more than decent performance. Like, most of the applications are indeed are waiting for the database, for external systems or for some low-level library to finish computational expensive tasks.


On this matter, another topic which we haven’t much discussed yet, likely due to it being far away the original angle of the topic, is that there is more to software performance than CPU throughput. Software which seemingly rightfully spends all of its time waiting, such as web servers and GUI programs, can also benefit from memory usage optimizations and latency optimizations.

This is an often overlooked area of software performance (people who have ever switched programming environments due to excessive GUI lags, seen a computer feel incredibly faster after installing an SSD, or faced sysadmin opposition to large-scale JVM-based software deployments due to the high price of RAM, raise your hands) where languages which provide a high level of control like Rust can also help when needed.

Another area where the design of programming languages and standard libraries can help performance is efficient IO. Taking the example of async IO in the remainder of this paragraph, a common issue here is that a performance-oriented I/O path is there, but not yet usable by mere mortals, this is for example a problem that Rust’s and Python’s current implementations of asynchronous network IO have. Other languages such as Go or C# manage to make a better-than-average solution (stackful coroutines) readily available, which is already something, but still fail to make the best solution (event loop + state machines) pleasant to use. It seems to me that there is more work to be done in this area in order to make good I/O performance ergonomic and idiomatic.


This is something that I am also discovering via Rust; concepts like “cache trashing” and IO latency (excellently summarised by coding horror, and this “state machine/event loop” model are a bit new to me, and I am finally beginning to not just “be aware” of them, but also understanding it. (Or so I believe :stuck_out_tongue_closed_eyes:)

I think it is quite impressive that Rust as a language exposes sufficiently powerful primitives that the “optimal” solution can be implemented as a crate (futures) instead of having to be a language level built-in. And it even compiles down to zero-overhead optimal assembly!
All other langagues I know of required big version bumps and an entirely new way of thinking to be bolted onto the existing language. In rust it just… fits?
The way I understand it, the core revelation that makes library-level futures possible is that thread-safety is exposed to the compiler (as Send/Sync). Send and Sync basically expose a hardware constraint (atomicity / thread caches / memory fences) to the software. This is, again, something I have never heard of in other languages.
I also understand why no other languages did it before; there was an incredible amount of thinking behind the auto-derive and opt-out-built-in-types (OIBIT) to make it not only correct, but also convenient. Again a core innovation that brings Rust’s high-level type system close to the metal.


I generally agree with the beginning of your post, but not with this statement (and the following ones based on it). Send and Sync are mostly about making multithreading safe, whereas Rust’s state machine based futures do not spawn extra threads and are usable in a single-threaded context, so in this sense I see them as (mostly) orthogonal.

In my eyes, the killer Rust feature that makes futures-rs possible is how easy it is to write and use high-performance generic code in Rust. I may be missing some aspect of the implementation, but I do not see anything preventing people from building a similar library in a programming language with a similar generics implementation, such as C++. It would just be a lot more painful to implement, maintain and use without the assistance of Rust’s trait system (soon to be taken to the next level by impl Trait).


Valid point! I stand corrected!

That does mean I am now unsure how threadpools interact with event loops. Will the loop on the “main” thread dispatch to threadpools?

Regardless, I believe I am still correct (am I?) when saying that in both the single and multithread cases, lifetimes will ensure that the abstraction remains valid. The “problem” with juggling all the async state machines is to keep track of what is valid at which time. The futures combinators with closures make it easy, but thanks to Ownership and lifetimes, you can not accidentally close over the wrong type of pointer/variable.

As for c++, the main problem I see there is that users would still be able to smuggle all kinds of unsafe and/or unsynchronized pointers into the abstractions in their usage of the library. Probably accidentally via some global config object…


Not currently. In current Tokio/futures, all future continuations are run inline on the event loop thread. The programmer is entrusted with the task of keeping these continuations short, and offloading heavy CPU work to a dedicated thread pool if needed.

There are discussions about changing this default in the Tokio/Futures reform, and instead offloading the continuations to a thread pool by default. Basically, the tradeoff is that with this design, future continuations would need to be Send and would pay the thread pool offloading overhead by default even if they don’t need it.

I agree with your remaining points about memory safety and the thread safety of multithreaded event loops :slight_smile:


Ah yes, that would be a pretty big limitation on the design space for users of the futures crate.

I expect that less-savvy users will be happy that the get multithreading “for free”/with less boilerplate, but more advanced users will probably miss the control they have in the current design.
Especially if the continuation itself is cheap, the offloading costs might not always be worth it, as you point out.

Ideally, there would be some way to have multiple executor designs to choose from, e.g. singlethreaded, multiple executors each on their own single thread, and a threadpooled single executor. In my naïve understanding, that would allow the current flexibility with “for free” multi-threading for the “simpler” cases.
Where exactly is this discussion taking place? I’d love to read it to learn from it! :man_student:t3:


Not necessarily. Rc usage would need to be replaced with Arc but otherwise I don’t see a huge burden here.

What @HadrienG described is the default mode - you can change it. The gist of the RFC is that currently the Core is both a reactor and an executor - they want to split those responsibilities. In the default mode, there would be a single thread that handles the reactor portion (ie readiness notifications) and that thread would unpark futures waiting on events that became ready. The threadpool would then execute the future. But that’s just the default - IIRC you can continue with a single thread being a reactor and an executor if you desire.



I think people are overlooking several important factors like fighting a compiler is way better than debugging issues in production. Having a quality dependency management tool saves much pain in day to day work.
Ever worked in 6 year old Enterprise product node code base? You’d give up a left arm to have a good type system. On ease of programming in Go (good illustration of why fighting compiler might be a better option):

I am total Rust nub. but see a ton of potential for wide adoption. You also have to remember that people operate at different scales when your monthly AWS bill starts hitting mil+ a month spending a few million extra in dev. time to half that bill is very good investment.


Exactly. It should be even more than ability to choose. It should be possible to implement own executors. It should be possible to specify what executor to use either as a parameter for a future and combinators or as a variable in context. Again, I suggest to have a look into Scala futures API for this.