Was CrowdStrike a Null pointer related C++ bug?

They won't be, for legal liability reasons. Or if they are it'll be years from now, after any and all lawsuits, assuming crowdstrike hasn't been litigated into the ground by then (6 ft under, to be precise).

But the real cause is a broken engineering culture: no matter the languages or other tools you use, nothing can save a company from a "it's Friday, this is a quick patch, what could go wrong?" type of mentality.

If a change goes into production, it should, without exception, propagate through a whole CI/CD street, which includes a barrage of (applicable, of course) unit and integration tests. If that isn't possible then it should be done manually. But it definitely shouldn't be skipped, even if the change seems trivial.
Because it's never about the change in isolation; rather it's about how that change affects anything else it might interact with.
And the deeply unpleasant truth is that that is about unhappy paths, which grow combinatorially relative to happy paths, which is why companies are inclined to skip investigating/handling them in the first place. And yet, managing those error paths and associated failure modes is essential for robust software.

I have to say, I've never felt any pressure wrt that one way or the other. But, at the same time, there are lots of very high quality extant libraries, and if one of those will do, that of course has my preference.

But just to make the point, if I ever needed to deserialize some random config format that was used by an Ancientā„¢ application, I'd just go ahead and implement that.

Ultimately it's just pragmatism on my part I guess.

I guess that depends on what meaning one associates with that phrase :slight_smile:

Personally my view is:
This forum, and IRLO with it, both is and should be welcoming to all walks of life. No ifs or buts about it.

And that it feels that way (at least to me) is a testament to the good and largely underappreciated work the mods have been doing on both forums. So I'd just like to take 1 sec and give them a shout out. You guys rock!

But I'm equally happy that this was, is, and remains a tech forum first and foremost, rather than a platform for politically-minded people. If it was the latter, personally I'd instantly be much less motivated to engage.

But again, that's just me, and I don't know if the phrase refers to any of what I've said above.

15 Likes

I'm currently working to set up a CI/CD pipeline for embedded products. The company has deep roots in research, and the building and releasing of the artifacts is pretty arcane. I hope I'll get it under control and have complete pipelines running before autumn ^^

2 Likes

This is a bit off topic, but if the build step itself it difficult and error prone, perhaps using Nix might be helpful. You'll still have to set it up of course, which in the case of nix happens by writing a declarative file somewhat akin to (but more powerful than) JSON.

But once that's done it would be trivial to build the project, on the CI/CD server, or on your own dev machine, someone else's dev machine, or anything in between (including containers). And it would all be reproducible¹², and without any DLL hell whatsoever.

So whether it's useful to you is mainly a matter of whether, in context, you think it's worth the investment.

¹ depending on how you use it
² it also has the nice side effect of eliminating "it works for me but not for thee" style issues

The problem with nix is that the config language is particularly arcane and cumbersome. I was looking at the context of NixOS, which I like as a concept. Unfortunately it seems to awkward to actually use as a daily driver Linux distro.

However, for bulids there are fortunately options that also create reproducible builds: bazel/buck/etc come to mind for example, though I believe there are other options too.

Side note: As for Linux distros there is Guix which uses scheme instead. (I'm not sure that is an improvement to be honest.) I'm also working on my own approach coming from a different angle: allowing saving the current system state (installed packages, changed config files including saving as diffs) on a traditional distro. And of course also applying.

1 Like

As a daily user of nix, nix-darwin and nixos I can't say I share that opinion. It's different yes. But different isn't necessarily a bad thing.

The issue with these build tools is that they don't generalize to the OS or even user management level. They just build, which on its own for me isn't sufficient. But if your use cases fit those tools, then they're viable.

My understanding of Guix is that it's mostly stuck in the world of academia, which means it will perpetually suffer from similar issues as Haskell, lack of support being a big one for most people, especially newcomers.

There's something to be said for the principle of least power.

If it's essentially just an installer mechanism, then that doesn't solve the issues nix solves:

  • DLL hell is gone for good because it's perfectly possible and normal to have different versions of the same library in the nix store, and those don't fight with each other
  • no need for containerized apps in the style of flatpak or shiver snap, which conceptually were a kludge to begin with
  • reproducible builds every time if you so choose (something I understand bazel is also capable of)
  • On the OS level, a system with rollback support. If an "upgrade" (a system rebuild, really) fails, you don't get left with a system in an inconsistent state. Instead you just roll back to a previous build. So it's a bit like STM in that way, But applied to the whole OS
  • as I mentioned earlier, no "works for me, but not for thee" style issues, which in some cases can take literal hours to debug
  • managing multiple nixos hosts adds trivial complexity compared to managing 1 host i.e. scaling up the number of managed hosts is not all that time or effort intensive, once you have the confĆ­gs you want

I use nixos because of all of these reasons, as well as a couple of others that have more to do with use case than technology.

This is a personal thing, but I'm quite happy to be rid of LSB style OSes. Fewer failures modes in my current setup.

@erelde below: yeah it might be more appropriate to split this off into its own topic.

2 Likes

This deviates the topic even further which is why I didn't answer, but ^^ you can't imagine how far behind we are. I'm still trying to gather all the build dependencies for every product branches, we're collectively trying to collapse all those product branches back into master, but some of them have deviated for years. It's good honest work ^^

1 Like

It isn't quite like an installer. I'm heavily basing the design on GitHub - CyberShadow/aconfmgr: A configuration manager for Arch Linux (which is Arch Linux specific and written in bash). I have been a daily user of this for a few years. Unfortunately I have to use Ubuntu at work, so it won't help me manage that install. And it is fairly slow.

I rewrote it in Rust, using Rune as the config language. And I'm currently in early "dog fooding" phase, fixing speed bumps, still need to write docs and many other things. It is way faster, being able to determine that no changes were needed in about 3 seconds rather than close to a minute. Slower on Debian since dpkg doesn't store mtimes of installed files, but I have ideas for that.

Funnily people talk about this, but I never seem to encounter this myself, not since the bad old Windows 9x days at least. It is not a problem in practice on Arch, nor on Ubuntu LTS at work.

Flatpak is also a (weak) security boundary. I'd rather run Spotify and other closed source software in a Flatpak than directly on the host.

Yeah, won't get that, obviously. I don't remember when I last made a Linux system unbootable though. Probably close to 20 years ago. If a system boots after first install it won't break except due to hardware changes (switching from Nvidia to AMD GPU for example). And nix won't help you with that.

For a personal OS config tool (i.e "I have way too many computers" not "I want to sysadmin a fleet") that is irrelevant. And for building Foss I work on it needs to work with cargo install anyway. And for work we use yocto sysroots (embedded Linux) and are working on switching from cmake to bazel (c++ to Rust is also under investigation).

Now that is what my tool aims to solve and IMO actually solves as well. I have heterogeneous hardware and it should be easy to add a new computer. I do auto discovery as far as possible (e.g. looking at GPU vendor, dmi data etc to determine what packages need to be installed on a specific host). But also not all computers are used for (or usable for) all purposes. My headless RPi need different software from my desktop, which needs different software than my laptop which again need different software than my retro computers.

If nixos works for you, use it. It is the more thorough alternative. But I can't get along with it, and at work can't use it.

Even on Windows this isn't really true anymore; but partially because everyone just gave up on installing dependencies to the system and just bundle then with the app install, and partially because we just got better at handling compatability in general.

Honestly, the closest issue I've had to this has been the same as on Linux; some tool breaks on eg python 3.12, so you need to install 3.11 and either make it default or set the env or config for the tool to use it, and of course something else may well require 3.12.

Until we get rid of $PATH as a global concept that's a pretty unavoidable problem, so I've always wanted something like Nix, but despite having a lot of the pieces in place for a long time technically, Microsoft doesn't seem interested.

2 Likes

What replaced "DLL hell" on Windows is the 100s of copies of the Visual Studio Redistributable which accumulate and never seem to get removed, and the automatic installer bundlers which track every DLL that gets loaded and ensures they get bundled in a way such that they'll get loaded after installation.

IIRC, Windows did actually make a push to limit the use of global/implicit dependencies: UWP and the other various restricted build environments. Roughly everyone hated it because it broke compatibility with traditional Win32 API. (Although all of my knowledge here is vague hearsay at best.)


If the CrowdStrike bug was indeed an out of bounds index, then it would easily cause the same crash in Rust. What would differ would be that it would crash on the out of bounds indexing instead of when trying to use the loaded bad value.

An interesting possibility is whether the Rust crash could be more controlled than the C++ one. With a privileged driver causing an access violation, there's not much more the OS can do than a rapid restart, as the driver could have already caused arbitrary problems with the running state. Whereas the process of a panic in Rust could be configured to signal the OS to kill just the one driver process, at least in theory.

5 Likes

I think you're a little out of date; we had DLL hell until mid 00s, then we had that "WinSXS" (SXS = "Side by side") nonsense that was a pain in the butt for everyone until mid 10s, then we got the "universal C runtime" that was basically Microsoft replacing one complicated pain the in butt system for another, but at least it's only really a pain in the butt for them.

Other than the VC runtime, I don't think anyone ended up using that SXS system other than Microsoft; that was what I meant by everyone just bundling their deps, by simply putting their dlls in the application's directory. The exception might be managing .net runtime stuff? I remember that having some nonsense you needed to deal with in some cases.

2 Likes

I've had issues with this in the past, even on Ubuntu, when compiling custom C code from scratch. I have no problems believing that since then the situation has improved, for a multitude of reasons. But what LSB distros will never be able to do¹ is guarantee that DLL hell is gone for good.

Funny you should mention this.
Before moving to NixOS, I was a user of Ubuntu for 10+ years. In that period of time, especially in the later years, I've had multiple times that Ubuntu just completely borked itself, often enough (but not always) after a system upgrade, and a reinstall actually took less effort and time than trying to figure out how to fix it. This is rather annoying on a production machine, to say the least. And every time it happened, by opinion of Ubuntu and Canonical dropped, bit by bit.
Add to that stuff NIH initiatives like Mir and Snaps, and their deal with Amazon, and I started looking at the exit door some years back, and that's when I stumbled upon Nix and NixOS.
So in that regard NixOS does have something to offer me: stability. Not just in word, but also in deed, since I've never had any issues like that since migrating to it. That's what I meant by "fewer failure modes relative to LSB distros".

I wouldn't say I agree with that, because maintenance of each additional machine is a time sink if you let it, in the form of (nearly) duplicated effort. I know that from experience.
On top of that, for me at least it's not just a personal config tool. Like Ubuntu before it, I use NixOS while tending to my professional duties. And there too, the stability + rollback + easy horizontal scaling value proposition is present, and is still growing the value part.

That said, none of this is meant to convince you to use NixOS. Use whatever you like.
I'm merely pointing out why it's more valuable to me than a mere standalone build system, or an LSB Linux distro :slight_smile:

Yeah that's where the concept of a configuration file (each of which can then be used as a template for different kinds of hosts) becomes really valuable.

Yep, that's definitely true. An easy place to spot such things is with (older) games that want to update some runtime or another.
It's essentially the same hack used by flatpak and snaps: apps supply their own dependencies. It patches over the problem for sure, so the end user can get on with their life. Just not without paying a tradeoff tax, mostly in the form of additional disk space consumed.

Oh I dunno about that. Nix/NixOS uses $PATH rather than abolishing it, and combines that with clean environments for any application you want, to prevent env vars from unduly influencing code they weren't meant to. Add to that being able to install multiple versions of any library side by side in the nix store, and you end up with a solution that tackles the problem at the source. It's boring in the best way possible :slight_smile:

I view that kind of like app stores: none of Apple, Google, Valve or Microsoft pioneered that idea. The Linux community as a whole did, when they invented APT, RPM, PacMan and the like. And for a long time after, no commercial entities were interested.
The reason is that those companies aren't interested in better tech as a goal; rather they are interested in how they can parlay that into making more revenue, and more importantly, profit. So it's more of an economic than a technical matter for them.

In addition, when MS tried to improve the dependency situation with (I think it was called) UWP, the response from their dev community was a big "meh", because it broke win32 compatibility. From the position of an app developer this was understandable, but on the whole it didn't help matters.

Apple on the other hand didn't have nearly as large a dev community, and they had more leverage over the devs in that community in terms of migrating to new frameworks, so when they developed the App Store, they were in a position to mandate it without ill effects on the whole in terms of response from those devs.

¹ Without the use of containerized apps at least. Those import their entire dependency tree by design, unnecessarily blowing up the installed application size in the process. This doesn't matter for some tiny GUI app, but for larger applications that can be a difference measured in gigabytes. And that's a very real tradeoff.

Do you know for a fact that there's a broken engineering culture in that specific company?

While I completely agree on the technical flow considerations you mentioned, you have to bear in mind it happens in a time-critical security context. They don't only have to deliver updates; they often have to do it within very short timing constraint.

Since the code has likely been tested before, and the update only contained data to be interpreted by this code, it's not such an unreasonable move to avoid the lengthy WHQL kit validation process and to skip an unrealistic multistage rollout delivery.

As far as I know, the company was well renowned before the two recent accidents, though I've never had to use their software. It doesn't look like a company with a broken engineering culture. It rather looks like imperfect testing and / or code that could be more robust, but those are very difficult to avoid all the time, so without knowing more, I wouldn't judge too hastily.

2 Likes

Let's perform a little Gedankenexperiment.

We'll assume there is a defined engineering culture, and it has defined rules about how to deploy artifacts, what artifacts can be deployed and what criteria they must meet, and where, when and in what time frame to deploy it.
If any of these assumptions are false, their engineering culture has serious holes in it to the point of brokenness.

Assuming those assumptions are true though, then 1 of a couple of things must have happened:

  1. Someone disregarded the rules, which in my eyes is a fireable offense in a security context, and even more so since the damage this has caused to crowdstrike as a company is roughly from 78 billion down to 65 billion, so around 13 billion dollars. And that doesn't even count the damage done to other companies.

  2. Everybody followed all the rules. This means that the engineering culture needs updating pretty badly, since it obviously failed here.

  3. The layoffs they performed in 2023 (see below) caused serious gaps in knowledge. That too is an issue with engineering culture since it's like taking a sharp knife to that knowledge base within the company, though in this case one caused by the execs.

  4. Time pressure w.r.t. delivering such an update. In this case it's more the engineering itself that suffers, but that's more of an issue with the culture of the company as a whole, rather than specifically the engineering culture, because it means that there are structural factors that actively make the product they deliver worse.

So even if it isn't specifically an engineering culture issue, it's still definitely a culture issue.

I wonder: what could possibly outweigh damage in the order of 10s of billions of dollars?

If the contracts CRWD has with its customers encourages this kind of risk taking with others' infrastructure, I suspect the customers will soon be feeling the need to update those contracts with similar urgency as the crowdstrike devs have felt for quite a while now.

Let's say that's true. Let's also say that it's all automated, which may or may not be true.
All that shows is that their test suite as a whole is in dire need of updating.

Keeping in mind the damage this has caused: is it really reasonable?

Perhaps that people felt this way before this incident. But equally, this incident is bound to change a lot of minds on this.

Crowdstrike engaged in mass layoff shenanigans last year, so it seems reasonable that that factored into those incidents.

Business people at the top of corporations should pay attention to this: layoffs are in a sense like a cancer. They slowly but surely destroy the healthy tissue and what's left is less healthy than before. Just because an employee wasn't fired doesn't mean they don't experience ill effects.

Well, crowdstrike has literally made it their business. Difficult it may be, but I'm curious what the conclusion at the end of all the litigation sure to follow will be.

EDIT:
Apparently one big factor was that the test procedures weren't sufficient.

1 Like

A few remarks:

  • (2): not all situations are always taken into account when you make rules. Engineers are supposed to think on their own, and they're not infallible. Also, gaps in rules doesn't imply a culture problem. I'd even say rules and culture are something orthogonal - I think you often mean methodology, and not culture.
  • (3): it did layoff a part of the staff, but do you really know "it caused serious gaps in knowledge"? Where did you get that information?
  • (4) isn't a culture problem; it's a constraint of the job. It's likely part of what lead to the problem.

Easy for you to say after the fact, but in general and with the knowledge they had before it happened, I'd say: the damage caused by a threat that hasn't been plugged in time. It happened in the past.

If that's what happened and if they thought the kernel code wouldn't crash with an update of the definition files, they weren't taking a risk but rather failing to test a specific use case (likely indirectly).

But I agree that the customers will have to ask themselves whether they want to continue with CrowdStrike or not, and they'll be influenced by all the BS we can read in the media.

It's the same point as the one above regarding the damage. Asking that after the fact is moot, and what we're discussing here is whether the cause is due to a major culture issue. If the kernel code takes all eventualities into account, yes, it's a perfectly reasonable strategy.

It could be the case, but I doubt those systems (software, validation flow, etc) were not already in place before. Have they suddenly change something significant after the layoffs? Seems far-stretched to me, though I agree that attention is needed when times are difficult in general.

We already had a good clue about that from the analysis above by Tavis Ormandy, which is why I mentioned the possibility in my post. Apparently, they're reacting as they should, but they may have a difficult time ahead. It's interesting to see they promised to release "the full Root Cause Analysis" - a bold move, but they must feel it's necessary.

I'm appalled by all the misinformation people are spreading about this incident by jumping at conclusions from their armchairs, in particular those who claim it's their CEO's fault, just because he happened to be working at McAfee when another clash occurred (I'm not talking about this conversation, but in general on Twitter, Reddit, Youtube, etc). But it's nothing new, Brandolini's law indeed.

1 Like

People are generally not great at known short term costs against potential long term risk. In the most empathetic understanding, just trying to get through today such that we can deal with tomorrow, tomorrow. In a much less graceful one, a hope that they won't be around to suffer the consequences once things inevitably erode to the point of going wrong.

Most anyone here will be more biased to managing long term risk — a major ask of Rust is investing effort up front for benefit later — but this is a documented fallacy of the human condition that we are all subject to.

We can and should strive to do better. But the "move fast, break things" mentality exists for a real reason, and it does no good to overlook that.

To put it pithily: using Rust wouldn't really make the error less likely, but choosing Rust might.

6 Likes

just an update:

[CrowdStrike releases root cause analysis of the global Microsoft breakdown - ABC News](https://confirmed - out-of-bounds memory access was the cause )

A quote from the article:

Falcon expected the update to have 20 input fields, but it had 21 input fields.

This "count mismatch" is what caused the global crash, CrowdStrike said.

"The Content Interpreter expected only 20 values," the RCA report states.

"Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash.

Rust would have crashed on vector-or-array out-of-bounds-access error too. But I still suspect that Rust strong type system would have made this less likely to happen and the crash would have occurred sooner, simplifying the root cause analysis and fix.

We may never know.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.