Could eBPF have spared us the Crowdstrike incident?

Interview The CrowdStrike chaos was caused by software running riot in the Windows kernel after an update tripped up the code. eBPF is a useful tool for kernel tracing and observability, but could it have mitigated the CrowdStrike incident?

"It's interesting," Tom Wilkie, CTO of observability specialist Grafana Labs tell The Register, "because there was a vulnerability in the eBPF runtime that caused a similar outage that was also triggered by CrowdStrike in a certain Red Hat kernel."

Wilkie is referring to an incident in June, where Red Hat warned its customers of an issue related to CrowdStrike's Falcon Sensor. The problem paled into insignificance compared to what happened to a few short weeks later, when a CrowdStrike update left 8.5 million Windows computers across the world stuck in a blue screen boot loop.

eBPF allows software to run in a virtual machine (VM) in the Linux kernel, permitting developers to add capabilities at runtime. The theory goes that an eBPF program can't crash the kernel because it runs in a sandbox and is safety-checked by a verifier. Because of the low level at which some programs run, it's a popular way of implementing observability and security.

Work to implement the technology for Windows is ongoing.

"So eBPF might be the solution," Wilkie continued, "but it has also been a historical cause of these problems. I mean, fundamentally, injecting code into running kernels is a risky activity. That was the problem CrowdStrike had. And you can still have bugs in eBPF; the safety guarantees offered by the eBPF runtime and the eBPF verifier are not perfect.

"The concept of eBPF is good, but the implementation - like all implementations - has bugs. Now, could you catch something like the CrowdStrike incident with eBPF? Yes. Probably. But honestly, you could also catch it just by doing better testing, and that would be my advice. Having better software engineering hygiene. And that's the lesson CrowdStrike has already learned."

Crowstrike CEO George Kurtz said at the Goldman Sachs' Communacopia and Technology Conference earlier this month that a freak incident caused the July calamity.

"So, in this particular case, we had a configuration change, which is like there's no code, its just a config that the sensor consumes. And we went through a validation process and we validated all those. They actually worked. The problem is we had 21 of them and the sensor understood 20. And that's the simple explanation of what happened.

"So, what have we changed in terms of the process? Well, we now run the configuration changes through not only the validation but all the various code QA processes we have and then deploy that in a phased rollout manner, as well as giving customers the choice on how they want to deploy that content."

Speaking to us ahead of this week's New York ObservabilityCON, during which Grafana Labs will announce enhancements to its Explore apps and Adaptive features, Wilkie also has thoughts on another contemporary theme: cloud repatriation and funding open source development.

Having users run in the cloud is central to Grafana's mission. Wilkie says the company continues to see the use of its cloud growing - both in terms of user count and revenue - but is repatriation happening? "I would agree with the sentiment," he concedes.

"It feels like there's been a shift in the market in the last year or two, like post-zero percent interest rates, where people are more critically looking at cloud economics and realizing that a lot of SaaS and Infrastructure-as-a-Service is just not viable from a cost perspective."

In a recent submission to the UK's Competition and Markets Authority, cloud giant AWS warned that it was facing stiff competition from the very on-premises infrastructure it dismissed as obsolete not so many years ago.

According to Wilkie, Grafana Labs' solution is to make its cloud more attractive. It has an on-premises version, but features such as adaptive metrics and logs are only available in the cloud. Wilkie says customers find it more cost-effective to use Grafana Labs' cloud for many applications than try to roll their own - well, he would, we guess.

Which brings us to how Grafana Labs remains a viable business and how it decides which services to make open source and which to keep proprietary.

... people are more critically looking at cloud economics and realizing that a lot of SaaS and Infrastructure-as-a-Service is just not viable from a cost perspective

Wilkie explains: "We call it the 'sniff test.' If a feature is going to be generally usable by a very large group of people, we will make it open source; if it only appeals to a small group of enterprises or large organizations, then we'll consider keeping it as a commercial differentiation."

He provides an example: "Grafana has 200-plus data sources, where you can connect Grafana to pretty much anywhere, and 170-ish are open source. Thirty of them are commercial integrations that we sell as part of Grafana Enterprise.

"A good example of a commercial integration would be with Datadog. One of our most popular enterprise data sources is our Datadog one. If you're paying Datadog to store your metrics and you want to visualize them in Grafana, you can pay us some money as well! It seems like a fair exchange of value."

Wilkie also cites Grafana's open source projects. A customer can build solutions with them, but, echoing comments made to El Reg by Kelsey Hightower, Grafana would be more than happy to sell them a managed service, requiring a credit card to get rolling in minutes. ®

Life Buzz News

Could eBPF have spared us the Crowdstrike incident?

POPULAR CATEGORY

corporate

tech

entertainment

research

misc

wellness

athletics