The SolarWinds Hack and My Apache SOLR Patch

marcussorealheis
7 min readJan 5, 2021

Trap tutorial, ridin’ down Memorial
From the bando to the Waldorf Astoria — “Both Eyes Closed,” Gucci Mane, 2017

SolarWinds, a company that provides a suite of network management and monitoring tools, was recently called out as the vendor compromised by a widespread hacking campaign from a very clever team of hackers from Russia. The scope of the SolarWinds hack is seemingly unprecedented. The infiltrators had access to SolarWinds customers’ networks for years via the Orion software. It is estimated that over 18,000 companies downloaded malware via SolarWinds software updates to their own infrastructure (move to the cloud, people!). The attack could pose an existential threat to the SolarWinds business. What was designed to be a shield became a sword.

Using a repository of abandoned domains, the sophisticated hackers were able to gain access to some of the world’s most trusted institutions. The malware evaded detection and veiled its identity by using names that looked familiar and legit. Since the compromised software is used to manage networks, we can conclude that the malware was able to move freely from host (computer) to host (computer) because the malware had already gained access to a trusted area where only authorized users are permitted access. This case brought me back to a security patch I wrote for the Apache Lucene/Solr project back in 2019.

block unknown

San Francisco Summer, 2019

It was temperate Friday night in the summer of 2019 in San Francisco, and I was with my buddy Moses* who is a very talented and extremely secretive security engineer at a tech company that recently IPO’d and a chain smoker of sorts. Then I got a call from our mutual friend Kristoff*.

Moses rushed back to the crib so I could get to my computer. It felt like TV. We discovered a subset of Kristoff’s servers had been compromised in his network and turned into cryptocurrency miners. As a result, the cloud provider had shut all his servers down! If that persisted for too long, he would be in serious trouble.

That’s when I dug in. I hopped on the phone with a group of trusted friends, and charted out a remediation strategy over the next few hours so that he could get back to business. As I started to peel back the layers, bouncing between Google and the command line, it became clear to me that the attack that Kristoff succumbed to could impact billions of users and thousands of companies because the attacker had exploited a very popular software solution and a common misconception in security:

My internal applications are safe and do not require the same level of security as external resources because they are in my network, which is locked down at the perimeter.

Two AK-47s and a blow torch
Couple [bots] knocking hard on my front porch (Huh)
A couple old schools in my backyard
If I don’t know you, I’ma serve you through my burglar bars — “First Day Out,” Gucci Mane, 2009

Kristoff’s network looked secure with firewall rules, key management, and a few other measures. All of his perimeter precautions were negated by the fact that he had a malicious process running in a device on his network. When I went to work tired that Monday from trying to save my buddy’s network from the proprietary self-managed product he had purchased to run alongside his open source software, I determined that I would eliminate wherever it occurred at my job.

At the time, I was leading Developer Relations at Lucidworks, a sponsor for Apache Lucene/Solr. While sitting with a colleague, we stumbled upon some really unintuitive authentication behavior that we probably ignored before because we had grown too accustomed to it. In Apache Lucene/Solr, users can enable the authentication plugin, but if they don’t specify a user, the plugin isn’t actually enabled. It sort of makes sense because you cannot authenticate if you do not have a user. But if you want authentication, you should get it.

I called up a Java legend and mentor to complain at a high level about the frustration that felt eerily similar to what I had dealt with over the weekend. I knew the default behavior of the RBAC (role-based access controls) in the application could lead to all sorts of problems if someone got past our Solr user’s perimeter security. Without saying a lot about what we’re going through, he opined that the design and experience, now that he had thought about it, was really silly.

That was all I needed to hear. Listen, with some of these big open source projects, getting any sort of traction that will result in a merge of your code contributions can require lots of upfront research, digging into code written by someone no longer involved, and a popularity contest that I do not always win because I am myself and it only takes one or two haters. Before I dove in and spent several weeks at least on this project as I am not much of a Java developer, I wanted to be sure I had some level of buy-in with a respected heavyweight aligned.

I was writing my first breaking change in the project ever and it didn’t land easily. I was effectively adding a user experience to customers looking to set up a Lucene/SOLR cluster with basic auth.

Here’s how basic auth could work before my patch:

  • developer starts a cluster
  • developer issues a command to the administrative api to load the basic authentication plugin without any users with administrative privileges created and without specifying the blockUnknown parameter should be true. The default optional parameter has historically been false after you “enabled” the authentication plugin.
  • developer and any other processes on the trusted network does whatever they want until, if ever, they decide to create a user or users with the appropriate privileges or authorization to issue administrative commands that require a password/authentication to execute — imagine if the clever team is on your network via SolarWinds acting a fool with your customers’ bank accounts

Here’s how basic auth would work if the same steps are taken in Apache Lucene/Solr with my patch:

  • developer starts a cluster
  • developer issues a command to the administrative api to load the basic authentication plugin without any users with administrative privileges created
  • the command fails and developer receives a log message informing them that they cannot enable authentication unless there is a user
  • developer creates a user with the appropriate permissions to take administrative actions
  • developer issues a command to the administrative api to load the basic authentication plugin and even if the network is compromised, the search server is safe from unwanted administrative actions barring other vulnerabilities

Of course, if you read the documentation, the improved iteration can only require three steps, 1, 4, and 5 and safely spin up a cluster with authentication enabled.

There are several authentication options in the project but Basic Auth is the simplest one and often the one companies start with when building. I prioritized fixing this option to protect the community’s most vulnerable users.

Several of the more established guys in our community suggested to me that it was a waste of time to do what I was doing because some people like running applications with authentication “enabled,” but with no users to authenticate. While his point was factual, we cannot compromise the security of all for the convenience of some.

🧐

Other users would mention to me that while they liked my Pull Request and agreed with it in principle, they didn’t see it as a necessary patch in their minds because their networks were always locked down so they never use authentication. Nobody’s network is foolproof. In all likelihood, there’s malware on your network already and you just don’t know about it. If you don’t know the software running on your network, keep it on the other side of your “burglar bars.” Open source or open core software is usually more trustworthy because you can inspect the code yourself. It’s also important to take a multi-faceted approach that covers as many attack vectors as you can think of, like network layer, authentication, authorization, sandboxing, malware detection, container verification, secrets management, audit logging, monitoring, application security, and so much more.

Depending on what kind of access a malicious actor has, even authentication won’t save you. If there is a process in your network with escalated privileges with intra-VPC connectivity, you’re still probably toast. In fact, the SolarWinds malware almost certainly had root access to the infected devices it was updating because it was able to update them. The risk is simply too high to manage your own infrastructure in 2021, not the other way around. Definitely use a SaaS service. Don’t get bit by engineering hubris. Your network will have holes because the network and the software running on it was largely written by humans. And humans make mistakes. I certainly do, all the time.

JIRA Issue: https://issues.apache.org/jira/browse/SOLR-13649

*The names in the story have been changed because they are security engineers and rather not tell their stories.

Thanks to Erik Hatcher, Jan Høydahl, Tim Potter, Jaren Glover, and Shalin Mangar for helping or motivating me to land this PR.

I am an angel investor, art collector, open source hacker, and a Senior Product Manager at MongoDB in San Francisco.

--

--

marcussorealheis

Apache Solr Committer, MongoDB and Weaviate Advisor, Co-Founder at a Futuristic Tools Company