Major Outage of Multiple Hosted v2 Networks

Incident Report for The Things Industries

Postmortem

From Johan Stokking, CTO and Co-Founder, The Things Industries:

Early morning Sunday, May 3rd (CEST), our internal monitoring systems observed decreased traffic flows and high CPU usage across many of our managed servers dedicated to customer environments of The Things Industries V2 SaaS. Upon investigation by our tier 1 operations team, it appeared an unusual issue related to our server management operations, which is beyond the scope of what our tier 1 operations team can resolve. It was therefore escalated to our tier 2 operations team.

Upon investigation by our tier 2 operations team, it became apparent that the high CPU usage was caused by the Salt minion. Salt is a system to monitor and update managed servers, which we use to manage servers with the customer V2 SaaS environments. Also, we discovered that the web server was disabled, causing the Console and some integrations to be unavailable. Due to the high CPU usage, the traffic throughput was between 1% and 10-20% of what it should be, depending on the customer infrastructure. That often starts a domino effect of devices that perform regular LoRaWAN link checks or depend on quick message acknowledgements, to figure out that there's no (reliable) link, so the devices revert to join mode. While in join mode, LoRaWAN devices are no longer sending telemetry, and require the network to respond in time. In practice, this meant that some customers were receiving no telemetry at all, while our servers were seeing many LoRaWAN join requests.

Therefore, the operations team temporarily killed the Salt minions on the customer servers. While the downtime for customers was ongoing, the servers were updated and rebooted outside of our regular maintenance windows. This was a manual process that needed to be done per customer environment, as this is exactly what we use Salt normally for. This resulted in traffic coming back immediately, but in some cases the forward shadow is quite long with tens of thousands of devices in join mode that need to join one-by-one in sometimes duty-cycle constrained regions, before they start sending telemetry again.

The tier 2 operations team discovered ongoing reporting from the large Salt user community which aligned well with our experience; high CPU load related to the Salt minion, a shut down web server and an automatic restart of the CPU demanding process. It was concluded that the issue is related to CVE-2020-11651 and CVE-2020-11652 vulnerabilities in Salt: exploits that can be used for remote code execution via an authorization bypass. Even though a patch was available since April 30th, roughly three days before the incident, our operations team was unaware of the necessity to patch, let alone the availability of a critical patch, as Salt lacks a notification channel for operations teams. See also F-Secure's analysis and advisory on this matter.

Even though service was fully restored Sunday morning (CEST) for most of our customers, we are still investigating the impact on our infrastructure. As a precaution, we are redeploying all customer servers in the coming days, under the emergency maintenance reservation in our service level agreement. The expected downtime is 5 to 15 minutes, which may cause some LoRaWAN devices to revert to join mode.

This was not a targeted attack. On the contrary, large sites report similar breaches. Based on the time of events and what we encountered on affected infrastructure, we have no reason to believe that that any data has been compromised. Instead, the aim was to gain generic and distributed compute resources, as we identified the malicious binary as a Monero miner.

We understand that service interruption is one thing, but lack of pro-active communication is another thing. While our internal monitoring and alerting worked as it supposed to, and while our tier 1 and tier 2 operations teams were busy resolving the issue, some our customers reached out to us asking what was wrong. We understand that this is the other way around of what it should be: we should pro-actively inform you, preferably automatically, of outages. Therefore, we are setting up The Things Industries status page where you can subscribe to updates. We are changing some internal processes to report outages faster, and the status page will be the channel for those reports.

Timeline Sunday, May 3rd, 2020

3:31 AM (CEST): Traffic decreases in most The Things Industries V2 SaaS customer environments. Automated alerts were firing 5 minutes after to our tier 1 operations team
5:30 AM (CEST): Tier 1 ops team escalates to tier 2 as the issue is not resolvable on individual instances
6:58 AM (CEST): Issue gets escalated to CTO
9:45 AM (CEST): Tier 2 ops team identify the symptoms of high CPU of Salt minion and closed web servers as common issue on affected infrastructure
10:20 AM (CEST): Killed CPU demanding process which was partially effective, started system update and restart process
12:20 PM (CEST): Killed Salt minions from the Salt master which resulted in improvements on most environments
14:00 PM (CEST): Situation stabilized, started to monitor closely and discuss further actions
14:05 PM (CEST): Tier 2 ops team identified the malicious binary as a Monero miner

Future Mitigation

In the short-term (this week):

Continue investigating the full impact of this incident
Redeploy Salt master and affected servers with more strict access policies
Rotate all possibly related security keys

Then:

Support customers in migrating to The Things Industries Cloud Hosted (based on V3), see also the next section
Report future incidents via status.thethings.industries
Thoroughly review update notification channels of technical dependencies of server management operations (like Salt) and review processes to timely patch these dependencies

Impact on The Things Industries Cloud Hosted (running V3)

There is no impact on The Things Industries Cloud Hosted running The Things Enterprise Stack V3. In our new Cloud Hosted environment, we do not manage individual server instances. Instead, we use AWS Fargate which is serverless compute for containers. There is no host to compromise as the host is not even accessible to us; we only care about containers that run images that we built and verify.

Posted May 04, 2020 - 16:19 CEST

Resolved

On Sunday, May 3rd, 2020 starting at 3:31 AM CEST, The Things Industries V2 SaaS infrastructure experienced an outage that lasted until around noon. The cause was a vulnerability in Salt, the software that we use to update and monitor servers. This was an untargeted attack with the purpose of abusing generic cloud computing resources.

We have mitigated the issue and our service is back to normal operations. However, for full certainty, we are redeploying all affected server infrastructure in the coming days. Also, we are rotating all possibly related security keys. We have set up a new public Status Page where customers can subscribe to potential future incidents. Finally, we're doing our best to avoid incidents like these in tooling that we use and need to trust.

Posted May 03, 2020 - 09:00 CEST