From Johan Stokking, CTO and Co-Founder, The Things Industries:
Early morning Sunday, May 3rd (CEST), our internal monitoring systems observed decreased traffic flows and high CPU usage across many of our managed servers dedicated to customer environments of The Things Industries V2 SaaS. Upon investigation by our tier 1 operations team, it appeared an unusual issue related to our server management operations, which is beyond the scope of what our tier 1 operations team can resolve. It was therefore escalated to our tier 2 operations team.
Upon investigation by our tier 2 operations team, it became apparent that the high CPU usage was caused by the Salt minion. Salt is a system to monitor and update managed servers, which we use to manage servers with the customer V2 SaaS environments. Also, we discovered that the web server was disabled, causing the Console and some integrations to be unavailable. Due to the high CPU usage, the traffic throughput was between 1% and 10-20% of what it should be, depending on the customer infrastructure. That often starts a domino effect of devices that perform regular LoRaWAN link checks or depend on quick message acknowledgements, to figure out that there's no (reliable) link, so the devices revert to join mode. While in join mode, LoRaWAN devices are no longer sending telemetry, and require the network to respond in time. In practice, this meant that some customers were receiving no telemetry at all, while our servers were seeing many LoRaWAN join requests.
Therefore, the operations team temporarily killed the Salt minions on the customer servers. While the downtime for customers was ongoing, the servers were updated and rebooted outside of our regular maintenance windows. This was a manual process that needed to be done per customer environment, as this is exactly what we use Salt normally for. This resulted in traffic coming back immediately, but in some cases the forward shadow is quite long with tens of thousands of devices in join mode that need to join one-by-one in sometimes duty-cycle constrained regions, before they start sending telemetry again.
The tier 2 operations team discovered ongoing reporting from the large Salt user community which aligned well with our experience; high CPU load related to the Salt minion, a shut down web server and an automatic restart of the CPU demanding process. It was concluded that the issue is related to CVE-2020-11651 and CVE-2020-11652 vulnerabilities in Salt: exploits that can be used for remote code execution via an authorization bypass. Even though a patch was available since April 30th, roughly three days before the incident, our operations team was unaware of the necessity to patch, let alone the availability of a critical patch, as Salt lacks a notification channel for operations teams. See also F-Secure's analysis and advisory on this matter.
Even though service was fully restored Sunday morning (CEST) for most of our customers, we are still investigating the impact on our infrastructure. As a precaution, we are redeploying all customer servers in the coming days, under the emergency maintenance reservation in our service level agreement. The expected downtime is 5 to 15 minutes, which may cause some LoRaWAN devices to revert to join mode.
This was not a targeted attack. On the contrary, large sites report similar breaches. Based on the time of events and what we encountered on affected infrastructure, we have no reason to believe that that any data has been compromised. Instead, the aim was to gain generic and distributed compute resources, as we identified the malicious binary as a Monero miner.
We understand that service interruption is one thing, but lack of pro-active communication is another thing. While our internal monitoring and alerting worked as it supposed to, and while our tier 1 and tier 2 operations teams were busy resolving the issue, some our customers reached out to us asking what was wrong. We understand that this is the other way around of what it should be: we should pro-actively inform you, preferably automatically, of outages. Therefore, we are setting up The Things Industries status page where you can subscribe to updates. We are changing some internal processes to report outages faster, and the status page will be the channel for those reports.
In the short-term (this week):
There is no impact on The Things Industries Cloud Hosted running The Things Enterprise Stack V3. In our new Cloud Hosted environment, we do not manage individual server instances. Instead, we use AWS Fargate which is serverless compute for containers. There is no host to compromise as the host is not even accessible to us; we only care about containers that run images that we built and verify.