Application data integrity issue
Incident Report for The Things Industries
Postmortem

Today we faced a major operational issue with The Things Stack Cloud with regards to the integrity of application data in uplink and downlink messages. This issue lasted today (February 10, 2023) from 14:12:52 to 15:39:30 UTC.

Data received from The Things Stack Cloud Application Server to the integrations (webhooks, MQTT, etc), as well as the data sent by the Application Server to end devices during this time frame can be considered “garbage”. Moreover, end devices that joined the network during this time frame may have received a valid join-accept, but the Network Server did not record the correct session keys. Therefore, these devices need to rejoin since this issue is resolved.

This affects all The Things Stack Cloud regions and all LoRaWAN devices using over-the-air-activation (OTAA).

At this moment, we cannot recover application data during this time period: Application Server does store application data using the Storage Integration, but only the decrypted data. This decrypted data is scrambled, so there is no use to recover this way.

Cause

The root cause of this issue is that we did not correctly cached unwrapped keys in The Things Stack. Key unwrapping happens in various places; for example, when loading session keys that are encrypted (wrapped with a key encryption key) in our databases. Key wrapping and unwrapping is computationally expensive. Therefore, we implemented a caching mechanism for the unwrapped key, with the intention to optimize processing time in places where we need to unwrap the same key many times over. An example is when a downlink queue with hundreds of downlink messages needs to be re-encrypted (because the order or FCntDown changes). In that case, we want to unwrap the encrypted AppSKey once, and not hundreds of times in a row.

This caching mechanism, along with other caching mechanisms related to cryptographic operations, was implemented and passed our internal code reviews and unit tests. However, we did not enable it in production until today. Today, we noticed downgraded performance and decided to enable this caching mechanism. And while we’re at it, update The Things Stack to a slightly newer version. That turned out wrong: the AppSKey used for decrypting uplink messages appeared not to be the correct one, so the Application Server produced garbage payload. The same happened for downlink messages.

As they say in software engineering:

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors

Today, we failed on the cache invalidation. And we also failed in testing this case.

Resolution

Not knowing the root cause, we rolled back the version of The Things Stack that we use in Cloud to the lastest stable version. This always take a while, as we use rolling updates, so service instances are upgraded one-by-one to avoid downtime. Since the bad caching mechanism was also present in the last stable version and the new configuration was still in place, the problem continued. It took us a while to realize that enabling the caching mechanism activated a piece of the codebase that scrambled LoRaWAN application payload. Since we rolled back that part of the configuration and restart our services, the problems started to disappear and things got back to normal.

The fix was simple in the end (see GitHub pull request) and this will be deployed shortly after it passes our QA testing in our staging environments with caching enabled.

Follow up

What we learn from this, is that we need more thorough code reviews. We also need to make sure that our staging environment reflects whatever we have in production, even if apparent simple features like caching are not necessary in the staging environment as it does not need the performance win. We also learn that even the slightest configuration changes should not be applied on a Friday, even if that means that we stick with somewhat degraded processing time and increased operational cost over the weekend.

Please reach out to The Things Stack Cloud support channels if you have any questions or need additional support to remedy the situation on your end.

Our apologies for the inconvenience.


Johan Stokking
CTO & Co-Founder, The Things Industries

Adrian-Stefan Mareș
Head of Engineering, The Things Industries

Posted Feb 10, 2023 - 18:52 CET

Resolved
The data integrity issue has been resolved and the root cause has been identified.
Posted Feb 10, 2023 - 16:44 CET
Update
The issue seems to be resolved; the application layer data integrity looks good. We keep monitoring the performance and investigating the root cause.
Posted Feb 10, 2023 - 16:40 CET
Monitoring
We have rolled back a configuration update and we are seeing better data integrity.

For reference: the issue started today, February 10, 2023 at 14:11:30 UTC.
Posted Feb 10, 2023 - 16:31 CET
Update
We are continuing to work on a fix for this issue.
Posted Feb 10, 2023 - 16:14 CET
Identified
We have identified an issue in The Things Stack Cloud with regards to the integrity of application data. This results in garbage data being sent as application data. We will keep you posted.
Posted Feb 10, 2023 - 16:03 CET
This incident affected: The Things Stack Cloud (Europe 1 (eu1), Europe 2 (eu2), North America 1 (nam1), Australia 1 (au1)).