Today we faced a minor operational issue with The Things Stack Cloud with regards to the inter-component communication of our infrastructure in the
eu1 cluster, affecting mainly the transmission of downlink messages. This issue lasted today (July 27, 2023) from 04:27 to 05:03 UTC.
The root cause of this issue is that the service which we use for service discovery, AWS Cloud Map, was experiencing issues with instance status updates and as such did not return all of the available instances of a particular service.
Service discovery is used in distributed systems in order to be able to dynamically address individual instances of a particular service. In our case such a service would be the Gateway Server, which is our service which handles the communication with the LoRaWAN gateways.
Due to the issues experienced by AWS Cloud Map, our Network Server service was unable to detect the existence of the Gateway Server service instances, and in turn was unable to schedule downlinks. Fortunately, a subset of our Network Server instances were still visible, and as such uplink traffic was still successfully processed, albeit with small delays due to the increased capacity on the visible instances.
As AWS Cloud Map issues were fixed, downlink traffic was restored and uplink traffic latency has returned to normal values.
Head of Engineering, The Things Industries