We have faced a minor operational issue with The Things Stack Cloud with regards to external API availability in all of our clusters, affecting mainly Console usage. Traffic processing and delivery was not affected.
The root cause of this issue is that the service which we use for load balancing and request routing, Envoy, had a bug (1, 2) in their HTTP/2 request processing library.
We have upgraded to the latest release of Envoy at the time, v1.29.0
, as part of our v3.29.0
release, and initially did not experience any elevated timeout or error rates in our load balancer. However, over the past two days more reports of these timeouts occurred and we have decided to rollback our Envoy version upgrade. We have not observed any elevated timeout rates since.
We monitor failed request rates inside a cluster, but not at the edge of the cluster, where the load balancer operates, as we deem these rates to be more accurate near the components which experience failures. We will be looking into possible improvements in our monitoring in order to account for such issues in the future.
We have rolled back to the last known working Envoy version, v1.28.1
, and have no longer been able to reproduce the sporadic timeouts.
Adrian-Ștefan Mareș
Head of Engineering, The Things Industries