Gateway Connectivity Issues

Incident Report for The Things Industries

Postmortem

Summary

On Mar 3, 2026 , it was reported that some gateways lost connection to the LNS for some of the tenants in NAM1 region. This was triggered by an AWS update procedure which has affected the Gateway Server component. Although the affected gateways reconnected eventually, it took longer than expected (8 hours for some tenants).

Impact

  • Some gateways got disconnected for some of the tenants in NAM1 region.
  • Three Gateway Server instances restarted during the incident, disconnecting a large number of gateways which failed to immediately reconnect to remaining active instances.
  • The number of connected gateways kept declining gradually.
  • Eventually, the affected gateways have reconnected and service has recovered without intervention.

Root Cause

The incident was triggered by an AWS infrastructure event (task retirement), which caused several Gateway Server instances in NAM1 to undergo a rolling restart. As instances restarted one by one, gateways began disconnecting gradually. Since the restart was rolling rather than simultaneous, some gateways maintained their connection to instances that remained active throughout the event.

The root cause of the prolonged recovery, however, was a short connection timeout configured on some gateways. With a large number of gateways attempting to reconnect simultaneously, the Gateway Server was operating under unusually high load — and the short timeout was insufficient under these conditions, causing connections to close prematurely before they could be fully established. This cycle repeated until the restarted instances completed their post-restart operations — at which point server load normalised and Gateway Server caches became available, significantly speeding up the connection process for the remaining disconnected gateways until service was fully restored.

In short: the AWS infrastructure event triggered the affected gateways disconnects, but the timeout misconfiguration is what made the recovery take up to 8 hours.

Resolution

There was no manual intervention to resolve this incident. The affected gateways reconnected automatically after the downtime.

Prevention / Long-term improvements

Proactive outreach to affected tenant owners regarding gateway connection timeout

A minimum 60-second timeout is necessary for reliable connection establishment under high server load conditions. We will be reaching out to affected tenant owners, recommending that the TC_TIMEOUT setting in their Basic Station configuration is set to at least the default value of 60s. This change will help prevent premature connection drops during periods of elevated reconnect activity.

Documentation improvements

Existing documentation will be improved to specifically address recommendations for longer TC_TIMEOUT setting of the Basic Station configuration.

Infrastructure improvements

Our Cloud infrastructure configuration will be improved to reduce and accommodate higher instance load post-restart.

Posted Mar 05, 2026 - 11:19 CET

Resolved

This incident has been resolved.
Posted Mar 05, 2026 - 11:16 CET

Monitoring

Affected gateways have reconnected and service has recovered without intervention. We are monitoring for any recurrence and continuing to investigate the root cause.
Posted Mar 04, 2026 - 03:28 CET

Investigating

We are observing gateways disconnects for some of tenants in NAM1, this seems to be triggered by an AWS update procedure which has affected the GatewayServer component. We are currently investigating the issue
Posted Mar 03, 2026 - 21:08 CET
This incident affected: The Things Stack Cloud (North America 1 (nam1)).