Webhooks failures in the TTSC eu1 cluster

Incident Report for The Things Industries

Postmortem

On October 7-8, 2025, the webhook ingestion service in our eu1 cluster experienced a degradation that led to an increase in processing errors. This resulted in delayed data delivery due to processing retries and, in a small number of cases, data loss after too many retries occurred.

The issue was caused by a lock that, under certain conditions, could be held indefinitely because it was configured without an automatic expiration. This created a chain reaction where subsequent processes would wait for the lock while holding their Redis connections open. The accumulation of these waiting processes likely caused a rapid increase in Redis connections, which in turn impacted the performance of dependent services.

Service was restored after deploying an update that adjusted the locking logic. To enhance future reliability, our follow-up actions include a review of similar code patterns and performing more heavy load tests on the webhook related services.

Posted Oct 13, 2025 - 18:29 CEST

Resolved

This incident has been resolved.
Posted Oct 09, 2025 - 04:21 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 08, 2025 - 12:57 CEST

Investigating

We are investigating reports about webhook failures in the TTSC eu1 cluster.
Posted Oct 08, 2025 - 12:12 CEST
This incident affected: The Things Stack Cloud (Europe 1 (eu1)).