Webhooks failures in the TTSC eu1 cluster

Incident Report for The Things Industries

Postmortem

On October 7-8, 2025, the webhook ingestion service in our eu1 cluster experienced a degradation that led to an increase in processing errors. This resulted in delayed data delivery due to processing retries and, in a small number of cases, data loss after too many retries occurred.

‌

The issue was caused by a lock that, under certain conditions, could be held indefinitely because it was configured without an automatic expiration. This created a chain reaction where subsequent processes would wait for the lock while holding their Redis connections open. The accumulation of these waiting processes likely caused a rapid increase in Redis connections, which in turn impacted the performance of dependent services.

‌

Service was restored after deploying an update that adjusted the locking logic. To enhance future reliability, our follow-up actions include a review of similar code patterns and performing more heavy load tests on the webhook related services.

Posted Oct 13, 2025 - 18:29 CEST

Resolved

This incident has been resolved.

Posted Oct 09, 2025 - 04:21 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 08, 2025 - 12:57 CEST

Investigating

We are investigating reports about webhook failures in the TTSC eu1 cluster.

Posted Oct 08, 2025 - 12:12 CEST

This incident affected: The Things Stack Cloud (Europe 1 (eu1)).