On October 7-8, 2025, the webhook ingestion service in our eu1 cluster experienced a degradation that led to an increase in processing errors. This resulted in delayed data delivery due to processing retries and, in a small number of cases, data loss after too many retries occurred.
The issue was caused by a lock that, under certain conditions, could be held indefinitely because it was configured without an automatic expiration. This created a chain reaction where subsequent processes would wait for the lock while holding their Redis connections open. The accumulation of these waiting processes likely caused a rapid increase in Redis connections, which in turn impacted the performance of dependent services.
Service was restored after deploying an update that adjusted the locking logic. To enhance future reliability, our follow-up actions include a review of similar code patterns and performing more heavy load tests on the webhook related services.