← All posts
srechaos-engineeringdevopspythonresilience

Embracing the Chaos - Automated Recovery in 45 Seconds

2026-01-16 · 1 min read

Breaking Production on Purpose

"Reliability" isn't about never failing. It's about how fast you recover when you do fail.

Today, I moved from simple "Monitoring" to Chaos Engineering. Instead of crossing my fingers and hoping my Redis queue stays alive, I wrote a Python script (chaos.py) to actively hunt down and KILL the container during peak load. The system fixed itself in exactly 45 seconds.


The Experiment

  1. The Attack - My script disconnected the redis_store container at 18:32.
  2. The Impact - The Grafana snapshot below shows the queue depth flatlining. The heartbeat stopped.
  3. The Resilience - I sat back and watched. The orchestration layer (Docker) detected the health check failure and automatically spun up a fresh instance.
  4. The Result - System back online in 45 seconds with zero human intervention.

Grafana Chaos Recovery Snapshot


Key Takeaway

Why do this? Because if you wait for a real outage to test your recovery strategy, it is already too late.

Stack: Python, Docker, Redis, Grafana

GitHub: github.com/ntjrrvarma/log-sentinel