Chaos Engineering for Cloud native Apps
Chaos engineering is the discipline of proactively experimenting on distributed systems to build confidence in their ability to withstand production failures. In this episode, Chris is joined by Ashish Balgath (Cloud Solution Architect at Thoughtworks) to explore why resilience testing requires a fundamentally different mindset to traditional unit testing.
Ashish explains how chaos engineering can be thought of as a fire drill for software: practising failure scenarios in a controlled environment so that teams build the muscle memory they need to respond quickly when real incidents occur. The conversation covers key prerequisites — mature observability, health checks, and structured logging — and explains why these must be in place before introducing chaos experiments safely.
The discussion moves into practical tooling, including a live demo of stopping a virtual machine and validating that Azure Traffic Manager correctly routes traffic to a healthy region. Chaos Monkey and other fault-injection simulator tools are also discussed. The episode closes with actionable tips for getting started: begin small, define a clear hypothesis before each experiment, run experiments from development environments through to production-like environments, and always roll back introduced faults at the end of each run.
Related Content

Beyond Monitoring: The Rise of Observability Platform
Cloud with ChrisAs systems grow in complexity across distributed architectures and microservices, traditional monitoring is no longer sufficient to maintain reliability and user experience. Observability goes beyond monitoring by correlating logs, metrics, and traces to rapidly pinpoint root causes across hybrid and multi-cloud landscapes. In this episode, Chris is joined by Samir Pradka, Enterprise Architect at Artos, to explore how organisations can build an observability platform incrementally, leverage AIOps for predictive analytics, and implement self-healing infrastructure using tools like Ansible and Azure Resource Manager.

How to be successful with monitoring in Azure
Monitoring is often an afterthought — until something breaks. In this episode, Chris is joined by Vanessa Bruwer, Senior Engineer on Microsoft's FastTrack for Azure team, to explore how organisations can build a structured observability strategy using Azure Monitor, Application Insights, Log Analytics, and distributed tracing. Vanessa shares the FastTrack methodology for taking teams from zero monitoring knowledge to self-sufficient Azure Monitor configuration, covering alerting strategy, metrics, and the differences between monitoring a VM versus a distributed microservice architecture.

CGN2 - Cloud Gaming Notes Episode 2 - Matchmaking Services
Ever thought about what it takes to host a multiplayer game in the cloud? In the second episode of Cloud Gaming Notes, Chris and Lee Williams go hands-on with Halo 5 Guardians to explore the engineering behind matchmaking services. They cover the Actor model and Azure Service Fabric, skill-based matchmaking algorithms, the critical role of latency in competitive gaming, and how live ops and DevOps principles keep a game-as-a-service continuously updated without downtime. Real-world cloud architecture through the lens of AAA gaming.