A Journey in Observability

  • eBPF Tracepoints: Gaining Access to the TCP State Machine

    My current research focus at UTS is around the inner workings of TCP Congestion Control, which as you might guess requires some detailed insights into the Linux TCP State Machine. While there have been significant improvements in the TCP Congestion Control Modules used by Linux (Reno -> NewReno -> BIC -> CUBIC1), the state machine…

  • XDP and eBPF for Network Observability with Python

    I’ve been playing with XDP and eBPF in my lab to see if it might be possible to create NetFlow/IPFIX style flow logs for network observability purposes. Of course this is possible, but is this something that is achievable in a few hours for average Joe? In my previous article I discussed the what eBPF…

  • Alert Fatigue: Why Too Many Alerts Can be Disastrous!

    Alert fatigue is a problem I’ve encountered so many times in IT Operations, especially as monitoring sprawl increases the number of tools we use to gain additional insights into our SLOs (Service Level Objectives). Those on the front line who who receive the alerts will begin to drown in the information and overlook the important…

  • Implementing Enterprise Observability for Success Review

    Implementing Enterprise Observability for Success by Manisha Agrawal and Karun Krishnannair takes a novel approach to implementing observability in the enterprise. Unlike other books that focus on the how to get data into a specific system, this book looks at it from an architectural and IT management perspective and is largely vendor agnostic (which I…

  • Termshark: Command Line Wireshark for the Win!

    I was recently working on a headless server trying to troubleshoot an issue with Linux Bridging and IPTables and needed to understand where my packets were getting dropped. Traditionally in this situation I would run a tcpdump (with aggressive filters) and either watch the output or take a PCAP and scp the file to my…

  • Mastering Python Networking Review

    I came across Mastering Python Networking by Eric Chou about a month ago on Twitter and immediately purchased it. I was excited to see book on programming targeted at people with a networking background as being able to automate becomes critical to scaling networks and reducing toil. To say I’m a fan of this book…

  • SRE: Five Ways to Build a Blameless Culture

    One of the main pillars of SRE (Site Reliability Engineering) is to introduce a blameless culture, however, building this takes more than just words. You can’t build a blameless culture by talking about culture, because culture is the result of changes in processes and structures within an organisation. Here are my top 5 ways to…

  • Performance Diagnostics Part 6: 5 SRE Practices to Minimise Toil

    One of the core tenets of SRE is to minimise toil to increase resiliency and improve digital experience. SRE (Site Reliability Engineering) is a practice that Google created in the 2000s to improve the performance of the “Site”, with the site being the Google Search. Google being made up of some incredibly smart people defied…

  • Digital Experience in an Opaque World – MilCIS 2023

    Thanks for attending the Observe Ability talk on Digital Experience in an Opaque World at MilCIS 2023. Observability drives digital experience as we cannot improve what we cannot measure. Fix our Computers! Link to Twitter Letter Link To Defense Business Board Report Convergence of Logs and Traces Link to Article Site Reliability Engineering Next Generation…

  • Unlocking Cybersecurity Excellence: A Guide to the Australian Cyber Essential Eight

    The Australian Cyber Essential Eight is a general framework targeted at Australian enterprise and government and is largely considered one of the better frameworks for hardening, and limiting the damage from the most common kind of attacks: Most Australian government entities must implement the Essential Eight strategy, developed by the Australian Cyber Security Centre and…