“Is the site up?”
There is something to be said about the value of simple monitoring tools; what may seem initially unremarkable, in the right situations, can quickly convey a great deal of information to a wide variety of people. Working in TechOps we’re not strangers to the ominous question: “Is the site down right now?” We should be able to quickly filter out false positives and provide non-technical members of the organization with easily discernible information regarding the current state of the system.
Tick tick tick…
A tool that can go a long way in fulfilling these goals is a clock…yes, a clock.
We drew inspiration from stock exchanges like NYSE and NASDAQ. Trading is powered by software applications called “matching engines” that process trading data. Ops teams at exchanges monitor the engines to ensure new messages are being generated via “clocks” that update a few thousand times per second. When potential issues are being investigated, anyone can look at the clocks and verify that the matching engines are still sequencing messages. If a matching engine experiences some kind of critical failure, it is likely that the associated clock will stop updating or update less frequently.
Displaying a clock ticking away on a screen, in a large font, not only provides glanceable information from a distance, but can also answer vague/scary questions like, “Are we down right now?” It can also provide a good starting point of information when deep diving into a larger, more complex issue.
What should we measure?
While candidate metrics might include web traffic or database operations both have too many other variables that might fog uptime clarity. We wanted to follow the money and so we ended up going with sales order creation on each of our web application frontends, which includes our US and Canadian websites, as well as our internal retail point of sale application. If any of these frontends are indeed “down,” sales order creation should grind to a halt. However, if new orders continue to come in, it means the bare minimum of our stack is functioning properly enough to accept the orders.
Our first attempt at a sales order clock was written in Python and fetched data from one of our read-only database replicas. This was problematic because the replica would experience lag during parts of the day, causing the clock to display inaccurate data.
On our second attempt, we opted to intercept StatsD (a server and protocol developed by Etsy to transmit and receive application metrics) data as it was being received by our Graphite host (a time series database), search for sales order counters, and display a timestamp for each sales order as it is received. We wrote a Python script to do this which utilized Scapy to intercept the StatsD packets from the network interface and TkInter to display the timestamps.
That is, until we realized a “repeater” can be configured in StatsD, which can be used to send a copy of the data to another IP address and port. This eliminated the need to intercept the packets from the network interface using Scapy. So, we scrapped Scapy and added Twisted to create a network server that listens for the redirected StatsD traffic.
And thus, we ended up with this:
Currently, the clock script displays the current time and the last sales order received in each of our web application frontends. These light up in bright green as new sales orders are received (for four seconds) and go gray if no new orders are received for a certain amount of time (a threshold that we occasionally adjust).
The clock has been helpful in all the ways we anticipated, but also in a few ways we didn’t foresee. During actual outages (i.e. when one or more of the timestamps go gray), when actions are taken to triage the system, we can quickly see if those actions were effective in resolving the outage or not. We can also easily quantify the severity of an outage to anyone outside of the TechOps team by glancing at the clocks and stating how long it’s been since we’ve received a new sales order.
The latest iteration of our clock script is serving its purpose admirably, but that doesn’t mean we can’t make it better. Here are some ideas we have for improving it further:
Apply this idea to other metrics that could be useful. For example, create a timestamp update for whenever we transmit orders to any of our logistics providers.
- Create better thresholds for some of our frontends that regularly experience low sales order activity (i.e. our retail frontend whenever our retail stores are closed).
- Add sound effects whenever a time threshold is exceeded.
This exercise demonstrates the power of simple tools and when custom tools fit the job. In fact, perhaps it is because of the simplicity inherent in the the initial idea that no existing monitoring framework existed; yet this simple solution proved to be the best way to accomplish our goals.