Evolution of Observability in Software Engineering

Observability is not a new concept. In the dryest possible definition, observability, when applied to software engineering, measures the ability to understand what's going on inside the system based on existing emitted signals and data points. In plain English, that means: when something weird happens, can you tell precisely why without adding more console.log() debug code?

At the beginning of my software development career, I was assigned an application that was implemented using a client-server model. We had a strict rule of "you touch it, you own it," making me responsible for the entire development cycle of any written code, from design, implementation, testing, and deployment to support and maintenance. When something went wrong, my phone rang quickly. Back then, remote tools did not enjoy a wide acceptance yet, and popping in the client's desktop session to see what was going on was impossible. Any problem had to be solved using a mental picture of the entire system, some logging information, and few answers to my questions supplied by non-technical personnel.

I quickly learned that my modules need to do a better job telling me what's going on. The application was easy to debug, and fixes were trivial once the problem was apparent. Determining what was wrong took the longest.

It would have been a bad look to keep providing the clients with patches outputting more and more debug statements in the logs. Talk about the loss of confidence! My startup logs were massive, recording all kinds of environmental conditions marking loading stages. Is the network up? Was the database password changed, and nobody told this user? Did someone change the security on the shared folder? Was loading step 1 completed?

Creating huge logs had significantly improved the speed of finding and fixing issues and growing my skill as a developer. If the user reported a frozen application, but my log indicated that the network connection step was attempted and failed, it was reasonable to suspect that perhaps someone had unplugged the Ethernet cable. Also, that I should not expect that network is always available(see distributed computing fallacy #1).

Later, as I was working on multilayer architecture, where requests were routed between multiple distributed components, the problem of observability presented itself again. It was no longer possible to use brute force debugging methods on the entire system - at best, I could debug one component at a time, but doing so changed the timing, and as a result, the problem often migrated.

Writing large logs would still work but required merging multiple log sources, hoping that clock was synchronized across all distributed components and then spending hours looking for patterns. Presented with the odd JVM freezes, I've added metrics to suspect modules. Each function in question would print how long it took to run and the amount of available memory at the beginning and end of processing.

Modules wrote metrics in a separate file that could then be analyzed for any inconsistencies or memory spikes and graphed to spot the problem quickly. While it was apparent what went wrong (JVM froze and crashed), it wasn't obvious why and where. Metrics gave me a quicker way of finding bottlenecks and hotspots to review further.

While multilayer architecture was complicated, it was still running on two separate pieces of hardware, both easily accessible. I could connect to them via terminal and inspect CPU load, view and parse the logs and watch console messages.

Enter serverless, distributed microservices, or even a simple load-balanced system with multiple nodes. Now metrics and logs are not enough. The differences between production and development environments keep growing (it's unlikely that dev would have the same caliber and volume of nodes), so some problems may not be visible until the code change is deployed and used. And distributed architecture means that mathematically, the percent of possible combinations leading to strange and disastrous events is growing.

When is a disaster about to happen?

One way to measure the maturity of the team (and product) is to ask who discovers the most defects and problems. When the clients report the problem, it's already too late as they are affected by the outage. Instead, application and infrastructure metrics should alert you when something odd is happening, and when functionality failure is about to occur. If you are a software developer, "throwing the code over the fence" and blaming QA teams for undiscovered issues is no longer an acceptable approach.

Reaching this maturity means collecting - and analyzing - mountains of data. It is no longer feasible to look through multiple logs. And because of the collected data volume, pattern recognition by humans becomes almost impossible.

Observability Vs. Monitoring

And that's where the main difference between these two concepts lies: dashboards can be filled with all available data (pulled from infrastructure and software execution), but eventually, at a certain point of further complexity, the amount of data no longer makes sense and cannot answer the question of what exactly had happened during the odd event that can only be seen 3.75% of the time.

Monitoring refers to the ability to view collected metrics over a long period. Monitoring can alert that requests time is increasing and may be able to answer the question of "where my requests had slowed down" or at least point out a possible culprit.

Observability is an attribute of the entire system. If I am reviewing a slowdown by looking at event logs, but missing network data, database pool stats, etc. - the final determination on why something had failed is impossible without going back in time and collecting more data.

It is relatively easy to review a few logs and determine what went wrong with monolithic or straightforward systems. Yet I find myself saying, "huh? That makes absolutely no sense" more and more as developed and deployed software complexity increases.

Events for which there is no explanation, even if they only had occurred once, are major red flags, showing a lack of observability. Such unexplained events indicate insufficient data to build a hypothesis and be a potential symptom of impending failure.

Deciding what to observe

It may be tempting to collect all the data, from infrastructure to data storage to individual software components.

The downside is transport costs from moving data across the network, additional processes to test, maintain, update, and configure, training to ensure software instrumentation is done correctly, analyzed well, and produces expected results.

I would start with the most critical systems. What keeps you awake at night? What system, if it goes down, would cost your company the most?

Observability implementation workflow

A typical implementation may do the following:

Collect the data (log events or metrics) for various targets: infrastructure, execution jobs, databases, software modules. Multiple OS and language stacks should be supported.
Aggregate, filter, enrich the data. Provide the ability to create custom data sets.
Push or pull data to storage. Have the ability to discover new targets for data collection.
Allow to search and filter data in storage via a query language.
Visually represent collected data for human review in the dashboard format.
Have alert functionality that warns the humans when specific data points fall outside predefined limits.

Terminology

It always helps to describe the basic vocabulary.

Data definition: Logs vs. Metrics vs. Distributed Tracing

Before we start collecting data, it's useful to figure out what kind of data it should be. Logs, metrics, and tracing are often described as pillars of observability, and to me, that means that you may need to collect all three for high observability.

What are the differences between them?

Logs are verbose, take a lot of space, and collect large amounts of information emitted by the system. They are typically high cardinality, meaning the percentage of unique data points in the set is high. That allows looking into specific detail of the event but makes aggregating them more difficult. Logs record what happens on the application or service level and answer why a specific service is broken.
Distributed tracing stitches together the data generated by multiple services by associating unique trace id with each incoming request. While lengthy, it provides the most holistic view of the system, especially in the microservices environment, by following requests flowing between different nodes. Tracing can answer the question of which service is broken or slowing down.
Metrics are data points, a key-value pair, often summarized and grouped. Metrics are low cardinality (typically just one data point), and cannot be drilled down to get specifics of the particular event. Metrics are great for predicting trends, noticing large patterns, formulating strategy, and figuring out where the infrastructure problem may be located.

When looking at the web page request, distributed tracing provides timing information on request routing between different backend services. Application log may record user-agent, request parameters, time of the request, incoming IP, and server response. Metric may store the number of requests from the given IP or average response time.

Data collection and transport: Telemetry

Telemetry involves recording and transporting data reading from the originating source to the final data storage destination. Data collectors and exporters may store intermediate results in memory, bursting them out to data storage at specific intervals to improve performance.

Data Storage: Time Series Database

Just like NoSQL databases are optimized to store documents, Time Series Databases are best suited for storing large amounts of data over a specified time frame, often saving space by keeping differences between the values rather than values themselves. Time Series Databases provide query language support to filter and search stored data.

Observability tools

While creating your own platform is always a possibility, there is a wide variety of tools available that fit the most common use cases.

Open Source: TICK Stack

TICK stack uses Telegraf for telemetry, InfluxDB for Time Series Database (that can be queried by using InfluxQL), Chronograf for the visual dashboard, and graph building, Kapacitor for generating alerts via multiple channels, like Slack or PagerDuty.

InfluxDB uses push, which works better for environments behind a firewall reporting to the cloud metrics server.

Open Source: Prometheus

Prometheus provides many exporters and integrators for telemetry, PromQL query language, Alertmanager for notifications, and easy integration with Grafana for visualization.

Prometheus uses pull (while providing the ability to push, which is discouraged).

Open Source: OpenTelemetry

OpenTelemetry is the next version of OpenCensus, which was developed by Google and released to Open Source. OpenCensus meant to solve integration complexity by having a single language-agnostic API to be used for instrumentation. OpenTelemetry is currently in Beta and has limited language support.

Commercial: New Relic, Data Dog, Honeycomb.io

It's effortless to go from nothing to collecting data in just a few minutes with commercial offerings. All solutions are hosted, meaning no maintenance and updates to the servers - while still allowing for custom integrations. Something to watch out for is cost - while open-source implementations are definitely not free when you take hosting resources and maintenance into account, it's easier to control costs when more endpoints are needed.

Small steps improvements

The first real goal is to have "some" observability. But throwing more data in the storage does not automatically means your system is now better. If it doesn't answer the questions of "what just happened in the field?" then you simply created a new project to maintain with a very low ROI. Start small, evaluate, and adjust.