“Knowing the quality of your services at any given moment in time before your customers do and using this information to continuously improve customer experience is part of modern software delivery and critical to the success of organisations”
In this post, we present why it is important to observe and manage our systems and solutions proactively and present the various mechanisms available for observing and reacting. We discuss distributed services observability through monitoring, logging, tracing and contextual heatmaps
TL;DR: 5 key takeaways in this post
- Observing distributed system components is key to managing chaos
- Observations types vary based on context and requirement
- Monitoring, Alerting, Logging, Auditing, Tracing and Historical views are types of observations we can make
- Observation ability is available out-of-the-box for platforms (AWS, Mulesoft etc) and in 3rd party products (Dynatrace, AppDynamics, New Relic etc). Some things you still need to bake into your API or Microservices framework or code to achieve über monitoring
- Use observations not just to react now but also to improve and evolve your technology platform
Why
Thanks to efficient software delivery practices we are delivering more integrated solution features and bolting on more integrated systems to accelerate the digital transformations. This means a lot of old internal systems and external services are being wired onto shiny new enterprise services over a standard platform to enable the flow of data back and forth
Keeping the lights on is simply not enough then, we need to know if the fuse is going to blow before the party guests get here!
Businesses therefore need to
- proactively observe their systems for fraud and malicious
- watch and act at actively and through passive means
- regularly interact with their systems as their users would to discover faults before users do
- track a single interaction end-to-end over simple and complex transactions for faster resolution of complaints and issues
- and evolve their features by listening to their systems over a period of time
Observation contexts and approach
I have, over time, realised that how we observe depends-on what we want to observe and when. There are multiple ways to observe, most of us are familiar with terms like Monitoring, Alerting, Logging, Distributed Tracing etc. but these are useful within an observation context. These contexts are real-time active or passive, incident management, historical analysis etc.
Let us look at some of these contexts in detail:
- To know at any instant if platform or services are up or down then we use a Monitoring approach
- If we want to be notified of our monitored components hitting some threshold (CPU, heap, response time etc.) then we use Alerting
- If we want the system to take some action based on monitoring thresholds (scale-out, deny requests, circuit-break etc.) then we use Alert Actioning
- If we want more contextual, focussed deep-dive for tracking an incident or defect then we use Logging and Tracking (with tracking IDs)
- If we want to track activity (user or system) due to concerns around information security or privacy then we implement Log Auditing
- If we want to detect bottlenecks, find trends, look for hot spots, improve and evolve the architecture etc. then we use Historical Logs, Distributed Tracing and Contextual flow maps
Monitoring
Monitoring enables operators of a system to track metrics and know the status at any given point in time. It can be provided via out-of-box plugins or external products and enabled on all levels of an integrated solution: bottom-up from Platform to Services and side-to-side from a client-specific service to domain services, system services etc
A key thing to note here is that monitoring in the traditional sense was driven by simply “watching the moving parts” but with modern monitoring products, we can “interact” with the services as a “hypothetical user” to detect issues before they do. This is called synthetic transaction monitoring and in my experience has been invaluable at delivering a proactive response to incidents and improving customer experience
For example:
- Cloud Service Provider Monitoring: AWS Monitoring offers monitoring of its cloud platform and the AWS services [ Example: https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html ]
- Platform As A Service (PaaS) Provider Monitoring: Mulesoft offers an “Integration Services” platform as a service and provides monitoring for its on-prem or cloud-offerings which includes monitoring for the platform and its runtime components (mule applications) [Example: https://www.mulesoft.com/platform/api/monitoring-anypoint]
- Monitoring Products: Products like New Relic, Dynatrace, App dynamics etc. work great if your enterprise spans a variety of cloud or on-prem services, needs a centralised monitoring solution and requires advanced features such as synthetic transactions, custom plugins etc
Alerting and Actions
Alerting allows users to be notified when monitored resources cross a threshold ( or trip some management-rule. Alerting depends on monitoring and is a proactive approach to knowing how systems are performing at any point in time
While alerts can be great, they can quickly overwhelm a human if there is too many. One strategy is for the system to take automatic action if there is an alert threshold reach and let the human know it has done something to mitigate a situation. For example:
- If the API is overloaded (504 – Gateway timeout) but still processing requests, then spin up a new instance of the component to serve the API from a new runtime
- If downstream service has gone down (500 – Service unavailable) or is timing out (408 – Request timeout) then trip the circuit breaker i.e return 504 from this API
- If there is a known issue with the runtime heap memory which causes the application to become unresponsive every 20ish hours, then start a new instance when heap reaches a certain threshold and restart this se
A Sample Dynatrace is shown below with the status of microservices and metrics over time per instance
Logging, Auditing and Transaction Tracking
This tells us about a specific functional context at a point-in-time and given by Logging solutions over our microservices and end systems. Generally, this type of information is queried from the logs using a transaction id or some customer detail and happens after an issue or defect is detected in the system. This is achieved through logging or distributed tracing
Logging:
- Use log levels – DEBUG, INFO, ERROR and at each level log only what you need to avoid log streams filling up quickly and call from your friendly enterprise logging team
- Avoid logging personally identifiable (PI) information ( name, email, phone, driver’s licence etc) – imagine this was your data flowing through someone’s logs, what would you like them to store and see?
- Log HTTP method and path if your framework does not do that by default
Auditing:
- Is logging user actions for tracking access especially for protected resources
- Involves logging information about “who”, “when” and “which resource”
- Is compact and concise to enable faster detection (less noise in the logs the better)
- Usually, separate from functional logs but can be combined if it suits
Tracking:
- Useful for looking at things end-to-end, User Interface to the backend systems
- Uses trackingIDs to track transactions with each point forwarding the trackingID to the next point downstream
- Each downstream point must respond back with the same trackingID to close the loop
- The entry-point, i.e. service client (Mobile app, Web app etc) must generate the trackingID. If this is not feasible then the first service accepting the request must generate this unique ID and pass it along
Heatmaps and historical views
This type of view is constructed from looking at long term data across a chain of client-microservices-provider interactions. Think of a heatmap of flows and errors which emerge over time through traces in the system. This information obviously is available after a number of interactions and highly useful in building strategies to detect bottlenecks in the solution and improve service quality to the consumers
A historical view with heatmaps is achieved through aggregated logs overlayed on visual flow maps grouped by some processID or scenarioID
One example of this in the view below from a tool called Camunda Cockpit. Camunda a lightweight embedded BPMN engine and used for orchestrating services in a distributed transaction context (learn more from Bernd Rucker here https://blog.bernd-ruecker.com/saga-how-to-implement-complex-business-transactions-without-two-phase-commit-e00aa41a1b1b)
Summary
- Observing distributed system components is important for operational efficiency
- Observations types vary based on observation context and need
- Monitoring, Alerting, Logging, Auditing, Tracing and Historical views are types of observations we can make
- Observation ability is available out-of-the-box for platforms (AWS, Mulesoft etc) and in 3rd party products (Dynatrace, AppDynamics, New Relic etc). Some things you still need to bake into your API or Microservices framework or code to achieve über monitoring
- Use observations not just to react now but also to improve and evolve your technology platform