Understanding Error Handling in Integrated Solution: Making things robust


Integration engineering regardless of the architectural style (EAI, microservices, SOA, accident etc) must ensure the end solution is robust and reliable. These are two key attributes of any solution and rely on handling scenarios when things do not go exactly as planned when two systems talk to each other over a network

Software Quality Attributes

Planning for unexpected user behaviour, data quality, internal/external system reliability, network reliability etc. is often difficult when crafting a problem statement to solution. In our daily human-to-human communications we can encounter a variety of situations where conversations fail due to issues with the medium, language, context etc. and we have been trained to handle them or learn to handle them

Similarly, we must do the same when we model our human business processes into our software to build distributed features that live in various systems across a network and rely on system-to-system or human-to-system interactions

Systems integration vs general communication

The key to then designing distributed software for reliability and robustness is first understanding the set of the information produced in these situations. This should inform us about the unexpected behaviour and we call these faults or exceptions in our systems. The second is to then model our software and put the right rules in place to react to the exceptions and handle them in the best possible manner. These reactions must be considered in advance and can be automatic or human reliant but must be observable

Planning for un-expected behaviour

Breaking down the problem space

As mentioned earlier, the problem space for handling unreliability is quite large and we somehow need to build a mental model to then handle them better. One strategy I found that integration engineering teams employ is to handle exceptions for real-time communication vs messaging based communication differently, i.e. they are two different error handling contexts

Thus broadly speaking we have two contexts:

  1. Synchronous
  2. Asynchronous

Synchronous exception handling

Synchronous communications happen in real-time and are either request/response or one-way

Synchronous communication in the “real” world

These communications either are a command (create, update, submit, lodge etc) or a query (get by ID, search etc). Exceptions in real-time communications are more visible than in message-based communications because the feedback is immediate to the end user

In the real world, we observe these in a range of ways from a simple denial of service error (“you are not authorised to perform an action“) to data validation errors (“invalid Medicare number”) to more complex and cryptic errors related systems and data (“unable to process right now“)

One way to think about these types of communication and exceptions to handling is to imagine speaking to someone over the phone. The act of a caller / caller interaction happens synchronously and there is a request for information followed by a response. The requestor may ask questions (query) or order them to do something (command). These type of interactions are a sub-class of synchronous communication called Synchronous Request/Response. Error handling here then is typically done by the caller

Synchronous Request/Response

Yet another way of synchronous communication is when we submit forms to request something or state some claim (here’s my details, and I would like to _____). This is a synchronous interaction between the caller and the callee, yet it results in an overall one-way interaction because the transaction ends when the form is submitted. The person taking the form may not be the one resolving the request immediately. Thus, this is a sub-class called synchronous one-way interaction and if the error is superficial i.e. related to the form fields and mandatory items then the caller has to do it, however everything else is handled by the callee including guaranteeing the delivery of the form, handling information issues with the form etc

In summary, here are some key things to remember when handling synchronous exceptions

  1. Make all synchronous exceptions observable and trackable. Care about the frequency and amplitude of these over time to measure your reliability
  2. Handle exceptions due to availability of the service or a method: Synchronous interactions happen explicitly between a caller and callee. The caller establishes a connection and requests something. The caller needs to therefore handle exceptions related to being able to call the service (HTTP 301, 405, 500, 501 )
  3. Handle exceptions due to availability of the medium or network : Synchronous interactions are sensitive to the communication medium (noisy room, unreliable internet network, poor phone connection etc) and the caller needs to handle these (HTTP 502, 503, 504 etc)
  4. Handle request/response and one-way exceptions differently: Recognising synchronous request/response vs one-way interactions produce different types of exceptions and require different service qualities. Request/response leans on reliability (HTTP 4xx etc) while one-way interactions depend on robustness (how do we HTTP 202 and not a lot of HTTP 504s )
  5. Ensure you are clear on the responsibility of handling exception: Caller or Callee? In most cases it is the caller since they initiate the call, unless this is a connection established by the caller to publish information and the interaction is a one-way channel. In a one-way interaction the callee handles the exceptions (mostly)

Putting it together: Handling Synchronous Exceptions

As a Service Consumer

A sample error handling model for a service provider during a synchronous request/response session

As a Service Provider

A sample model for service provider error handling during an synchronous request/response session

Handling Asynchronous Exceptions

… to be continued

Summary

Distributed solutions involving interactions between humans-to-systems and systems-to-systems require reliability and robustness as key quality attributes. Exception handling is key and understand contexts helps in designing better exception handling

Synchronous communication exceptions happen in real-time and there are different sub-contexts in a synchronous communication leading to different responsibilities when error handling

Along with the contexts, we looked at a couple of samples of how to handle synchronous exceptions

The next time we will dive into the world of asynchronous communication and discuss sub-contexts and techniques for building robust and reliable solutions

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s