Integration engineering regardless of the architectural style (EAI, microservices, SOA, accident etc) must ensure the end solution is robust and reliable. These are two key attributes of any solution and rely on handling scenarios when things do not go exactly as planned when two systems talk to each other over a network

Planning for unexpected user behaviour, data quality, internal/external system reliability, network reliability etc. is often difficult when crafting a problem statement to solution. In our daily human-to-human communications we can encounter a variety of situations where conversations fail due to issues with the medium, language, context etc. and we have been trained to handle them or learn to handle them
Similarly, we must do the same when we model our human business processes into our software to build distributed features that live in various systems across a network and rely on system-to-system or human-to-system interactions

The key to then designing distributed software for reliability and robustness is first understanding the set of the information produced in these situations. This should inform us about the unexpected behaviour and we call these faults or exceptions in our systems. The second is to then model our software and put the right rules in place to react to the exceptions and handle them in the best possible manner. These reactions must be considered in advance and can be automatic or human reliant but must be observable

Breaking down the problem space
As mentioned earlier, the problem space for handling unreliability is quite large and we somehow need to build a mental model to then handle them better. One strategy I found that integration engineering teams employ is to handle exceptions for real-time communication vs messaging based communication differently, i.e. they are two different error handling contexts
Thus broadly speaking we have two contexts:
- Synchronous
- Asynchronous
Synchronous exception handling
Synchronous communications happen in real-time and are either request/response or one-way
These communications either are a command (create, update, submit, lodge etc) or a query (get by ID, search etc). Exceptions in real-time communications are more visible than in message-based communications because the feedback is immediate to the end user
In the real world, we observe these in a range of ways from a simple denial of service error (“you are not authorised to perform an action“) to data validation errors (“invalid Medicare number”) to more complex and cryptic errors related systems and data (“unable to process right now“)
One way to think about these types of communication and exceptions to handling is to imagine speaking to someone over the phone. The act of a caller / caller interaction happens synchronously and there is a request for information followed by a response. The requestor may ask questions (query) or order them to do something (command). These type of interactions are a sub-class of synchronous communication called Synchronous Request/Response. Error handling here then is typically done by the caller

Yet another way of synchronous communication is when we submit forms to request something or state some claim (here’s my details, and I would like to _____). This is a synchronous interaction between the caller and the callee, yet it results in an overall one-way interaction because the transaction ends when the form is submitted. The person taking the form may not be the one resolving the request immediately. Thus, this is a sub-class called synchronous one-way interaction and if the error is superficial i.e. related to the form fields and mandatory items then the caller has to do it, however everything else is handled by the callee including guaranteeing the delivery of the form, handling information issues with the form etc

In summary, here are some key things to remember when handling synchronous exceptions
- Make all synchronous exceptions observable and trackable. Care about the frequency and amplitude of these over time to measure your reliability
- Handle exceptions due to availability of the service or a method: Synchronous interactions happen explicitly between a caller and callee. The caller establishes a connection and requests something. The caller needs to therefore handle exceptions related to being able to call the service (HTTP 301, 405, 500, 501 )
- Handle exceptions due to availability of the medium or network : Synchronous interactions are sensitive to the communication medium (noisy room, unreliable internet network, poor phone connection etc) and the caller needs to handle these (HTTP 502, 503, 504 etc)
- Handle request/response and one-way exceptions differently: Recognising synchronous request/response vs one-way interactions produce different types of exceptions and require different service qualities. Request/response leans on reliability (HTTP 4xx etc) while one-way interactions depend on robustness (how do we HTTP 202 and not a lot of HTTP 504s )
- Ensure you are clear on the responsibility of handling exception: Caller or Callee? In most cases it is the caller since they initiate the call, unless this is a connection established by the caller to publish information and the interaction is a one-way channel. In a one-way interaction the callee handles the exceptions (mostly)
Putting it together: Handling Synchronous Exceptions
As a Service Consumer

As a Service Provider

Handling Asynchronous Exceptions
… to be continued
Summary
Distributed solutions involving interactions between humans-to-systems and systems-to-systems require reliability and robustness as key quality attributes. Exception handling is key and understand contexts helps in designing better exception handling
Synchronous communication exceptions happen in real-time and there are different sub-contexts in a synchronous communication leading to different responsibilities when error handling
Along with the contexts, we looked at a couple of samples of how to handle synchronous exceptions
The next time we will dive into the world of asynchronous communication and discuss sub-contexts and techniques for building robust and reliable solutions