Journal of Distributed Software Engineering, Architecture and Design
Understanding Error Handling in Integrated Solution: Making things robust
<div class="cs-rating pd-rating" id="pd_rating_holder_1819065_post_1869"></div>
<p class="wp-block-paragraph">Integration engineering regardless of the architectural style (EAI, microservices, SOA, accident etc) must ensure the end solution is <strong>robust</strong> and <strong>reliable</strong>. These are two key attributes of any solution and rely on handling scenarios when things do not go exactly as planned when two systems talk to each other over a network</p>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.20.18-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.20.18-pm.png?w=1024" alt="" class="wp-image-1896" /></a><figcaption>Software Quality Attributes</figcaption></figure>
<p class="wp-block-paragraph">Planning for unexpected user behaviour, data quality, internal/external system reliability, network reliability etc. is often difficult when crafting a problem statement to solution. In our daily human-to-human communications we can encounter a variety of situations where conversations fail due to issues with the medium, language, context etc. and we have been trained to handle them or learn to handle them</p>
<p class="wp-block-paragraph">Similarly, we must do the same when we model our human business processes into our software to build distributed features that live in various systems across a network and rely on system-to-system or human-to-system interactions</p>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.13.12-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.13.12-pm.png?w=1024" alt="" class="wp-image-1892" /></a><figcaption>Systems integration vs general communication</figcaption></figure>
<p class="wp-block-paragraph">The key to then designing distributed software for reliability and robustness is first understanding the set of the <strong><em>information</em></strong> <em>produced</em> in these situations. This should inform us about the unexpected behaviour and we call these <strong>faults or exceptions</strong> in our systems. The second is to then model our software and put the right rules in place to <strong>react</strong> to the exceptions and <strong>handle</strong> them in the best possible manner. These reactions must be considered in advance and can be automatic or human reliant but must be <em><strong>observable</strong></em> </p>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.15.30-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.15.30-pm.png?w=1024" alt="" class="wp-image-1894" /></a><figcaption>Planning for un-expected behaviour</figcaption></figure>
<h2 class="wp-block-heading">Breaking down the problem space</h2>
<p class="wp-block-paragraph">As mentioned earlier, the problem space for handling unreliability is quite large and we somehow need to build a mental model to then handle them better. One strategy I found that integration engineering teams employ is to handle exceptions for real-time communication vs messaging based communication differently, i.e. they are two different error handling contexts</p>
<p class="wp-block-paragraph">Thus broadly speaking we have two contexts:</p>
<ol class="wp-block-list"><li>Synchronous</li><li> Asynchronous </li></ol>
<h2 class="wp-block-heading">Synchronous exception handling</h2>
<p class="wp-block-paragraph">Synchronous communications happen in real-time and are <strong>either request/response or </strong>one-way</p>
<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-3.43.28-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-3.43.28-pm.png?w=984" alt="" class="wp-image-1886" width="600" height="274" /></a><figcaption>Synchronous communication in the “real” world</figcaption></figure>
</div>
<p class="wp-block-paragraph">These communications either are a <strong>command</strong> (create, update, submit, lodge etc) or a <strong>query</strong> (get by ID, search etc). Exceptions in real-time communications are more visible than in message-based communications because the feedback is immediate to the end user</p>
<p class="wp-block-paragraph">In the real world, we observe these in a range of ways from a simple <em>denial of service</em> error (“<em>you are not authorised to perform an action</em>“) to <em>data validation </em>errors (“<em>invalid Medicare nu</em>mber”) to more<em> complex and cryptic</em> errors related systems and data (“<em>unable to process right now</em>“)</p>
<p class="wp-block-paragraph">One way to think about these types of communication and exceptions to handling is <em><strong>to imagine speaking to someone over the phone</strong></em>. The act of a caller / caller interaction happens synchronously and there is a request for information followed by a response. The requestor may ask questions (query) or order them to do something (command). These type of interactions are a sub-class of synchronous communication called Synchronous Request/Response. Error handling here then is typically done by the caller </p>
<p class="wp-block-paragraph"></p>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-3.48.57-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-3.48.57-pm.png?w=1024" alt="" class="wp-image-1888" /></a><figcaption>Synchronous Request/Response</figcaption></figure>
<p class="wp-block-paragraph">Yet another way of synchronous communication is when we submit forms to request something or state some claim (here’s my details, and I would like to _____). This is a synchronous interaction between the caller and the callee, yet it results in an overall one-way interaction because the transaction ends when the form is submitted. The person taking the form may not be the one resolving the request immediately. Thus, this is a sub-class called synchronous one-way interaction and if the error is superficial i.e. related to the form fields and mandatory items then the caller has to do it, however everything else is handled by the callee including guaranteeing the delivery of the form, handling information issues with the form etc </p>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.00.16-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.00.16-pm.png?w=1024" alt="" class="wp-image-1890" /></a></figure>
<p class="wp-block-paragraph">In summary, here are some key things to remember when handling synchronous exceptions </p>
<ol class="wp-block-list"><li>Make all <strong>synchronous</strong> exceptions observable and trackable. Care about the frequency and amplitude of these over time to measure your reliability</li><li><strong>Handle exceptions due to availability of the service or a method</strong>: Synchronous interactions happen explicitly between a caller and callee. The caller establishes a connection and requests something. The caller needs to therefore handle exceptions related to being able to call the service (HTTP 301, 405, 500, 501 ) </li><li><strong>Handle</strong> <strong>exceptions due to availability of the medium or network </strong>: Synchronous interactions are sensitive to the communication medium (noisy room, unreliable internet network, poor phone connection etc) and the caller needs to handle these (HTTP 502, 503, 504 etc) </li><li><strong>Handle request/response and one-way exceptions differently</strong>: Recognising synchronous request/response vs one-way interactions produce different types of exceptions and require different service qualities. Request/response leans on reliability (HTTP 4xx etc) while one-way interactions depend on robustness (how do we HTTP 202 and not a lot of HTTP 504s ) </li><li><strong>Ensure you are clear on the responsibility of handling exception: </strong>Caller or Callee? In most cases it is the caller since they initiate the call, unless this is a connection established by the caller to <em>publish</em> information and the interaction is a one-way channel. In a one-way interaction the callee handles the exceptions (mostly)</li></ol>
<h2 class="wp-block-heading">Putting it together: Handling Synchronous Exceptions </h2>
<h4 class="wp-block-heading">As a Service Consumer</h4>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.38.44-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.38.44-pm.png?w=1024" alt="" class="wp-image-1898" /></a><figcaption>A sample error handling model for a service provider during a synchronous request/response session</figcaption></figure>
<h4 class="wp-block-heading">As a Service Provider</h4>
<figure class="wp-block-image size-large"><a href="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.38.58-pm.png"><img src="https://alok-mishra.com/wp-content/uploads/2021/06/screen-shot-2021-06-20-at-4.38.58-pm.png?w=960" alt="" class="wp-image-1900" /></a><figcaption>A sample model for service provider error handling during an synchronous request/response session </figcaption></figure>
<h2 class="wp-block-heading">Handling Asynchronous Exceptions</h2>
<p class="wp-block-paragraph"><strong>… to be continued</strong> </p>
<h2 class="wp-block-heading">Summary</h2>
<p class="wp-block-paragraph">Distributed solutions involving interactions between humans-to-systems and systems-to-systems require reliability and robustness as key quality attributes. Exception handling is key and understand contexts helps in designing better exception handling</p>
<p class="wp-block-paragraph">Synchronous communication exceptions happen in real-time and there are different sub-contexts in a synchronous communication leading to different responsibilities when error handling</p>
<p class="wp-block-paragraph">Along with the contexts, we looked at a couple of samples of how to handle synchronous exceptions</p>
<p class="wp-block-paragraph">The next time we will dive into the world of asynchronous communication and discuss sub-contexts and techniques for building robust and reliable solutions</p>
Integration engineering regardless of the architectural style (EAI, microservices, SOA, accident etc) must ensure the end solution is robust and reliable. These are two key attributes of any solution and rely on handling scenarios when things do not go exactly as planned when two systems talk to each other over a network
Software Quality Attributes
Planning for unexpected user behaviour, data quality, internal/external system reliability, network reliability etc. is often difficult when crafting a problem statement to solution. In our daily human-to-human communications we can encounter a variety of situations where conversations fail due to issues with the medium, language, context etc. and we have been trained to handle them or learn to handle them
Similarly, we must do the same when we model our human business processes into our software to build distributed features that live in various systems across a network and rely on system-to-system or human-to-system interactions
Systems integration vs general communication
The key to then designing distributed software for reliability and robustness is first understanding the set of the informationproduced in these situations. This should inform us about the unexpected behaviour and we call these faults or exceptions in our systems. The second is to then model our software and put the right rules in place to react to the exceptions and handle them in the best possible manner. These reactions must be considered in advance and can be automatic or human reliant but must be observable
Planning for un-expected behaviour
Breaking down the problem space
As mentioned earlier, the problem space for handling unreliability is quite large and we somehow need to build a mental model to then handle them better. One strategy I found that integration engineering teams employ is to handle exceptions for real-time communication vs messaging based communication differently, i.e. they are two different error handling contexts
Thus broadly speaking we have two contexts:
Synchronous
Asynchronous
Synchronous exception handling
Synchronous communications happen in real-time and are either request/response or one-way
Synchronous communication in the “real” world
These communications either are a command (create, update, submit, lodge etc) or a query (get by ID, search etc). Exceptions in real-time communications are more visible than in message-based communications because the feedback is immediate to the end user
In the real world, we observe these in a range of ways from a simple denial of service error (“you are not authorised to perform an action“) to data validation errors (“invalid Medicare number”) to more complex and cryptic errors related systems and data (“unable to process right now“)
One way to think about these types of communication and exceptions to handling is to imagine speaking to someone over the phone. The act of a caller / caller interaction happens synchronously and there is a request for information followed by a response. The requestor may ask questions (query) or order them to do something (command). These type of interactions are a sub-class of synchronous communication called Synchronous Request/Response. Error handling here then is typically done by the caller
Synchronous Request/Response
Yet another way of synchronous communication is when we submit forms to request something or state some claim (here’s my details, and I would like to _____). This is a synchronous interaction between the caller and the callee, yet it results in an overall one-way interaction because the transaction ends when the form is submitted. The person taking the form may not be the one resolving the request immediately. Thus, this is a sub-class called synchronous one-way interaction and if the error is superficial i.e. related to the form fields and mandatory items then the caller has to do it, however everything else is handled by the callee including guaranteeing the delivery of the form, handling information issues with the form etc
In summary, here are some key things to remember when handling synchronous exceptions
Make all synchronous exceptions observable and trackable. Care about the frequency and amplitude of these over time to measure your reliability
Handle exceptions due to availability of the service or a method: Synchronous interactions happen explicitly between a caller and callee. The caller establishes a connection and requests something. The caller needs to therefore handle exceptions related to being able to call the service (HTTP 301, 405, 500, 501 )
Handleexceptions due to availability of the medium or network : Synchronous interactions are sensitive to the communication medium (noisy room, unreliable internet network, poor phone connection etc) and the caller needs to handle these (HTTP 502, 503, 504 etc)
Handle request/response and one-way exceptions differently: Recognising synchronous request/response vs one-way interactions produce different types of exceptions and require different service qualities. Request/response leans on reliability (HTTP 4xx etc) while one-way interactions depend on robustness (how do we HTTP 202 and not a lot of HTTP 504s )
Ensure you are clear on the responsibility of handling exception: Caller or Callee? In most cases it is the caller since they initiate the call, unless this is a connection established by the caller to publish information and the interaction is a one-way channel. In a one-way interaction the callee handles the exceptions (mostly)
Putting it together: Handling Synchronous Exceptions
As a Service Consumer
A sample error handling model for a service provider during a synchronous request/response session
As a Service Provider
A sample model for service provider error handling during an synchronous request/response session
Handling Asynchronous Exceptions
… to be continued
Summary
Distributed solutions involving interactions between humans-to-systems and systems-to-systems require reliability and robustness as key quality attributes. Exception handling is key and understand contexts helps in designing better exception handling
Synchronous communication exceptions happen in real-time and there are different sub-contexts in a synchronous communication leading to different responsibilities when error handling
Along with the contexts, we looked at a couple of samples of how to handle synchronous exceptions
The next time we will dive into the world of asynchronous communication and discuss sub-contexts and techniques for building robust and reliable solutions
Alok brings experience in engineering and architecting distributed software systems from over 20 years across industry and consulting. His posts focus on Systems Integration, API design, Microservices and Event driven systems, Modern Enterprise Architecture and other related topics
View all posts by alokmishra