Coordinating Distributed Software: Choreography vs Orchestration

There are 2 ways of coordinating distributed systems and services to achieve an end-to-end outcome
There are 2 ways of coordinating distributed systems and services to achieve an end-to-end outcome

Choreography and Orchestration

Choreography and Orchestration are two common patterns in coordinating distributed systems to build an end-to-end solution. Careful consideration must be given when picking one vs the other because they offer advantages and dis-advantages that are quite contrasting and impactful. While choreography offers nice de-coupled option, monitoring events and rolling back things across multiple systems can be a nightmare to maintain; similarly while orchestration offers a nice sequenced, organised way to invoke services it brings a lot of overhead and hidden coupling which can impact ability to keep your services nimble over time

As an example, at one client we saw this implementation of a long-running flow done using events. There were specific choreography rules in just 3 systems sending events back and forth on completing various steps in a task (like you and me exchanging letters in the 1860s across the ocean, what could go wrong?).  Lost and out-of-sequence events were causing out-of-sequence actions actions and events (generation) from events which led to cascading chain of actions eventually leading to customers receiving partially generated or out-of-sequence documents. It was a customer management disaster and quite hard to find and fix in production! 

Okay now that I have you worried, lets talk about Choreography and Orchestration in detail


Choreography is often qualified further with “event choreography” is a mechanism where the systems in a distributed systems can receive messages and there are rules in each system to interpret the content of the message and react to it (update tables, query end system, generate document, emit a message etc). The solution design then consists of where these messages will come from, how the systems will react to them and if the  ordering/sequencing of these messages matters when reacting to them

Event choreography is used to de-couple microservices across bounded-contexts and was the focus of the “message oriented middleware” architectures. Event choreography solutions are easy to scale to new systems (simply add a new consumer) but complex to own and operate (try finding lost messages in production) and fail if there is any sequencing or global-transaction requirement (few solutions like global locks/mutex, entity versioning or event a common bulletin board to coordinate choreography are attempts to orchestrate them)



Service orchestration is a way to react to a command or event to start a long-running process which can interact with one or more services. In this mechanism there is a central coordinator actor (system or process  or person) who executes a series of steps in a controlled manner one-by-one, in-parallel (scatter/gather) using synchronous (request/response, one-way) and asynchronous (event outbound, wait and move to next step on event inbound) interactions

Orchestration is great to handle complex series of steps especially those that are part of a single logical transaction. What we could not achieve with events and the chaos, we can now do with coordination and control. This includes roll-back and compensating activities – i.e. steps in the backward flow to rollback transactions in the earlier steps if this step fails

The major problem with orchestration is coupling, the orchestrator is tying the services together and they all are now bound in a solution context. This is why orchestrations should be done with care, against mature endpoints and avoided when event-based state transfer can work  


A Word about building Orchestration services

There is a whole another blog post about event-driven architecture and choreography, in this post we dive more into orchestrating services. As mentioned there services exist because the solution sees actions in a group of distributed system as part of a single global transaction with some strict sequencing requirements 

Screen Shot 2020-03-13 at 8.33.59 pm

These services have other hidden attributes or technical requirements such as 

  1. Needing an API to start a new instance
  2. Needing an API to read the current state of a given instance
  3. Need to orchestrate a set of actions that may be part of a single end-to-end transaction, but note it is not necessary to have these steps as a single transaction
  4. They have tasks which wrap callouts to external APIs, DBs, messaging systems etc.
  5. Their Tasks can define error handling and rollback conditions (compensations)
  6. They store their current state and details about completed tasks

Screen Shot 2020-03-13 at 7.52.57 pm

Why are orchestrations stateful? 

Services, especially integration services can be generally stateless. These are optimised for short-lived request-response type applications, however there are scenarios where long-running one-way request handling is required along with the ability to provide the client with the status of the request and the ability to perform distributed transaction handling and rollback (because XA sucked!)

So you need stateful because

  • there are a group of tasks that need to be done together as a step that is asynchronous with no guaranteed response-time or asynchronous one-way with a response notification due later
  • or there are a group of tasks where each step individually may have a short response time but  aggregated response-time is large
  • or there are a group of tasks which are part of a single distributed transaction if one fails you need to rollback all

What API endpoints are there in a stateful microservice?

Microservices implementing stateful orchestrations provide a service that is richer than the normal resource query/command. They start a complex long running process with activities or steps and therefore we need interface contract that lets us start, interact, terminate, investigate and manage the long-running process/flows

  1. An endpoint to initiate: for example, HTTP POST which responds with a status code of “Created” or “Accepted” (depending on what you do with the request) and responds back with a location
  2. An endpoint to query request state: for example, HTTP GET using the process id from the initiate process response. The response is then the current state of the process with information about the past state

Sample use case: User Signup

  1. The process of signing-up or registering a new user requires multiple steps and interaction looks like this [Command]
  2. The client can then check the status of the registration periodically [Query]


POST /registrations HTTP/1.1Content-Type: application/jsonHost:

{ "firstName": "foo","lastName":"bar",email:"" }
HTTP/1.1 201 Created  
Location: /registrations/12345


GET /registrations/12345 HTTP/1.1Content-Type: application/jsonHost:

{ "firstName": "foo","lastName":"bar",email:"" }
HTTP/1.1 200 Ok  

{ "id":"12345", "status":"Pending", "data": { "firstName": "foo","lastName":"bar",email:"" }}

Screen Shot 2020-03-13 at 7.38.41 pm

Orchestration anti-patterns

While the pattern is simple, I have seen the implementation vary with some key anti-patterns. These anti-patterns make the end solution brittle over time leading to issues with stateful microservice implementation and management

  1. Enterprise business process orchestration: Makes it complex, couples various contexts. Keep it simple!
  2. Hand rolling your own orchestration solution: Unlike regular services, operating long-running services requires additional tools for end-to-end observability and handling errors
  3. Implementing via a stateless service platform and bootstrapping a database: The database can become the bottleneck and prevent your stateful services from scaling. Use available services/products as they optimised their datastores to make them highly scalable and consistent
  4. Leaking internal process id: Your end consumer should see some mapped id not the internal id of the stateful microservice. This abstraction is necessary for security (malicious user cannot guess different ids and query them) and dependency management
  5. Picking a state machine product without “rollback”: Given that distributed transaction rollback and error-handling are two big things we are going need to implement this pattern, it is important to pick a product that lets you do this. A lightweight BPM engine is great for this otherwise you may need to hack around to achieve this in other tools
  6. Using stateful process microservices for everything: Just don’t! Use the stateless pattern as they are optimal for the short-lived request/responses use cases. I have, for example, implemented request/response services with a BPEL engine (holds state) and lived to regret it
  7. Orchestrate when Choreography is needed: If the steps do not make sense within a single context, do not require a common transaction boundary/rollback or the steps have no specific ordering with action rules in other microservices then use event-driven choreography


Orchestration and Choreography are two choices to coordinate distributed systems when designing an end-to-end solution. Both these options have their pros and  cons, therefore knowing them well and understanding your solution quality attributes (ilities) will help pick one or the other

We looked at orchestration in detail and learned that orchestrating requires holding state about the lifecycle of the process (which step, status etc) and many process engines tend to do this – they persist or hydrate the state of the flow. I am a huge Camunda fan as it is a light-weight BPMN engine and supports the orchestration/SAGA pattern quite well – read about SAGAs, distributed transaction and more from by not-yet-friend Bernd Rucker here 



Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s