
Choreography and Orchestration
Choreography and Orchestration are two common patterns in coordinating distributed systems to build an end-to-end solution. Careful consideration must be given when picking one vs the other because they offer advantages and dis-advantages that are quite contrasting and impactful. While choreography offers nice de-coupled option, monitoring events and rolling back things across multiple systems can be a nightmare to maintain; similarly while orchestration offers a nice sequenced, organised way to invoke services it brings a lot of overhead and hidden coupling which can impact ability to keep your services nimble over time
As an example, at one client we saw this implementation of a long-running flow done using events. There were specific choreography rules in just 3 systems sending events back and forth on completing various steps in a task (like you and me exchanging letters in the 1860s across the ocean, what could go wrong?). Lost and out-of-sequence events were causing out-of-sequence actions actions and events (generation) from events which led to cascading chain of actions eventually leading to customers receiving partially generated or out-of-sequence documents. It was a customer management disaster and quite hard to find and fix in production!
Okay now that I have you worried, lets talk about Choreography and Orchestration in detail
Choreography
Choreography is often qualified further with “event choreography” is a mechanism where the systems in a distributed systems can receive messages and there are rules in each system to interpret the content of the message and react to it (update tables, query end system, generate document, emit a message etc). The solution design then consists of where these messages will come from, how the systems will react to them and if the ordering/sequencing of these messages matters when reacting to them
Event choreography is used to de-couple microservices across bounded-contexts and was the focus of the “message oriented middleware” architectures. Event choreography solutions are easy to scale to new systems (simply add a new consumer) but complex to own and operate (try finding lost messages in production) and fail if there is any sequencing or global-transaction requirement (few solutions like global locks/mutex, entity versioning or event a common bulletin board to coordinate choreography are attempts to orchestrate them)

Orchestration
Service orchestration is a way to react to a command or event to start a long-running process which can interact with one or more services. In this mechanism there is a central coordinator actor (system or process or person) who executes a series of steps in a controlled manner one-by-one, in-parallel (scatter/gather) using synchronous (request/response, one-way) and asynchronous (event outbound, wait and move to next step on event inbound) interactions
Orchestration is great to handle complex series of steps especially those that are part of a single logical transaction. What we could not achieve with events and the chaos, we can now do with coordination and control. This includes roll-back and compensating activities – i.e. steps in the backward flow to rollback transactions in the earlier steps if this step fails
The major problem with orchestration is coupling, the orchestrator is tying the services together and they all are now bound in a solution context. This is why orchestrations should be done with care, against mature endpoints and avoided when event-based state transfer can work

A Word about building Orchestration services
There is a whole another blog post about event-driven architecture and choreography, in this post we dive more into orchestrating services. As mentioned there services exist because the solution sees actions in a group of distributed system as part of a single global transaction with some strict sequencing requirements
These services have other hidden attributes or technical requirements such as
- Needing an API to start a new instance
- Needing an API to read the current state of a given instance
- Need to orchestrate a set of actions that may be part of a single end-to-end transaction, but note it is not necessary to have these steps as a single transaction
- They have tasks which wrap callouts to external APIs, DBs, messaging systems etc.
- Their Tasks can define error handling and rollback conditions (compensations)
- They store their current state and details about completed tasks
Why are orchestrations stateful?
Services, especially integration services can be generally stateless. These are optimised for short-lived request-response type applications, however there are scenarios where long-running one-way request handling is required along with the ability to provide the client with the status of the request and the ability to perform distributed transaction handling and rollback (because XA sucked!)
So you need stateful because
- there are a group of tasks that need to be done together as a step that is asynchronous with no guaranteed response-time or asynchronous one-way with a response notification due later
- or there are a group of tasks where each step individually may have a short response time but aggregated response-time is large
- or there are a group of tasks which are part of a single distributed transaction if one fails you need to rollback all
What API endpoints are there in a stateful microservice?
Microservices implementing stateful orchestrations provide a service that is richer than the normal resource query/command. They start a complex long running process with activities or steps and therefore we need interface contract that lets us start, interact, terminate, investigate and manage the long-running process/flows
- An endpoint to initiate: for example, HTTP POST which responds with a status code of “Created” or “Accepted” (depending on what you do with the request) and responds back with a location
- An endpoint to query request state: for example, HTTP GET using the process id from the initiate process response. The response is then the current state of the process with information about the past state
Sample use case: User Signup
- The process of signing-up or registering a new user requires multiple steps and interaction looks like this [Command]
- The client can then check the status of the registration periodically [Query]
Command
POST /registrations HTTP/1.1
Content-Type: application/json
Host: myapi.org
{ "firstName": "foo","lastName":"bar",email:"foo@bar.com" }
HTTP/1.1 201 Created
Location: /registrations/12345
Query
GET /registrations/12345 HTTP/1.1
Content-Type: application/json
Host: myapi.org
{ "firstName": "foo","lastName":"bar",email:"foo@bar.com" }
HTTP/1.1 200 Ok
{
"id":"12345", "status":"Pending", "data": {
"firstName"
:
"foo","lastName":"bar",email:"foo@bar.com"
}
}
Orchestration anti-patterns
While the pattern is simple, I have seen the implementation vary with some key anti-patterns. These anti-patterns make the end solution brittle over time leading to issues with stateful microservice implementation and management
- Enterprise business process orchestration: Makes it complex, couples various contexts. Keep it simple!
- Hand rolling your own orchestration solution: Unlike regular services, operating long-running services requires additional tools for end-to-end observability and handling errors
- Implementing via a stateless service platform and bootstrapping a database: The database can become the bottleneck and prevent your stateful services from scaling. Use available services/products as they optimised their datastores to make them highly scalable and consistent
- Leaking internal process id: Your end consumer should see some mapped id not the internal id of the stateful microservice. This abstraction is necessary for security (malicious user cannot guess different ids and query them) and dependency management
- Picking a state machine product without “rollback”: Given that distributed transaction rollback and error-handling are two big things we are going need to implement this pattern, it is important to pick a product that lets you do this. A lightweight BPM engine is great for this otherwise you may need to hack around to achieve this in other tools
- Using stateful process microservices for everything: Just don’t! Use the stateless pattern as they are optimal for the short-lived request/responses use cases. I have, for example, implemented request/response services with a BPEL engine (holds state) and lived to regret it
- Orchestrate when Choreography is needed: If the steps do not make sense within a single context, do not require a common transaction boundary/rollback or the steps have no specific ordering with action rules in other microservices then use event-driven choreography
Summary
Orchestration and Choreography are two choices to coordinate distributed systems when designing an end-to-end solution. Both these options have their pros and cons, therefore knowing them well and understanding your solution quality attributes (ilities) will help pick one or the other
We looked at orchestration in detail and learned that orchestrating requires holding state about the lifecycle of the process (which step, status etc) and many process engines tend to do this – they persist or hydrate the state of the flow. I am a huge Camunda fan as it is a light-weight BPMN engine and supports the orchestration/SAGA pattern quite well – read about SAGAs, distributed transaction and more from by not-yet-friend Bernd Rucker here