A few days ago, Werner Vogels - Amazon’s CTO - shared the Distributed Computing Manifesto an internal document written by Amazon’s senior software engineers back in 1998 in an effort to propose a plan for overhauling Amazon’s architecture for the next phase of its evolution.
I have personally worked for Amazon as a software engineer a long time ago and the evolution of its architecture was one of the first things I heard about during their onboarding process. Even so, I read this document with a lot of interest. Even though it’s a document from more than 20 years ago that many software engineers would write off as obsolete, I think there are many lessons to take from it and I would recommend as a good read. I thought I’d also write down some of my thoughts and observations even though I acknowledge some of them (or in the worst case, all of them) might sound obvious to you.
The manifesto focuses on two main concepts:
- a service-based model, consisting of a 3-tier architecture (presentation/business logic/data) where only a single service would access directly data in a database and other services would need to go through a well-defined interface from that service. This would enable agility, by preventing changes at the data layer from impacting all the clients of the corresponding service.
- a workflow-based model, which could be used to implement asynchronous workflows in these services. These workflows were at that point implemented by directly accessing data records inside the database. One issue with that is everything was performed against a single DB instance, which was a pattern hard to scale.
In the context of the service-based model, the authors explain services can provide synchronous and asynchronous interfaces depending on the nature of the business case. They mention two main options for an asynchronous interface, polling and callback. In the polling approach, the requester periodically checks whether the request has been completed. In the callback approach, the requester provides a callback routine to be invoked when the request has been completed. The main issue described with the callback approach is the requester might not be active when the request is conpleted for the callback to be invoked. There is a third option, which could mitigate this issue. Instead of specifying a callback, the requester provides a persistent queue where a message can be published indicating the request was completed. This doesn’t necessarily require the requester to be available at the time the request is completed. This pattern is known as the asynchronous request-response in the enterprise integration patterns parlance.
Interestingly, after describing all its benefits the authors explicitly call out the drawbacks of the workflow-based model. That is something you don’t always see on a manifesto, which might paint an unrealistically rosy picture. But, rarely do software decisions come without trade-offs and being aware of what you are trading is what will lead you to a better decision in the end.
As described previously, the main challenge against database-backed workflows amd the reason towards moving the proposed approach presented by the authors is the fact that operating against a single instance of a database is hard to scale and can be a single point of failure. Nowadays, there are distributed databases that have solved these problems. Interestingly, one of these databases is built by Amazon itself, called DynamoDB. Its architecture is based on partitions that are stored in separatae physical machines, which allows for horizontal scaling and high availability to great extents. For example, DynamoDB can provide 4 9s of availability for regular tables, which can go up to 5 9s of availability if you want to use global tables that are replicated across regions. As you can find in Amazon’s documentation or academic papers, DynamoDB can handle up to tens of million requests per second over petabytes of storage, all of that with predictable single-digit millisecond latency. This is not to say a database-backed workflow is superior, only that the database technology is not the limitation nowadays. An alternative choice (enabled by the availability of the database technologies) that can be made is having the individual services of the workflow access core/reference data from other services, instead of having them listening to updates from a service message bus and maintaining a replicated view (which can lead to a lot of overhead, computational waste and having to deal with all the challenges of eventual consistency) or requiring every message to contain all the information that is needed to act on the message (which can end up in very large messages, the associated overhead and the inability to react to changes in the reference data).
As part of the workflow-based model, the authors expose the main question around who manages the workflow and how management is performed. The authors present two approaches they call the “directed” and the “autonomous”. These might not sound very familiar, but they are actually what people nowadays refer to as “choreography” and “orchestration”. If you are interested, I have written a previous post about orchestration and choreography that you can find here, so I won’t repeat myself here.
Another interesting thing the authors touch on is the option of separating data domains (e.g. Customers/vendors/catalog) per geography and hosting them in separate infrastructure. If you have tried using a single Amazon account across different countries, you might have seen some of the effects of such an architectural decision (e.g. your orders tab showing only the orders completed in that country). The main advantage the authors mention is reduced blast radius, so that if there is an issue with the infrastructure of one country other countries can remain unaffected. There are probably more advantages, such as the ability to comply with data residence regulations and other general legal discrepancies between countries more easily. However, there are drawbacks to consider as well, such as extra resources1, more moving parts to maintain and support along with the possibility of divergence of the software stack between geographies.
-
There can be different reasons why you might need extra resources. One is because different geographies can have workloads with different temporal characteristics (e.g. peaks of one country can coincide with troughs of another), which can be served by a shared infrastructure more efficiently. Another reason is implementing static stability can be costlier. For a simplified example, assume you have 3 countries that can be served by 5 servers each. If you use separate stacks and you want to deploy across 3 AZs to make each country tolerant to failures of one datacenter plus one more server (which is a typical topology), you will need (3+3+3)*3 = 27 servers. Instead, if you use one stack for all countries (i.e. assuming you need 15 servers to handle the overall load) and want to have the same tolerance, you will need (8+8+8) = 24 servers. ↩