Distributed Architecture Strategy

Much of our new design is oriented around a much more distributed architecture than we’ve normally used at ATLAS. While we are a little late to the party, that also means we have the chance to learn from our forebears and not make the same mistakes.

Many people think of microservices when they think of distributed architecture. While this is the primary implementation today, distributed architectures have been around since the dawn of computing. They can take many different forms, each with their own pros and cons. At ATLAS, we should be hesitant to refer to our version as “microservices”—not out of dislike for the pattern, necessarily, but because the term brings several implications with it. Being realistic, a full-on microservice-oriented architecture will be challenging for us to achieve simply due to our staffing limitations.

Throughout this document (and any time we discuss architecture going forward), the terms “microservices” and “distributed architecture” will be used fairly interchangeably. Documentation should emphasize the ideal architecture, even if in practice we must make some sacrifices and compromises.

Table of Contents

Important Terms

In the scope of this document, there are some terms we should set definitions for to avoid confusion.

Microservices Example

A microservice-oriented architecture might propose that the operation “run a report” (as it exists in current ATLAS, newer designs notwithstanding) consists of several distinct services. It may have services for:

Each of these services are currently just steps of the monolithic report pipeline. However, due to the monolithic design, each of these steps has innumerable cross-cutting concerns. Worse, this logic has far-reaching and omniscient dependencies throughout that service. As a result, the entire report pipeline incurs the risk of regression even if only a single step needs to change. Additionally, even a transient failure at any single step causes the entire pipeline to fail.

Exactly how granular your services should go is a subject of hot debate in the field still to this day. One could easily imagine each of those steps being further broken into smaller services: a service for delivering FTP results, a service for delivering email results, a service for executing BI reports, a service for executing AdHoc reports—the list goes on. This is part of the hesitation to use the word “microservices” when describing our distributed architecture. While there could be value in going more granular with these services, that value should be measured during design and balanced against the increase in complexity and overhead.

When to use Distributed Architecture

Distributed architecture is a great solution to many problems, but it does not solve every problem. Certainly, it isn’t the right solution at all for some problems. These distributed service patterns, like microservices, bring with them a decent amount of overhead and complexity. If you select them to solve your design problems, you are making a trade-off. The amount of benefit you gain can be conceptually represented as a “multiplier” that depends on your problem. Some problems have huge multipliers—basically anything that needs to scale quickly and widely, or anything which needs highly isolated parallelism. Some problems have near-zero multipliers, or even negative multipliers.

It is very important, before deciding to go with microservices or similar architectures, that you evaluate your problem’s multiplier. Read the Why Distributed Architecture? section to help figure out how much your particular engineering problem will benefit from distribution. As an example: relatively simple applications, temporary solutions, quick fixes, dirty hacks, and some client applications have very low multipliers. From a mindset perspective, it’s probably wise to always begin with a more monolithic design. Only break it down when you discover the advantages of doing so outweigh the disadvantages (more on that strategy later).

That said, don’t be afraid of choosing distributed architecture where it makes sense. This section exists more to warn you against using it for trivial applications or forcing it to apply to problems which resist it. There is no mandate that says “all projects going forward must be designed using a distributed architecture,” but we now have the infrastructure to support it more robustly than ever before. Instead, this document should serve as a helpful teaching tool and guide for why you might want to use these patterns, and how to use them successfully.

Why Distributed Architecture?

Detractors of microservices as a pattern often cite that the complexity of the pattern isn’t worth the gains. The claim is that this complexity increases exponentially as the number of services increases, yet in order to “do it right,” one must create highly granular services—which, of course, leads to high complexity.

It’s certainly irrefutable that system complexity grows exponentially as the number of services increases. However, the granularity of the system is a variable that can be tweaked to avoid complexity. As with any architecture, you measure the strengths and the weaknesses and choose the right one for the job. Distribution is not the correct solution for every problem, but it’s very effective at solving certain types of problems.

Here are some of the benefits of choosing a distributed architecture. If your business problem aligns well with these benefits, it increases that conceptual multiplier. These are the reasons you might choose this architecture over others:

  1. Reduction of friction in development and deployment
    • Improves the development and debugging experience by isolating effort into a single module
    • Tends to make deployments deterministic and far simpler, more robust
    • Enables Continuous Deployment, which ensures a fast stream of improvements and a quick time-to-fix for defects
      • CD is challenging with monolithic applications because deployment takes so long and regressions can be very broad. A single microservice can build and deploy in seconds, and issues are usually isolated to that one service
      • In our ecosystem, we’re looking at time improvements measured in hours for many defects; the test, deploy, re-test cycle for a fix is significantly expedited
      • Zero-downtime CD is possible, if the infrastructure is designed correctly
  2. Isolation of responsibilities and concerns at the module level
    • Lends to better designs, more future-proof APIs, and fewer defects
    • Allows a much greater ability to test modules individually, which increases code quality and robustness
  3. Scalability
    • Makes our business more flexible
    • Allows us to alleviate bottlenecks and improve customer experience as needed with almost no spin-up time
    • Reduces friction between dev and ops teams
  4. Sharing of components and services
    • Greatly speeds development and enhances products with less dev effort
    • Multiplies improvements across all participating products
  5. Infrastructure and language freedom
    • Services can be written in the best language for the job
    • By design, a microservice should be as infrastructure-independent as possible to increase its motility
    • Reduces cost to the business
  6. Separation into distinct services leads to more complete applications
    • Microservices are first-class applications with their own resiliency, high-availability strategies, fault tolerance, error handling, and so on
    • This not only makes applications more resilient, it also tends to improve visibility and logging/metrics
  7. Enables cross-functional, smaller teams to be more productive
    • Without monoliths full of omniscient dependencies, it’s much easier for any developer to work on any piece of the puzzle
    • This reduces costs and increases developer utilization
  8. Reduction in business risk due to better facilitation of agile development
    • Since each piece of an application is discrete, agile teams can produce easily deliverable units
    • Modules that change as a result of agile feedback loops are easily replaced with minimal lost effort

Granularity

This is it: the million dollar question. How granular should your services be? How much should you break up functional units into individual services versus grouping them at higher levels of abstraction? Sadly, there is no easy answer. Instead, for every use case, you’ll need to evaluate the advantages and disadvantages to find a good balance point.


Advantages:

Disadvantages:


The guideline for granularity is pretty simple, though it’s more philosophical than scientific. Make every unit as granular as you can, so long as you are able to amplify the advantages and minimize the disadvantages. If going more granular no longer amplifies the advantages, or raises the disadvantages to an uncomfortable level, your unit is granular enough.

One could imagine, especially as we first dip our toes into the waters of distributed architecture, we’ll be more uncomfortable than we need to be about the disadvantages. While we definitely must be mindful of them, we must also remember this is a new paradigm for us; we’ll need to change the way we think. As you’ll see later, there are tools available to help us mitigate the disadvantages.

Granularization Example

Earlier, the monolithic process of running a report in current ATLAS was broken down into 7 services. What if it was further broken down and made more granular? Here’s how that might look:

We’ve gone from 1 monolith, to 7 services, to 19 services. Is there value in making the process so fine-grained? Let’s revisit the advantages.

Sounds amazing so far. It’s super scalable, robust, smooth to develop, and highly reusable. This design checks all of the boxes, right? Well, let’s take a look at the disadvantages too.

Without any additional tools to help, this sounds rough. However, Kubernetes was created to help mitigate these disadvantages. It is the key to minimizing them. With Kubernetes, we can let go of the need to know and manage where each service lives. We can define how they’re deployed via script and get deterministic deployments. We can easily monitor the health of the deployment. We can minimize network overhead and cap resource overhead. K8s alone helps remove a lot of the pain from this 19-service design. On top of that, there are several other tools and methodologies we can employ to reduce the impact, such as serverless (functional) paradigms.

Is the 19-service design worth choosing, then? Well, realistically, the best design is probably somewhere between the 7-service and 19-service ones. Splitting validation into two services probably does not amplify the microservice advantages in practice. The metadata services aren’t likely to benefit from the scalability of being so granular. Reports today don’t even support FTP or SharePoint delivery. No matter the scale, though, you’re going to run into some of these roadblocks eventually.

The remainder of this article is going to help you address the disadvantages and roadblocks of microservices at scale. We will cover those tools, methodologies, and patterns that make developing microservices much more fool-proof and allow us to squeeze every last drop of value out of them.

Documentation

Unsurprisingly, the number one strategy to deal with the unwieldy nature of microservices is to produce reliable documentation. Lack of documentation is a serious business risk that doesn’t get enough attention no matter the architecture. Not only is the chance of introducing regressions into existing systems higher, but the tribal knowledge that went into developing and maintaining legacy applications is often lost forever. Documentation helps during development as well: the more informed the team members are about the design, APIs, and philosophy behind them, the better decisions they will make during development. It makes developers more efficient, less stressed, and more cohesive.

The ugly, unavoidable truth is that documentation is expensive. Producing it—especially when a product or application is being developed by a small and/or tight-knit team—is often unattractive. Frankly, even going through a formal design process at all is sometimes expensive, as developers spend tens of hours expressing on paper what they feel they already know in their minds. Being given the task of writing documentation is seen as “getting stuck” with it; no one wants to do it. Worst of all, it requires constant maintenance to keep current, as the only thing more harmful than no documentation is bad documentation.

Given that it’s so expensive and cumbersome, yet so important to the success of distributed applications, how can we reduce the cost and still achieve a high quality of documentation? The solution is simple: centralization, automation, and prioritization.

Documentation Centralization

The most aggravating thing about keeping documentation current is that it usually lives in multiple places. Code lives in Git, but documentation often lives in some combination of OneNote, SharePoint, and a Word doc in an e-mail someone sent you 6 months ago. As a result, you find yourself flipping between several different places trying to keep track of them all. This wastes significant time, usually tens of minutes per instance as you have to rediscover where everything lives. If documentation were centralized and organized, however, it would be easier to update and effortless to reference.

To make sure that the documentation is not only discoverable, but free of administrative roadblocks, here are the recommendations for centralizing documentation:

At a company of our scale, it should be reasonable to centralize all technical documentation into a single location. If the company grows significantly, the sheer amount of documentation could become impossible for any organization scheme to manage. At that point (and only at that point) we can consider breaking the documentation out into logical sub-units.

Documentation Automation

You’ll notice that this very document lives in a Git repository. It is automatically published via CD to a documentation site, and it’s written in Markdown. Each of these is critically important to the upkeep of this document.

The particular combination of Git + TFS deployment + Markdown isn’t necessarily the best option. However, it is how this document was published, and this document was published using the same principles that will make other documentation efforts successful. Many products exist which provide the same advantages (e.g. Confluence); feel free to use them. Just keep centralization in mind: the docs need to be in one place.

Other Automation

Aside from the automatic deployment of documents to a centralized location, there are other things we can automate to make doc maintenance simpler. Some of these tools we’re already familiar with and some of them will be new to us, but each of them is critical to sustainable docs.

Most important, probably, is the documentation of code itself. Self-documenting code is already a staple of good engineering. When documentation must be written, however, putting the documentation in the code itself is natural for developers. It’s more likely to get updated when the functionality changes. Taking it one step further, publishing it automatically can be achieved via several tools (often built into modern IDEs and compilers, like JavaDoc and C#’s XML documentation).

The automatically-generated documentation from code comments is great for internal reference, but what about externally-facing docs? Sometimes the internal docs work for external consumers as well, but often you’ll want to add some additional elements to API documentation, such as sample code or visualizations. This is where tools like Swagger UI come into play. Often, in cases like this where external customers will be consuming the documentation, it’s also wise to treat the docs as a deliverable themselves rather than a byproduct of development. You’ll want to devote some time to polishing them up and making sure they’re accurate; then, rely on automation to keep them up-to-date.

Documentation Prioritization

You’re probably already practicing centralization and automation to some degree. The real trick to help documentation efforts be successful is this one: prioritization. Specifically, acknowledging that some forms of documentation are inherently less valuable than others, and should be given less effort.

Before we can discuss priority, let’s look at the distinct varieties of docs. These definitions aren’t exactly standardized, but should certainly be agreeable.

There are certainly more, but these are the most common and most relevant to our discussion. This philosophy prioritizes documentation in this order:

  1. Architectural documentation and API documentation.
  2. Design documentation.
  3. Code documentation.
  4. Implementation design.

Implementation design is the least valuable type of documentation. Implementation can change constantly, quickly, and sometimes unpredictably (e.g. due to hotfixes). Keeping an implementation design current is essentially equal effort to its corresponding development. It’s not worth spending the time to carefully document the code so closely, even after the fact.

Code documentation is very important, perhaps the most important type of documentation for a product’s technical success. However, it falls low on the list of priorities because it should be happening anyway. No extra effort should be expended to create it. If developers are not documenting their code well, or not writing self-documenting code, actions should be taken to correct the deficiency. When done well, it also serves much the same purpose as the implementation design. Developers should be able to trivially reverse-engineer the meaning and purpose of any component or unit.

Design documentation and architectural documentation are important because they both capture intangible elements of the design that cannot be easily discovered otherwise. Tribal knowledge, business decisions, design philosophies, and historical context are all made available through these documents. Essentially, they act as the rubric by which the implementation is measured. In an imaginary worst case scenario where the working implementation was lost, you should only need these two documents to produce a new one. It might not be exactly the same, but it will meet the same requirements. Of the two, design documentation is less important than the architectural documentation only because the design can change without affecting the architecture. Imagine the design docs as an implementation of an interface in C#, where the interface is the architecture.

API documentation, then, is the only purely-practical piece of documentation that holds a high priority. Much of this effort is spent making sure that your software, which is meant to be consumable since it has an API, is easy to integrate. Maintenance effort spent here, unlike in implementation design, is always worthwhile as it directly contributes to the quality of the product. Additionally, a well-designed API should change fairly rarely (even though the implementation might change constantly), minimizing the upkeep. Further reducing the upkeep are tools that can automatically generate API documentation or assist in creating it, which is never possible for implementation design.

Application Design Philosophy

In the distributed architecture world, it helps to think of application design a little differently. Often, we tend to think of applications as expressions of a product or feature set. Especially when coming from a monolithic n-Tier Architecture background, it’s easy to see an application as similar to an engineer itself: when given a problem, it figures out how to solve the entire problem and then solves it.

For distributed architectures, it’s better to instead think of each service as part of an assembly line. The service is responsible for taking the input (let’s say a car door) and producing an output (attaching the door to the car). Its only responsibility is to take doors and attach them to cars. In fact, it doesn’t even really care about the car or the door; it just attaches door-like things to objects on the expected attachment points. If the attachment points don’t exist on either the “door” or the “car”, there’s a problem.

As we’ll learn later on, even the attachment points can be somewhat flexible. If the door has fewer attachment points than the car, but they still line up, the service might attach them anyway. If the door has more attachment points than the car, but the most important points still line up and the door still fits, the service might attach them anyway. This ensures that we don’t need to retrain the service (i.e. rewrite code) when the door or car change slightly. The service can continue trucking along.

The 10-Hops Rule

The 10-Hops Rule is a guideline invented to help ensure overhead is minimized in microservice designs. The name is pretty self-evident to the philosophy: try to keep the satisfaction of any given user-level request below 10 service hops.

This is just a guideline. Going over 10 hops doesn’t immediately cause the entire design to fall apart like a bad game of Jenga. However, remember that every hop adds some amount of latency overhead—10ms is a good rule of thumb to help with planning. By the time you get to 10 hops, you’ve added 1/10th of a second to the processing time of every user-level request simply through service communication.

The 10-Hop Rule also helps control complexity. If a single request must pass through 50 services on its way to completion, keeping track of where that request is at along the line becomes much more challenging.

Keep in mind that we’re talking about 10 sequential hops. When you’ve got some services talking asynchronously to each other, or one service talking to many services at once, you can determine the overhead using a sort of tiered diagram, as shown below.

Service Hops By Overhead

Note that the third tier’s overhead isn’t 60ms - service 2 and service 3 finished communicating before service 4, so only 40ms were spent waiting at that tier. To put it simply, each tier has as much overhead as the slowest service in that tier. Complexity, however, is still measured at the total number of services required to satisfy the request.

Fault Tolerance

One of the bits of wisdom learned from our forerunners in the microservices space is the importance of building tolerant systems. While tools like Kubernetes allow us to basically infinitely maintain a system through anything but severe crash loops, that doesn’t mean a given service’s design should depend on such behavior. Every piece of a system’s puzzle should still be a complete application; that’s one of the benefits of a distributed system.

Measuring the success of a distributed system by looking at single unit uptimes is a mistake. However, in aggregate, the uptime of the fleet contributes to the system’s uptime of scale. In other words, if your services are constantly crashing and having to start up again, you’re wasting processing time and potentially impacting metrics. This is the lesson others have learned: ensuring that the system’s units are tolerant of faults and errors is critical to maintaining a healthy ecosystem.

There are, of course, times when the application must crash. This should usually be mitigated to infrastructure-level errors that are unrecoverable. Otherwise, the top priority is making sure that the service is able to recover cleanly from these sorts of concerns:

Tolerance doesn’t just mean recovering from issues without crashing or corrupting internal state. Tolerance can also mean being lenient on how many problems can exist in an input while still being actionable, for example. The objective is to ensure that small changes in service-to-service communication, transient faults, miscommunications, file I/O issues, and such do not cause misbehaviors in the application.

Additionally, we want to ensure that compatibility remains high. APIs should be designed in such a way to facilitate evolution without requiring changes on consumers. System infrastructure changes (for example, changing the URI of a resource) should be mitigated through tools like Kubernetes, where they can be abstracted from the beginning, or obviated through a no-deploy central configuration mechanism (again, perhaps in Kubernetes).

Avoiding Retry Storms

You might have read the previous section and thought, “Great! I can make sure all my service-to-service requests are tolerant trivially. I’ll just add a retry to the request!” Congratulations, you just introduced a flaw that could take down the entire system—maybe the entire data center.

The difficulty in distributed systems is the scale: you might have 100 instances of a given service running, all getting requests at the same time. These 100 instances may, themselves, depend on another fleet of services, and them another set, and so on. If Service F becomes unresponsive for some reason—say a network resource temporarily becomes unavailable or something like that—then you end up with a traffic jam at that service. All of Service E backs up now, followed by Service D, C, B, and eventually A. It becomes a standstill.

“That’s OK,” you say. “As soon as the network resource becomes available again, the jam will resolve.” Perhaps that’s true, except that Service E is programmed to retry its accesses to Service F. It turns out that Service D also retries its accesses to Service E, and Service C does the same with Service D. A lot of applications like to use ~1 second as a retry wait between attempts…

Suddenly you have an exponential explosion of network traffic, as every service in the entire ecosystem starts retrying at the exact same time over and over. Services crash entirely because they starve themselves of network resources, infrastructure dies due to excessive load, and you more or less DDoS the data center from the inside. This is a retry storm, similar to the classic Thundering Herd Problem. Thankfully, it’s easily mitigated.

Principles of avoiding retry storms:

These mitigations cannot completely eliminate retry storms. No matter what, they will occur to some degree any time the system ends up in a bottleneck state and the bottleneck becomes clogged. By following these guidelines, however, the impact can be minimized.

Communication Contracts

One of the pesky recurring problems of software design is the management of communication contracts between services. Whether it’s an evolving API, a code-level interface, or the response model from a service, it seems like a common source of regressions whenever two applications need to talk to each other. Since distributed architectures are built entirely on the concept of applications talking to each other, they can sometimes be especially prone to problems arising from miscommunication.

There is no “magic bullet” solution to the issue of communication between applications. In each instance, for both the API into an application and the responses coming out, one must exercise caution and diligence in design. As a general philosophy around how to design your communication contracts, try to follow Postel’s Law.

An implementation should be conservative in its sending behavior, and liberal in its receiving behavior.
Jon Postel

To break this philosophy into more digestible chunks:

Obviously, there can be exceptions. This is a set of guidelines, not rules. Just remember that every compromise on this philosophy increases the chance of regression and the cost of maintenance in the communication between applications.

Deployment Tactics and Service Management

Creating an application is about more than just building the systems. You also must design and implement the deployment and management strategies for the application. Organizations (including ours) often have methodologies for these DevOps tasks whether they have dedicated DevOps engineers or not, but with distributed architectures it’s critically important that these existing strategies are rethought. The rigors of these systems are often more complex, although not necessarily more challenging, than monolithic systems. It is ultimately more beneficial to discuss the health of a distributed system in terms of uptime of scale rather than traditional metrics.

It is strongly recommended that each project team have at least one member fulfilling a DevOps role to design, build, manage, and monitor the deployment of complex distributed applications. However, this isn’t always feasible for a number of reasons. As a result, the entire development team for a project backed by microservices should be fully aware of these tactics, strategies, and patterns for successfully administering and managing the application. To be successful, everyone must participate in DevOps.

Service-Level Agreements (SLAs) and Metrics

Especially when dealing with services at a scope typical of microservice architecture, it is necessary to clearly define SLAs for systems. Without solid lines for “success” and “failure,” these states are open to interpretation both from customers and internal stakeholders. Inevitably, your team will find itself constantly failing as the goal posts move based on shifting priorities, customer happiness or unhappiness, and internal perception. Establishing these contracts (in the abstract sense, though sometimes also literally) allows you to take control of the success and failure states and base them on concrete, measurable metrics.

SLAs are comprised of one or more service-level objectives (SLOs). Your ability to base the SLOs on actual metrics ties directly to your system’s ability to measure them. Thus, it’s incredibly important to make recording metrics a first-class item of your design. Even if the metric isn’t used as part of an SLA, record it anyway. The more data you have about the operation, state, and throughput of a system, the more information you have to diagnose issues.

Good SLAs vs. Bad SLAs

The successful implementation of SLAs requires that they meet several criteria. The only meaningful SLA is one that actually has value and can be measured; if it can’t be measured or it can’t be met, it’s not useful.

SLA requirements:

  1. It must be achievable through normal maintenance and upkeep.
    • In other words, the SLA must not require over-allocating staff or working overtime in order to meet it.
    • It must be achievable and reasonably future-proof with infrastructure that exists at the time of the agreement.
    • An SLA that requires overtime, over-allocation, or lacks proper infrastructure is doomed to fail, and therefore not a valid SLA.
  2. SLOs must be measurable. The metrics must be mathematically significant and defined within the agreement.
    • “The customer must be kept happy” is not a valid SLA because it’s not measurable.
    • “Customers must rate our service at least a B+ on surveys” is not a valid SLA because it’s not mathematically significant; there is no meaning to the measurement.
    • “The queuing system must be capable of handling peak load” is not a valid SLA because while peak load is measurable, and throughput rates are mathematically significant, neither of these thresholds are defined in the agreement.
      • Not defining boundaries and thresholds in the agreements leads to scalability problems—peak load will naturally increase over time given company growth, creating an inevitability where the system will fail to meet SLA.
      • This is why you tend to see SLAs around uptimes and MTBF (mean time between failure) measurements: the target number is measurable, significant, and made part of the agreement.
  3. It must be honored.
    • Obvious? Yes, but it should be reinforced: defining an SLA that will never be upheld creates friction on all fronts, wastes significant effort, and spends trust.
    • Trust is the most valuable currency we have with both internal and external stakeholders. Spending it frivolously is not wise.
    • It is better to have no SLA than to define one you can never (or will never) honor.
  4. The responsible team must be held accountable for failures to meet SLA.
    • Given rule #1—that the SLA is achievable—if the team fails to meet SLA, actions must be taken.
    • This does not necessarily imply punitive action; rather, a plan must be formulated to bring the service back into compliance and executed immediately.
    • Some sort of post-examination of the violation should be conducted.
      • Perhaps the environment has changed such that the SLA is no longer maintainable, which should be easily validated with metrics.
        • Often, simple scaling or software tweaks can bring a service back into compliance.
      • Possibly, the team doesn’t have enough resources or people to meet SLA.
      • Sometimes, a defect in the software may cause a violation of SLA.
        • A fix should be identified, tested, and fast-tracked into deployment.
        • Depending on the frequency of this type of violation, a broader look should be taken at the team’s quality practices. Adjustments may need to be made to methodology.
      • It is possible that the team simply failed to be responsible.
        • Assuming that the company is hiring strong staff with good priorities, this should be fairly rare. It’s often mixed up with one of the other issues due to a poor root cause analysis, though it certainly does occur on its own.
        • These scenarios will be handled by the unit’s usual method of addressing underperformance.
  5. It must provide business value.
    • Creating an SLA around 99.999% uptime and a six-month MTBF is great if customers are paying to justify that level of service. Adhering to those standards, however, requires herculean effort and resources.
    • If customers don’t need it (or aren’t paying for it), why waste the time, money, and effort providing it?
    • This isn’t to say that only customer-facing SLAs are important. Internal SLAs are just as important, if not more so since they directly prop up the external ones.
    • Instead, focus on determining the level of service you need to guarantee, and build SLAs on that.

SLAs and Microservices

This section on SLAs exists in this document because of the intrinsic link between the upkeep and management of a highly distributed system and the need for clear SLAs. By definition, the parts of a distributed and/or modular system will fail. Applying the usual monolithic architecture mentality to these failures will cause nothing but overreactions and negative perceptions. At sufficient scale, all distributed systems are some degree of “down” at all times.

The good news is: that’s fine. It may be concerning to plan for a system where failures are commonplace, but part of the philosophy here is that failures are commonplace in every system, distributed or not. In microservices, you have the opportunity to plan for them and mitigate them architecturally. The highly-available, widely-scaled, and fault-tolerant nature of these services means that individual failures are of no concern. Instead, you watch for systemic issues and measure health in terms of uptime of scale—this is why we must have strong SLAs. They help provide a measurable criteria for health and success.

This measurable criteria can be called the steady state of intended behavior or throughput. As it turns out, the steady state and your system’s SLAs go hand-in-hand; they’re intrinsically bound. The primary way to identify systemic issues at scale is to compare the current state against the steady state.

The Stoplight Strategy

To help with the management and monitoring of our distributed systems, a strategy has been devised to amplify the discoverability of systemic issues. This is the Stoplight Strategy. The idea is simple: a given system’s state can be represented depending on how closely it is currently adhering to steady state. Our existing monolithic system is either up or down, but distributed architectures have a whole gradient of grey area in between. As alluded to earlier, most distributed systems at scale are always some degree of “down”. The definitions of success and failure have to change. This leads us into a metric-based, SLA-oriented model of defining green, yellow, and red system states.

It is critically important that we define the thresholds, SLOs, and criteria for a system’s green, yellow, and red states early and concretely. They must be reevaluated often in order to keep in sync with the current SLAs and demands.

Green state is when we meet or exceed the established expected performance, uptime of scale, latencies, throughputs, etc. of the system. All SLOs are being met or exceeded.

Once we know our green state, we can start to define our yellow state. The yellow state is an operational state, but may be suffering from factors that impact user experience. Examples include high latencies, dangerously low uptime of scale, or an abnormal number of errors across the system in a given time window. The yellow state is incredibly important because it indicates two things:

  1. Failure is imminent
  2. There is an opportunity to improve the state of the system before failure occurs

Red state occurs when some part of the system is fully down, or at least one SLO is not being met.

The Stoplight Strategy In Practice

Important to keep in mind is that a green state doesn’t necessarily indicate that the system is 100% healthy. It only indicates that the system is exceeding the criteria we’ve established, and that’s OK. We might still look at other metrics to evaluate whether there are concerns to be investigated, but assuming the green state is well-defined, the system is working as expected. Any concerns found should not be critical, though it is wise to be on the lookout for trends that will lead the system into a yellow state eventually.

On the other hand, being in yellow state means that investigations should begin immediately. While the state is not critical yet, it usually indicates that it will become critical. If a system is in the red state, it should be considered completely down, even if it might still be servicing some requests.

The intentions behind this strategy are almost more important than the strategy itself:

You might notice that this system bears many similarities to something like Azure’s Service Status page. This is fully intentional. The two strategies serve the same purpose, though for slightly different audiences. In fact, there’s a lot of value in providing an internal version with greater detail and a version for customers that gives a different, simplified perspective.

A simple stoplight-inspired indicator of state will never be sufficient to thoroughly monitor system states. It is not intended to be the only metric of health, or the only touchpoint of support. However, it does provide an easy-to-understand and quick reaction pivot point for all involved engineers. One of the problems in many of our existing systems is that we have a hard time saying—at even a basic level—whether they’re healthy or not. We almost completely lack visibility. We have difficulty reacting quickly to changes in state, because there is no human-readable way to get state at a glance. This indicator helps by acting as a canary for potential issues.

CI and CD

CI (Continuous Integration) and CD (Continuous Deployment, in this case) are vital practices to maintaining a distributed application. A quick definition for each to help set context:

Together, these processes help us control the complexity of distributed architectures. By employing CI, we ensure that each unit of the application is always in a good state—that is, not completely broken or poorly integrated. Regressions are mitigated by automated testing. Most development teams use some form of CI today; there is likely no value in going further preaching its benefits.

Continuous deployment is a more challenging process, but it too ensures that the application’s units remain in a good state. The continuous delivery strategy is part of continuous deployment: the units must be easily deliverable and always deployable. However, going the extra step into automatically deploying units as soon as they’re ready provides a ton of value, assuming your infrastructure and design is capable of zero-downtime deployments.

This has been implied above, but each unit of your distributed application should be a separate CI/CD pipeline. In other words, each unit should be capable of building and deploying independently of other units. Thanks to tools like Kubernetes and Docker, this becomes significantly more straightforward.

Docker allows each unit to dictate its own build process in a Dockerfile. Kubernetes makes it simple for it to declaratively define its deployment in a .yaml. As these are (mostly) universal concepts, we no longer need to rely as much on extensive custom build and deploy scripts in a system like TFS. For monolithic applications, it isn’t the worst thing in the world to write custom build/deploy scripts, but at distributed scale it’s impossible to maintain.

Example CI and CD Workflow

Here’s an example of how our CI/CD workflow might look. This isn’t exactly how our actual workflow will look, but it should help demonstrate the concepts.

  1. A developer named John Doe makes changes to the code in a branch. Let’s call it john.doe/feature-branch for example’s sake.
  2. John commits code to john.doe/feature-branch during active development.
    • If necessary, John has a path on the Docker registry (e.g. Harbor) for pushing images manually for testing, though they cannot push to certain protected paths like the root.
    • A Test environment exists as well for manually testing deployments and for testing units (or whole applications) at scale without affecting Production resources.
  3. After development and initial testing, a pull request is submitted to merge john.doe/feature-branch into master.
  4. The PR contains not just code changes, but potentially build and deploy changes as well. The repository’s policy requires that the DevOps group approve the PR if the build and/or deployments change, in addition to the usual approvals.
  5. The DevOps group and the usual suspects approve the PR. It merges into master.
  6. The build server detects a change to master and initiates the CI process.
    • The image is built and automated tests are executed.
  7. Assuming the build and tests pass, the new image is pushed to the registry under a specific version number.
  8. After the image is pushed to the registry, it is vulnerability-scanned. Assuming the scan completes successfully, the CI process is complete. The image is tagged as latest in the registry.
  9. Now the CD process begins. The build server initiates a deployment of the Kubernetes .yaml.
  10. Since the DevOps group signed off on the PR, the deployment should be safe. Using k8s mechanisms for upgrading deployments, the existing pods are rolled off one-by-one, with new ones created in their place.
  11. Assuming the deployment enters a good state before a configured timeout passes, the CD process is complete.
  12. The new unit can be considered fully deployed now.

Summary

After a pretty solid amount of research and evaluation, this document contains the current philosophy around ATLAS’s approach to distributed and microservice-oriented architectures. This philosophy will change with time and experience, but it represents our idealistic strategy at the time of writing. As mentioned in the first section, we always want to design the “correct” implementation first. Make compromises only when the correct design is unachievable, and make them as small as possible. If you begin designing from a compromising mindset, your design will not reach the heights it may have otherwise been able to reach despite limitations and setbacks.

The collection of strategies around implementing distributed architectures are a first draft. These, too, will expand and evolve as we gain experience. Perhaps the idea of a “stoplight” system of measuring system health is too naïve to bring value, or perhaps it’s too complex for the target audience even as presented. Maybe it turns out that the Circuit Breaker pattern just doesn’t work for us. It’s possible that continuous deployment turns out incompatible with our team structures. Most likely, we will add powerful solutions we discover along the way. At no point should this document be considered final or even authoritative.

The document is, however, a set of guidelines and philosophy that should help everyone work into this new way of thinking. This is its only true objective. Putting yourself into the mindset for successfully creating a distributed system—that is, avoiding the common pitfalls and long-term pain points that often plague distributed systems built by monolithic teams—requires more than a basic understanding of the tools. It requires a philosophical and fundamental shift in many of the assumptions and basic foundations that we build up after years of experience. As alluded to before, distributed systems have existed since the birth of networked computing. Today they’re just easier than ever before to create and manage. The tools we have now make designs practical that were once infeasible.