As a project grows, it’s common that each service communicates with a variety of others. For example, a checkout process to reserve a hotel room might include the following steps:
It would be difficult for all of these steps to be in one service — frankly, they shouldn’t be. Some of these steps could be carried out by third party libraries.
Let’s suppose that the process that validates our credit card stops working. How should our application respond to this failure? Can it recover from this error? The answers to these questions help us to measure if our application is resilient. We can say that an application is resilient if it has the capacity to recover from a failure and continue operating.
One form to achieve resiliency is through a development pattern known as a Circuit Breaker.
This pattern is used often in microservices and gets its name from electric engineering. It references a piece of hardware that has the capability of cutting the electrical current inside of itself. In our case, we can use this name for software that cuts communication between processes when one of the processes stops responding.
Why would we use this design pattern?
This brings us to the question, should we protect all of our resources with a circuit breaker? In my opinion, we only need to worry about those which are critical due to integrations with multiple services.
A circuit breaker works with three distinct states:
Note: The name of the states can generate confusion, but it can be helpful to remember that the circuit breakers we use here are analogous to electrical circuits.
We use thresholds to determine when to change from one state to another. This value is adjusted according to whether a resource executes correctly or not. When a resource fails, we add 1 to our count of errors. Once we reach a specified value in the threshold, we change the circuit breaker to open — or in this case, we indicate that we should not execute any more calls to this resource. Instead, we should immediately return with an error message, without advancing in our execution.
After a certain amount of time (let’s say 5 minutes since our last request), we can change the state to half-open, and allow a request to proceed to the resource. If it doesn’t fail, then we can reset our failure counter, and change the status once again to closed.
In Python, there are quite a few options; in this case, we will explore pybreaker as an example.
fail_max indicates the maximum number of allowed errors before changing our status to open.
reset_timeout indicates the time (in seconds) before changing to half-open since our last error.
Next, let’s take a look at how pybreaker implements the states of the circuit breaker.
In the implementation for these states, we see that the request is executed in the following line:
ret = func(*args, **kwargs)
In this case, func is validate_credit_card in our example. This execution allows for two possible responses:
If the current state is closed, then we don’t need to do anything in on_success.
If the current state is half_open, then we change the state to closed in on_success.
In this implementation, we should check timeout; if we haven’t completed the wait period specified in timeout, then we shouldn’t call the resource, and we should interrupt execution. If the inverse is true, we change the state to half_open and we execute the process (the implementation is described in the earlier example.)
Circuit Breaker allows us to protect the critical processes of our application, making them more resilient, and allowing them to adapt to the current state of the services in use. We should consider using circuit breakers in our processes that invoke different services, or those that have costs in terms of processing, time or money. The usage of circuit breakers also has the potential to generate a better user experience, given that they won’t have to complete a series of steps, all to at the end, encounter a fatal error.
Sources: