The obvious question that every responsible client asks is What happens when the service is down?
We know that Authress is critical production infrastructure, and so we can't afford for Authress to be down. There are rare incidents that can affect your infrastructure and ours, and so, even in the event of one of these circumstances we need to a concrete strategy for being alerted and how we respond.
Potential failure modes
Let's first discuss what it means for Authress to be down. There are a few of different levels, so it makes sense to call out each one. For each of these, we investigate the Authress architecture solution which prevents issues due to that level of scale. In the circumstances there can't be anything automated, we'll call out what our strategy looks like.
One of our cloud provider's regions is down
We use multiple regions in our cloud provider. At any time, one of these regions can be down (A variety of this problem is that the region actually works, but all the provider's health checks are successful).
The spectrum for this problem is between one availability zone (AZ) to the whole region having an outage. While it's unlikely all availability zones are out, they are geographically close, so something could happen. More frequently, a single point of failure resource for the region is down. Since this happens far too frequently, Authress utilizes both cross-AZ and cross-region to automatically failover. Also included in here is:
- Routing is down.
- Cloud Provider can route requests to the region and our services in the region, but networking is down.
- Requests are routed correctly, but our services fail to handle the request, in the case of corrupted packets.
Countermeasure: Authress runs in multiple regions and uses signals from the cloud provider as well as our own internal health checks to know whether the region is down. Some endpoint routes automatically switch to another region, in other cases whole requests are rerouted to working locations. External and Internal DNS are rerouted as required.
One of the cloud provider's services is down in a region
There are many cloud provider resources that Authress uses. Any one of these could experience an issue.
Countermeasure: Authress handles retries, multi-path execution, and proactive region circuit breakers to decide what resolution to take. In most cases, alternative resources are utilized. In other situations, Authress marks the region as down and routes requests as if the whole region is down.
The cloud provider and routing works, but something is broken
It's possible that there is a transient issue with one of Authress' services or endpoints. In the case of this incident, clients might see temporary 5XX status response codes. These indicate that retrying makes sense.
Countermeasure: Authress tracks 5XX caused by our infrastructure and by core software components. In the case of error elevation to above our thresholds, automated changes are deployed to correct issues. In some of these cases, rollbacks will occur. In most situations, faulty services are automatically taken out of rotation for secondary investigation.
Authress returns more 4XX than the baseline
Negative impactful changes may not necessarily cause hard errors. In the case of 5XX however, that still might mean that functionality is not operating correctly. An example is: Authress returning 4XX instead of 2XX.
Countermeasure: Authress measures successful performance and execution of endpoint responses. In the case that these deviate from a baseline our team is automatically alerted to start the investigation process. Given the nature of these, no automated resolution can be put in place to resolve the problem automatically. However, since Authress uses blue/green deployments as well as parallel execution in some situations, these issues are significantly reduced. Frequently, production health check tests are used to verify production functionality after all other levels of internal testing is completed.
Spikes in usage
While everything may appear to be working correctly, spikes are monitored because they may indicate either faulty infrastructure, core components, or a potential for an external malicious agent.
Countermeasure: Authress uses automated analytics with feedback loops to controls in infrastructure to detect and block malicious requests. Before automated blocking occurs these requests are inspected for validity to ensure valid customer requests are not incorrectly dropped.
Authress has the above countermeasures in place to support the expected SLA. There are additional client side protections that can be put in place that also have secondary benefits for performance and reducing call frequency/load:
Service level: In memory caching via memoization
The first level of optimization is short lived in-memory caching of responses. The strategy is to call Authress and cache for short duration (less than a minute) successful authorizations. Subsequent authorization request can go both--to memory and to Authress--and then in th background storing the updated authorization in the local cache.
Platform level: Set up a proxy
A second level of optimization could be to proxy all the service calls to Authress from multiple services. This ensures freshness through scale, but managing a proxy itself is not without issues. The problem with a cache at the proxy level is dealing with less frequent authorizations being cached too long. Going down this path has the challenge of cache invalidation.
First pass authorization + Async secondary verification
In many deployments requests contain additional information to do a first level verification, such verification might be checking that the
tenantId of the user's access token to match against the tenant owned resource in the request. For asynchronous requests, the tenant verification can be done synchronously and the full authorization can be queued with the request, then finally verified when the status incident has resolved before the async action is completed.
There are some other mitigation strategies that can exist as well, however using one not listed above comes with more downsides than the upside protection that they would provide. To go through some common unsuccessful strategies, we list them below with the justifications.
Turning off requests to Authress
When Authress is having a status incident, and none of the mitigation strategies above are available, the best solution is to return a
503 to your caller and expose a temporary status incident to your users. It is far better to throw an error, than to assume that a user has access to a resource that they should not have. We have found that a small, but non-trivial number, of requests are
404s which are both users accessing resources they shouldn't have as well as services using this failed authorization check to drive user specific behavior in UIs. Returning a success instead of error, might cause the wrong user flow to be executed.
Another way to think about this is, do your users care more about security or reliability. Often legal requirements dictate security over reliability.
Building duplicated infrastructure
One common question is whether or not to add an additional technology on the client side to avoid possible downtime. A good rule of thumb is:
- In the hypothetical case that Authress is a service run by just another team in your platform, would this countermeasure still be built?
- Is this something that should not be handled directly by the Authress service?
If the answer to both of these questions is Yes, and it isn't listed in the above mitigation strategies, definitely build it, and then we'll update this article to include your solution. Alternatively, if it is something that Authress should do on the service side also let us know.
However, if such technology would not be built under these circumstances, then you are losing value that Authress is providing for your platform. Authress is optimized to work in the normal case as another service in your platform, adding additional technology against the recommendation actually increases the risk of incidents on the client side, which the Authress team would not be able to help in supporting.