Ensuring the reliability of Authress | Authress

Authress has an expected 99.999% SLA. There can be many requirements for uptimes--regulatory needs, core dependencies, or your service offers life-saving responsibilities. In these circumstances, it's critical to never let your service go down. Here, we'll review how we’ve architected, built, and maintained a solution with such a high SLA, the frameworks and strategies we’ve established, and what it takes to keep up that commitment. Presented at AWS Summit Zurich 2024.

Slides: Presentation Slides
Conference: AWS Summit Zurich 2024

Transcript

Okay. Hello and welcome. I'm Warren Parad, and I'm the CTO at Authress. And today, I'm going be talking about how we built one of the most reliable Auth solutions available. When I say reliability, what I mean is that we offer a five-nines SLA.

That's 99.999% of the time, what our service offers is up and running, and it's working as expected by our customers and their end users. To put this into perspective, in this giant blue rectangle I have up on the screen, the red in here represents the amount of time in which we can be down. And if you can't amounts to just five minutes and fifteen seconds per year that we need to maintain uptime to match our SLA promise. It very much means that we just can't be down. For the rest of the presentation, I want to walk through how some of the challenges that we've had to overcome in order to make this a reality as well as how we actually achieve this, utilizing AWS as our infrastructure.

And to do that, I want to add a little bit of context of what Authress actually does. So Authress is a login and access control solution designed specifically to help add the functionality to and protect the applications that you write. It offers, user management, user identity, authentication, access control, permissions, role and resource based authorization, as well as API key management for both—your services and your end users to interact with your APIs. It also offers audit trails and a number of other features to help complete the picture. So hopefully with this said you can already start to imagine why we have to offer a five-nines SLA and some of the challenges are that we may have.

If wee forget for a moment about the need to actually maintain a high uptime, and we thought about what the initial architecture is that we could we could have to actually build our product, you may end up with something that looks a little bit like this. We could have had a single region with, some HTTP interface, either a gateway or an ALB, to handle the request and pass them to, Lambda or EC2 or some other compute, as well as having a database, RDS or DynamoDB, and some ancillary services. Now, I'm sure you're thinking a single region with these components. It's already pretty obvious that there's a problem here that we're going have to overcome. And I think the Amazon CTO is frequently attributed to having said that "everything fails all the time".

And AWS's serve services are certainly no exception. Over the last decade, there have been more and more incidents in AWS, for which any one of those would have had a serious impact on our services, and on the services that we offer to our customers. And if we actually tried to rely on the AWS infrastructure directly and just take AWS's SLAs as our own, we would very quickly run into a problem. AWS's Lambda's SLA is below five-nines, the API Gateway SLA is below five-nines, The AWS SQS SLA is below five-nines. We're not really, getting any success here.

And so, fundamentally, the first kind of problem that we have to overcome is infrastructure failures. These are problems with the infrastructure that we're utilizing in AWS, by how it's delivered to us by default. So just for an example, going back to the initial architecture, if we had gone with this and there's an issue, with our database and we can't connect to it or connections are intermittent, timing out, returning five hundreds, or if there is an issue with our compute resource, either the control plane, spinning up new containers is a problem, or getting a request, or maybe there's an issue with, actually handling the incoming requests and return responses to our customers—networking could be an issue. In any of these cases, we pretty much just declare bankruptcy if we're only running in a single region and just declare that the region is down.

So this is obviously not sufficient for us. We have to handle this very explicitly. And the way we do that is we duplicate the infrastructure one for one, from primary region to a secondary region. And when there's an issue in that primary region, we fail over automatically to the secondary region. So our customers hopefully have never aware that there's actually an incident.

This requires having all the infrastructure and the data to be duplicated in both places. Okay. That's great that we know how to actually what we want to achieve. How how do we actually do this in AWS? To do this we use dynamic routing and Route 53.

We designate one of the regions as primary and one of the regions as the failover region. And when there's an issue, Route 53 will automatically failover to the appropriate region. We have to have the same infrastructure in every single region that we operate in. And since we operate in five different regions total, we actually have 10 sets of our infrastructure out there in case of a failover event. The failover region in the set is geographically close to the primary region so that if there's an incident, it's very nearly undetectable by our customers during a failover.

The way we actually know that there's an incident and the way we communicate through Route 53 that there's a problem is using Route 53 health checks. Route 53 health checks allow you to actually communicate, that there's a problem to the dynamic failover records and for Route 53 three to route those new requests to the appropriate location. Our health check looks something like this. It comes in and communicates with the gateway and verifies that it's up. That executes a request, on in our compute, which may interact with the database.

Now you could be utilizing the standard, health checks that are come out of the box, but we actually significantly configure the health checks we're using and use something custom that allows us to have greater control over when to failover from one region to another one. This is really critical for us, because it means that, you the last thing you want to do is failover from one almost working region to another region that isn't working, or failover and it not actually fix the problem. That can make it worse! So instead of just relying on just the health checks that AWS gives us, we have in-depth execution which verifies whether or not we're down in a particular region. I have a code snippet up here, for example.

It's not exactly what we're doing, but I think it's a pretty good proxy. About 22 times a minute, the Route 53 health check will fire off from different places around the world and verify that this code worked as expected. What we're actually doing is we profile our request to make sure that we're handling all our customer requests in an optimal way. If it's too slow to actually respond, then the profiler will will be able to validate that and, also mark the region as down.

Fundamentally, we check a few of different things:

The first one is we go to our database and we make sure that we can actually fetch most of the resources that we need correctly in order to perform authorization checks. We make sure that the core logic in our authorizer is working correctly. This is the primary functionality that most of our customers use. We may verify secondary services in AWS. If enough of them are down, it could also cause an impact that is visible to our customers, and that could be a reason to failover as well.

And then we have some logic validations to make sure that the code that we actually deploy is working correctly. Maybe there was an issue with the deployment package, maybe what we're actually running there isn't isn't correct or there's an issue with it, this would catch it. Then of course we evaluate the result of these these requests and return success or failure back to the health check. We actually don't stop here. Realistically with this configuration and this infrastructure, it requires having a complete duplicated set of all the components running in multiple regions even if you're never utilizing them.

And if only one component is down, it's it's a shame to just fail over and run everything in a second region. Sometimes that could create unnecessary load and extra complexity both for your writing your service as well as some of the routing that could take place. So in some cases, where we're afforded an opportunity to change our architecture, we move to a decentralized edge compute solution. And we do that using Lambda@Edge through CloudFront Distributions. So instead of having a failover, when a request comes in instead of coming to a region, it gets executed on the edge, and the closest region to that edge node, goes to the compute, which has to be a Lambda@Edge function, and it directs from the database.

Well, components in that region could be down, an example would be DynamoDB, and in that case we automatically fall back to executing a database request to an adjacent region, and contact the database there to read and write as necessary. And if there's a problem there, then we fall back again to a third region. At at this point, you may be like, well, why three, why not four or five? At at three, we've identified that it's significantly rare for such an event to ever occur that, would be far above what our SLA promise actually is, and would potentially create, negative impact on the request that are coming into our service from our customers. For instance, if you have to fill over to four regions, that's four potential checks in your service. That really increases the amount of execution time—the latency before returning the successful request to our customers. So it's much better in those cases to give up after three and just return a failure and have have the client retry as necessary.

Okay. So at this point, we've handled I think we've handled infrastructure in AWS, but that's not the only situation in which we have to deal with, as a service provider. It would be it would be really nice if this just solves all of our problems. We also have to deal with application failures.

It's obviously something you would want to achieve to have bug free code, but the reality of the situation is that at some point there's going to be an issue that is going end up in production and could affect some of our customers. If we waited until we got a Discord message, email, or started an on-call incident to trigger one of the engineers on our team to get online and investigate. By the time they're online, we've already violated our SLA. Realistically, getting online in less than five minutes and fifteen seconds is just not going happen, let alone actually fixing the problem.

So instead, we need to optimize for having automation in place over just alerting. So wherever there's an alert, we really need to think about how can we automatically resolve this problem without involving someone. Naturally, the the primary way to handle this is to have a lot of testing in place, automated testing. After all, every single incident you could ever have, we've ever had, could be easily unsolved by having just One.More.Test.

We test efficiently before deployment so that by the time changes get to production, we're confident as possible.

We don't focus on test coverage though, but rather test value. And the tests we create are the ones that we believe have the most value. But I'm not going into that further. However, having all the right tests all the time is almost impossible for a service that's constantly improving and adding more and more features every day.

For every additional test we want to write, the time to actually create and maintain that grows exponentially. So the challenge to get to a 100% complete test coverage would actually take an infinite amount of time. This is known as the Pareto principle, or the 80/20 rule. We will never catch everything in the amount of effort to get even a little bit closer, very quickly gets away from us. So instead, we have to optimize both for prevention by running the right tests, but more importantly recovery when an issue gets to prod.

The only conclusion we can come to is that we need to have additional tests running against our production environment. An example of what we do is our validation tests. An example of a validation test that you could run is use the data that you've got within your database, and the logs that you have to verify—what you're seeing—to verify what you've got already in the database. So for our service, in some cases we have the same data stored in multiple different places. It's not exactly the same, so it gets used for performance searching, quick lookups, and pre-populated caching,.

Other times we mutate the data for optimized throughput. We can act on a schedule we can fire off some compute, which will go and verify the data between these databases. And then, if there's an issue we can send an alert which would allow someone to actually go investigate, and see what's going on here. If you don't not everyone has a situation where they have multiple databases, with similar data. But if you don't, you can utilize potentially past logs or the current logs that are coming in and reprocess them and verify the data that's in your database is exactly as you expected to be.

You can rerun it through the same code and see if the same result happens. You can sample your logs so that, you're not just reprocessing all the data. Another thing we do is, of course, incremental rollout. So when we create a change that's going go to production, we need to make sure that if there's an issue with it, that it impacts the lowest number of our customers and their users as possible. To do that, we bucket our customers into small groups and deploy changes to an individual customer group one at a time.

If that change is successfully deployed, then we'll move on to the next group, and the next group, until it's deployed everywhere. This is another incremental rollout. If there's a problem, then we'll stop the deployment, alert our team that there's an issue with it, and then dive in and see what the possible problem could be.

This is a good strategy, but we actually need to figure out what makes sense to alert on. And I don't think it would be a presentation at AWS conference if I didn't at least mention AI one time. And so I'll say that, we are using anomaly detection in CloudWatch and a couple other tools to identify automatically if there's an issue with the code that we've rolled out to production for one of those deployment groups. The anomaly detection evaluates a specific metric that's relevant for us. The metric we are using is what we call the authorization ratio. That's the ratio of successful authorizations or login attempts versus ones that are blocked, cancelled, or never completed. If there's a problem there, then anomaly detection will alert automatically, and someone will jump in and attempt to resolve it. You can on this graph of alerting data, that we got pretty close to breaching the threshold and thus identifying an anomaly.

The detection was one cycle away from realizing a potential problem and alerting to us. I just happened to find this when I went and looked at the logs. In this case, there was no issue so we just moved on as as planned. Deployment went on to the next deployment group. This ratio is specific to our business at Authress, but there are of course relevant business metrics that make sense to alert on in your situation.

It doesn't make sense to just use an arbitrary thing like APM or number of 200s or 500s or even 400s that you're getting. It should actually be relevant to your use case. And I think this is a really important point, because it can be actually quite a challenge to correctly identify whether or not our service is up and running and that it also matches what our customers also believe. It's very difficult to have an accurate picture of what our customers are seeing.

Let's break that down. The Authress service health versus our customer's perspective.

A. If we correctly identify that our service is up and running, and our and our customers don't see a problem, then we're all good. Customers think everything is great. We think everything is great. Fantastic.

B. Inversely, maybe we identified that there is an incident currently happening and so do our customers. And in which case, we probably have deployed some automation here, which automatically fixes the problem or at the very least alerts the team to go and investigate. While it's not great that there's an alert, we correctly identify that there is an incident. That's a huge step forward.

G. One interesting category is this bottom left-hand corner, where our customers think our service—everything is great. There is no alerts from them. But for us internally, we've identified that there may be a problem. And you may be thinking, well, how how could this happen? Well, our service alerting could be a little bit too sensitive. Or we have an issue in production, but the users who care about it or who are utilizing it maybe aren't online. Take our customers in Australia at this moment. They're all asleep.

So we could potentially reduce our alerting, and not fire off an alarm in these in these particular. Reducing these is important, because they amount to extra work without necessarily extra value. I think the most interesting quarter is up here on the top right-hand corner, which is called a Gray Failure. Our customer says—"Hey, there's a problem with your service." And we look at our, dashboards, our metrics, our alerts, and there's nothing happening.

Everything looks fine to us. You may be thinking, how is that possible? How could that happen? Well, realistically it's possible that we missed on alert.

We don't catch everything. There's something there that we're missing, and there's an incident, and we are not alerting on it. It's very rare though. More likely what happens is the expectations from our customers don't match our expectations.

Maybe there's something in our knowledge base which is confusingly worded, and this reported incident gives us an opportunity to review that documentation and improve it. Maybe it's a poorly named variable. Maybe our customer has some sort of alerting that's relevant for their service, and they're alerting on that even though Authress hasn't changed at all. A good example is one of our customers has a different network configuration or just using the not as good cloud provider. In these circumstances, we really need to care about what's going on from their perspective. They may have additional tests that we're not running, which we can get the benefit from them. And so it's critical that we consider our customer support pipeline to be invaluable for us. If there's a problem and our customers identify, we want to know about that as soon as possible.

That means triaging and exposing customer support requests as alerts in our system to our team. If you get too many alerts, then it isn't realistic to alert on customer requests. You may be tempted to add in additional support levels into your organization to combat the problem. However, for every additional level of support, you reduce the turnaround time for solving support requests. And if one of those requests identify a potential incident and offers insight into a problem with your service, then that means you're actually reducing the possible SLA that you can offer.

So instead, we spend a lot of time thinking about how to eliminate support requests that aren't related to incidents or triage them differently. That means we evaluate every single support request that comes in. We try to answer the question of why did we get this request? Is there a problem on documentation? Is there something confusing? Is there a way that we can avoid this problem in the future, make something simpler, change something around?

We consider our support to be one of the lifelines of our service and allowing us to maintain, the five-nines SLA that we offer. We track these metrics internally that we utilize to know about the status of our service. And in some of these cases, because our customers have a different view, it helps to expose the same information to them.

So actually in the dashboard of our product, we expose some of these metrics specifically so that they can compare what's going on. The most used resources in Authress from their perspective, the users that are utilizing their service at that time, as well as some specific messages and warnings related to Authress functionality usage, that they may want to take a look at. And they could utilize that additional information to identify if in their particular situation they have some feedback for us, or for them this additional data helps point to an internal incident on their side, and it would be great for us to know what's going on and why.

I wish I could say that after implementing everything that I've said so far in this talk, automated region failover, incremental rollout, having a huge, customer support focus, that that would be enough. But realistically, we also need to deal with negligence and malice.

We're lucky enough to be such a popular product that we have free security researchers out there on the Internet constantly reminding us by spamming our email about the identification of a "possible problem" with our service that we should go and and solve. And it's important that we take some of these seriously because if we have a multi tenant solution, I'm sure I'm sure most of you do, and that means we share some infrastructure resources between customers. So if one customer is malicious or negligent and you're over utilizing resources or that customer has a user who is malicious and is over utilizing their resources, that have an impact on our infrastructure resources. Then that could potentially have a negative impact for our other customers. The way we defend against this problem is by adding rate limiting of a few different kinds. We add rate limiting into our application level at the region where the code is executing, and we also add it into the CloudFront distribution. And we do this using a web application firewall.

Web Application Firewall (WAF), if you're not familiar, is a component from AWS, which, tracks requests and you can match on a lot of different things. There's a lot of different configurations out there, but I want to share exactly what we do. So fundamentally, our WAF configuration includes some AWS managed rules, specifically the IP reputation list. This is a great rule to have because AWS automatically populates it with things things that they've identified as malicious.

And so that means that you don't without even thinking about it, you can automatically reduce your attack surface by known problematic entities. If one of these entities is performing a DDoS attack on AWS' resources and affects many different customers, we'll have the benefit of being protected. We don't stop there, but we don't use any other managed rules though. There's a lot in the marketplace and also by AWS, bot control rules, etc. From our perspective, overpriced. They don't really offer a lot, and sometimes they have too many false positives or are challenging to configure. So we don't actually use them.

We do however add BLOCK rules for high rate limits.

So anything that we consider that no customer would ever likely hit, realistically, and if they were to hit then there's probably an issue going on, we just immediately stop those requests. We match on IP address as well as some other properties, and we just immediately cut that out. Not every customer is below this limit. We have some that are above, and so we do make special exceptions for those. But we know when we expect that's going to be happening.

Additionally, we add a whole bunch of COUNT, which is just a way of logging that a particular user is utilizing resources in our service. So add a couple different denominations, 1000rps, 2000rps, 5000rps, etc. We automatically track requests, and this can help us early identify that there's an attack on the horizon, something is coming, and utilize this information to prepare ourselves and our service, or to get AWS on the line, to ask—"Hey. What's going on? Is there something else that we should be doing here?"

And so of course we have alerts on these WAF configurations. Remember, this is not sufficient to actually prevent attacks. We we need automation in place. And so the way we do that is we log on web application firewall requests, and then we filter that data to try to understand what's happening on the fly. So automatically when we get one of these alerts, we're running some automation, which is evaluating what's actually happening.

And here's an example of some of our query automation where we're pulling out highest used IP addresses over a small period of time or some specific parameters. This is goes into CloudWatch and we query that data using their API. And depending on this and some of our own internal rules and logic, we'll dynamically update the WAF rules so that they will block attacks as they happen automatically.

And because we're running in multiple regions, we have the same technology to deploy to all of our regions, and it's also extended out to the edge compute as well.

Review

Okay. I said a lot of things. Let's quickly review what we're doing.

So we have Route 53 automatic failover and health checks to handle multi region deployments and dealing with infrastructure issues.
We have edge compute to help push off some functionality and decentralize it, so not everything is in the same place. It also gives us improved throughput, and reduces latency to our customers' end users.
We implement incremental rollout, to reduce the impact of problematic changes to production.
We have a customer support focus.
Lastly, we have a Web Application Firewall (WAF) with AWS managed and custom roles.

Insights

I want to jump into a few quick insights.

There'll be failures everywhere. So not only will things fail, everything will fail all the time. I think it really speaks to the fact that wherever you've got components or code, or functionality or an endpoint exposed. Something will happen to it. For us, DNS is still a single point of failure. I don't love this about our architecture and I wish there was something that we could do, but I'm not confident there is more we could be doing at this moment.

The last one, deploying almost identical infrastructure is actually really hard. So it'd be one thing if we just deployed our technology to 5 regions or 10 regions and we're done. But some of those regions are backup regions. So they're not primarily used. So the configuration is slightly different.

The failover is slightly different. Some records are primary, some are secondary. So there's a lot of there's actually a lot of minor differences that we have between different regions and this adds significant complexity to our deployment story.

If anyone has any great suggestions on any of this, I'm happy to discuss them.

info

For help understanding this article or how you can implement improved reliability or a similar architecture in your services, feel free to reach out to the Authress development team or follow along in the Authress documentation.

Transcript​

Review​

Insights​

Transcript

Review

Insights