-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
The way that the LBC reconciles SG rules is to revoke old rules, then add new ones, which I think is liable to cause downtime in a variety of scenarios. See here for the implementation:
aws-load-balancer-controller/pkg/networking/security_group_reconciler.go
Lines 104 to 114 in c2515fe
permissionsToGrant := diffIPPermissionInfos(desiredPermissions, sgInfo.Ingress) | |
if len(permissionsToRevoke) > 0 && !reconcileOpts.AuthorizeOnly { | |
if err := r.sgManager.RevokeSGIngress(ctx, sgInfo.SecurityGroupID, permissionsToRevoke); err != nil { | |
return err | |
} | |
} | |
if len(permissionsToGrant) > 0 { | |
if err := r.sgManager.AuthorizeSGIngress(ctx, sgInfo.SecurityGroupID, permissionsToGrant); err != nil { | |
return err | |
} | |
} |
I believe this is obviously problematic, and I feel like it must be this way only because of some technical constraint or an oversight, but either way, it seems dangerous to use for production systems that bear a high amount of load. For example, let's say we're using EKS and we're adding a new LoadBalancer service that happens to get assigned a new NodePort that is higher than any used to date thus far. When using the managed shared backend security group setting (which is the default, I believe), the LBC will call RevokeSecurityGroupIngress
, on the security group that all the EKS nodes use, the rule that uses that shared backend security group as the source security group ID, and has a port range of the lowest provisioned NodePort to the previously highest provisioned NodePort, then call AuthorizeSecurityGroupIngress
to add a new rule just like it, but with the new highest NodePort as the upper bound. This change occurs in around a second (based on container logs and CloudTrail entries) but during that second, that means there is no rule that allows ingress traffic from your NLBs to your nodes. (Although, I would like to note that sometimes CloudTrail says the AuthorizeSecurityGroupIngress
call comes in first, but the container logs never do, so maybe there's some eventual consistency in CloudTrail that means I shouldn't rely upon it.)
I understand that since SGs are stateful, existing client connections won't be cut off, but I was surprised to read in logs and in source that it works this way. I expected that this would break new connections or cause at least some sort of mayhem, but in my (limited) testing of around ~120 req/s to an existing NLB while I was spinning up a new one that caused this SG rule reconciliation, I didn't notice any downtime. I asked around on the k8s slack about it, and people surmised that there might be some request batching that makes there be no downtime for such short bursts, but no one knew what the contingency plan would be if the AuthorizeSecurityGroupIngress
API call failed. So, I asked AWS Support, but their answer wasn't encouraging; I'm still going to work with them on this, but, they made it sound like there is no batching occurring, so aside from the inherent eventual consistency of AWS systems, and the statefulness of connections under SGs, there's nothing that should be making traffic keep "floating" during what should actually be a blip of some sort. However, they didn't really address my other concerns, so I do believe that it's possible they misunderstood me entirely, and that there is actually some sort of batching.
Even then, server-side batching doesn't address all the problems here. For example, what if the LBC crashes (due to an OOM or some container runtime issue) after the RevokeSecurityGroupIngress
call successfully goes through, but before the AuthorizeSecurityGroupIngress
rule is made? Or what if there is a networking issue for just long enough at that brief moment? What if there is some sort of slowdown with the LBC or with the AWS API that makes it take longer to do all this, making the probability of connection failure more likely? Another case that AWS Support brought up that I didn't think of, which they seemed to be trying to assure me would not be a problem but actually just added to my list of concerns is the authorize call getting rate limited. And this is to say nothing of situations like someone misconfiguring their LBC at some point to remove AuthorizeSecurityGroupIngress
from its IAM permissions, or some bad actor doing that to them–it's not like you can protect against all footguns in the world, but this seems like it's designed to be a footgun.
So, I have come to raise my concerns here as well. I believe this is a dangerous way of doing things. All it takes is for a single AuthorizeSecurityGroupIngress
call to fail, and then suddenly, your Nodes will stop accepting traffic from all the NLBs whose backend SG rules are being managed by the LBC. I do believe this can be all side-stepped by using --disable-restricted-sg-rules=true
when starting the LBC, though I'm not keen on that because it seems like a smell: there are good reasons to restrict the SG port range, both as a matter of good practice no matter what the deployment environment is like, and as a matter of making the component safer for use in clusters where one might have some sort of host networking mode containers exposing ports that are only meant to be reachable from within the cluster, or any other novel arrangement that one is likely to find oneself in when doing Kubernetes the Hard Way instead of using EKS.
Steps to reproduce
If you wanted to provoke the problem presented here, you could do some sort of fiddling with your IAM permissions, or use an outbound proxy that watches the logs of the LBC to decide whether the thing is trying to do an AuthorizeSecurityGroupIngress
call and block it, or if you want to be a bad customer, you could trigger API rate limits by automating some system somewhere that will make AuthorizeSecurityGroupIngress
start getting RequestLimitExceeded
errors. I guess you could also try to cause OOMs or kill the container in the middle of the reconciliation process, but, those seem harder to accomplish precisely.
Anyway, once you have introduced this chaos, simply create a new Service and the LBC will delete all your ingress rules and fail to replace them, and now all your NLBs should be in a broken state.
Expected outcome
If possible, AuthorizeSecurityGroupIngress
should happen first. I feel like that should be always safe to do, since I don't think rule overlap triggers the rule duplication logic in backend SG rule reconciliation. Like, let's say your rule is going to from allowing ports 31000-32000
to 31000-32500
, then just make a new rule like that, then delete the old one. If necessary, you could use rule descriptions to help differentiate the rules more. This should work, right?
But if I'm wrong about that, then, I'm not sure how to truly fix this besides changing how things work on the backend. Obviously, introducing batching without API changes would ensure that the happy-path scenario (in which AuthorizeSecurityGroupIngress
never fails to be called after a RevokeSecurityGroupIngress
) never causes a blip of downtime. But introducing batching as a first class feature to the API, such that the whole operation is atomic from both a client and server perspective, would be best.
Finally, if those aren't possible, then I would think the best course of action is to remove this feature and replace it with something that lets the user simply specify the port ranges to allow in the managed rules. That would be safest, IMO.
Environment
- AWS Load Balancer controller version: 2.7.2
- Kubernetes version: v1.25.16
- Using EKS (yes/no), if so version?: yes, eks.15
Additional Context:
Once again, here's the link to the k8s slack thread where I asked about this, hoping to get a maintainer or SIG member to respond: https://kubernetes.slack.com/archives/C0LRMHZ1T/p1712080079771249
Also, it should be noted that I am used to the olden days of NLBs not having SGs; using SGs on NLBs is novel to me as of trying to get the LBC working in our new EKS clusters, so I may be making assumptions about how they work that apply to CLBs and EC2 instances, but don't apply to NLBs for some reason. Sorry for being behind the times. If my concerns are based on a fundamental misunderstanding then please, by all means, correct my misunderstanding and I'll be happy to hear that there's no problem here at all.
Thanks in advance for your time and attention.