[bug fix] Make SG deletion more responsive #4216
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
#4207
Description
The crux of the issue linked above is that when deleting a resource that references a security group can take up to ~20 hours (or the configured sync period * 2). The root cause is the deferred TGB queue, which was initially added to improve LBC boot time. The problem introduced by the deferred queue is that it purposely skips over TGBs that haven't been changed. However, in order to garbage collection Security Groups the LBC must have each TGB cached within the network manager:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/networking_manager.go#L216-L220
The cache the only gets populated from the TGB reconciler from doing work here:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L196 [Pods]
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L324 [Nodes]
The cache population step happens after the checkpoint check, which means on boot time we don't populate the cache until the sync time has elapsed on the checkpoint timestamp. This means, in the worse case, that the cache might not get populated for sync time * 2 or 20 hours in most cases. It's important to note, that long lived LBCs wouldn't have this issue, as once the cache is warmed, it will never get de-populated.
This fix does a couple things:
Testing:
Checklist
README.md
, or thedocs
directory)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯