Skip to content

Conversation

zac-nixon
Copy link
Collaborator

Issue

#4207

Description

The crux of the issue linked above is that when deleting a resource that references a security group can take up to ~20 hours (or the configured sync period * 2). The root cause is the deferred TGB queue, which was initially added to improve LBC boot time. The problem introduced by the deferred queue is that it purposely skips over TGBs that haven't been changed. However, in order to garbage collection Security Groups the LBC must have each TGB cached within the network manager:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/networking_manager.go#L216-L220

The cache the only gets populated from the TGB reconciler from doing work here:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L196 [Pods]
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L324 [Nodes]

The cache population step happens after the checkpoint check, which means on boot time we don't populate the cache until the sync time has elapsed on the checkpoint timestamp. This means, in the worse case, that the cache might not get populated for sync time * 2 or 20 hours in most cases. It's important to note, that long lived LBCs wouldn't have this issue, as once the cache is warmed, it will never get de-populated.

This fix does a couple things:

  1. Refactors the network manager to go into the networking package.
  2. Adds the network manager into the deployer so it can attempt to trigger SG clean up when it detects an in-use SG.
  3. Refactors the deferred TGB logic to track TGB state in-memory. This means that we will always process TGBs within 20-40 minutes after LBC start up. This ensures that the cache is warmed a lot faster, which means that the SG deletions happen quicker.

Testing:

  • Reproduced the original issue.
  • Ensured that the code here triggers deletion in ~45m. [I think this is ok, as deletion is low priority, and the resources which linger are free to customers]
  • Let the LBC code run for ~2 days to ensure no memory leaks occur.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 3, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025
@shraddhabang
Copy link
Collaborator

/lgtm
/approved

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wweiwei-li, zac-nixon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [wweiwei-li,zac-nixon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 225d31a into kubernetes-sigs:main Jun 3, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants