[bug fix] Make SG deletion more responsive #4216

zac-nixon · 2025-06-03T16:40:36Z

Issue

Description

The crux of the issue linked above is that when deleting a resource that references a security group can take up to ~20 hours (or the configured sync period * 2). The root cause is the deferred TGB queue, which was initially added to improve LBC boot time. The problem introduced by the deferred queue is that it purposely skips over TGBs that haven't been changed. However, in order to garbage collection Security Groups the LBC must have each TGB cached within the network manager:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/networking_manager.go#L216-L220

The cache the only gets populated from the TGB reconciler from doing work here:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L196 [Pods]
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go#L324 [Nodes]

The cache population step happens after the checkpoint check, which means on boot time we don't populate the cache until the sync time has elapsed on the checkpoint timestamp. This means, in the worse case, that the cache might not get populated for sync time * 2 or 20 hours in most cases. It's important to note, that long lived LBCs wouldn't have this issue, as once the cache is warmed, it will never get de-populated.

This fix does a couple things:

Refactors the network manager to go into the networking package.
Adds the network manager into the deployer so it can attempt to trigger SG clean up when it detects an in-use SG.
Refactors the deferred TGB logic to track TGB state in-memory. This means that we will always process TGBs within 20-40 minutes after LBC start up. This ensures that the cache is warmed a lot faster, which means that the SG deletions happen quicker.

Testing:

Reproduced the original issue.
Ensured that the code here triggers deletion in ~45m. [I think this is ok, as deletion is low priority, and the resources which linger are free to customers]
Let the LBC code run for ~2 days to ensure no memory leaks occur.

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the docs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

shraddhabang · 2025-06-03T17:51:10Z

/lgtm
/approved

k8s-ci-robot · 2025-06-03T18:42:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wweiwei-li, zac-nixon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wweiwei-li,zac-nixon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zac-nixon added 3 commits May 30, 2025 11:36

refactor networking_manager to go into networking package

1af0d0d

refactor network manager creation outside of tgb resource manager

ba3052c

plumb networking manager into deployers, introduce gc fn

dfd4dae

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 3, 2025

k8s-ci-robot requested review from oliviassss and wweiwei-li June 3, 2025 16:40

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025

k8s-ci-robot assigned shraddhabang Jun 3, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2025

wweiwei-li approved these changes Jun 3, 2025

View reviewed changes

k8s-ci-robot merged commit 225d31a into kubernetes-sigs:main Jun 3, 2025
9 checks passed

zac-nixon mentioned this pull request Jun 6, 2025

refactor deferred reconciler to always process tgb on start up #4224

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug fix] Make SG deletion more responsive #4216

[bug fix] Make SG deletion more responsive #4216

Uh oh!

zac-nixon commented Jun 3, 2025

Uh oh!

shraddhabang commented Jun 3, 2025

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

[bug fix] Make SG deletion more responsive #4216

[bug fix] Make SG deletion more responsive #4216

Uh oh!

Conversation

zac-nixon commented Jun 3, 2025

Issue

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Uh oh!

shraddhabang commented Jun 3, 2025

Uh oh!

k8s-ci-robot commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!