Skip to content

feat: treat AWS permission errors as hard errors to fail readiness probes #5794

@u-kai

Description

@u-kai

What would you like to be added:

AWS provider should treat permission/authorization errors as hard errors instead of soft errors, causing external-dns to terminate and readiness probes to fail.
Currently, AWS permission errors (AccessDenied, UnauthorizedOperation) are wrapped as provider.SoftError, so the process continues running and /healthz stays healthy.

Why is this needed:
Permission errors are non-recoverable and require manual intervention.
In real-world setups (IRSA/Pod Identity, EC2 instance profiles), identity association changes often require a fresh pod/instance to pick up the new mapping/credentials. Failing fast accelerates that refresh path and surfaces the error to operators and monitoring.

Current behavior creates silent failures - DNS records aren't updated but the service appears healthy.
By failing fast on permission errors:

  • Kubernetes restarts the pod automatically
  • Operators get immediate feedback about misconfigurations
  • Monitoring systems properly detect issues This aligns with "fail fast" principles for non-recoverable errors and improves operational visibility.

If the proposed direction makes sense, I’d be happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions