Skip to content

Unsuitable histogram buckets for LBC metrics (awslbc_readiness_gate_ready_seconds_bucket) #3987

@frittentheke

Description

@frittentheke

Describe the bug
Following the discussion around very slow target registration in #1834 PR #3941 was crafted by @zac-nixon adding metrics about the latency (podReadinessFlipSeconds) of the readiness gate. This cool new feature was merged by @wweiwei-li and @shraddhabang and then released with https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.10.1

Unfortunately the used buckets for the histogram are unsuitable for the latency observed (and realistic) with AWS NLB target registration. As it stands the now improved registration time is about 60 to 70 seconds with the buckets being:
{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} by default (see https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#pkg-variables). This causes all readiness flips to end up the in the catchall bucket, e.g.:

awslbc_readiness_gate_ready_seconds_bucket{le="+Inf"}

Likely some linear buckets (https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#LinearBuckets) with a range from e.g. 30s to 5m that can be expected from the API and processes behind the health check and readiness gate mechanism makes sense.

Steps to reproduce

Expected outcome
A concise description of what you expected to happen.

Environment

  • AWS Load Balancer controller version: 2.10.1
  • Kubernetes version: 1.31.x

Additional Context:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions