vmoperator: set default -http.shutdownDelay for HTTP-serving components by endesapt · Pull Request #1882 · VictoriaMetrics/operator

endesapt · 2026-02-24T14:49:41Z

Created AddHTTPShutdownDelayArg func that adds -http.shutdownDelay based on the factors:

users http.shutdownDelay extra argument
readinessProbe.PeriodSeconds(or 5 seconds as default)*readinessProbe.FailiureTreshold(or 10 by default)
50 seconds by default

Applied it to every component that has http.shutdownDelay flag

Summary by cubic

Set a default -http.shutdownDelay for all HTTP-serving components equal to the pod’s termination grace period. Applies only to new resources, defaults to 30s when the pod value isn’t set, and respects extraArgs overrides.

New Features
- Added AddHTTPShutdownDelayArg: delay = terminationGracePeriodSeconds (fallback 30s); skipped if extraArgs["http.shutdownDelay"] is set.
- Defaulted pod TerminationGracePeriodSeconds to 30s in Deployment/StatefulSet/DaemonSet when nil; avoids mutating CR specs.
- Applied to VMAgent, VLAgent, VMAlert, VMAuth (incl. LB), and VM/VL/VT Single and Cluster (insert/select/storage); tests cover default and custom values; freshness check reverted to keep changes limited to new resources.

^{Written for commit d91d083. Summary will update on new commits.}

AndrewChubatiuk · 2026-02-24T14:51:36Z

internal/controller/operator/factory/build/container_test.go

 }

+func TestAddHTTPShutdownDelayArg(t *testing.T) {
+	t.Run("adds default derived from built-in readiness defaults", func(t *testing.T) {


please rewrite this using f-test pattern, like other tests even in this file

Thanks, done

cubic-dev-ai

No issues found across 23 files

AndrewChubatiuk · 2026-02-25T06:55:37Z

@endesapt @vrutkovs
should it also depend on terminationGracePeriodSeconds?

vrutkovs · 2026-02-25T08:49:54Z

internal/controller/operator/factory/build/container.go

+	}
+
+	delaySeconds := defaultReadinessPeriodSeconds * defaultReadinessFailureThreshold
+	if probes != nil && probes.ReadinessProbe != nil {


I don't think it's a good idea - this makes the results less predictable. terminationGracePeriodSeconds would be a factor which would influence this result.

Instead, I propose to set a valid default value (possibly separate for each component?) and set readinessPeriod / terminationGracePeriodSeconds unless they are already defined.

Also, we should apply this to new instances only - we should avoid unnecessary rollouts after operator upgrades

I investigated a little bit and i am now concerned, if we even need readiness probe for that? As i read Pod Termination Flow it is said that when pod is terminated, it is instantly marked in EndpointSlices as ready:false which prevents all the traffic to it? So we need to rely only on terminationGracePeriodSeconds to understand how much time we have to process traffic that we already recieved.
So i guess the final scheme will be without readinessProbe and will be:

if spec.terminationGracePeriodSeconds is not set, set it to default(for each component or united)

http.shutdownDelay is inherited from it if not defined by user.

(possibly separate for each component?)

If i didn't miss anything, i can change the code accordingly, but i dont know what the values can be for each component (or even what the unified value should be), because i dont have much experience in maintaining VM components. Would be glad if you help me with that.

I can just start with setting unified default to 30s and then we will change it to appropriate value

i am now concerned, if we even need readiness probe for that?

No, not necessarily, but my understanding is that readinessProbe needs to align with shutdownDelay - when the container goes down, we'll run x readiness checks + terminationGracePeriodSeconds. We want to minimize the downtime, so readinessProbe needs to run quite often (its cheap), wait for 2 failure and only afterwards mark the container as failed.
I don't think we need to adjust probe timings here, but if we do - we should do that here and preferably for new resources only.

I can just start with setting unified default to 30s and then we will change it to appropriate value

That sounds reasonable, make sure the implementation makes it easy to tweak it for, say, vmstorage later easily

That sounds reasonable, make sure the implementation makes it easy to tweak it for, say, vmstorage later easily

Understand

we'll run x readiness checks + terminationGracePeriodSeconds. We want to minimize the downtime, so readinessProbe needs to run quite often (its cheap), wait for 2 failure and only afterwards mark the container as failed.

So still for shutdownDelay we will have the scheme i described earlier right? Readiness probe will fail during the termination cause of the flag, and it will be marked as ready: false; serving: "false" in EndpointSlice stopping the traffic.

I am talking about this:

if spec.terminationGracePeriodSeconds is not set, set it to default(for each component or united)

http.shutdownDelay is inherited from it if not defined by user.

for new resources only.

I will try to do that, but it would be great if you can give some directions on how to do that? How to distinct new resources from older ones? By their annotations maybe?

we will have the scheme i described earlier right?

Yes, I think we're on the same page. I don't think readiness probes need to be changed for now.

How to distinct new resources from older ones?

Most reconcile procedured have prevCR var - if its nil its an empty CR

Also, we should apply this to new instances only - we should avoid unnecessary rollouts after operator upgrades

@vrutkovs i don't get your idea. why do we need it?

If we apply new args as -http.shutdownDelay to Deployment spec, it will force Pods to rolling update. And as we are adding this to almost every vm operator component, it will make a lot of rolling updates.
I guess it is a lot of pressure on a cluster with lots of vm deployments that's why we are trying to avoid that.

vrutkovs · 2026-02-25T08:50:27Z

internal/controller/operator/factory/build/container_test.go

+		wantArgs:  nil,
+	})
+	f(opts{
+		extraArgs: nil,


Test to add: shutdownDelay and probe both set

endesapt · 2026-02-25T20:19:39Z

Changed AddHTTPShutdownDelayArg to work as we discussed and work only for new resources.
But for now i didnt change the terminationGracePeriodSeconds, cause default value is still 30 seconds.

I wanted to set it explicitly in daemonset,deployment,statefulset like that:

f1e4300#diff-e7d6e0f45db6d4efd51be09d7ff37703298db2a37b4b0ee87054f1c549d77176R26-R28

But then i understood that it will make old deployments restart. I can try pass the check to the functions, but i think it will be messy. Do we need it?

As for changing defaults for distinct components, it will be pretty easy. As httpShutdownDelay relies on terminationGracePeriod, we can change default value for each component like this:

operator/internal/controller/operator/factory/build/defaults.go

Lines 339 to 341 in cf18409

    
           if cr.Spec.TerminationGracePeriodSeconds == nil { 
        
           	cr.Spec.TerminationGracePeriodSeconds = ptr.To[int64](120) 
        
           }

vrutkovs

LGTM on the approach, few comments on the code

internal/controller/operator/factory/build/container.go

internal/controller/operator/factory/build/defaults.go

internal/controller/operator/factory/build/daemonset.go

endesapt · 2026-02-26T11:45:09Z

Hope i got you right

vrutkovs

LGTM, @AndrewChubatiuk could you have a look? I'm not sure if defaulting style is ciorrect

internal/controller/operator/factory/vtsingle/vtsingle.go

internal/controller/operator/factory/vtcluster/storage.go

internal/controller/operator/factory/vtcluster/select.go

internal/controller/operator/factory/vtcluster/insert.go

internal/controller/operator/factory/vmsingle/vmsingle.go

internal/controller/operator/factory/vmcluster/vmcluster.go

internal/controller/operator/factory/vmauth/vmauth.go

internal/controller/operator/factory/vmalert/vmalert.go

internal/controller/operator/factory/vmagent/vmagent.go

internal/controller/operator/factory/vlsingle/vlsingle.go

internal/controller/operator/factory/vlcluster/vmauth_lb.go

internal/controller/operator/factory/vlcluster/vlstorage.go

internal/controller/operator/factory/vlcluster/vlselect.go

internal/controller/operator/factory/vlcluster/vlinsert.go

internal/controller/operator/factory/vlagent/vlagent.go

…t, VMAlert, VMSingle

- change terminationGracePeriodSeconds default to 30 seconds. - refactor tests and add test for custom terminationGracePeriodSeconds - change default shutdownDelay to 30 seconds.

delete setting default terminationGracePeriodSeconds to 30s

endesapt · 2026-02-27T13:15:00Z

brought branch up to date to master, edited LastAppliedSpec according to the review

endesapt · 2026-02-27T13:17:33Z

@vrutkovs @AndrewChubatiuk Can you approve requested changes if everything is alright?

vrutkovs · 2026-02-27T14:03:16Z

This change looks good, so no worries about that but I'm afraid I misinformed you: we would like to ensure that operator update will not cause unnecessary pod rollouts. This is a longstanding task and not limited to this PR (this is why I asked to update the PR and apply this change to new resources only).

However, it appears this decision will later bite us - we'll have to handle removing this argument too, so its getting more and more complicated. The radical solution is to implement "dry run mode" - #1832, where users would be able to verify that operator update will cause rollouts and can plan their maintenance accordingly.

As a result, we'll have to change the directions again - no need to take the "freshness" of the resource into account, we'll tackle the unexpected rollouts via different means. Sorry about this taking too long - we accidentally tripped a long-standing problem, nothing wrong with your code.

I understand if you want to leave the PR as is - we'll update it accordingly in that case, no problem with that.

endesapt · 2026-02-27T16:28:42Z

I understand if you want to leave the PR as is

I like to contribute to VictoriaMetrics - it is my way to learn and to say thanks to your project(i hope it helps). Changes is a part of it, so it is okay. I will revert the "freshness" when i will have time.

endesapt requested review from AndrewChubatiuk and vrutkovs as code owners February 24, 2026 14:49

AndrewChubatiuk reviewed Feb 24, 2026

View reviewed changes

cubic-dev-ai bot reviewed Feb 24, 2026

View reviewed changes

vrutkovs requested changes Feb 25, 2026

View reviewed changes

vrutkovs requested changes Feb 26, 2026

View reviewed changes

endesapt requested a review from vrutkovs February 27, 2026 11:02

vrutkovs reviewed Feb 27, 2026

View reviewed changes