Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Feb 8, 2026

Add a dedicated cache to silencer to drop the dependency on the global marker.
For now this should make get API calls non-blocking for silencer.
This will also enable us to eventually deprecate and remove the global marker.

The new silencer cache is evicted by hooking into provider PostDelete callback.

This change also removes the logic to recreate silencer on reload which was not needed.

Signed-off-by: Siavash Safi siavash@cloudflare.com

@siavashs siavashs force-pushed the feat/dedicated-silencer-cache branch 3 times, most recently from 0da0d55 to 56adf7d Compare February 9, 2026 12:15
@siavashs siavashs added the keepalive Exempt from stale checks label Feb 11, 2026
@siavashs siavashs force-pushed the feat/dedicated-silencer-cache branch 4 times, most recently from 69b5a5a to 2021ab1 Compare February 12, 2026 10:53
Add a dedicated cache to silencer to drop the dependency on the global marker.
For now this should make get API calls non-blocking for silencer.
This will also enable us to eventually deprecate and remove the global marker.

The new silencer cache is evicted by hooking into provider PostDelete callback.

This change also remove the logic to recreate silencer on reload which
was not needed.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the feat/dedicated-silencer-cache branch from 2021ab1 to 4f121a3 Compare February 12, 2026 11:55
@siavashs siavashs marked this pull request as ready for review February 12, 2026 12:07
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs requested a review from ultrotter February 12, 2026 15:20
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally looks very good. I don't think anything I comment on needs to be changed.

The one thing I'm worried about is the delete callback from the mem provider: it's a bit scary that this holds a lock on the entire provider while the callback executes. I don't think there's any way to deadlock, but it does seem like we could easily cause deadlock bugs in the future.

@@ -1,4 +1,4 @@
// Copyright 2016 Prometheus Team
// Copyright The Prometheus Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to have changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the official new copyright header for Prometheus.

*alertGCInterval,
*perAlertNameLimit,
nil,
silencer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, but one thing to note is that the alert deleted hook is called while we have a mutex on the entire alert provider:

func (a *Alerts) gc() {
a.mtx.Lock()
defer a.mtx.Unlock()
deleted := a.alerts.GC()
for _, alert := range deleted {
// As we don't persist alerts, we no longer consider them after
// they are resolved. Alerts waiting for resolved notifications are
// held in memory in aggregation groups redundantly.
a.marker.Delete(alert.Fingerprint())
a.callback.PostDelete(&alert)
}

We should be careful that this doesn't deadlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @grobinson-grafana advised against using the callbacks but we do not have a better option yet.
We could spin up a go routine in silencer and return the callback quickly so we do not block the provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, same thing is happening with marker, so we are just adding an extra call for a map delete which should be fast.
Silencer itself does not hold locks long on its cache so I think this is going to be fine for now.
One thing we can improve is to batch deletes to both marker and the silence cache.
I can open a separate PR for that as a follow up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just be careful with callbacks, we fixed I think a total of 3-4 race condition and deadlock bugs related to callbacks either calling itself (attempting to re-aquire its own mutex) or callbacks from A calling B and callback from B calling A (hence deadlock).

Copy link
Contributor Author

@siavashs siavashs Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a draft change which removes locks from callbacks since the caller doesn't really need to hold its internal lock while calling the callback.
This should improve the performance and avoid any potential deadlocks in future.

Maybe when we get the new EventInterface, we could potentially hook into that to eventually delete alerts marked as resolved in different components.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/silences keepalive Exempt from stale checks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants