-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat(silence): add dedicated silencer cache #4980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0da0d55 to
56adf7d
Compare
69b5a5a to
2021ab1
Compare
Add a dedicated cache to silencer to drop the dependency on the global marker. For now this should make get API calls non-blocking for silencer. This will also enable us to eventually deprecate and remove the global marker. The new silencer cache is evicted by hooking into provider PostDelete callback. This change also remove the logic to recreate silencer on reload which was not needed. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
2021ab1 to
4f121a3
Compare
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Spaceman1701
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally looks very good. I don't think anything I comment on needs to be changed.
The one thing I'm worried about is the delete callback from the mem provider: it's a bit scary that this holds a lock on the entire provider while the callback executes. I don't think there's any way to deadlock, but it does seem like we could easily cause deadlock bugs in the future.
| @@ -1,4 +1,4 @@ | |||
| // Copyright 2016 Prometheus Team | |||
| // Copyright The Prometheus Authors | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the official new copyright header for Prometheus.
| *alertGCInterval, | ||
| *perAlertNameLimit, | ||
| nil, | ||
| silencer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine, but one thing to note is that the alert deleted hook is called while we have a mutex on the entire alert provider:
alertmanager/provider/mem/mem.go
Lines 172 to 183 in a3a156d
| func (a *Alerts) gc() { | |
| a.mtx.Lock() | |
| defer a.mtx.Unlock() | |
| deleted := a.alerts.GC() | |
| for _, alert := range deleted { | |
| // As we don't persist alerts, we no longer consider them after | |
| // they are resolved. Alerts waiting for resolved notifications are | |
| // held in memory in aggregation groups redundantly. | |
| a.marker.Delete(alert.Fingerprint()) | |
| a.callback.PostDelete(&alert) | |
| } |
We should be careful that this doesn't deadlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @grobinson-grafana advised against using the callbacks but we do not have a better option yet.
We could spin up a go routine in silencer and return the callback quickly so we do not block the provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code again, same thing is happening with marker, so we are just adding an extra call for a map delete which should be fast.
Silencer itself does not hold locks long on its cache so I think this is going to be fine for now.
One thing we can improve is to batch deletes to both marker and the silence cache.
I can open a separate PR for that as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah just be careful with callbacks, we fixed I think a total of 3-4 race condition and deadlock bugs related to callbacks either calling itself (attempting to re-aquire its own mutex) or callbacks from A calling B and callback from B calling A (hence deadlock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a draft change which removes locks from callbacks since the caller doesn't really need to hold its internal lock while calling the callback.
This should improve the performance and avoid any potential deadlocks in future.
Maybe when we get the new EventInterface, we could potentially hook into that to eventually delete alerts marked as resolved in different components.
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add a dedicated cache to silencer to drop the dependency on the global marker.
For now this should make get API calls non-blocking for silencer.
This will also enable us to eventually deprecate and remove the global marker.
The new silencer cache is evicted by hooking into provider PostDelete callback.
This change also removes the logic to recreate silencer on reload which was not needed.
Signed-off-by: Siavash Safi siavash@cloudflare.com