Fix crash upon reading empty config for notifier#4979
Fix crash upon reading empty config for notifier#4979SoloJacobs wants to merge 1 commit intoprometheus:mainfrom
Conversation
Previously, `alertmanager` and `amtool` would dereference a null
pointer upon encountering a configuration such as:
```yaml
receivers:
- name: "example"
webhook_configs:
-
```
Here is an example traceback:
```sh
$ ./amtool check-config config/testdata/conf.crash.yml
Checking 'config/testdata/conf.crash.yml'panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xba8d25]
goroutine 1 [running]:
gopkg.in/yaml%2ev2.handleErr(0xc000239a28)
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/yaml.go:249 +0x68
panic({0xea0000?, 0x19a8260?})
/usr/local/go/src/runtime/panic.go:783 +0x132
github.com/prometheus/alertmanager/config.(*Config).UnmarshalYAML(0xc0004281e0, 0x773fed0393b0?)
/home/solomonjacobs/get/f-alertmanager/config/config.go:523 +0xce5
gopkg.in/yaml%2ev2.(*decoder).callUnmarshaler(0xc0002c71a0, 0xc00042a070, {0x773fed0393b0, 0xc0004281e0})
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/decode.go:270 +0x94
gopkg.in/yaml%2ev2.(*decoder).prepare(0xc0002c71a0, 0xc00042a070, {0xfb4200?, 0xc0004281e0?, 0xc000072e08?})
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/decode.go:313 +0x256
gopkg.in/yaml%2ev2.(*decoder).unmarshal(0xc0002c71a0, 0xc00042a070, {0xfb4200?, 0xc0004281e0?, 0xaa506b?})
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/decode.go:364 +0x13b
gopkg.in/yaml%2ev2.(*decoder).document(...)
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/decode.go:384
gopkg.in/yaml%2ev2.(*decoder).unmarshal(0xea7a00?, 0xc0004281e0?, {0xfb4200?, 0xc0004281e0?, 0x46467e?})
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/decode.go:360 +0x110
gopkg.in/yaml%2ev2.unmarshal({0xc0003ea800, 0x34, 0x40}, {0xea7a00, 0xc0004281e0}, 0x1)
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/yaml.go:148 +0x371
gopkg.in/yaml%2ev2.UnmarshalStrict(...)
/home/solomonjacobs/go/pkg/mod/gopkg.in/yaml.v2@v2.4.0/yaml.go:89
github.com/prometheus/alertmanager/config.Load({0xc0003ea7c0, 0x34})
/home/solomonjacobs/get/f-alertmanager/config/config.go:287 +0x55
github.com/prometheus/alertmanager/config.LoadFile({0x7fff2ce3c111, 0x1e})
/home/solomonjacobs/get/f-alertmanager/config/config.go:313 +0x37
github.com/prometheus/alertmanager/cli.CheckConfig({0xc0003c1e80?, 0xc000239ca8?, 0x419c1f?})
/home/solomonjacobs/get/f-alertmanager/cli/check_config.go:67 +0x1d4
github.com/prometheus/alertmanager/cli.(*checkConfigCmd).checkConfig(...)
/home/solomonjacobs/get/f-alertmanager/cli/check_config.go:48
github.com/alecthomas/kingpin/v2.(*actionMixin).applyActions(...)
/home/solomonjacobs/go/pkg/mod/github.com/alecthomas/kingpin/v2@v2.4.0/actions.go:28
github.com/alecthomas/kingpin/v2.(*Application).applyActions(0xc0001a4e00?, 0xc0002bc7e0)
/home/solomonjacobs/go/pkg/mod/github.com/alecthomas/kingpin/v2@v2.4.0/app.go:568 +0xd0
github.com/alecthomas/kingpin/v2.(*Application).execute(0xc0001a4e00, 0xc0002bc7e0, {0xc0003c1e50, 0x1, 0x1})
/home/solomonjacobs/go/pkg/mod/github.com/alecthomas/kingpin/v2@v2.4.0/app.go:398 +0x65
github.com/alecthomas/kingpin/v2.(*Application).Parse(0xc0001a4e00, {0xc000034370?, 0xc000034370?, 0xc00040ea80?})
/home/solomonjacobs/go/pkg/mod/github.com/alecthomas/kingpin/v2@v2.4.0/app.go:230 +0x13f
github.com/prometheus/alertmanager/cli.Execute()
/home/solomonjacobs/get/f-alertmanager/cli/root.go:184 +0x12c9
main.main()
/home/solomonjacobs/get/f-alertmanager/cmd/amtool/main.go:19 +0xf
```
Related-to: prometheus#4912
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
0bb15fd to
c92f9e3
Compare
| for _, wh := range rcv.WebhookConfigs { | ||
| if wh == nil { | ||
| wh = &WebhookConfig{} | ||
| return wh.UnmarshalYAML(func(any) error { return nil }) |
There was a problem hiding this comment.
Why do we need to call wh.UnmarshalYAML() and return?
Doesn't this mean we simply stop processing the rest of the config?
Also since most of these fields are required, should we just log an error and exit the process?
There was a problem hiding this comment.
Why do we need to call wh.UnmarshalYAML() and return?
The primary motivation for me was to preserve the errors, that it would have encountered, if it did parse an empty config. A different way to accomplish this is by defining an error, and share it between the UnmarshalYAML and this call site.
Doesn't this mean we simply stop processing the rest of the config?
For this specific notifier, this call always terminates here. In the case of some of the error is only exposed later.
Also since most of these fields are required, should we just log an error and exit the process?
We should definitely just return the error: If we exit the process, this might break the reloading mechanism. On initial startup, the process will log a message and then exit. I will not touch the error handling in that function in this PR, although I do have some cleanups in mind. I imagine you meant something different, tough.
I hope that clears up your questions. I would then propose the following: Have a validation function per WebhookConfig, EmailConfig etc. This will be source of truth, rather then logic being scattered across (c *Config) UnmarshalYAML , (c *WebhookConfig) UnmarshalYAML. @siavashs do you like this proposal?
@TheMeier Would such a refactoring interfere with your current work?
In the future, I would like to
- move the validation out of the
UnmarshalYAML, - use a parse, don't validate approach,
- have consistent error messages, if required fields are missing.
But that is for different PR.
There was a problem hiding this comment.
Actually I thought about validation functions per notifier-config too, would be a nice addtion. I don't think it would interfer too much wih my experiment. But maybe it might be worthwhile to wait a bit more and see if a path emerges for splitting out notifier-specific code before adding more per-notifier code.
There was a problem hiding this comment.
The primary motivation for me was to preserve the errors, that it would have encountered, if it did parse an empty config. A different way to accomplish this is by defining an error, and share it between the UnmarshalYAML and this call site.
We must use either an existing error type or create a custom one without needing to make unnecessary method calls to produce an error.
For example you can construct TypeErrors, or just a plain err.Error which says what failed.
For this specific notifier, this call always terminates here. In the case of some of the error is only exposed later.
There are following code blocks which need to process the rest of the config, an early return here means that we never process those.
I think we need to gather all errors and return them to the caller, then the caller decides what to do with them.
Otherwise the user will become aware of the errors one by one, forcing them to go through iterations rather than receiving a full validation result from the config.
I would then propose the following: Have a validation function per WebhookConfig, EmailConfig etc. This will be source of truth, rather then logic being scattered across (c *Config) UnmarshalYAML , (c *WebhookConfig) UnmarshalYAML.
During the maintainers meet up we discussed about breaking up the config file into multiple parts and having independent load and validations per part, for example:
- if you have 10 receivers and you add a new broken one we just skip loading that and show an error in the logs, ui, api, etc.
This should be done both during startup and reload, which should help Alertmanager to start and let the user fix the broken part of the config rather than having complete downtime.
Reloads are similarly handled, in a scenario when multiple users inject their receiver configs, a broken new receiver should not block the receiver added by another user to load, etc.
So I like the idea of decoupling the config load mechanism and also braking up the config file into smaller pieces.
There was a problem hiding this comment.
There are following code blocks which need to process the rest of the config, an early return here means that we never process those.
Yes, but this behavior was already present, it is not introduced in this change. We can collect errors, but that is a much larger change, which affects the control in many places. For now, all this change should do is change crashes into error messages, the same way it is done everywhere else. I will do follow-up with other suggestions of yours.
I created an issue to track the other findings, which we have discussed here. @TheMeier @siavashs Let me know what you think: #4989
There was a problem hiding this comment.
Let's create a missing <section> config error and return that instead of calling unmarshal on empty config.
Otherwise LGTM.
There was a problem hiding this comment.
I think it is fine to delay error collection to a later point in time. Creating a missing <section> config error also sounds good to me
Previously,
alertmanagerandamtoolwould dereference a null pointer upon encountering a configuration such asHere is an example traceback.