-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Handle MOVED error pointing to same endpoint. #3003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…on before retrying the request.
948a601 to
9c32b44
Compare
| { | ||
| if (bridge is null) | ||
| { | ||
| // already toast | ||
| } | ||
| else if (bridge.Multiplexer.TryResend(hashSlot, message, endpoint, isMoved)) | ||
| // MOVED to same endpoint detected. | ||
| // This occurs when Redis/Valkey servers are behind DNS records, load balancers, or proxies. | ||
| // The MOVED error signals that the client should reconnect to allow the DNS/proxy/load balancer | ||
| // to route the connection to a different underlying server host, then retry the command. | ||
| bridge?.TryConnect(null)?.Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it relies on timing between the connection being detected as closed, and the parser - but the parser is IIRC "inline" here, i.e. we haven't finished reading yet. I need to think very carefully here about whether this is reliable ... I"m not sure either way, and I invite your input!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional concern: from memory (@NickCraver may remember more), DNS resolution in .NET has a habit of being cached for the process duration, unless explicit steps are taken - and I'm not seeing any explicit steps. I'm concerned that this may result in instant reconnection on the old cached (in-proc) DNS entry; definitely something to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philon-msft tells me that the DNS part might be safer than I recall - will discuss, but: might not be a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgravell regarding your first concern: I added a _needsReconnect flag to the bridge so that we only mark the connection as requiring a reconnect. This ensures that a retried command is queued rather than buffered to the old connection, deferring the actual reconnection to the reader loop in the case of a MOVED-to-same-endpoint scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: the DNS caching concern:
I created a simple test using a Route53 DNS record:
- Connect to host A (via the Route53 DNS hostname), SET a key, verify it's stored
- Change the DNS record to point to a different host B
- Create a new connection (via same hostname), verify the key doesn't exist
The test passed - the second connection correctly resolved to the new host. I don't see DNS caching issues. Additionally, as I mentioned in #2990, if DNS caching were an issue with SE.Redis clients, we at ElastiCache would have already encountered it with our customers - our managed clusters logic relies heavily on DNS changes for failover and scaling operations.
…D-to-same-endpoint
|
@mgravell ping:) |
|
Ping is fair - I've been eyes-down wrapping up 8.6 tweaks; on my list to look deeper after that. |
Ack. Thanks! |
Handle MOVED error pointing to same endpoint
Problem
When Redis/Valkey servers are deployed behind DNS records, load balancers, or proxies (as is common in managed environments), a MOVED error may be returned with a target node endpoint that is the same endpoint from which the error originated. Before the proposed change, when StackExchange.Redis received a MOVED error pointing to the same endpoint, it would fail the request and propagate the MOVED exception back to the application.
Issue ref: #2990
Solution
When a MOVED error points to the same endpoint, the client now triggers a reconnection before retrying the command. This allows the DNS record, proxy, or load balancer to route the new connection to a different underlying server host, enabling the retry to succeed.
Why proactive reconnection is necessary
The server closes the connection on its end immediately after sending the MOVED-to-same-endpoint error. However, due to the multiplexed nature of StackExchange.Redis connections, the library wouldn't notice the disconnection when it retries the request—all Write and Flush operations return successfully even though the underlying connection is broken. The disconnection is only detected when attempting to read the response.
At that point, it isn't safe to retry on the connection failure: we don't know if the request was actually sent to the server, so the connection error is raised back to the application, which also cannot determine if it's safe to retry. By initiating a reconnect immediately following the MOVED-to-same-endpoint error, we ensure the retry occurs on a fresh connection to a new server host.
Changes
Core fix:
ResultProcessor.cs: Detect when MOVED endpoint matches the current server endpoint and trigger reconnection by disposing the existing connection.Tests:
MovedToSameEndpointTests.cs: Integration test verifying the reconnect-and-retry behaviorMovedTestServer.cs: Test server helper that simulates MOVED responses pointing to the same endpoint