Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions scripts/mitigation-scripts/stale-lb-dsr-rules/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Stale LB DSR Rules Cleanup

## Overview

This mitigation script automatically detects and removes stale Load Balancer Direct Server Return (LB DSR) rules from VFP (Virtual Filtering Platform) that reference non-existent backend endpoints. It runs continuously to maintain network health by cleaning up orphaned rules that can cause connectivity issues.

## Problem Statement

When backend endpoints are removed or become unavailable, the corresponding LB DSR rules in VFP may not be cleaned up properly. These stale rules can:
- Cause packet routing failures
- Lead to connection timeouts
- Create unnecessary overhead in the networking stack
- Result in traffic being sent to non-existent endpoints

## Solution

The `cleanup-stale-lb-rules.ps1` script:
1. Checks and sets the required registry configuration for LB DSR feature management
2. Continuously monitors VFP LB DSR rules (both IPv4 and IPv6)
3. Compares rule destination IPs (DIPs) against active HNS endpoints
4. Automatically removes rules that reference non-existent endpoints

## Prerequisites

- Windows Server with HNS (Host Network Service) enabled
- VFP control utilities (`vfpctrl.exe`) available
- PowerShell with administrator privileges
- HNS PowerShell module

## Usage

### Running the Script on a Single Node

```powershell
.\cleanup-stale-lb-rules.ps1
```

The script will:
1. Check registry key `HKLM:\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides\140377743`
2. If the key value is 1, set it to 0 and restart the node (this disables PR 13179278 which is causing delete LB RPC calls from KubeProxy to fail with Invalid IP Error - ICM: 719903780)
3. Start a continuous monitoring loop with 10-second intervals
4. Clean up any stale LB DSR rules found

**Note:** This approach fixes issues on a single node. If the issue is widespread across the cluster, deploy the solution using a DaemonSet:

```powershell
kubectl create -f cleanup-stale-lb-rules.yaml
```

This will run the mitigation script as HPC pods on all affected nodes.

### Configuration

You can modify these parameters at the top of the script:

- **`$groups`**: VFP groups to monitor (default: `LB_DSR_IPv4_OUT`, `LB_DSR_IPv6_OUT`)
- **`$refreshIntervalSeconds`**: Time between cleanup iterations (default: 10 seconds)

## How It Works

### 1. Registry Check
The script first ensures the feature flag registry key (140377743) is set to 0. If not, it sets the value and restarts the node.

### 2. Endpoint Collection
- Retrieves all HNS policies
- Extracts endpoint references
- Builds a dictionary of valid endpoint IP addresses

### 3. Rule Validation
For each VFP port and LB DSR group:
- Lists all rules in the `LB_DSR` layer
- Extracts DIP (Destination IP) ranges from each rule
- Compares DIPs against the valid endpoint dictionary

### 4. Cleanup
- Rules with DIPs not found in active endpoints are flagged as stale
- Stale rules are automatically deleted using `vfpctrl /remove-rule`

## Output Examples

### Healthy State
```
All DIP ranges are present in the dictionary.
```

### Stale Rules Detected
```
Missing DIP ranges:
- 10.244.0.25
- fdf5:5d67:b9ce:b28f::13f
Deleting rule : ruleId: ABC123, port: Port1, group: LB_DSR_IPv4_OUT
```

## Monitoring

The script provides color-coded output:
- **Green**: Healthy state, all rules valid
- **Yellow**: Configuration changes or rule deletion in progress
- **Red**: Stale rules detected
- **Cyan**: Status updates and iteration markers

## Important Notes

- The script runs indefinitely until manually stopped (Ctrl+C)
- Node restart may occur on first run if registry configuration is incorrect
- Ensure no legitimate endpoint updates are in progress during cleanup to avoid false positives
- The script requires elevated privileges to modify VFP rules and registry settings

## Troubleshooting

### Script doesn't detect stale rules
- Verify VFP and HNS are functioning correctly
- Check that `vfpctrl.exe` is accessible in the system PATH
- Ensure HNS endpoints are properly registered

### Node restarts unexpectedly
- This is expected behavior if the registry key is not set to 0
- After restart, the script will continue normal operation

### Permission errors
- Run PowerShell as Administrator
- Verify account has rights to modify VFP rules and registry

## Related Documentation

- [VFP Documentation](../../helper/VFP.psm1)
- [HNS Module](../HNS/)
- [Network Health Monitoring](../../networkhealth/)

## Support

For issues or questions, please refer to the main repository documentation or open an issue.
195 changes: 195 additions & 0 deletions scripts/mitigation-scripts/stale-lb-dsr-rules/cleanup-orphan-rules.ps1
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
function Write-KustoLog {
param(
[string]$Level,
[string]$Event,
[string]$Message,
[hashtable]$Properties = @{}
)
$log = [ordered]@{
Timestamp = (Get-Date -Format 'o')
Level = $Level
Event = $Event
Message = $Message
Computer = $env:COMPUTERNAME
}
foreach ($k in $Properties.Keys) {
$log[$k] = $Properties[$k]
}
$log | ConvertTo-Json -Compress | Write-Output
}

$nw = Get-HnsNetwork | Where Name -Eq azure
$nodeIpv4 = $nw.ManagementIP
$nodeIpv6 = $nw.ManagementIPv6

Write-KustoLog -Level 'Info' -Event 'ScriptStarted' -Message 'Cleanup orphan rules script started.' -Properties @{
NodeIPv4 = $nodeIpv4
NodeIPv6 = $nodeIpv6
}


function Get-ExpectedVfpRuleIds {
param(
[string]$nodeIPv4,
[string]$nodeIPv6,
[array]$hnsPolicies
)

$expVfpRuleIds = @{}

foreach ($policy in $hnsPolicies) {
$pol = $policy.Policies[0]
$alloc = $policy.Resources.Allocators[0]

# Skip non-DSR policies
if (-not $alloc.IsDSR) { continue }

$extPort = $pol.ExternalPort
$intPort = $pol.InternalPort
$protocol = $pol.Protocol
$idPrefix = $alloc.ID.Substring(0,5).ToUpper()

# Determine nodeIP based on IPv6 flag
if ($pol.IPv6 -or $alloc.IsIpv6) {
if (-not $nodeIPv6) { continue }
$nodeIP = $nodeIPv6
} else {
$nodeIP = $nodeIPv4
}

# VIP: use VIPs[0] if present, otherwise nodeIP
if ($pol.VIPs -and $pol.VIPs.Count -gt 0) {
$vip = $pol.VIPs[0]
} else {
$vip = $nodeIP
}

$expVfpRuleId = "LB_DSR_${nodeIP}_${vip}_${extPort}_${intPort}_${protocol}_${idPrefix}"
$expVfpRuleIds[$expVfpRuleId] = $true
}

return $expVfpRuleIds
}

function Get-StaleVfpRuleIds {
param(
[hashtable]$expVfpRuleIds
)

$staleVfpRuleIds = @{}
$groups = @("LB_DSR_IPv4_OUT", "LB_DSR_IPv6_OUT")
$ports = (vfpctrl.exe /list-vmswitch-port /format 1 | ConvertFrom-Json).Ports.Name
foreach ($port in $ports) {
foreach ($group in $groups) {
$rules = (vfpctrl /port $port /layer LB_DSR /group $group /list-rule /format 1 | ConvertFrom-Json).Rules
foreach ($rule in $rules) {
$ruleId = $rule.Id
Write-Host "Checking VFP rule: Port=$port, Group=$group, RuleId=$ruleId" -ForegroundColor Gray
if ($expVfpRuleIds[$ruleId] -eq $null) {
$key = "${port}::${ruleId}"
$staleVfpRuleIds[$key] = @{
Port = $port
Layer = "LB_DSR"
Group = $group
RuleId = $ruleId
}
}
}
}
}

return $staleVfpRuleIds
}

function Test-VfpRuleHasMatchingPolicy {
param(
[string]$port,
[string]$layer,
[string]$group,
[string]$ruleId,
[array]$hnsPolicies
)

$obj = (vfpctrl /get-rule-info /port $port /layer $layer /group $group /rule $ruleId /format 1 | ConvertFrom-Json).Rules

$protocol = $obj.Conditions | Where-Object { $_.ProtocolList } | ForEach-Object { $_.ProtocolList[0] }

$destIP = $obj.Conditions | Where-Object { $_.DestinationIPv4RangeList -or $_.DestinationIPv6RangeList } | ForEach-Object {
if ($_.DestinationIPv4RangeList) { $_.DestinationIPv4RangeList[0].H }
elseif ($_.DestinationIPv6RangeList) { $_.DestinationIPv6RangeList[0].H }
}

$destPort = $obj.Conditions | Where-Object { $_.DestinationPortList } | ForEach-Object { $_.DestinationPortList[0] }

$found = $hnsPolicies | Where-Object {
$_.Policies[0].VIPs -contains $destIP -and
$_.Policies[0].ExternalPort -eq $destPort -and
$_.Policies[0].Protocol -eq $protocol
}

return [bool]$found
}

$iterationIntervalSeconds = if ($env:ITERATION_INTERVAL_SECONDS) { [int]$env:ITERATION_INTERVAL_SECONDS } else { 30 }
$pass2DelaySeconds = if ($env:PASS2_DELAY_SECONDS) { [int]$env:PASS2_DELAY_SECONDS } else { 10 }

Write-KustoLog -Level 'Info' -Event 'ConfigLoaded' -Message 'Configuration loaded.' -Properties @{
IterationIntervalSeconds = $iterationIntervalSeconds
Pass2DelaySeconds = $pass2DelaySeconds
}

while ($true) {
$hnsPolicies = Get-HnsPolicyList
Write-KustoLog -Level 'Info' -Event 'Pass1Started' -Message 'First pass: collecting expected and stale VFP rule IDs.' -Properties @{ PolicyCount = $hnsPolicies.Count }
$expVfpRuleIdsBefore = Get-ExpectedVfpRuleIds -nodeIPv4 $nodeIpv4 -nodeIPv6 $nodeIpv6 -hnsPolicies $hnsPolicies
$staleVfpRuleIdsBefore = Get-StaleVfpRuleIds -expVfpRuleIds $expVfpRuleIdsBefore
Write-KustoLog -Level 'Info' -Event 'Pass1Completed' -Message 'First pass completed.' -Properties @{
ExpectedRuleCount = $expVfpRuleIdsBefore.Count
StaleRuleCount = $staleVfpRuleIdsBefore.Count
}

Start-Sleep -Seconds $pass2DelaySeconds

$hnsPolicies = Get-HnsPolicyList
Write-KustoLog -Level 'Info' -Event 'Pass2Started' -Message 'Second pass: collecting expected and stale VFP rule IDs.' -Properties @{ PolicyCount = $hnsPolicies.Count }
$expVfpRuleIdsAfter = Get-ExpectedVfpRuleIds -nodeIPv4 $nodeIpv4 -nodeIPv6 $nodeIpv6 -hnsPolicies $hnsPolicies
$staleVfpRuleIdsAfter = Get-StaleVfpRuleIds -expVfpRuleIds $expVfpRuleIdsAfter
Write-KustoLog -Level 'Info' -Event 'Pass2Completed' -Message 'Second pass completed.' -Properties @{
ExpectedRuleCount = $expVfpRuleIdsAfter.Count
StaleRuleCount = $staleVfpRuleIdsAfter.Count
}

$orphanVfpRules = @{}
foreach ($key in $staleVfpRuleIdsBefore.Keys) {
if ($staleVfpRuleIdsAfter.ContainsKey($key)) {
$orphanVfpRules[$key] = $staleVfpRuleIdsAfter[$key]
}
}

Write-KustoLog -Level 'Info' -Event 'OrphanDetection' -Message 'Orphan detection completed.' -Properties @{ OrphanRuleCount = $orphanVfpRules.Count }

if ($orphanVfpRules.Count -eq 0) {
Write-KustoLog -Level 'Info' -Event 'NoOrphanRules' -Message 'No orphan VFP rules found.'
} else {
foreach ($entry in $orphanVfpRules.Values) {
$ruleProps = @{
Port = $entry.Port
Layer = $entry.Layer
Group = $entry.Group
RuleId = $entry.RuleId
}
Write-KustoLog -Level 'Warning' -Event 'OrphanRuleDetected' -Message "Orphan VFP rule detected." -Properties $ruleProps

if (-not (Test-VfpRuleHasMatchingPolicy -port $entry.Port -layer $entry.Layer -group $entry.Group -ruleId $entry.RuleId -hnsPolicies $hnsPolicies)) {
Write-KustoLog -Level 'Warning' -Event 'OrphanRuleNoPolicy' -Message "Rule has no matching HNS policy and is a candidate for removal." -Properties $ruleProps
# Remove the orphan rule
# vfpctrl.exe /port $entry.Port /layer $entry.Layer /group $entry.Group /delete-rule /rule $entry.RuleId
} else {
Write-KustoLog -Level 'Info' -Event 'OrphanRuleHasPolicy' -Message "Rule has a matching HNS policy and should not be removed." -Properties $ruleProps
}
}
}

Write-KustoLog -Level 'Info' -Event 'IterationCompleted' -Message "Iteration completed. Sleeping for $iterationIntervalSeconds seconds."
Start-Sleep -Seconds $iterationIntervalSeconds
}
Loading