[SPARK-55075][K8S] Track executor pod creation errors with ExecutorFailureTracker by parthchandra · Pull Request #53840 · apache/spark

parthchandra · 2026-01-16T22:53:08Z

What changes were proposed in this pull request?

Adds tracking of executor pod creation with the ExecutorFailureTracker.

Why are the changes needed?

If there are unrecoverable pod creation errors then Spark continues to try and create pods instead of failing. An example is where a note book server is constrained to have a maximum number of pods and the user tries to start a notebook with twice the number of executors as the limit. In this case the user gets and 'Unauthorized' message in the logs but Spark will keep on trying to spin up new pods. By tracking pod creation failures we can stop trying after reaching max executor failures.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests added

Was this patch authored or co-authored using generative AI tooling?

Unit tests generated using Claude Code.

github-actions · 2026-01-16T22:53:19Z

JIRA Issue Information

=== Improvement SPARK-55075 ===
Summary: Fail application after too many executor pod creation errors
Assignee: None
Status: Open
Affected: ["4.1.1"]

This comment was automatically generated by GitHub Actions

parthchandra · 2026-01-17T02:33:41Z

@dongjoon-hyun, please take a look.

dongjoon-hyun · 2026-01-17T02:54:07Z

Thank you for pinging me, @parthchandra !

dongjoon-hyun

It would be great to clarify your target error case examples in the PR description more clearly because ExecutorPodsAllocator has a spark.kubernetes.allocation.maxPendingPods management already.

If there are unrecoverable pod creation errors then Spark continues to try and create pods instead of failing.

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

pan3793 · 2026-01-19T01:19:08Z

@parthchandra, do you think if we can merge this case into spark.executor.maxNumFailures? I think that fail to create the executor pod is also a kind of executor failure

parthchandra · 2026-01-20T17:35:44Z

ConfigBuilder("spark.kubernetes.allocation.pod.creation.retries")

parthchandra · 2026-01-20T18:58:29Z

@dongjoon-hyun @pan3793 let me get some more information from the customer that reported the issue about whether spark.kubernetes.allocation.maxPendingPods and spark.executor.maxNumFailures help with the issue.

dongjoon-hyun · 2026-01-21T00:01:04Z

Got it. Thank you, @parthchandra .

parthchandra · 2026-02-05T23:18:54Z

Sorry for taking so long to get back on this -
Here's an example of the use case we are talking about. In this case we have a notebook server configured by an administrator for a maximum of 16 pods. The user requests a notebook with a Spark configuration with 32 executors. Because this exceeds the the 'quota' we get pod creation failures and Spark keeps trying to request pods.
Setting spark.kubernetes.allocation.maxPendingPods had no effect.

The Splunk log for this has the following repeated for every attempt -

timestamp="2026-02-05T00:54:20,029+0000",level="WARN",threadName="kubernetes-executor-snapshots-subscribers-0",appName="spark-driver",logger="org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl",jobName="notebooks-spark-job",sourceId="91xvti54j6iq",jobInstanceId="1770251444978-uiqn8uubzrtjq1t1f38rkpk00m33z",organizationName="Default",instanceName="SparkDriver",version="04106e4c-0ef4-4db4-addb-2de65a0d7c17",attemptId="1",message="Exception when notifying snapshot subscriber.",exception="io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://*************** Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:754)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:98)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1155)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:98)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:190)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
	at scala.collection.immutable.List.foreach(List.scala:334)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.0.1:443/************. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136)
	... 3 more"

timestamp="2026-02-05T00:54:20,010+0000",level="INFO",threadName="kubernetes-executor-snapshots-subscribers-0",appName="spark-driver",logger="org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator",jobName="notebooks-spark-job",sourceId="91xvti54j6iq",jobInstanceId="1770251444978-uiqn8uubzrtjq1t1f38rkpk00m33z",organizationName="Default",instanceName="SparkDriver",version="04106e4c-0ef4-4db4-addb-2de65a0d7c17",attemptId="1",message="Cannot list PVC resources. Please check account permissions."

timestamp="2026-02-05T00:54:19,996+0000",level="INFO",threadName="kubernetes-executor-snapshots-subscribers-0",appName="spark-driver",logger="org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator",jobName="notebooks-spark-job",sourceId="91xvti54j6iq",jobInstanceId="1770251444978-uiqn8uubzrtjq1t1f38rkpk00m33z",organizationName="Default",instanceName="SparkDriver",version="04106e4c-0ef4-4db4-addb-2de65a0d7c17",attemptId="1",message="Going to request 30 executors from Kubernetes for ResourceProfile Id: 0, target: 32, known: 0, sharedSlotFromPendingPods: 64."

timestamp="2026-02-05T00:54:19,996+0000",level="DEBUG",threadName="kubernetes-executor-snapshots-subscribers-0",appName="spark-driver",logger="org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator",jobName="notebooks-spark-job",sourceId="91xvti54j6iq",jobInstanceId="1770251444978-uiqn8uubzrtjq1t1f38rkpk00m33z",organizationName="Default",instanceName="SparkDriver",version="04106e4c-0ef4-4db4-addb-2de65a0d7c17",attemptId="1",message="ResourceProfile Id: 0 (pod allocation status: 0 running, 0 unknown pending, 0 scheduler backend known pending, 0 unknown newly created, 0 scheduler backend known newly created)"

parthchandra · 2026-02-05T23:29:07Z

@parthchandra, do you think if we can merge this case into spark.executor.maxNumFailures? I think that fail to create the executor pod is also a kind of executor failure

I've added a new api in the ExecutorPodsLifecycleManager to call failureTracker.registerExecutorFailure and pass the ExecutorPodsLifecycleManager to the ExecutorPodsAllocator. This should now get propagated as an executor failure .
This required a bunch of places to be changed since I had to add an additional parameter to the ExecutorPodsAllocator constructor. There are no functional changes in the test suites which had to change as a result.

pan3793 · 2026-02-06T03:00:30Z

I've added a new api in the ExecutorPodsLifecycleManager to call failureTracker.registerExecutorFailure and pass the ExecutorPodsLifecycleManager to the ExecutorPodsAllocator. This should now get propagated as an executor failure.

@parthchandra, yeah, I think this is sufficient to fix your problem.

Adds a retry for executor pod creation ...

this does not help in your case, more generally, for permanent error. I would rather not add such logic, because:

ExecutorPodsAllocator will continue to request new pods as long as the pod number does not reach the requested number, so a few transient pod creation errors do not matter.
I think ExecutorFailureTracker is designed to capture all kinds of executor failures, e.g.
- executor (pod on K8s, container on YARN) launch failures,
- executor bootstrap failures, e.g., due to wrong setup of env, network, or config
- executor running failures, e.g., due to OOM.
- etc.
without pod creation retry logic,
1. for permanent errors (your case), it fails fast
2. for rare transient errors, it won't reach spark.executor.maxNumFailures
3. for frequently transient errors, it usually indicates that your cluster is overloaded or some services are unstable, in that case, user should either increase the spark.executor.maxNumFailures, or let the app fail to expose those potential issues.

If you really like to have separate configurations for pod creation error, maybe you can enhance the ExecutorFailureTracker to accept kind on registerExecutorFailure?

parthchandra · 2026-02-09T04:21:23Z

@pan3793 I think what you say makes sense. Now that we've tied the failure tracker in, the retry logic is somewhat redundant. Removing it will also reduce the number of configuration options making it easier to use.

parthchandra · 2026-02-09T06:39:01Z

updated to remove the retry logic

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

parthchandra · 2026-02-10T06:48:58Z

@dongjoon-hyun @pan3793 ptal.
Also, the ci failure is in the linter check which cannot find python. Re-running has made no difference. Please suggest what I can do here.

pan3793 · 2026-02-10T07:00:59Z

there were some infra changes these days, rebasing your patch to the latest master branch might help

pan3793

@parthchandra, thanks for update, overall LGTM, only small nits.

would like @dongjoon-hyun and @attilapiros(who modified ExecutorFailureTracker last time) have a look

BTW, please update the PR title and description to reflect the final state

pan3793 · 2026-02-12T06:35:12Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

+
+  test("Pod creation failures are tracked immediately without retries") {
+    // Make all pod creation attempts fail
+    when(podResource.create()).thenThrow(new KubernetesClientException("Simulated failure"))


small nit: make the error message more accurate

Suggested change

when(podResource.create()).thenThrow(new KubernetesClientException("Simulated failure"))

when(podResource.create()).thenThrow(new KubernetesClientException("Simulated pod creation failure"))

pan3793 · 2026-02-12T06:38:52Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

        k8sConf.resourceProfileId.toInt), Seq.empty)
  }
+
+  test("Pod creation failures are tracked immediately without retries") {


as we discussed before, the retry mechanism does not help here, we don't need to mention that, the case name might be

Suggested change

test("Pod creation failures are tracked immediately without retries") {

test("Pod creation failures are tracked by ExecutorFailureTracker") {

pan3793 · 2026-02-12T06:41:14Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+            log"Total failures: ${MDC(LogKeys.TOTAL, failureCount)}", e)
+          None
+      }
+      if (optCreatedExecutorPod.nonEmpty) {


for cases without an else branch, you can write it in

optCreatedExecutorPod.foreach { createdExecutorPod => ... }

parthchandra · 2026-02-12T18:09:53Z

@pan3793 addressed your comments.
@dongjoon-hyun, @attilapiros, TIA for taking a look.

dongjoon-hyun · 2026-02-12T18:58:04Z

Ack, @pan3793 and @parthchandra . Will take a look Today.

dongjoon-hyun

Unfortunately, this seems to introduce a breaking change to the exposed developer API since 3.3.0, @parthchandra .

https://spark.apache.org/docs/latest/running-on-kubernetes.html

Please note that we documented like "a full class name of a class implementing AbstractPodsAllocator" which has been unchanged for last 5 years. I believe we can change at Apache Spark 5.0.0 in 2027 (if we needed this).

Please introduce a new way to hand over ExecutorPodsLifecycleManager instead of modifying the existing StatefulSetPodsAllocator or DeploymentPodsAllocator. If we need to change them, it means the user-implemented classes are broken already.

pan3793 · 2026-02-13T02:01:41Z

To avoid touching the constructor of *PodsAllocators, how about adding a new method in AbstractPodsAllocator? it will be called after the constructor immediately

@DeveloperApi
abstract class AbstractPodsAllocator {
  ...
  def setExecutorPodsLifecycleManager(lifecycleManager: ExecutorPodsLifecycleManager): Unit = {}
  ...
}

dongjoon-hyun · 2026-02-13T03:33:43Z

+1 for @pan3793 's direction.

BTW, it should be invoked only for the new classes which supports it. The exist class might be compiled on the old AbstractPodsAllocator. So, there are two ways.

It could be a new interface.
Or, we can use Java reflection to check the existence of the class.

parthchandra · 2026-02-13T03:43:47Z

@dongjoon-hyun Thank you for catching this! Let me go with the direction from @pan3793 and also ensure that it is backward compatible.

parthchandra · 2026-02-18T12:08:13Z

@dongjoon-hyun @pan3793 Would you be able to look at this one more time? Thanks

attilapiros · 2026-02-18T13:43:02Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+          None
+      }
+      optCreatedExecutorPod.foreach { createdExecutorPod =>
+        try {


Why an error during pod creation handled differently from an error coming during adding the owner reference and creating PVC?

In both case the nonfatal error ends in deleting the pod so why the 2nd case does not tracked as a failure?

Errors from owner reference and PVC creation are handled differently it appears (since they throw an exception). I've added them to the tracking by the lifecycle manager but also retained the current behaviour of throwing an exception for these cases.
Would you prefer we stop throwing an exception and have everything handled by the lifecycle manager?

Actually I have missed that line. Could you please run a manual test and check what happens with the exception which thrown here (by throwing an exception here directly and running one of the integration test)?
I am afraid it can go way up to CoarseGrainedSchedulerBackend.

I think the exception being thrown is fine. It. is propagated up through requestNewExecutors() → onNewSnapshots(). The exception is caught at ExecutorPodsSnapshotStoreImpl.processSnapshotsInternal and logged as a warning.
The test test("SPARK-41410: An exception during PVC creation should not increase PVC counter") is testing explicitly for an exception to be thrown in this case. I modified the code to throw an Exception and the test passed (as long as the exception is a KubernetesClientException).

dongjoon-hyun · 2026-02-18T16:55:46Z

Ack, @parthchandra . I'll take a look again today. Thank you!

Stale review.

dongjoon-hyun · 2026-02-18T17:45:36Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

        k8sConf.resourceProfileId.toInt), Seq.empty)
  }
+
+test("Pod creation failures are tracked by ExecutorFailureTracker") {


indentation?

dongjoon-hyun · 2026-02-18T17:46:21Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

        k8sConf.resourceProfileId.toInt), Seq.empty)
  }
+
+test("Pod creation failures are tracked by ExecutorFailureTracker") {


In addition, it would be great to add a JIRA ID.

- test("Pod creation failures are tracked by ExecutorFailureTracker") { + test("SPARK-55075: Pod creation failures are tracked by ExecutorFailureTracker") {

dongjoon-hyun · 2026-02-18T17:50:32Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala

+      sc: SparkContext,
+      kubernetesClient: KubernetesClient,
+      snapshotsStore: ExecutorPodsSnapshotsStore,
+      lifecycleManager: ExecutorPodsLifecycleManager) = {


Shall we use the following style to minimize the change. For example, KubernetesClusterManagerSuite?

- lifecycleManager: ExecutorPodsLifecycleManager) = { + lifecycleManager: Option[ExecutorPodsLifecycleManager] = None) = {

Good suggestion. Done.

dongjoon-hyun · 2026-02-19T15:58:32Z

Than you for updating, @parthchandra .

dongjoon-hyun · 2026-02-19T16:00:30Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/AbstractPodsAllocator.scala

+   * Optional lifecycle manager for tracking executor pod lifecycle events.
+   * Set via setExecutorPodsLifecycleManager for backward compatibility.
+   */
+  protected var executorPodsLifecycleManager: ExecutorPodsLifecycleManager = _


Like the comment, Optional lifecycle manager, we had better follow the Scala style.

- protected var executorPodsLifecycleManager: ExecutorPodsLifecycleManager = _ + protected var executorPodsLifecycleManager: Option[ExecutorPodsLifecycleManager] = None

dongjoon-hyun · 2026-02-19T16:00:54Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/AbstractPodsAllocator.scala

+   * This method is optional and may not exist in custom implementations based on older versions.
+   */
+  def setExecutorPodsLifecycleManager(manager: ExecutorPodsLifecycleManager): Unit = {
+    executorPodsLifecycleManager = manager


Maybe,

- executorPodsLifecycleManager = manager + executorPodsLifecycleManager = Some(manager)

dongjoon-hyun · 2026-02-19T16:02:41Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+          val failureCount = totalFailedPodCreations.incrementAndGet()
+          if (executorPodsLifecycleManager != null) {
+            executorPodsLifecycleManager.registerPodCreationFailure()
+          }


Instead of null check, Scala prefers the following.

- if (executorPodsLifecycleManager != null) { - executorPodsLifecycleManager.registerPodCreationFailure() - } + executorPodsLifecycleManager.foreach(_.registerPodCreationFailure)

dongjoon-hyun · 2026-02-19T16:07:30Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+            executorPodsLifecycleManager.registerPodCreationFailure()
+          }
+          logError(log"Failed to create executor pod ${MDC(LogKeys.EXECUTOR_ID, newExecutorId)}. " +
+            log"Total failures: ${MDC(LogKeys.TOTAL, failureCount)}", e)


Line 473-479 seems to repeated twice here and 506-513. Could you make a method to remove the code duplication?

dongjoon-hyun · 2026-02-19T16:11:44Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala

+      } catch {
+        case _: NoSuchMethodException =>
+          logInfo("Allocator does not support setExecutorPodsLifecycleManager method. " +
+            "Pod creation failures will not be tracked.")


For K8s Deployment and StatefulSet, the following will be correct. Could you revise the message?

- logInfo("Allocator does not support setExecutorPodsLifecycleManager method. " + - "Pod creation failures will not be tracked.") + logInfo("No need to track pod creation failure because allocator does not require it.")

Method setExecutorPodsLifecycleManager is added in AbstractPodsAllocator, not the derived class. Is reflection really required here?

I was following the previous comment about needing to use reflection to maintain backwards compatibility. But you're right reflection is not needed. Removed.

dongjoon-hyun · 2026-02-19T16:12:30Z

To @parthchandra , from my side, the PR looks almost ready. I left a few more comments. Have a safe travel, 🛬 !

...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

…ilureTracke

parthchandra

Also rebased on latest

parthchandra · 2026-02-22T01:07:06Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+          None
+      }
+      optCreatedExecutorPod.foreach { createdExecutorPod =>
+        try {


I think the exception being thrown is fine. It. is propagated up through requestNewExecutors() → onNewSnapshots(). The exception is caught at ExecutorPodsSnapshotStoreImpl.processSnapshotsInternal and logged as a warning.
The test test("SPARK-41410: An exception during PVC creation should not increase PVC counter") is testing explicitly for an exception to be thrown in this case. I modified the code to throw an Exception and the test passed (as long as the exception is a KubernetesClientException).

parthchandra · 2026-02-22T01:08:05Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/AbstractPodsAllocator.scala

+   * Optional lifecycle manager for tracking executor pod lifecycle events.
+   * Set via setExecutorPodsLifecycleManager for backward compatibility.
+   */
+  protected var executorPodsLifecycleManager: ExecutorPodsLifecycleManager = _


parthchandra · 2026-02-22T01:08:11Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/AbstractPodsAllocator.scala

+   * This method is optional and may not exist in custom implementations based on older versions.
+   */
+  def setExecutorPodsLifecycleManager(manager: ExecutorPodsLifecycleManager): Unit = {
+    executorPodsLifecycleManager = manager


parthchandra · 2026-02-22T01:08:21Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+          val failureCount = totalFailedPodCreations.incrementAndGet()
+          if (executorPodsLifecycleManager != null) {
+            executorPodsLifecycleManager.registerPodCreationFailure()
+          }


parthchandra · 2026-02-22T01:08:29Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+            executorPodsLifecycleManager.registerPodCreationFailure()
+          }
+          logError(log"Failed to create executor pod ${MDC(LogKeys.EXECUTOR_ID, newExecutorId)}. " +
+            log"Total failures: ${MDC(LogKeys.TOTAL, failureCount)}", e)


...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

parthchandra · 2026-02-22T02:18:53Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala

+      } catch {
+        case _: NoSuchMethodException =>
+          logInfo("Allocator does not support setExecutorPodsLifecycleManager method. " +
+            "Pod creation failures will not be tracked.")


I was following the previous comment about needing to use reflection to maintain backwards compatibility. But you're right reflection is not needed. Removed.

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @parthchandra.

Let's wait for a few days more to make it sure that we addressed all other reviewers' comments.

attilapiros

lgtm

dongjoon-hyun · 2026-02-24T05:08:20Z

Thank you all! Merged to master for Apache Spark 4.2.0.

parthchandra · 2026-02-24T17:00:36Z

Thank you @dongjoon-hyun, @pan3793, @attilapiros !

github-actions bot added the KUBERNETES label Jan 16, 2026

dongjoon-hyun reviewed Jan 17, 2026

View reviewed changes

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala Outdated Show resolved Hide resolved

parthchandra closed this Jan 20, 2026

parthchandra reopened this Jan 20, 2026

pan3793 reviewed Feb 10, 2026

View reviewed changes

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala Outdated Show resolved Hide resolved

parthchandra force-pushed the k8s-failures branch from 8005dba to 8f9944c Compare February 11, 2026 06:45

pan3793 reviewed Feb 12, 2026

View reviewed changes

parthchandra changed the title ~~[SPARK-55075][K8S] Fail application after too many executor pod creation errors~~ [SPARK-55075][K8S] Track executor pod creation errors with ExecutorFailureTracker Feb 12, 2026

dongjoon-hyun previously requested changes Feb 12, 2026

View reviewed changes

parthchandra force-pushed the k8s-failures branch from a4a3b25 to 907cf42 Compare February 13, 2026 06:18

attilapiros reviewed Feb 18, 2026

View reviewed changes

dongjoon-hyun reviewed Feb 18, 2026

View reviewed changes

dongjoon-hyun reviewed Feb 19, 2026

View reviewed changes

dongjoon-hyun self-assigned this Feb 19, 2026

pan3793 reviewed Feb 20, 2026

View reviewed changes

...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala Outdated Show resolved Hide resolved

parthchandra added 3 commits February 21, 2026 18:21

[SPARK-55075][K8S] Track executor pod creation errors with ExecutorFa…

83b4403

…ilureTracke

address review comments

d220f39

Address more review comments

99b2112

parthchandra force-pushed the k8s-failures branch from fe26091 to 99b2112 Compare February 22, 2026 02:22

parthchandra commented Feb 22, 2026

View reviewed changes

attilapiros reviewed Feb 22, 2026

View reviewed changes

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala Outdated Show resolved Hide resolved

Undo commented out code

c95de5c

dongjoon-hyun approved these changes Feb 23, 2026

View reviewed changes

attilapiros approved these changes Feb 23, 2026

View reviewed changes

pan3793 approved these changes Feb 24, 2026

View reviewed changes

dongjoon-hyun closed this in 178bcec Feb 24, 2026

	when(podResource.create()).thenThrow(new KubernetesClientException("Simulated failure"))
	when(podResource.create()).thenThrow(new KubernetesClientException("Simulated pod creation failure"))

	test("Pod creation failures are tracked immediately without retries") {
	test("Pod creation failures are tracked by ExecutorFailureTracker") {

Conversation

parthchandra commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 16, 2026

JIRA Issue Information

Uh oh!

parthchandra commented Jan 17, 2026

Uh oh!

dongjoon-hyun commented Jan 17, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 commented Jan 19, 2026

Uh oh!

parthchandra commented Jan 20, 2026

Uh oh!

parthchandra commented Jan 20, 2026

Uh oh!

dongjoon-hyun commented Jan 21, 2026

Uh oh!

parthchandra commented Feb 5, 2026

Uh oh!

parthchandra commented Feb 5, 2026

Uh oh!

pan3793 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra commented Feb 9, 2026

Uh oh!

parthchandra commented Feb 9, 2026

Uh oh!

Uh oh!

parthchandra commented Feb 10, 2026

Uh oh!

pan3793 commented Feb 10, 2026

Uh oh!

pan3793 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Feb 12, 2026

Uh oh!

dongjoon-hyun commented Feb 12, 2026

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Feb 13, 2026

Uh oh!

dongjoon-hyun commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra commented Feb 13, 2026

Uh oh!

parthchandra commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 18, 2026

parthchandra commented Jan 16, 2026 •

edited

Loading

pan3793 commented Feb 6, 2026 •

edited

Loading

pan3793 left a comment •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Feb 13, 2026 •

edited

Loading

dongjoon-hyun Feb 18, 2026 •

edited

Loading

dongjoon-hyun Feb 19, 2026 •

edited

Loading