feat: restore sequencer by alpe · Pull Request #3061 · evstack/ev-node

alpe · 2026-02-10T09:58:25Z

Resolves #2521

Overview

This PR implements the Sequencer Recovery mode, enabling a node to restore its state from the Data Availability (DA) layer and P2P network before assuming the role of a sequencer.

Key Changes:

Configuration: added node.sequencer_recovery start param, allowing configurable timeouts for the catch-up phase.
Recovery Logic: Implemented sequencerRecoveryElector to manage the lifecycle:
- Starts the node in Sync/Follower mode.
- Monitors DA and P2P sync progress using monitorCatchup.
- Ensures the local store is populated (storeHeight > 0) before transition.
- Promotes the node to Aggregator mode once fully caught up.
Integration Test: Added TestSequencerRecoveryFromDA to verify the end-to-end recovery flow, ensuring the node successfully syncs historical blocks and resumes block production.

github-actions · 2026-02-10T09:58:45Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Feb 18, 2026, 1:27 PM

claude · 2026-02-10T09:58:46Z

Claude finished @alpe's task in 2m 9s —— View job

Code Review

Gather context and understand PR
Read changed files
Analyze code quality, architecture, error handling, security, performance, testing
Post comprehensive review feedback

Summary

This PR implements a catchup/recovery phase for aggregator (sequencer) nodes — allowing them to sync from DA and P2P before starting block production. The implementation is clean and the integration tests are well-structured. Below are findings grouped by severity.

Architecture & Design

1. Rename clarity: CatchupTimeout vs SequencerRecovery

The PR description and original issue refer to "sequencer recovery", but the implementation uses CatchupTimeout/catchup_timeout. This is fine and arguably better (it's used for the P2P wait timeout specifically, not the whole feature). The rename from SequencerRecovery → CatchupTimeout (visible in the git diff) is a good improvement — worth confirming the naming is settled.

2. newAggregatorMode always calls NewAggregatorWithCatchupComponents

In node/failover.go:90, newAggregatorMode now unconditionally calls NewAggregatorWithCatchupComponents, which always creates a catchupSyncer. Then in setupFailoverState (line 162-165), when catchup is NOT enabled, bc.Syncer is immediately set to nil, effectively discarding the syncer that was just allocated.

// block/components.go:349-379 - always creates a Syncer...
func NewAggregatorWithCatchupComponents(...) (*Components, error) { ... }

// node/failover.go:162-165 - ...then discards it if not needed
if isAggregator && !catchupEnabled {
    bc.Syncer = nil
}

This means every aggregator node allocates a syncer unnecessarily. Consider restoring the original newAggregatorComponents (lowercase, unexported) for the non-catchup path, or restructuring setupFailoverState to call the appropriate constructor. Fix this →

3. Incomplete comment at node/failover.go:162

// Catchup only applies to aggregator nodes that need to sync before

The comment is truncated — missing "before producing blocks" or similar.

4. Stale comment at node/sequencer_recovery_integration_test.go:305

// stopNodeAndCleanup stops a running node and calls its cleanup function.

This comment references a function that doesn't exist. This was flagged by Gemini but remains in the latest commit.

Correctness & Concurrency

5. P2P client not started in failoverState.Run

In the original code, Run called f.p2pClient.Start(ctx). The diff shows this was removed. Now p2p client startup is presumably handled by the caller (the FullNode). This is fine if confirmed, but it's a behavioral change that should be explicitly noted/verified — especially since headerSyncService and dataSyncService both depend on the P2P client being live.

6. runCatchupPhase defers bc.Syncer.Stop() but also sets bc.Syncer = nil

// node/failover.go:257-268
func (f *failoverState) runCatchupPhase(ctx context.Context) error {
    if err := f.bc.Syncer.Start(ctx); err != nil { ... }
    defer f.bc.Syncer.Stop()  // captures bc.Syncer at this point

    caughtUp, err := f.waitForCatchup(ctx)
    ...
    f.bc.Syncer = nil  // ← sets to nil after defer captures non-nil value
    return nil
}

The defer captures f.bc.Syncer correctly (it's a pointer dereference, not a copy of the nil assignment), so this works. However, it's subtle — the defer f.bc.Syncer.Stop() call uses the current value of f.bc.Syncer at defer registration time. This is correct but worth a comment for clarity.

7. waitForCatchup returns (false, nil) on context cancellation

// node/failover.go:292-294
case <-ctx.Done():
    return false, nil

Then at the call site:

// node/failover.go:259-265
caughtUp, err := f.waitForCatchup(ctx)
if err != nil { return err }
if !caughtUp {
    return ctx.Err()  // ← could be nil if ctx.Err() == nil but still !caughtUp
}

If the context is done, ctx.Err() will be non-nil. But there's a subtle edge case: if waitForCatchup returns (false, nil) for any other reason (which currently it can't, but the signature allows it), ctx.Err() would return nil and the error would be swallowed. Consider returning ctx.Err() directly from waitForCatchup or returning a sentinel error.

8. p2pCaughtUp logic edge case when catchupTimeout == 0

// node/failover.go:312-315
p2pCaughtUp := ignoreP2P || (maxP2PHeight > 0 && storeHeight >= maxP2PHeight)
if !ignoreP2P && f.catchupTimeout == 0 && maxP2PHeight == 0 {
    p2pCaughtUp = true
}

When catchupTimeout == 0 and maxP2PHeight == 0, P2P is considered "caught up." But if catchupTimeout == 0, timeoutCh is never set (line 281-287), so ignoreP2P will never become true. This means the node would stall indefinitely if there are no P2P peers and no timeout is configured. Is this intentional? If catchupTimeout == 0 means "DA-only mode", the code path should probably short-circuit differently. Fix this →

Error Handling

9. nolint:errcheck on syncer stop

defer f.bc.Syncer.Stop() // nolint:errcheck // not critical

The Syncer.Stop() error is suppressed. While marked as "not critical", errors during stop (e.g., drain timeouts) should at least be logged. Consider defer func() { if err := f.bc.Syncer.Stop(); err != nil { f.logger.Warn().Err(err).Msg("catchup syncer stop") } }().

10. daHeadReached is only set to true, never reset

// block/internal/syncing/syncer.go:411
s.daHeadReached.Store(true)

The field is documented as "stays true once DA head is seen." This is intentional but means a second call to Syncer.Start() on the same instance (if ever reused) would immediately report HasReachedDAHead() == true. The Stop() method does not reset this flag. This is acceptable given the current usage pattern (start once, stop once, discard), but worth documenting.

Testing

11. Magic number 1 for StatusSuccess in verifyBlobsInDA

// node/sequencer_recovery_integration_test.go:329,333
if res.Code == 1 { // StatusSuccess

Gemini flagged this and it's still present. Use datypes.StatusSuccess for type-safe comparison. Fix this →

12. peerAddress uses Loggable() map access

// node/sequencer_recovery_integration_test.go:253
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))

Gemini flagged this. peer.ID.Loggable() is for logging; peer.ID.String() is the stable public API. This could panic if the map structure changes. Fix this →

13. TestSequencerRecoveryFromP2P has a soft hash assertion

// node/sequencer_recovery_integration_test.go:205-219
if allMatch {
    t.Log("recovery node synced original blocks from P2P — all hashes verified")
} else {
    t.Log("recovery node produced its own blocks (P2P sync was not completed in time)")
}

This test doesn't fail when P2P hashes don't match — it merely logs. The intent seems to be that P2P sync is best-effort, but this makes the test not actually verify P2P recovery. At minimum, consider adding a t.Logf("WARNING: P2P hash check skipped...") or add a stricter assertion guarded by a longer timeout.

14. time.Sleep in requireNodeStartedSuccessfully

// node/sequencer_recovery_integration_test.go:311
time.Sleep(waitTime)

Using time.Sleep in tests is fragile. Consider polling the error channel with a ticker or using require.Eventually to make this more robust.

Cleanup Items (from prior reviews, still outstanding)

15. da_visualization.go refactoring included in this PR

The latest commit (956cfdb) converts daVisualizationHTML from an inline string constant to a go:embed directive. This is a functional change unrelated to sequencer recovery. It should be in a separate PR to keep this PR focused.

16. Stale comment at line 305 (mentioned in item 4 above — the stopNodeAndCleanup comment)

Minor Nits

pkg/config/defaults.go: SequencerRecovery: DurationWrapper{Duration: 0} → default zero is fine but could be omitted for brevity since zero is the Go default.
The block/components.go comment at line 327-333 documents the caller's 3-step protocol for NewAggregatorWithCatchupComponents. This is good, but it leaks implementation details. Since runCatchupPhase in failoverState is the only caller, this pattern is encapsulated — the comment is useful for maintainability.

Overall Assessment

The core implementation is solid. The catchup phase logic in waitForCatchup is thoughtfully designed with the P2P timeout fallback. The integration tests cover the primary recovery scenarios. The main concerns are:

Unconditional syncer allocation when catchup is disabled (item 2)
peerID.Loggable() type assertion that could panic (item 12)
Magic number for StatusSuccess (item 11)
Stale comment at line 305 (item 4)
da_visualization.go unrelated change should be in its own PR (item 15)

gemini-code-assist · 2026-02-10T09:58:47Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience of sequencer nodes by implementing a dedicated recovery mechanism. It enables a sequencer to gracefully recover from data loss by first synchronizing its state from the Data Availability layer and the P2P network, and then seamlessly resuming its role as an aggregator. This ensures continuous operation and data integrity, particularly in environments where traditional consensus mechanisms like Raft are not employed.

Highlights

Sequencer Recovery Mechanism: Introduced a new sequencerRecoveryElector that allows a sequencer node to start in a synchronization mode, catching up from both the Data Availability (DA) layer and the peer-to-peer (P2P) network before transitioning to its aggregator role. This is designed for single-sequencer setups without Raft.
DA Head Tracking: The Syncer now includes an daHeadReached atomic boolean and a HasReachedDAHead method to signal when the DA worker has successfully caught up to the DA head, which is crucial for the recovery process.
Configuration and Validation: A new configuration option, Node.SequencerRecovery, has been added to enable this recovery mode. Validation ensures that this mode can only be activated when the node is configured as an aggregator.
Integration Tests: Comprehensive integration tests have been added to verify the sequencer recovery functionality from both DA and P2P sources, ensuring robustness in disaster recovery scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

block/internal/syncing/syncer.go
- Added daHeadReached atomic boolean to track if the DA worker has reached the DA head.
- Updated daWorkerLoop to set daHeadReached to true when caught up.
- Exposed HasReachedDAHead method for external checking.
node/failover.go
- Introduced sequencerRecoveryElector to manage the recovery process.
- Implemented Run, startSyncPhase, startAggregatorPhase, and waitForCatchup methods within sequencerRecoveryElector to orchestrate the sync-then-aggregate flow.
node/full.go
- Modified newFullNode to conditionally use newSequencerRecoveryElector when Node.Aggregator is true, Node.SequencerRecovery.Duration is greater than zero, and Raft is not enabled.
node/sequencer_recovery_integration_test.go
- Added new integration tests to verify sequencer recovery from DA and P2P.
pkg/config/config.go
- Defined FlagSequencerRecovery for the new configuration option.
- Added SequencerRecovery field to NodeConfig struct.
- Added validation to ensure SequencerRecovery mode requires Aggregator mode.
- Added FlagSequencerRecovery to Cobra flags.
pkg/config/config_test.go
- Updated expected flag count.
pkg/config/defaults.go
- Set default value for SequencerRecovery to 0.
test/testda/dummy.go
- Added BlobCount method to DummyDA for testing purposes.

Activity

The pull request was created to resolve issue Allow running a sequencer node that syncs from DA first in full node mode #2521.
The author provided an overview of the changes in the PR description.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

codecov · 2026-02-10T10:02:48Z

Codecov Report

❌ Patch coverage is 49.51456% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.03%. Comparing base (cf5565e) to head (956cfdb).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
block/components.go	0.00%	27 Missing ⚠️
node/failover.go	71.21%	11 Missing and 8 partials ⚠️
pkg/config/config.go	33.33%	2 Missing and 2 partials ⚠️
block/internal/syncing/syncer.go	33.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3061      +/-   ##
==========================================
- Coverage   61.16%   61.03%   -0.14%     
==========================================
  Files         113      113              
  Lines       11444    11544     +100     
==========================================
+ Hits         7000     7046      +46     
- Misses       3655     3698      +43     
- Partials      789      800      +11

Flag	Coverage Δ
combined	`61.03% <49.51%> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gemini-code-assist

Code Review

This pull request introduces a sequencer recovery mechanism, a valuable feature for disaster recovery. The implementation allows a sequencer to start in sync mode, catch up from DA and P2P, and then transition to an aggregator. The logic is well-structured within the new sequencerRecoveryElector, and the catch-up conditions are robust, including a P2P timeout. The addition of comprehensive integration tests for both DA-only and P2P-assisted recovery scenarios is commendable. I've identified a few minor areas for improvement in the test code to enhance maintainability and robustness.

gemini-code-assist · 2026-02-10T10:07:21Z

node/sequencer_recovery_integration_test.go

+	t.Helper()
+	peerID, err := peer.IDFromPrivateKey(nodeKey.PrivKey)
+	require.NoError(t, err)
+	return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))


The peer.ID.Loggable() method is primarily for logging purposes, and its output format may not be stable across library versions. It's more robust to use peer.ID.String(), which is the public API for getting the canonical string representation of a peer ID.

Suggested change

return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))

return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.String())

gemini-code-assist · 2026-02-10T10:07:21Z

node/sequencer_recovery_integration_test.go

+	return recNode.(*FullNode), recStopTicker
+}
+
+// stopNodeAndCleanup stops a running node and calls its cleanup function.


This comment refers to a function stopNodeAndCleanup that doesn't exist in the file. It appears to be a leftover and should be removed to avoid confusion.

gemini-code-assist · 2026-02-10T10:07:22Z

node/sequencer_recovery_integration_test.go

+		if res.Code == 1 { // StatusSuccess
+			blobsFound += len(res.Data)
+			t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
+		}
+		res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
+		if res.Code == 1 { // StatusSuccess
+			blobsFound += len(res.Data)
+			t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
+		}


The magic number 1 is used to check for success status. For better readability and maintainability, please use the datypes.StatusSuccess constant. You'll need to add the following import:

import ( // ... other imports datypes "github.com/evstack/ev-node/pkg/da/types" )

Suggested change

if res.Code == 1 { // StatusSuccess

blobsFound += len(res.Data)

t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)

}

res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())

if res.Code == 1 { // StatusSuccess

blobsFound += len(res.Data)

t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)

}

if res.Code == datypes.StatusSuccess { // StatusSuccess

blobsFound += len(res.Data)

t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)

}

res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())

if res.Code == datypes.StatusSuccess { // StatusSuccess

blobsFound += len(res.Data)

t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)

}

pkg/config/config.go

julienrbrt · 2026-02-10T12:27:31Z

pkg/config/config.go

Nit, can we make nodeConfig.Node.SequencerRecovery.Duration and Raft.Enable mutually exclusive? This is a cobra option

node/failover.go

* main: refactor: reuse P2P client across node modes (#3065) refactor: store pending block separately (#3073) chore: Fix mismatched comment in TestCache_WithNilStore function (#3074) build(deps): Bump github.com/pion/dtls/v3 from 3.0.6 to 3.0.11 (#3068) feat: block Pruning (#2984) refactor: omit unnecessary reassignment (#3067) refactor: use WaitGroup.Go to simplify code (#3059) chore: prep rc.4 (#3062) fix(syncing): skip draining when exec client unavailable (#3060) build(deps): Bump the all-go group across 4 directories with 5 updates (#3058)

…re block production.

* main: build(deps): Bump the all-go group across 8 directories with 6 updates (#3076) refactor(block): allow pruning when da disabled (#3075)

julienrbrt

nice! lgtm. left some cleanups nits.

pkg/config/config.go

pkg/rpc/server/da_visualization.go

test/testda/dummy.go

Recover sequencer

5db4d19

github-actions bot assigned alpe Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

Review feedback

7aa2b74

tac0turtle reviewed Feb 10, 2026

View reviewed changes

pkg/config/config.go Outdated Show resolved Hide resolved

julienrbrt reviewed Feb 10, 2026

View reviewed changes

tac0turtle reviewed Feb 10, 2026

View reviewed changes

node/failover.go Outdated Show resolved Hide resolved

alpe added 4 commits February 17, 2026 10:06

feat: Implement aggregator catchup phase to sync from DA and P2P befo…

9bbeeca

…re block production.

Merge branch 'main' into alex/2521_sync

0a9d8c0

* main: build(deps): Bump the all-go group across 8 directories with 6 updates (#3076) refactor(block): allow pruning when da disabled (#3075)

Linter

e2586b4

alpe requested review from julienrbrt and tac0turtle February 18, 2026 11:45

julienrbrt approved these changes Feb 18, 2026

View reviewed changes

pkg/config/config.go Show resolved Hide resolved

pkg/rpc/server/da_visualization.go Show resolved Hide resolved

test/testda/dummy.go Outdated Show resolved Hide resolved

Review feedback

956cfdb

julienrbrt added this pull request to the merge queue Feb 19, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 19, 2026

julienrbrt merged commit 057c388 into main Feb 19, 2026
28 checks passed

julienrbrt deleted the alex/2521_sync branch February 19, 2026 11:23

	return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))
	return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.String())

Conversation

alpe commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes:

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Architecture & Design

Correctness & Concurrency

Error Handling

Testing

Cleanup Items (from prior reviews, still outstanding)

Minor Nits

Overall Assessment

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julienrbrt Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

alpe commented Feb 10, 2026 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading