Skip to content

feat: restore sequencer#3061

Merged
julienrbrt merged 7 commits intomainfrom
alex/2521_sync
Feb 19, 2026
Merged

feat: restore sequencer#3061
julienrbrt merged 7 commits intomainfrom
alex/2521_sync

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Feb 10, 2026

Resolves #2521

Overview

This PR implements the Sequencer Recovery mode, enabling a node to restore its state from the Data Availability (DA) layer and P2P network before assuming the role of a sequencer.

Key Changes:

  • Configuration: added node.sequencer_recovery start param, allowing configurable timeouts for the catch-up phase.
  • Recovery Logic: Implemented sequencerRecoveryElector to manage the lifecycle:
    • Starts the node in Sync/Follower mode.
    • Monitors DA and P2P sync progress using monitorCatchup.
    • Ensures the local store is populated (storeHeight > 0) before transition.
    • Promotes the node to Aggregator mode once fully caught up.
  • Integration Test: Added TestSequencerRecoveryFromDA to verify the end-to-end recovery flow, ensuring the node successfully syncs historical blocks and resumes block production.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 18, 2026, 1:27 PM

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @alpe's task in 2m 9s —— View job


Code Review

  • Gather context and understand PR
  • Read changed files
  • Analyze code quality, architecture, error handling, security, performance, testing
  • Post comprehensive review feedback

Summary

This PR implements a catchup/recovery phase for aggregator (sequencer) nodes — allowing them to sync from DA and P2P before starting block production. The implementation is clean and the integration tests are well-structured. Below are findings grouped by severity.


Architecture & Design

1. Rename clarity: CatchupTimeout vs SequencerRecovery

The PR description and original issue refer to "sequencer recovery", but the implementation uses CatchupTimeout/catchup_timeout. This is fine and arguably better (it's used for the P2P wait timeout specifically, not the whole feature). The rename from SequencerRecoveryCatchupTimeout (visible in the git diff) is a good improvement — worth confirming the naming is settled.

2. newAggregatorMode always calls NewAggregatorWithCatchupComponents

In node/failover.go:90, newAggregatorMode now unconditionally calls NewAggregatorWithCatchupComponents, which always creates a catchupSyncer. Then in setupFailoverState (line 162-165), when catchup is NOT enabled, bc.Syncer is immediately set to nil, effectively discarding the syncer that was just allocated.

// block/components.go:349-379 - always creates a Syncer...
func NewAggregatorWithCatchupComponents(...) (*Components, error) { ... }

// node/failover.go:162-165 - ...then discards it if not needed
if isAggregator && !catchupEnabled {
    bc.Syncer = nil
}

This means every aggregator node allocates a syncer unnecessarily. Consider restoring the original newAggregatorComponents (lowercase, unexported) for the non-catchup path, or restructuring setupFailoverState to call the appropriate constructor. Fix this →

3. Incomplete comment at node/failover.go:162

// Catchup only applies to aggregator nodes that need to sync before

The comment is truncated — missing "before producing blocks" or similar.

4. Stale comment at node/sequencer_recovery_integration_test.go:305

// stopNodeAndCleanup stops a running node and calls its cleanup function.

This comment references a function that doesn't exist. This was flagged by Gemini but remains in the latest commit.


Correctness & Concurrency

5. P2P client not started in failoverState.Run

In the original code, Run called f.p2pClient.Start(ctx). The diff shows this was removed. Now p2p client startup is presumably handled by the caller (the FullNode). This is fine if confirmed, but it's a behavioral change that should be explicitly noted/verified — especially since headerSyncService and dataSyncService both depend on the P2P client being live.

6. runCatchupPhase defers bc.Syncer.Stop() but also sets bc.Syncer = nil

// node/failover.go:257-268
func (f *failoverState) runCatchupPhase(ctx context.Context) error {
    if err := f.bc.Syncer.Start(ctx); err != nil { ... }
    defer f.bc.Syncer.Stop()  // captures bc.Syncer at this point

    caughtUp, err := f.waitForCatchup(ctx)
    ...
    f.bc.Syncer = nil  // ← sets to nil after defer captures non-nil value
    return nil
}

The defer captures f.bc.Syncer correctly (it's a pointer dereference, not a copy of the nil assignment), so this works. However, it's subtle — the defer f.bc.Syncer.Stop() call uses the current value of f.bc.Syncer at defer registration time. This is correct but worth a comment for clarity.

7. waitForCatchup returns (false, nil) on context cancellation

// node/failover.go:292-294
case <-ctx.Done():
    return false, nil

Then at the call site:

// node/failover.go:259-265
caughtUp, err := f.waitForCatchup(ctx)
if err != nil { return err }
if !caughtUp {
    return ctx.Err()  // ← could be nil if ctx.Err() == nil but still !caughtUp
}

If the context is done, ctx.Err() will be non-nil. But there's a subtle edge case: if waitForCatchup returns (false, nil) for any other reason (which currently it can't, but the signature allows it), ctx.Err() would return nil and the error would be swallowed. Consider returning ctx.Err() directly from waitForCatchup or returning a sentinel error.

8. p2pCaughtUp logic edge case when catchupTimeout == 0

// node/failover.go:312-315
p2pCaughtUp := ignoreP2P || (maxP2PHeight > 0 && storeHeight >= maxP2PHeight)
if !ignoreP2P && f.catchupTimeout == 0 && maxP2PHeight == 0 {
    p2pCaughtUp = true
}

When catchupTimeout == 0 and maxP2PHeight == 0, P2P is considered "caught up." But if catchupTimeout == 0, timeoutCh is never set (line 281-287), so ignoreP2P will never become true. This means the node would stall indefinitely if there are no P2P peers and no timeout is configured. Is this intentional? If catchupTimeout == 0 means "DA-only mode", the code path should probably short-circuit differently. Fix this →


Error Handling

9. nolint:errcheck on syncer stop

defer f.bc.Syncer.Stop() // nolint:errcheck // not critical

The Syncer.Stop() error is suppressed. While marked as "not critical", errors during stop (e.g., drain timeouts) should at least be logged. Consider defer func() { if err := f.bc.Syncer.Stop(); err != nil { f.logger.Warn().Err(err).Msg("catchup syncer stop") } }().

10. daHeadReached is only set to true, never reset

// block/internal/syncing/syncer.go:411
s.daHeadReached.Store(true)

The field is documented as "stays true once DA head is seen." This is intentional but means a second call to Syncer.Start() on the same instance (if ever reused) would immediately report HasReachedDAHead() == true. The Stop() method does not reset this flag. This is acceptable given the current usage pattern (start once, stop once, discard), but worth documenting.


Testing

11. Magic number 1 for StatusSuccess in verifyBlobsInDA

// node/sequencer_recovery_integration_test.go:329,333
if res.Code == 1 { // StatusSuccess

Gemini flagged this and it's still present. Use datypes.StatusSuccess for type-safe comparison. Fix this →

12. peerAddress uses Loggable() map access

// node/sequencer_recovery_integration_test.go:253
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))

Gemini flagged this. peer.ID.Loggable() is for logging; peer.ID.String() is the stable public API. This could panic if the map structure changes. Fix this →

13. TestSequencerRecoveryFromP2P has a soft hash assertion

// node/sequencer_recovery_integration_test.go:205-219
if allMatch {
    t.Log("recovery node synced original blocks from P2P — all hashes verified")
} else {
    t.Log("recovery node produced its own blocks (P2P sync was not completed in time)")
}

This test doesn't fail when P2P hashes don't match — it merely logs. The intent seems to be that P2P sync is best-effort, but this makes the test not actually verify P2P recovery. At minimum, consider adding a t.Logf("WARNING: P2P hash check skipped...") or add a stricter assertion guarded by a longer timeout.

14. time.Sleep in requireNodeStartedSuccessfully

// node/sequencer_recovery_integration_test.go:311
time.Sleep(waitTime)

Using time.Sleep in tests is fragile. Consider polling the error channel with a ticker or using require.Eventually to make this more robust.


Cleanup Items (from prior reviews, still outstanding)

15. da_visualization.go refactoring included in this PR

The latest commit (956cfdb) converts daVisualizationHTML from an inline string constant to a go:embed directive. This is a functional change unrelated to sequencer recovery. It should be in a separate PR to keep this PR focused.

16. Stale comment at line 305 (mentioned in item 4 above — the stopNodeAndCleanup comment)


Minor Nits

  • pkg/config/defaults.go: SequencerRecovery: DurationWrapper{Duration: 0} → default zero is fine but could be omitted for brevity since zero is the Go default.
  • The block/components.go comment at line 327-333 documents the caller's 3-step protocol for NewAggregatorWithCatchupComponents. This is good, but it leaks implementation details. Since runCatchupPhase in failoverState is the only caller, this pattern is encapsulated — the comment is useful for maintainability.

Overall Assessment

The core implementation is solid. The catchup phase logic in waitForCatchup is thoughtfully designed with the P2P timeout fallback. The integration tests cover the primary recovery scenarios. The main concerns are:

  1. Unconditional syncer allocation when catchup is disabled (item 2)
  2. peerID.Loggable() type assertion that could panic (item 12)
  3. Magic number for StatusSuccess (item 11)
  4. Stale comment at line 305 (item 4)
  5. da_visualization.go unrelated change should be in its own PR (item 15)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience of sequencer nodes by implementing a dedicated recovery mechanism. It enables a sequencer to gracefully recover from data loss by first synchronizing its state from the Data Availability layer and the P2P network, and then seamlessly resuming its role as an aggregator. This ensures continuous operation and data integrity, particularly in environments where traditional consensus mechanisms like Raft are not employed.

Highlights

  • Sequencer Recovery Mechanism: Introduced a new sequencerRecoveryElector that allows a sequencer node to start in a synchronization mode, catching up from both the Data Availability (DA) layer and the peer-to-peer (P2P) network before transitioning to its aggregator role. This is designed for single-sequencer setups without Raft.
  • DA Head Tracking: The Syncer now includes an daHeadReached atomic boolean and a HasReachedDAHead method to signal when the DA worker has successfully caught up to the DA head, which is crucial for the recovery process.
  • Configuration and Validation: A new configuration option, Node.SequencerRecovery, has been added to enable this recovery mode. Validation ensures that this mode can only be activated when the node is configured as an aggregator.
  • Integration Tests: Comprehensive integration tests have been added to verify the sequencer recovery functionality from both DA and P2P sources, ensuring robustness in disaster recovery scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • block/internal/syncing/syncer.go
    • Added daHeadReached atomic boolean to track if the DA worker has reached the DA head.
    • Updated daWorkerLoop to set daHeadReached to true when caught up.
    • Exposed HasReachedDAHead method for external checking.
  • node/failover.go
    • Introduced sequencerRecoveryElector to manage the recovery process.
    • Implemented Run, startSyncPhase, startAggregatorPhase, and waitForCatchup methods within sequencerRecoveryElector to orchestrate the sync-then-aggregate flow.
  • node/full.go
    • Modified newFullNode to conditionally use newSequencerRecoveryElector when Node.Aggregator is true, Node.SequencerRecovery.Duration is greater than zero, and Raft is not enabled.
  • node/sequencer_recovery_integration_test.go
    • Added new integration tests to verify sequencer recovery from DA and P2P.
  • pkg/config/config.go
    • Defined FlagSequencerRecovery for the new configuration option.
    • Added SequencerRecovery field to NodeConfig struct.
    • Added validation to ensure SequencerRecovery mode requires Aggregator mode.
    • Added FlagSequencerRecovery to Cobra flags.
  • pkg/config/config_test.go
    • Updated expected flag count.
  • pkg/config/defaults.go
    • Set default value for SequencerRecovery to 0.
  • test/testda/dummy.go
    • Added BlobCount method to DummyDA for testing purposes.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 49.51456% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.03%. Comparing base (cf5565e) to head (956cfdb).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
block/components.go 0.00% 27 Missing ⚠️
node/failover.go 71.21% 11 Missing and 8 partials ⚠️
pkg/config/config.go 33.33% 2 Missing and 2 partials ⚠️
block/internal/syncing/syncer.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3061      +/-   ##
==========================================
- Coverage   61.16%   61.03%   -0.14%     
==========================================
  Files         113      113              
  Lines       11444    11544     +100     
==========================================
+ Hits         7000     7046      +46     
- Misses       3655     3698      +43     
- Partials      789      800      +11     
Flag Coverage Δ
combined 61.03% <49.51%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sequencer recovery mechanism, a valuable feature for disaster recovery. The implementation allows a sequencer to start in sync mode, catch up from DA and P2P, and then transition to an aggregator. The logic is well-structured within the new sequencerRecoveryElector, and the catch-up conditions are robust, including a P2P timeout. The addition of comprehensive integration tests for both DA-only and P2P-assisted recovery scenarios is commendable. I've identified a few minor areas for improvement in the test code to enhance maintainability and robustness.

t.Helper()
peerID, err := peer.IDFromPrivateKey(nodeKey.PrivKey)
require.NoError(t, err)
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The peer.ID.Loggable() method is primarily for logging purposes, and its output format may not be stable across library versions. It's more robust to use peer.ID.String(), which is the public API for getting the canonical string representation of a peer ID.

Suggested change
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.String())

return recNode.(*FullNode), recStopTicker
}

// stopNodeAndCleanup stops a running node and calls its cleanup function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment refers to a function stopNodeAndCleanup that doesn't exist in the file. It appears to be a leftover and should be removed to avoid confusion.

Comment on lines +305 to +313
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 1 is used to check for success status. For better readability and maintainability, please use the datypes.StatusSuccess constant. You'll need to add the following import:

import (
	// ... other imports
	datypes "github.com/evstack/ev-node/pkg/da/types"
)
Suggested change
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}
if res.Code == datypes.StatusSuccess { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == datypes.StatusSuccess { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can we make nodeConfig.Node.SequencerRecovery.Duration and Raft.Enable mutually exclusive? This is a cobra option

alpe added 4 commits February 17, 2026 10:06
* main:
  refactor: reuse P2P client across node modes (#3065)
  refactor: store pending block separately (#3073)
  chore: Fix mismatched comment in TestCache_WithNilStore function (#3074)
  build(deps): Bump github.com/pion/dtls/v3 from 3.0.6 to 3.0.11 (#3068)
  feat: block Pruning (#2984)
  refactor: omit unnecessary reassignment (#3067)
  refactor: use WaitGroup.Go to simplify code (#3059)
  chore: prep rc.4 (#3062)
  fix(syncing): skip draining when exec client unavailable (#3060)
  build(deps): Bump the all-go group across 4 directories with 5 updates (#3058)
* main:
  build(deps): Bump the all-go group across 8 directories with 6 updates (#3076)
  refactor(block): allow pruning when da disabled (#3075)
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! lgtm. left some cleanups nits.

@julienrbrt julienrbrt added this pull request to the merge queue Feb 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 19, 2026
@julienrbrt julienrbrt merged commit 057c388 into main Feb 19, 2026
28 checks passed
@julienrbrt julienrbrt deleted the alex/2521_sync branch February 19, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow running a sequencer node that syncs from DA first in full node mode

3 participants

Comments