Add agent docs eval: test that AI can build transfer scripts #73

gakonst · 2026-02-05T16:26:29Z

Summary

Adds an end-to-end eval that uses the Amp SDK to test whether AI agents can successfully build working code using Tempo docs.

What it does

Prompts an Amp agent to: "Build a TypeScript CLI that transfers 0.01 pathUSD on Tempo testnet"
Agent reads docs, writes the script, and executes it
Test verifies the output transaction hash exists on-chain

Why

Per discussion in #product-docs - we're seeing agents (like Opus 4.5) get confused about:

Chain ID (using 12890 instead of 42431)
Token (using USDC instead of pathUSD)
Missing network details

This eval will help us iterate on docs until agents succeed consistently.

Files changed

e2e/agent-transfer-funds.test.ts - The eval test
package.json - Added @sourcegraph/amp-sdk dependency

Manual step needed

After merging, add this to .github/workflows/verify.yml to run the eval on schedule:

# Add to the "on:" section:
  schedule:
    - cron: '0 9 * * *'

# Add this job:
  agent-eval:
    name: Agent Docs Eval
    runs-on: ubuntu-latest
    timeout-minutes: 15
    if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
    permissions:
      contents: read
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-node@v6
      - run: corepack enable pnpm
      - run: pnpm install
      - run: pnpm exec playwright install chromium --with-deps
      - run: pnpm exec playwright test agent-transfer-funds.test.ts
        env:
          AMP_API_KEY: ${{ secrets.AMP_API_KEY }}

Also add AMP_API_KEY to repository secrets.

- Uses Amp SDK to prompt an agent to build a TypeScript CLI - Agent must use tempo.ts SDK to transfer pathUSD on testnet - Verifies the output tx hash exists on-chain Note: Workflow changes need to be added separately (see PR description) Amp-Thread-ID: https://ampcode.com/threads/T-019c2e9b-8e68-703a-841f-92dc4d4910ef Co-authored-by: Amp <amp@ampcode.com>

vercel · 2026-02-05T16:26:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
tempo-docs	Ready	Preview, Comment	Feb 5, 2026 4:32pm

- Import tempoModerato from 'viem/chains' (not tempo.ts/chains) - Add testIgnore to playwright config to skip agent-*.test.ts unless AGENT_EVAL env is set - Regular E2E tests now run without the agent eval interfering Amp-Thread-ID: https://ampcode.com/threads/T-019c2e9b-8e68-703a-841f-92dc4d4910ef Co-authored-by: Amp <amp@ampcode.com>

vercel bot deployed to Preview February 5, 2026 16:28 View deployment

vercel bot deployed to Preview February 5, 2026 16:32 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent docs eval: test that AI can build transfer scripts #73

Add agent docs eval: test that AI can build transfer scripts #73

Uh oh!

gakonst commented Feb 5, 2026

Uh oh!

vercel bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add agent docs eval: test that AI can build transfer scripts #73

Are you sure you want to change the base?

Add agent docs eval: test that AI can build transfer scripts #73

Uh oh!

Conversation

gakonst commented Feb 5, 2026

Summary

What it does

Why

Files changed

Manual step needed

Uh oh!

vercel bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Feb 5, 2026 •

edited

Loading