Skip to content

PrismBench: A comprehensive framework for evaluating Large Language Model capabilities through Monte Carlo Tree Search. Systematically maps model strengths, automatically discovers challenging concept combinations, and provides detailed performance analysis with containerized deployment and OpenAI-compatible API support.

Notifications You must be signed in to change notification settings

CommissarSilver/PrismBench

Repository files navigation

🔬 PrismBench

An LLM capability mapping framework for systematic evaluation of language models in computer science problem-solving

License Documentation Python

This branch contains the updated PrismBench framework. For the replication package of the paper, please switch to Replication Package branch.

What is PrismBench?

PrismBench systematically evaluates LLM models through a three-phase Monte Carlo Tree Search approach:

  • Phase 1: Maps initial capabilities across CS concepts
  • Phase 2: Discovers challenging concept combinations
  • Phase 3: Conducts comprehensive evaluation of weaknesses

Quick Start

git clone https://github.com/PrismBench/PrismBench.git
cd PrismBench

# See all available commands
make help

# Set up the development environment
make setup

# Configure your API keys in apis.key, then start services
make start

Once running, the web interface is available at http://localhost:3000.

📖 See detailed setup guide →

Key Features

  • Systematic Evaluation: MCTS-driven exploration of model capabilities
  • Challenge Discovery: Automatically identifies model weaknesses
  • Comprehensive Analysis: Detailed performance metrics and insights
  • Containerized: Easy deployment with Docker
  • API Compatible: Works with any OpenAI-compatible API

Architecture

PrismBench/
├── src/services/           # Core framework components
│   ├── llm_interface/     # LLM communication layer
│   ├── environment/       # Challenge execution environment  
│   ├── search/           # MCTS implementation
│   └── gui/              # shadcdn-based interface for interacting with the framework
├── configs/              # Configuration files
├── docs/                # Comprehensive documentation wiki
└── Dockerfile.base      # Shared base image for Python services

🐳 Docker Build Optimization

PrismBench uses a shared base image approach to optimize Docker builds:

  • Base Image: Dockerfile.base contains common Python setup, system dependencies, and uv installation
  • Service Images: Each Python service inherits from the base image, reducing build time and image size
  • Layer Caching: Common dependencies are cached in the base image, speeding up subsequent builds

Build Commands

# Build only the base image
make build-base

# Build all services (uses cached base image)
make build

# Rebuild base image from scratch
make rebuild-base

# Rebuild all services from scratch
make rebuild

📚 Documentation

🌟 Visit our wiki →

🚀 Getting Started

🎯 Core Concepts

🔧 Component Documentation

Component Purpose Documentation
🎨 GUI Web interface for the framework 📖 README
🤖 LLM Interface Model communication 📖 README
🌍 Environment Code execution 📖 README
🔍 Search MCTS implementation 📖 README
📊 Analysis Results processing 📖 README

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License - see LICENSE file for details.

Citation

If you use PrismBench in your research, please cite:

@software{prismbench,
  title={PrismBench: LLM Capability Mapping Framework},
  author={anonymous},
  year={2025},
  url={https://github.com/PrismBench/PrismBench}
}

About

PrismBench: A comprehensive framework for evaluating Large Language Model capabilities through Monte Carlo Tree Search. Systematically maps model strengths, automatically discovers challenging concept combinations, and provides detailed performance analysis with containerized deployment and OpenAI-compatible API support.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors