🔬 PrismBench

An LLM capability mapping framework for systematic evaluation of language models in computer science problem-solving

This branch contains the updated PrismBench framework. For the replication package of the paper, please switch to Replication Package branch.

What is PrismBench?

PrismBench systematically evaluates LLM models through a three-phase Monte Carlo Tree Search approach:

Phase 1: Maps initial capabilities across CS concepts
Phase 2: Discovers challenging concept combinations
Phase 3: Conducts comprehensive evaluation of weaknesses

Quick Start

git clone https://github.com/PrismBench/PrismBench.git
cd PrismBench

# See all available commands
make help

# Set up the development environment
make setup

# Configure your API keys in apis.key, then start services
make start

Once running, the web interface is available at http://localhost:3000.

📖 See detailed setup guide →

Key Features

Systematic Evaluation: MCTS-driven exploration of model capabilities
Challenge Discovery: Automatically identifies model weaknesses
Comprehensive Analysis: Detailed performance metrics and insights
Containerized: Easy deployment with Docker
API Compatible: Works with any OpenAI-compatible API

Architecture

PrismBench/
├── src/services/           # Core framework components
│   ├── llm_interface/     # LLM communication layer
│   ├── environment/       # Challenge execution environment  
│   ├── search/           # MCTS implementation
│   └── gui/              # shadcdn-based interface for interacting with the framework
├── configs/              # Configuration files
├── docs/                # Comprehensive documentation wiki
└── Dockerfile.base      # Shared base image for Python services

🐳 Docker Build Optimization

PrismBench uses a shared base image approach to optimize Docker builds:

Base Image: Dockerfile.base contains common Python setup, system dependencies, and uv installation
Service Images: Each Python service inherits from the base image, reducing build time and image size
Layer Caching: Common dependencies are cached in the base image, speeding up subsequent builds

Build Commands

# Build only the base image
make build-base

# Build all services (uses cached base image)
make build

# Rebuild base image from scratch
make rebuild-base

# Rebuild all services from scratch
make rebuild

📚 Documentation

🌟 Visit our wiki →

🚀 Getting Started

📖 Complete Wiki - Comprehensive documentation
⚡ Quick Start Guide - Get running in 5 minutes
🏗️ Architecture Overview - System design
⚙️ Configuration Guide - Setup and customization

🎯 Core Concepts

🧠 MCTS Algorithm - Core algorithm details
🤖 Agent System - Multi-agent architecture
🌍 Environment System - Evaluation environments
📊 Results Analysis - Understanding outputs

🔧 Component Documentation

Component	Purpose	Documentation
🎨 GUI	Web interface for the framework	📖 README
🤖 LLM Interface	Model communication	📖 README
🌍 Environment	Code execution	📖 README
🔍 Search	MCTS implementation	📖 README
📊 Analysis	Results processing	📖 README

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT License - see LICENSE file for details.

Citation

If you use PrismBench in your research, please cite:

@software{prismbench,
  title={PrismBench: LLM Capability Mapping Framework},
  author={anonymous},
  year={2025},
  url={https://github.com/PrismBench/PrismBench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.devcontainer		.devcontainer
configs		configs
docker		docker
docs		docs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
apis.key.template		apis.key.template
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 PrismBench

What is PrismBench?

Quick Start

Key Features

Architecture

🐳 Docker Build Optimization

Build Commands

📚 Documentation

🚀 Getting Started

🎯 Core Concepts

🔧 Component Documentation

Contributing

License

Citation

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

CommissarSilver/PrismBench

Folders and files

Latest commit

History

Repository files navigation

🔬 PrismBench

What is PrismBench?

Quick Start

Key Features

Architecture

🐳 Docker Build Optimization

Build Commands

📚 Documentation

🚀 Getting Started

🎯 Core Concepts

🔧 Component Documentation

Contributing

License

Citation

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages