An LLM capability mapping framework for systematic evaluation of language models in computer science problem-solving
This branch contains the updated PrismBench framework. For the replication package of the paper, please switch to Replication Package branch.
PrismBench systematically evaluates LLM models through a three-phase Monte Carlo Tree Search approach:
- Phase 1: Maps initial capabilities across CS concepts
- Phase 2: Discovers challenging concept combinations
- Phase 3: Conducts comprehensive evaluation of weaknesses
git clone https://github.com/PrismBench/PrismBench.git
cd PrismBench
# See all available commands
make help
# Set up the development environment
make setup
# Configure your API keys in apis.key, then start services
make startOnce running, the web interface is available at http://localhost:3000.
- Systematic Evaluation: MCTS-driven exploration of model capabilities
- Challenge Discovery: Automatically identifies model weaknesses
- Comprehensive Analysis: Detailed performance metrics and insights
- Containerized: Easy deployment with Docker
- API Compatible: Works with any OpenAI-compatible API
PrismBench/
├── src/services/ # Core framework components
│ ├── llm_interface/ # LLM communication layer
│ ├── environment/ # Challenge execution environment
│ ├── search/ # MCTS implementation
│ └── gui/ # shadcdn-based interface for interacting with the framework
├── configs/ # Configuration files
├── docs/ # Comprehensive documentation wiki
└── Dockerfile.base # Shared base image for Python services
PrismBench uses a shared base image approach to optimize Docker builds:
- Base Image:
Dockerfile.basecontains common Python setup, system dependencies, and uv installation - Service Images: Each Python service inherits from the base image, reducing build time and image size
- Layer Caching: Common dependencies are cached in the base image, speeding up subsequent builds
# Build only the base image
make build-base
# Build all services (uses cached base image)
make build
# Rebuild base image from scratch
make rebuild-base
# Rebuild all services from scratch
make rebuild
|
|
| Component | Purpose | Documentation |
|---|---|---|
| 🎨 GUI | Web interface for the framework | 📖 README |
| 🤖 LLM Interface | Model communication | 📖 README |
| 🌍 Environment | Code execution | 📖 README |
| 🔍 Search | MCTS implementation | 📖 README |
| 📊 Analysis | Results processing | 📖 README |
We welcome contributions! Please see our Contributing Guide for details.
MIT License - see LICENSE file for details.
If you use PrismBench in your research, please cite:
@software{prismbench,
title={PrismBench: LLM Capability Mapping Framework},
author={anonymous},
year={2025},
url={https://github.com/PrismBench/PrismBench}
}