MineDraft: A Framework for Batch Parallel Speculative Decoding

Environment Requirements

System Requirements

OS: Linux (tested on Ubuntu 22.04)
Python: 3.9 - 3.12 (tested with 3.12)
CUDA: >= 11.8 (tested with 12.8)
GPU: 5× NVIDIA GPUs with sufficient VRAM recommended (e.g., A100 80GB, H100, or L40)

Dependencies

vLLM: 0.9.2
PyTorch: 2.7.0
torch-scatter: 2.1.2

Installation

1. Create a Virtual Environment

python -m venv venv
source venv/bin/activate

Or using uv:

uv venv --python 3.12 --seed
source venv/bin/activate

Or using conda:

conda create -n minedraft python=3.12 -y
conda activate minedraft

2. Install vLLM

pip install vllm==0.9.2 --extra-index-url https://download.pytorch.org/whl/cu128

3. Install MineDraft

# Install the package with benchmark dependencies
pip install -e ".[benchmark]"

This will install:

Core dependencies: torch-scatter==2.1.2
Benchmark dependencies: datasets, nvitop, pandas, numpy, matplotlib, IPython, tqdm

Dataset Preparation

Before running experiments, prepare the benchmark datasets:

# Create datasets directory
mkdir -p benchmarks/datasets

# Download and convert datasets
python scripts/convert_datasets.py

This will download and prepare the following datasets:

ShareGPT.json: ShareGPT_V3_unfiltered_cleaned_split
arena.json: LMSYS Chatbot Arena Conversations
spec_bench.json: Spec-Bench
tough.json: Domain-specific tough questions

Experiment Structure

The experiments are organized in the scripts/ directory:

Script	Description
`experiment_1_*.sh`	Qwen3-32B with various draft models (0.6B, 1.7B, 4B)
`experiment_2_eagle_*.sh`	EAGLE models (Vicuna-33B, Vicuna-13B)
`experiment_2_llama_*.sh`	Llama-3.3-70B-AWQ with Llama-3.1-8B
`experiment_3_n_*.sh`	Multi-sample ablation
`experiment_4_bs_*.sh`	Batch size ablation (8, 16, 32, 64)
`experiment_5_tetris_*.sh`	Tetris VSR analysis experiments
`experiment_6_qwen8b.sh`	Qwen3-32B with Qwen3-8B
`experiment_7_qwen235b.sh`	Qwen3-235B-A22B-Instruct-2507-FP8 with Qwen3-14B
`experiment_8_nsys.sh`	NVIDIA Nsight profiling

Most experiments have two subsets:

*_parallel.sh: Parallel speculative decoding (requires 5 GPUs)
*_sequential.sh: Sequential speculative decoding (requires 4 GPUs)

Running Experiments

Run All Experiments

cd scripts

# Run all experiments (both parallel and sequential)
bash run_all.sh

# Or run only parallel experiments
bash run_parallel.sh

# Or run only sequential experiments
bash run_sequential.sh

Run Individual Experiments

cd scripts

# Example: Run parallel SD subset in Experiment 1 (Qwen3-32B)
bash experiment_1_parallel.sh

# Example: Run sequential SD subset in Experiment 2 (EAGLE models)
bash experiment_2_eagle_sequential.sh

(Optional) Using the GPU Bootstrap Script

For environments with shared GPU resources, you may use the bootstrap script to automatically wait for available GPUs.

You will need to remove or comment out the export CUDA_VISIBLE_DEVICES= line in the experiment script before using the bootstrap script.

Example usage:

python scripts/bootstrap.py bash scripts/experiment_1_parallel.sh

The bootstrap script will:

Monitor GPU availability
Wait until 5 GPUs are available with <1% memory and utilization (you can adjust required GPU count and thresholds in the main function)
Automatically set CUDA_VISIBLE_DEVICES and run the experiment

Speculative Decoding Configuration

The experiments use various speculative decoding configurations via --speculative-config:

{
    "method": null,
    // null for standard SD, "eagle" for EAGLE
    "model": "<draft_model>",
    // HuggingFace model ID for draft model
    "draft_tensor_parallel_size": 1,
    // TP size for draft model, should be always 1
    "num_speculative_tokens": 5,
    // Number of draft tokens (k)
    "is_parallel": true,
    // Enable PSD (and MineDraft)
    "force_pearl": false,
    // Enable PEARL (is_parallel must be true if this set to true, will disable MineDraft)
    "tetris": true,
    // Enable Tetris
    "tetris_turn_on_batch_size": 1,
    // Batch size threshold for Tetris
    "tetris_capacity": 0,
    // Tetris capacity (0 means calculated by num_speculative_tokens * max_num_seqs, i.e., k * m)
    "tetris_extra_proposals": 3
    // Extra draft tokens for Tetris
}

Hardware Configuration Example

Parallel mode: 5 GPUs (4 for target model TP, 1 for draft model)
Sequential mode: 4 GPUs (all for target model TP, draft shares resources)

Results and Analysis

Benchmark trace results are stored in benchmarks/trace/ as JSON lines (jsonl).

NVIDIA Nsight Systems profiling reports are stored in project root directory as .nsys-rep files.

For trace analysis, see the Jupyter notebook benchmarks/trace/analyze_plots.ipynb and its dependency benchmarks/trace/analyze_traces.py.

Troubleshooting

Out of Memory (OOM)

Reduce --gpu-memory-utilization (default: 0.65)
Reduce --max-num-seqs (batch size)
Use smaller models

CUDA Version Mismatch

Ensure CUDA >= 12.8 is installed and properly configured:

nvcc --version
nvidia-smi

Model Download Issues

Models are automatically downloaded from HuggingFace. Ensure you have:

Sufficient disk space or quota
HuggingFace access tokens for gated models (e.g., Llama)

huggingface-cli login

NVIDIA Nsight Systems Issues

If you use a old version of Nsight Systems (<2024.2), you may see an error during exporting profiling report:

Wrong event order has been detected when adding events to the collection

Update to Nsight Systems >=2024.2 to resolve this issue, which can be downloaded from https://developer.nvidia.com/tools-downloads#?search=nsight%20systems%202024.2&tx=$development_platform,linux.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
minedraft		minedraft
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pickaxe.PNG		pickaxe.PNG
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MineDraft: A Framework for Batch Parallel Speculative Decoding

Environment Requirements

System Requirements

Dependencies

Installation

1. Create a Virtual Environment

2. Install vLLM

3. Install MineDraft

Dataset Preparation

Experiment Structure

Running Experiments

Run All Experiments

Run Individual Experiments

(Optional) Using the GPU Bootstrap Script

Speculative Decoding Configuration

Hardware Configuration Example

Results and Analysis

Troubleshooting

Out of Memory (OOM)

CUDA Version Mismatch

Model Download Issues

NVIDIA Nsight Systems Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MineDraft: A Framework for Batch Parallel Speculative Decoding

Environment Requirements

System Requirements

Dependencies

Installation

1. Create a Virtual Environment

2. Install vLLM

3. Install MineDraft

Dataset Preparation

Experiment Structure

Running Experiments

Run All Experiments

Run Individual Experiments

(Optional) Using the GPU Bootstrap Script

Speculative Decoding Configuration

Hardware Configuration Example

Results and Analysis

Troubleshooting

Out of Memory (OOM)

CUDA Version Mismatch

Model Download Issues

NVIDIA Nsight Systems Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages