Skip to content

Connect any LLM-powered client app, such as a coding agent, to any supported inference backend/model.

License

Notifications You must be signed in to change notification settings

matdev83/llm-interactive-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,736 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Interactive Proxy

CI Architecture Check Coverage Python License

A swiss-army knife proxy that sits between your LLM client and provider—giving you a universal adapter, cost optimization, and full visibility with zero code changes.

Quick Start

1. Installation

git clone https://github.com/matdev83/llm-interactive-proxy.git
cd llm-interactive-proxy
python -m venv .venv
source .venv/Scripts/activate  # Windows: .venv\Scripts\activate
pip install -e .[dev]

2. Start the Proxy

export OPENAI_API_KEY="your-key-here"
python -m src.core.cli --default-backend openai:gpt-4o

3. Point Your Client at the Proxy

# Instead of direct API calls:
from openai import OpenAI
client = OpenAI(api_key="your-key")

# Use the proxy (base_url only):
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key"  # Proxy handles real authentication
)

# Now use normally - requests go through the proxy
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

That's it. All your existing code works unchanged—the proxy handles routing, translation, and monitoring transparently.

See Quick Start Guide for detailed configuration.

Why Use LLM Interactive Proxy?

One configuration. Any client. Any provider.

Stop rewriting your code every time you want to try a different LLM. Stop managing API keys in a dozen different tools. Stop wondering why your agent is stuck in an infinite loop or why your API bill suddenly spiked.

Solve Real Problems

Tired of juggling multiple LLM subscriptions?
Connect all your premium accounts—GPT Plus/Pro, Gemini Advanced, Qwen, GLM Code, and more—through one endpoint. Use them all without switching tools.

Worried about agent misbehavior?
Fix stuck agents with automatic loop detection. Reduce token costs with intelligent context compression. Get a second opinion mid-conversation by switching models seamlessly.

Need more control over what LLMs actually do?
Rewrite prompts and responses on-the-fly without touching client code. Block dangerous git commands before they execute. Add a "guardian angel" model that monitors and helps when your primary model drifts off track.

Want visibility into what's happening?
Capture every request and response in CBOR format. Debug issues, audit usage, and understand exactly what your LLM apps are doing.

Zero changes to your client code. Just point it at the proxy and gain control.

Key Capabilities

Universal Connectivity

  • Protocol Translation — Use OpenAI SDK with Anthropic, Claude client with Gemini, any combination
  • Subscription Consolidation — Leverage all your premium LLM accounts through one endpoint
  • Flexible Deployment — Single-user mode for development, multi-user mode for production

Cost & Performance Optimization

  • Smart Routing — Rotate API keys to maximize free tiers, automatically fallback to cheaper models
  • Context Window Compression — Reduce token usage and improve inference speed without losing quality
  • Full Observability — Wire capture, usage tracking, token counting, performance metrics

Intelligent Session Control

  • Loop Detection — Automatically detect and resolve infinite loops and repetitive patterns
  • Dynamic Model Switching — Change models mid-conversation for diverse perspectives without losing context
  • Quality Verifier — Deploy a secondary model to verify responses when the primary model struggles

Behavioral Customization

  • Prompt & Response Rewriting — Modify content on-the-fly to fine-tune agent behavior
  • Tool Call Reactors — Override and intercept tool calls to suppress unwanted behaviors
  • Usage Limits — Enforce quotas and control resource consumption

Security & Safety

  • Key Isolation — Configure API keys once, never expose them to clients
  • Directory Sandboxing — Restrict LLM tool access to designated safe directories
  • Command Protection — Block harmful operations like aggressive git commands
  • Tool Access Control — Fine-grained control over which tools LLMs can invoke

Enterprise Features

  • B2BUA Session Isolation — Internal session identity generation and strict trust boundaries (enabled by default; use --disable-b2bua-session-handling to opt out)

See User Guide for the complete feature list.

Routing Selector Semantics

  • backend:model selects an explicit backend family.
  • backend-instance:model (for example openai.1:gpt-4o) targets a concrete backend instance.
  • model and vendor/model are model-only selectors.
  • vendor/model:variant remains model-only (the : suffix is part of the model payload unless : appears before the first /).
  • URI-style parameters in selectors (for example model?temperature=0.5) are parsed and propagated through routing metadata.
  • Explicit-backend configuration and command surfaces (for example --static-route, replacement targets, and one-off routing) require strict backend:model format.

Architecture

graph TD
    subgraph "Clients"
        A[OpenAI Client]
        B[OpenAI Responses API Client]
        C[Anthropic Client]
        D[Gemini Client]
        E[Any LLM App]
    end

    subgraph "LLM Interactive Proxy"
        FE["Front-end APIs<br/>(OpenAI, Anthropic, Gemini)"]
        Core["Core Proxy Logic<br/>(Routing, Translation, Safety)"]
        BE["Back-end Connectors<br/>(OpenAI, Anthropic, Gemini, etc.)"]
        FE --> Core --> BE
    end

    subgraph "Providers"
        P1[OpenAI API]
        P2[Anthropic API]
        P3[Google Gemini API]
        P4[OpenRouter API]
    end

    A --> FE
    B --> FE
    C --> FE
    D --> FE
    BE --> P1
    BE --> P2
    BE --> P3
    BE --> P4
Loading

Documentation

Supported Front-end Interfaces

The proxy exposes multiple standard API surfaces, allowing you to use your favorite clients with any backend:

  • OpenAI Chat Completions (/v1/chat/completions) - Compatible with OpenAI SDKs and most tools.
  • Reasoning-model token floor guard - For reasoning-first models (e.g. openrouter:stepfun/step-3.5-flash:free, kimi-code:kimi/kimi-for-coding), explicit low max_tokens/max_completion_tokens values are raised to a configurable minimum (default 512) to prevent empty assistant messages. Configure via reasoning_model_token_floor in app config.
  • OpenAI Responses (/v1/responses) - Optimized for structured output generation.
  • OpenAI Models (/v1/models) - Canonical backend-agnostic model discovery from the capability index (canonical vendor/model IDs only).
  • Anthropic Messages (/anthropic/v1/messages) - Native support for Claude clients/SDKs.
  • Dedicated Anthropic Server (http://host:8001/v1/messages) - Drop-in replacement for Anthropic API on a separate port (default: 8001).
  • Google Gemini v1beta (/v1beta/models, :generateContent) - Native support for Gemini tools.
  • Routing Error Parity - Dynamic routing failures are emitted in protocol-native error envelopes while preserving canonical details.code and details.retryable semantics across OpenAI, Anthropic, and Gemini surfaces.

See Front-End APIs Overview for more details.

  • Diagnostics Endpoint (/v1/diagnostics) includes bounded routing metadata: availability status per backend instance (active, rate_limited, disabled), canonical model-to-eligible-instance summaries, preference/tie-set diagnostics, and deterministic truncation metadata.
  • Reactivation Control Endpoint (/v1/diagnostics/backends/{backend_instance}/reactivate) explicitly reactivates disabled backend instances and can optionally clear permanent unsupported (instance, model) state.

Supported Backends

See Backends Overview for full details and configuration.

Access Modes

The proxy supports two operational modes to enforce appropriate security boundaries:

  • Single User Mode (default): For local development. Allows OAuth connectors, optional authentication, localhost-only binding.
  • Multi User Mode: For production/shared deployments. Blocks OAuth connectors, requires authentication for remote access, allows any IP binding.

Quick Examples

# Single User Mode (default) - local development
./.venv/Scripts/python.exe -m src.core.cli

# Multi User Mode - production deployment
./.venv/Scripts/python.exe -m src.core.cli --multi-user-mode --host=0.0.0.0 --api-keys key1,key2

See Access Modes User Guide for detailed documentation.

Support

License

This project is licensed under the GNU AGPL v3.0 or later.

Development

# Run tests
python -m pytest

# Run linter
python -m ruff --fix check .

# Format code
python -m black .

# Validate unified outbound routing compliance (same check as CI gate)
python dev/scripts/check_routing_unification_compliance.py

See Development Guide for more details.