ABPrompt

Scientific A/B Testing for LLM Prompts

Run controlled experiments on your prompts with statistical rigor. Compare variants with hypothesis testing, effect sizes, and AI-powered insights.

Features

Experiment Framework: Define variants, test inputs, and sample sizes
LLM-as-Judge: Automated quality evaluation on 7 dimensions
Statistical Analysis: Welch's t-test, Mann-Whitney U, Cohen's d, confidence intervals
Comprehensive Telemetry: Track cost, latency, tokens, quality scores
AI Insights: Generate actionable recommendations automatically
Visual Reports: ASCII-based terminal visualizations

Installation

pip install -e .

Quick Start

from abprompt import (
    Variant, Experiment, ExperimentRunner,
    LLMJudge, StatisticalAnalyzer, Provider
)

# Define prompt variants
variant_a = Variant(
    name="concise",
    prompt_template="Answer briefly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

variant_b = Variant(
    name="detailed",
    prompt_template="Explain thoroughly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

# Create experiment
experiment = Experiment(
    id="prompt_test",
    name="Concise vs Detailed",
    variants=[variant_a, variant_b],
    test_inputs=[{"question": "What is machine learning?"}],
    sample_size=30,
)

# Run with LLM-as-Judge
runner = ExperimentRunner(judge=LLMJudge())
trials = await runner.run(experiment)

# Analyze statistically
analyzer = StatisticalAnalyzer()
results = analyzer.compare_all_metrics(trials_a, trials_b)
winner, confidence, reason = analyzer.determine_winner(results)

Live Demo

export ANTHROPIC_API_KEY=your-key-here
python live_demo.py

Statistical Methods

Method	Purpose
Welch's t-test	Compare means with unequal variances
Mann-Whitney U	Non-parametric alternative
Cohen's d	Effect size measurement
95%/99% CI	Confidence intervals
Bonferroni	Multiple comparison correction

Quality Metrics (LLM-as-Judge)

Quality Score (overall)
Relevance
Clarity
Completeness
Accuracy
Engagement
Creativity

Tests

pytest tests/ -v
# 178 tests, 100% pass rate

Project Structure

abprompt/
  core/
    types.py       # Data models
    runner.py      # Experiment execution
  judge/
    llm_judge.py   # LLM-as-Judge evaluation
  telemetry/
    collector.py   # Metrics collection
  analysis/
    statistics.py  # Statistical tests
    insights.py    # AI insights generation
  visualization/
    reports.py     # Report generation
tests/
  test_*.py        # Comprehensive tests

Day 9/30 AI Challenge

This project is part of my 30-day AI challenge.

License

MIT

Features

Experiment Framework: Define variants, test inputs, and sample sizes

LLM-as-Judge: Automated quality evaluation on 7 dimensions

Statistical Analysis: Welch's t-test, Mann-Whitney U, Cohen's d, confidence intervals

Comprehensive Telemetry: Track cost, latency, tokens, quality scores

AI Insights: Generate actionable recommendations automatically

Visual Reports: ASCII-based terminal visualizations

Quick Start

from abprompt import (
    Variant, Experiment, ExperimentRunner,
    LLMJudge, StatisticalAnalyzer, Provider
)

# Define prompt variants
variant_a = Variant(
    name="concise",
    prompt_template="Answer briefly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

variant_b = Variant(
    name="detailed",
    prompt_template="Explain thoroughly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

# Create experiment
experiment = Experiment(
    id="prompt_test",
    name="Concise vs Detailed",
    variants=[variant_a, variant_b],
    test_inputs=[{"question": "What is machine learning?"}],
    sample_size=30,
)

# Run with LLM-as-Judge
runner = ExperimentRunner(judge=LLMJudge())
trials = await runner.run(experiment)

# Analyze statistically
analyzer = StatisticalAnalyzer()
results = analyzer.compare_all_metrics(trials_a, trials_b)
winner, confidence, reason = analyzer.determine_winner(results)

Method

Purpose

Welch's t-test

Compare means with unequal variances

Mann-Whitney U

Non-parametric alternative

Cohen's d

Effect size measurement

95%/99% CI

Confidence intervals

Bonferroni

Multiple comparison correction

Project Structure

abprompt/
  core/
    types.py       # Data models
    runner.py      # Experiment execution
  judge/
    llm_judge.py   # LLM-as-Judge evaluation
  telemetry/
    collector.py   # Metrics collection
  analysis/
    statistics.py  # Statistical tests
    insights.py    # AI insights generation
  visualization/
    reports.py     # Report generation
tests/
  test_*.py        # Comprehensive tests

ABPrompt

ABPrompt

Features

Installation

Quick Start

Live Demo

Statistical Methods

Quality Metrics (LLM-as-Judge)

Tests

Project Structure

Day 9/30 AI Challenge

Links

License

Related Skills

<h1 align="center">

2. Apply Deepthink Protocol (reason about dependencies

- Identify gaps

ABPrompt

Features

Installation

Quick Start

Live Demo

Statistical Methods

Quality Metrics (LLM-as-Judge)

Tests

Project Structure

Day 9/30 AI Challenge

Links

License