General
PromptBeginner5 minmarkdown
<h1 align="center">
<a href="https://prompts.chat">
5
**Scientific A/B Testing for LLM Prompts**
Sign in to like and favorite skills
Scientific A/B Testing for LLM Prompts
Run controlled experiments on your prompts with statistical rigor. Compare variants with hypothesis testing, effect sizes, and AI-powered insights.
pip install -e .
from abprompt import ( Variant, Experiment, ExperimentRunner, LLMJudge, StatisticalAnalyzer, Provider ) # Define prompt variants variant_a = Variant( name="concise", prompt_template="Answer briefly: {question}", model="claude-3-5-haiku-20241022", provider=Provider.ANTHROPIC, ) variant_b = Variant( name="detailed", prompt_template="Explain thoroughly: {question}", model="claude-3-5-haiku-20241022", provider=Provider.ANTHROPIC, ) # Create experiment experiment = Experiment( id="prompt_test", name="Concise vs Detailed", variants=[variant_a, variant_b], test_inputs=[{"question": "What is machine learning?"}], sample_size=30, ) # Run with LLM-as-Judge runner = ExperimentRunner(judge=LLMJudge()) trials = await runner.run(experiment) # Analyze statistically analyzer = StatisticalAnalyzer() results = analyzer.compare_all_metrics(trials_a, trials_b) winner, confidence, reason = analyzer.determine_winner(results)
export ANTHROPIC_API_KEY=your-key-here python live_demo.py
| Method | Purpose |
|---|---|
| Welch's t-test | Compare means with unequal variances |
| Mann-Whitney U | Non-parametric alternative |
| Cohen's d | Effect size measurement |
| 95%/99% CI | Confidence intervals |
| Bonferroni | Multiple comparison correction |
pytest tests/ -v # 178 tests, 100% pass rate
abprompt/ core/ types.py # Data models runner.py # Experiment execution judge/ llm_judge.py # LLM-as-Judge evaluation telemetry/ collector.py # Metrics collection analysis/ statistics.py # Statistical tests insights.py # AI insights generation visualization/ reports.py # Report generation tests/ test_*.py # Comprehensive tests
This project is part of my 30-day AI challenge.
MIT