Markdown Converter
Agent skill for markdown-converter
Production-grade LLM evaluation framework developer
Sign in to like and favorite skills
Purpose: Quick reference for working on Arbiter
Arbiter: Production-grade LLM evaluation framework (v0.1.2) Stack: Python 3.11+, PydanticAI, provider-agnostic (OpenAI/Anthropic/Google/Groq) Coverage: 96% test coverage, strict mypy, comprehensive examples Pricing: LiteLLM bundled database (consistent with Conduit)
Design Philosophy: Simplicity wins, use good defaults, YAML config where needed, no hardcoded assumptions.
New to this repo? Run these 5 commands first:
# 1. Verify you're on a feature branch (NEVER work on main) git status && git branch # 2. Run all quality checks make all # 3. Run specific evaluator test to verify environment pytest tests/unit/test_semantic.py -v # 4. Check for any TODOs or placeholders (should be NONE) grep -r "TODO\|FIXME\|NotImplementedError" arbiter/ || echo "ā No placeholders found" # 5. Verify coverage is >80% make test-cov | tail -1
make test, pytest tests/, pytest -vmake format (runs black)make lint (runs ruff)make type-check (runs mypy in strict mode)tests/unit/examples/ for new user-facing features__init__.py filesmake all before committing (format + lint + type-check + test)Core Architecture (Why: Breaks all evaluators):
arbiter/evaluators/ - Must follow template patternarbiter/api.py (evaluate, compare functions) - Breaking change for all usersBasePydanticEvaluator - All evaluators inherit from thisarbiter/core/middleware.py - Affects all evaluationsarbiter/core/llm_client.py - Provider-agnostic guarantee at riskDependencies & Config (Why: Security and maintenance burden):
pyproject.toml - Increases attack surfaceREADME.md - User-facing documentationarbiter/storage/ - Data persistence implicationsMonitoring & Observability (Why: Production debugging):
arbiter/core/monitoring.py - Breaks observabilitySecurity (CRITICAL):
Other Prohibitions:
.env files or API keys (use environment variables)~/.claude/ configuration filesarbiter/ repositoryDetection Commands (Run before committing):
# Check for security violations grep -r "API_KEY\|SECRET\|PASSWORD" arbiter/ tests/ examples/ && echo "šØ CREDENTIALS FOUND" || echo "ā No credentials" # Check for code quality violations grep -r "TODO\|FIXME" arbiter/ && echo "šØ TODO comments found" || echo "ā No TODOs" # Check for incomplete features grep -r "NotImplementedError\|pass # TODO" arbiter/ && echo "šØ Placeholder code found" || echo "ā No placeholders" # Verify on feature branch git branch --show-current | grep -E "^(main|master)$" && echo "šØ ON MAIN BRANCH - CREATE FEATURE BRANCH" || echo "ā On feature branch" # Verify coverage >80% make test 2>&1 | grep "TOTAL" | awk '{if ($NF+0 < 80) print "šØ COVERAGE " $NF " < 80%"; else print "ā Coverage " $NF}'
Don't flatter me. I know what AI sycophancy is and I don't want your praise. Be concise and direct. Don't use emdashes ever.
When to Analyze (Multiple Triggers):
Identify Failures:
Analyze Each Failure:
Update AGENTS.md (In Real-Time):
Priority Levels:
Example Pattern:
Failure: Committed TODO comments in production code (violated "No Partial Features" rule) Detection: `grep -r "TODO" src/` before commit Rule Update: Add pre-commit check pattern to Boundaries section Priority: š” IMPORTANT Action Taken: Proposed rule update to user mid-session, updated AGENTS.md
Proactive Analysis:
arbiter/ āāā arbiter_ai/ ā āāā api.py # Public API (evaluate, compare) ā āāā core/ # Infrastructure (llm_client, middleware, monitoring, registry, cost_calculator) ā āāā evaluators/ # Semantic, CustomCriteria, Pairwise, Factuality, Groundedness, Relevance ā āāā storage/ # Storage backends (PostgreSQL, Redis) ā āāā verifiers/ # Claim verification (Search, Citation, KnowledgeBase) āāā examples/ # 25+ comprehensive examples āāā tests/ # Unit + integration tests (583 tests, 96% coverage) āāā pyproject.toml # Dependencies and config
The cost calculator uses LiteLLM's bundled pricing database (same source as Conduit):
from arbiter_ai import get_cost_calculator calc = get_cost_calculator() cost = calc.calculate_cost("gpt-4o-mini", input_tokens=1000, output_tokens=500) print(f"Cost: ${cost:.6f}") # Cost: $0.000450
To update pricing:
uv update litellm
All evaluators extend
BasePydanticEvaluator and implement 4 methods:
class MyEvaluator(BasePydanticEvaluator): @property def name(self) -> str: return "my_evaluator" def _get_system_prompt(self) -> str: return "You are an expert evaluator..." def _get_user_prompt(self, output: str, reference: Optional[str], criteria: Optional[str]) -> str: return f"Evaluate '{output}' against: {criteria}" def _get_response_type(self) -> Type[BaseModel]: return MyEvaluatorResponse # Pydantic model async def _compute_score(self, response: BaseModel) -> Score: resp = cast(MyEvaluatorResponse, response) return Score(name=self.name, value=resp.score, confidence=resp.confidence, explanation=resp.explanation)
Must work with ANY LLM provider (OpenAI, Anthropic, Google, Groq, Mistral, Cohere).
# GOOD client = await LLMManager.get_client(provider="anthropic", model="claude-3-5-sonnet") # BAD from openai import OpenAI client = OpenAI() # Hardcoded to OpenAI
All functions require type hints, no
Any without justification.
Production-grade code only. Complete implementations or nothing.
If you start, you finish:
__init__.pyAll evaluators use PydanticAI for type-safe LLM responses.
All Pydantic models follow strict validation patterns:
ConfigDict (Required for ALL models):
from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator class MyModel(BaseModel): model_config = ConfigDict( extra="forbid", # Reject unknown fields (no backward compatibility) str_strip_whitespace=True, # Auto-strip whitespace from strings validate_assignment=True, # Validate on attribute changes )
Computed Fields (for serializable properties):
from pydantic import computed_field @computed_field # type: ignore[prop-decorator] @property def total_tokens(self) -> int: """Total tokens across input and output.""" return self.input_tokens + self.output_tokens
Field Validators (validate individual fields):
@field_validator("explanation") @classmethod def validate_explanation_quality(cls, v: str) -> str: """Ensure explanation is meaningful and not empty.""" if len(v.strip()) < 1: raise ValueError("Explanation cannot be empty") return v.strip()
Model Validators (cross-field validation):
@model_validator(mode="after") def validate_high_confidence_scores(self) -> "MyModel": """Ensure high-confidence scores have supporting details.""" if self.confidence > 0.9: if not self.supporting_details: raise ValueError( "High confidence scores (>0.9) require supporting details" ) return self
Testing with Validators:
Detection: New evaluator doesn't implement all 4 required methods Prevention: Copy existing evaluator (semantic.py) as template Fix: Implement
name, _get_system_prompt, _get_user_prompt, _get_response_type, _compute_score
Why It Matters: Template pattern ensures all evaluators work consistently
Detection:
from openai import OpenAI in evaluator code
Prevention: Use LLMManager.get_client() for provider abstraction
Fix: Replace direct provider imports with LLMManager
Why It Matters: Provider-agnostic design is core feature
Detection: New evaluator not importable from
arbiter
Prevention: Add to __init__.py exports in both evaluators/ and root
Fix: Add from .my_evaluator import MyEvaluator and update __all__
Why It Matters: Users can't use evaluator if not exported
Detection: Functions without Args/Returns/Example sections Prevention: Write docstring before implementation Fix: Add complete docstring with all sections Why It Matters: Docstrings are user documentation
Any TypeDetection:
grep -r "from typing import Any" arbiter/
Prevention: Use specific Pydantic models for type safety
Fix: Create Pydantic model for response structure
Why It Matters: Type safety prevents bugs
Detection: New evaluator without example in
examples/
Prevention: Create example file showing usage
Fix: Add examples/my_evaluator_example.py
Why It Matters: Examples are how users learn API
Detection:
make test shows coverage <80%
Prevention: Write tests as you code
Fix: Add unit tests until coverage >80%
Why It Matters: Untested evaluators will break
Detection: ValidationError in tests, changing
extra="forbid" to extra="ignore" or removing validators
Prevention: Write realistic test data that would pass production validation
Fix: Update test fixtures with valid data (non-empty explanations, supporting details for high confidence)
Why It Matters: Validators enforce data quality - weakening them hides bugs
Rule: Fix test data, never fix validators
When to Mock:
datetime.now()When to Use Real Dependencies:
Example:
# ā GOOD - Mock LLM call @pytest.mark.asyncio async def test_semantic_evaluator_mocked(mocker): mocker.patch("arbiter.core.llm_client.LLMManager.get_client") evaluator = SemanticEvaluator() # Test logic without hitting real API # ā GOOD - Real Pydantic validation def test_score_validation(): score = Score(name="test", value=0.95, confidence=0.9) assert score.value == 0.95 # Real validation # ā BAD - Using real API in tests async def test_evaluator(): result = await evaluate("test", model="gpt-4") # Costs money!
git status and git branchgit checkout -b feature/my-featuremake test frequentlyPre-Commit Validation (Run ALL these checks):
# 1. Format code make format # 2. Lint code make lint if [ $? -ne 0 ]; then echo "šØ LINT ERRORS - FIX BEFORE COMMIT"; exit 1; fi # 3. Type check make type-check if [ $? -ne 0 ]; then echo "šØ TYPE ERRORS - FIX BEFORE COMMIT"; exit 1; fi # 4. Run tests with coverage make test if [ $? -ne 0 ]; then echo "šØ TESTS FAILED OR COVERAGE <80%"; exit 1; fi # 5. No TODOs or placeholders grep -r "TODO\|FIXME\|NotImplementedError" arbiter/ && echo "šØ REMOVE TODOs" && exit 1 # 6. No credentials grep -r "API_KEY\|SECRET\|PASSWORD" arbiter/ tests/ examples/ && echo "šØ CREDENTIALS FOUND" && exit 1 # 7. Verify exports python -c "from arbiter import *; print('ā All exports work')" || echo "šØ EXPORT ERROR" # All checks passed echo "ā All checks passed - ready to commit" git add <files> git commit -m "Clear message"
examples/ if user-facing# 1. Create evaluator file touch arbiter/evaluators/my_evaluator.py # 2. Implement template methods (see Critical Rules #1) # 3. Export in arbiter/evaluators/__init__.py from .my_evaluator import MyEvaluator __all__ = [..., "MyEvaluator"] # 4. Export in arbiter/__init__.py from .evaluators import MyEvaluator __all__ = [..., "MyEvaluator"] # 5. Write tests touch tests/unit/test_my_evaluator.py # 6. Add example touch examples/my_evaluator_example.py
make test # Run all tests with coverage (requires >80% coverage to pass) pytest tests/unit/ # Run unit tests only (fast, mocked dependencies) pytest -v # Run all tests with verbose output (shows test names and results) make test-cov # Generate detailed coverage report with missing lines pytest tests/unit/test_semantic.py -v # Run specific test file with verbose output pytest -k "test_evaluate" -v # Run tests matching pattern "test_evaluate"
async def evaluate(output: str, reference: Optional[str] = None) -> EvaluationResult: """Evaluate LLM output against reference or criteria. Args: output: The LLM output to evaluate reference: Optional reference text for comparison Returns: EvaluationResult with scores, metrics, and interactions Raises: ValidationError: If output is empty EvaluatorError: If evaluation fails Example: >>> result = await evaluate(output="Paris", reference="Paris is the capital of France") >>> print(result.overall_score) 0.92 """
make test # Run pytest with coverage (requires >80%, shows missing lines) make type-check # Run mypy in strict mode (all functions must have type hints) make lint # Run ruff linter (checks code style and potential bugs) make format # Run black formatter (line length 88, modifies files in place) make all # Run all checks in order: format ā lint ā type-check ā test
Check evaluators/semantic.py (reference implementation) or README.md (user docs)