Coding
PromptBeginner5 minmarkdown
Markdown Converter
Agent skill for markdown-converter
7
This file provides context for AI coding agents (Codex, etc.) when working on this repository.
Sign in to like and favorite skills
This file provides context for AI coding agents (Codex, etc.) when working on this repository.
ARC-AGI benchmarking framework for evaluating LLMs on ARC (Abstraction and Reasoning Corpus) pattern recognition tasks. Supports multiple providers (OpenAI, Anthropic, Gemini, Fireworks, Grok, etc.) with built-in rate limiting, retries, and scoring.
make install # Install package in editable mode make test # Run all tests make test-verbose # Run tests with verbose output make run-sample # Run random baseline on single sample task make run-batch # Run random baseline on all sample tasks make score # Score submissions against ground truth make clean # Remove generated files and caches
Or manually:
# Install pip install -e . # Run tests pytest -q # Quick pytest -v # Verbose pytest -x # Stop on first failure # Run sample benchmark (no API key needed) python main.py --data_dir data/sample/tasks --config random-baseline --task_id 66e6c45b --save_submission_dir submissions/test # Run batch benchmark python cli/run_all.py --data_dir data/sample/tasks --config random-baseline --save_submission_dir submissions/test # Score submissions python src/arc_agi_benchmarking/scoring/scoring.py --task_dir data/sample/tasks --submission_dir submissions/test # View task visually python -m arc_agi_benchmarking.utils --task data/sample/tasks/66e6c45b.json
├── main.py # Single-task runner ├── cli/ │ ├── run_all.py # Batch runner (main entry point for benchmarks) │ └── submission_cli.py # Submission management CLI ├── src/arc_agi_benchmarking/ │ ├── adapters/ # Provider adapters (one per API) │ │ ├── provider.py # Base ProviderAdapter interface │ │ ├── openai_base.py # Shared OpenAI-compatible base class │ │ ├── open_ai.py # OpenAI adapter │ │ ├── anthropic.py # Anthropic adapter │ │ ├── gemini.py # Google Gemini adapter │ │ └── ... # Other providers │ ├── scoring/ │ │ └── scoring.py # Submission scoring logic │ ├── utils/ │ │ ├── preflight.py # Pre-run validation & cost estimation │ │ ├── viewer.py # Terminal task visualization │ │ ├── rate_limiter.py # API rate limiting │ │ └── metrics.py # Metrics collection │ ├── tests/ # All tests live here │ └── models.yml # Model configurations ├── data/ │ ├── sample/tasks/ # Sample ARC tasks for testing │ └── v2/ # Full evaluation set (clone separately) └── provider_config.yml # Rate limits per provider
src/arc_agi_benchmarking/models.yml): Model configurations including pricing, tokens, temperature.env.example)All providers implement
ProviderAdapter interface from adapters/provider.py:
get_response(system_prompt, user_prompt) -> str: Main API callOpenAIBaseAdapter for shared logicadapters/<provider>.py extending ProviderAdapter or OpenAIBaseAdapteradapters/__init__.pymain.py to recognize the providermodels.ymlARC tasks are JSON with:
train: List of input/output grid pairs (examples)test: List of test cases with input grids (and expected outputs for scoring)Grids are 2D arrays of integers 0-9 representing colors.
Tests are in
src/arc_agi_benchmarking/tests/. Run with pytest:
pytest # All tests pytest tests/test_preflight.py # Specific file pytest -k "test_cost" # By name pattern
Required API keys (set in
.env):
OPENAI_API_KEY - OpenAIANTHROPIC_API_KEY - AnthropicGOOGLE_API_KEY - Google GeminiXAI_API_KEY - X.AI / GrokFIREWORKS_API_KEY - FireworksGROQ_API_KEY - GroqOPENROUTER_API_KEY - OpenRouterHUGGING_FACE_API_KEY - HuggingFacepython -m pytest for running tests--skip-preflight)attempt_1, attempt_2 keys per task