Markdown Converter
Agent skill for markdown-converter
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Sign in to like and favorite skills
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This project uses Docker containers for development. The primary setup command is:
./setup.sh # installs Docker, builds container, installs CLI hooks
For local development without Docker, use a Python virtual environment:
# Create and activate virtual environment python3 -m venv venv source venv/bin/activate # On macOS/Linux # or venv\Scripts\activate # On Windows # Install the package in development mode pip install -e . # Install test dependencies pip install pytest pytest-asyncio
docker exec -it $(docker ps -qf "name=aiagents") bashdevcontainer exec --workspace-folder . bashmake dev or make cpumake dev-gpu (requires CUDA host)make rebuildmake clean| Task | Command |
|---|---|
| Lint/Format | or |
| Run tests | or or |
| Launch JupyterLab | |
| Preview docs | |
| Run benchmarks | |
To run experiments or custom training scripts:
# Activate virtual environment first source venv/bin/activate # Run your experiment script python your_experiment.py
The benchmarking system uses a pluggable architecture in
benchmarks/core.py:
Run benchmarks using the unified MARBLE runner:
# Quick MARBLE test (5 tasks, single vs APEX) python run_marble.py --scenarios coding --max-tasks 5 --agent-types single apex_a2a # Full MARBLE evaluation python run_marble.py --scenarios coding reasoning --max-tasks 100 --agent-types single multi apex_a2a # Resume interrupted benchmark python run_marble.py --scenarios coding reasoning --max-tasks 100 --resume # List available agent types python run_marble.py --list-agents # Compare agent types with baseline python run_marble.py --compare apex_a2a --baseline single --scenarios coding # Use specific LLM model python run_marble.py --llm-model gpt-4 --llm-provider openai --scenarios reasoning
Key Features of MARBLE Runner:
pyproject.toml with setuptoolsutils/wandb_logger.pyLarge files are managed with DVC:
dvc add checkpoints/new_model.pt git add checkpoints/new_model.pt.dvc dvc push # uploads to Google Drive
Tests include both unit tests and notebook execution:
pytest -q # run all tests pytest tests/ # unit tests only pytest notebooks/ # execute notebooks
Use
--runslow flag to include slow/long-running tests.
Refer to the APEX_Research_Design.txt document for context on the design of this repository (also available as APEX_Research_Design.pdf). Your tasks will be listed in IMPLEMENTATION.md, and you will use this file to look at tasks to complete as well as tasks to take on for the next section. It's very important you write incredibly clean code for this and make sure to run tests often to ensure that things are working properly. Write good tests as well.
When creating design documents:
.github/workflows/ci_cpu_gpu.yml is used for CI/CD. We need these to always pass.
You should always update the IMPLEMENTATION.md file with your changes and keep that as up to date as possible.
You must make sure to keep the requirements.txt updated. You must also ensure you create unit and integration tests for significant features.
You must do an alignment check after each prompting to ensure that the right problem is being solved.
Ensure you are always executing commands in the virtual environment.
When working on tasks, I follow a pull request-based workflow:
git checkout -b username/task-descriptionpytest -q) and linting (make lint) before committinggit push -u origin branch-namegh) needs to be installed and authenticated for PR creationgit checkout main && git pullgh pr comment to acknowledge what was doneExample PR comment workflow:
# After addressing review comments and pushing changes gh pr comment 17 --body "Addressed review comments: - Consolidated TESTING_GUIDE.md and QUICK_START.md into README.md - Deleted the separate markdown files as requested - Organized content into logical sections - All tests pass after changes"
When asked to consolidate code or reduce duplication:
git diff --stat to ensure you're actually simplifying# BAD: Created 800+ lines of test helpers to save 100 lines # fixtures.py (336 lines) # helpers/assertions.py (192 lines) # helpers/builders.py (318 lines) # Only used in 2-3 places each
# GOOD: Merged two 200-line services into one 400-line service # Deleted the originals after updating imports # Net reduction: ~200 lines with cleaner architecture
The goal is to make the codebase cognitively simpler, not to showcase abstraction skills. Every abstraction adds mental overhead - only add them when they provide significant value.
IMPORTANT: When implementing benchmarks, always use official datasets - never create synthetic or hallucinated data.
All major multi-agent benchmarks have been successfully integrated:
MARBLE: Full integration with official UIUC benchmark
TaskBench: Microsoft dataset via HuggingFace
AgentVerse: OpenBMB framework integration
When asked to implement or integrate a benchmark:
MARBLE (Multi-Agent Reasoning Benchmark): https://github.com/ulab-uiuc/MARBLE
6 official scenarios: coding, database, reasoning, research, werewolf, world
Installation instructions:
# Option 1: Install via pip (when available) pip install marble-bench # Option 2: Local installation git clone https://github.com/ulab-uiuc/MARBLE.git cd MARBLE poetry install # or pip install -e . # Option 3: Use with path # When running benchmarks, specify --marble-path /path/to/MARBLE
TaskBench: https://huggingface.co/datasets/microsoft/Taskbench
AgentVerse: https://github.com/OpenBMB/AgentVerse
Current Year: 2025 (July) - When searching for benchmarks or documentation, use recent years (2024-2025) in search queries to find the latest versions. The project has been under active development throughout 2025.
IMPORTANT: If LLM tasks seem unclear, check IMPLEMENTATION.md TODO section for current status and details.
The codebase is being refactored to use a unified LLM client architecture:
UnifiedLLMClient: Single interface for all LLM providers (Ollama, OpenAI, Anthropic, Mock)
src/ai_agents_research/llm/unified_client.pyCurrent LLM Patterns Being Replaced:
Key Features:
All LLM integration tasks are tracked in
IMPLEMENTATION.md under the "Current TODO Status" section. This includes:
If you encounter issues or need to understand the current state of LLM integration, always refer to IMPLEMENTATION.md first.
When completing any significant task, follow this cleanup checklist:
source venv/bin/activate PYTHONPATH=/path/to/project/src:$PYTHONPATH pytest tests/ -q
After completing a task, clean up:
Before marking a task complete:
# Remove common temporary files rm -f fix_*.py test_*.py mock_*.py *_temp.* *_old.* # Find and remove empty directories find . -type d -empty -delete # Find potentially obsolete markdown files find . -name "*PLAN*.md" -o -name "*STATUS*.md" -o -name "*SUMMARY*.md" # Check for unused imports (requires flake8) flake8 src/ benchmarks/ --select=F401,F841
Every task should leave the codebase in a better state than you found it. This means:
Do what has been asked; nothing more, nothing less. NEVER create files unless they're absolutely necessary for achieving your goal. ALWAYS prefer editing an existing file to creating a new one. NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User. ALWAYS refactor and update existing documents or code rather than creating brand new ones when possible.
We MUST make official benchmarks work. We CANNOT skimp on benchmarks or create our own hallucinated/synthetic benchmarks. This is a hard line. We must find models that can actually complete the official benchmark tasks, not create easier alternatives.
CRITICAL: When working with benchmarks (MARBLE, SWE-bench, etc.):
Use subagents liberally and as much as possible to save context window. Make sure that they are delegated tasks that are of reasonable scope. Always prefer giving a plan to me first before continuing to implementation.
Prefer giving solutions that solve the root cause of problems rather than patches or mitigations that will be annoying to fix later. We want robust code that works well. Prefer updating the existing code rather than creating brand new files when possible.
Delete code that is unnecessary or dead to keep the codebase clean as much as possible