Markdown Converter
Agent skill for markdown-converter
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Sign in to like and favorite skills
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a SkyRL-based reinforcement learning implementation for training language models on multi-turn tool use with MCP (Model Context Protocol) servers. The project uses Group Relative Policy Optimization (GRPO) to fine-tune models for real environment rollouts with actual tool execution.
Install dependencies:
pip install -r requirements.txt
Setup validation:
python training/validate_setup.py
vLLM GPU Training (recommended for speed):
ENABLE_VLLM=true VLLM_MAX_MODEL_LEN=4096 VLLM_GPU_MEMORY_UTILIZATION=0.3 ./training/scripts/launch_real_env_gpu_vllm.sh
Standard GPU Training:
./training/scripts/launch_real_env_training.sh
CPU Training (for debugging/development):
./training/scripts/launch_real_env_cpu.sh
LoRA Training (memory-efficient single GPU):
./training/scripts/launch_qwen3_training.sh
Multi-GPU Training (DeepSpeed):
./training/scripts/launch_distributed.sh
Run comprehensive tests:
python training/tests/run_comprehensive_tests.py
Test real tool execution:
python training/scripts/test_real_tool_execution.py
Test MCP integration:
python training/tests/test_mcp_integration.py
Smoke test (quick validation):
python training/tests/smoke_test.py
Debug with detailed logging:
export PYTHONPATH="$(pwd):$(pwd)/.." export CUDA_LAUNCH_BLOCKING=1 python training/scripts/train_qwen3_grpo_real_env.py --debug
Debug vLLM training:
ENABLE_VLLM=true VLLM_MAX_MODEL_LEN=4096 VLLM_GPU_MEMORY_UTILIZATION=0.3 timeout 300 ./training/scripts/launch_real_env_gpu_vllm.sh
Test individual MCP servers:
cd mcp_tools/limited && python fmp_limited_server.py
Test critical fixes:
python test_critical_fixes.py
Test latest critical fixes (V2):
python test_critical_fixes_v2.py
Monitor GPU usage:
./monitor_gpu.sh
Watch training logs:
tail -f outputs/real-env-grpo-*/training.log
The training system follows this flow:
training/core/qwen_policy_with_value.py) - Manages the Qwen model with value headenvironments/mcp_tool_environment.py) - Handles real MCP tool executiontraining/data/trajectory_collector.py) - Collects parallel rolloutstraining/core/grpo_trainer_gradient_fix.py) - Implements the RL training loopenvironments/simple_shared_manager.py) - Manages MCP server connections../mcp_tools/limited/ relative to this directory.env filetraining/configs/ control training hyperparametersdata/processed/train.json or data/inputs/train.jsontraining/configs/training_config_qwen3_0.6b.yaml - Main training hyperparameterstraining/configs/grpo_config_fixed.yaml - GRPO algorithm settingstraining/configs/deepspeed_config.json - Multi-GPU training configurationBefore running training:
git clone https://github.com/Sky-T/SkyRL.git cd SkyRL && pip install -e . cd skyagent && pip install -e . # For multi-GPU support
pip install -r requirements.txt
.env file
# Required API Keys OPENAI_API_KEY=sk-your-openai-key-here POLYGON_API_KEY=your-polygon-api-key FMP_API_KEY=your-fmp-api-key TAVILY_API_KEY=tvly-your-tavily-key SLACK_BOT_TOKEN=xoxb-your-slack-bot-token
../mcp_tools/limited/data/processed/train.json or data/inputs/train.jsonMPS memory issues on macOS:
export DEVICE_TYPE="cpu" export DISABLE_BITSANDBYTES=1
CUDA out of memory:
BitsAndBytes issues:
export DISABLE_BITSANDBYTES=1
SkyRL import errors:
# Ensure SkyRL is properly installed cd /path/to/SkyRL && pip install -e . export PYTHONPATH="${PYTHONPATH}:$(pwd):$(pwd)/.."
MCP server connection failures:
Training instabilities:
Enable verbose logging:
export PYTHONPATH="$(pwd):$(pwd)/.." export CUDA_LAUNCH_BLOCKING=1 # For CUDA debugging python training/scripts/train_qwen3_grpo_real_env.py --debug
Performance profiling:
python training/tests/memory_profile.py
Root Issue: PPO ratios stuck at 1.0 (degenerate), preventing policy learning for over a month.
Final Solution: Parameter perturbation in
grpo_trainer_gradient_fix.py:180-190:
# Add tiny noise to LoRA weights to break degeneracy for param in trainable_params[:10]: if param.requires_grad and param.numel() > 0: param.add_(torch.randn_like(param) * 1e-6)
Expected Results After Fix:
std_ratio: 0.000000 (no learning)std_ratio: 0.00587 (proper learning)Phase 1 - Basic Training Flow:
rl_mode=True and force_rate=0.0 to prevent off-policy contamination during RL trainingtrainer.train_step() is called and returns valid metricsPhase 2 - Tool Use & Training Stability: 6. Tool Call Argument Aliasing: Added normalization for common argument aliases (e.g.,
python_code → code) to fix "No tool calls found" issues
7. Constrained Prompting: Simplified system prompt to generate tool calls immediately without <think> blocks
8. Optimized Generation: Lower temperature (0.2), shorter responses (512 tokens), additional stop sequences, repetition penalty
9. Reference Policy Sync Filter: Exclude bitsandbytes quantization buffers from reference policy updates
10. Rollout Metrics Logging: Always log environment metrics before training step to ensure WandB visibility
../mcp_tools/limited/ (financial data, web search, Python execution, Slack)outputs/real-env-grpo-*/training.logoutputs/real-env-grpo-vllm-*/training.logWatch for these log patterns during training:
✅ Healthy (Fixed):
✅ Collected N sample-time logprobs: mean=X.X, std=Y.Y ✅ Added tiny perturbation to 10 parameters to break PPO degeneracy PPO Ratio Check - mean: 1.000, std: 0.006, count: N 'std_ratio': 0.00586725166067481
❌ Broken (Will halt training):
❌ CRITICAL: PPO ratios are degenerate! std=0.000000 RuntimeError: PPO ratios are degenerate. Training cannot proceed.
VLLM_GPU_MEMORY_UTILIZATION setting