Markdown Converter
Agent skill for markdown-converter
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Sign in to like and favorite skills
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is a REST API service that provides OpenAI-compatible endpoints for serving Large Language Models (LLMs). It supports both llama.cpp (GGUF format) and MLX backends for model inference.
# Initial setup (creates .venv, installs dependencies) ./setup.sh # Activate virtual environment source .venv/bin/activate
# Start the service (recommended - handles venv activation) ./start_service.sh # Start with development reload (activate venv first) uvicorn server:app --reload # Start with custom host/port PORT=8080 HOST=127.0.0.1 python server.py
# Run API tests python test_api.py # Test specific endpoint manually curl http://localhost:8000/v1/models
# Copy and edit configuration cp .env.example .env
server.py: FastAPI application implementing OpenAI-compatible endpoints
/v1/chat/completions - Chat completions with streaming support/v1/completions - Text completions (legacy format)/v1/embeddings - Generate embeddings with BGE-M3 dense/sparse support/v1/rerank - Rerank documents using cross-encoder models/v1/models - List available modelsmodel_manager.py: Handles model loading and inference
embedders.py: Modular embedding implementations
rerankers.py: Modular reranker implementations
cache.py: Response caching with optional disk persistence
The service uses a five-tier model system configured in .env:
LIGHT_MODEL_*: For fast, simple tasksMEDIUM_MODEL_*: For balanced performanceHEAVY_MODEL_*: For complex reasoning tasksEMBEDDING_MODEL_*: For embedding models (BGE-M3, etc.)RERANKER_MODEL_*: For reranking models (BGE-reranker, etc.)Each tier has independent settings for:
To use BGE-M3 or other embedding models:
EMBEDDING_MODEL_PATH=/path/to/bge-m3-Q8_0.gguf EMBEDDING_BACKEND=llamacpp EMBEDDING_EMBEDDING_MODE=true EMBEDDING_EMBEDDING_DIMENSION=1024 EMBEDDING_N_CTX=8192
The service supports BGE-M3's advanced embedding capabilities:
Use the
/v1/embeddings endpoint with:
{ "model": "embedding", "input": "Your text here", "embedding_type": "dense", // or "sparse", "colbert" "return_sparse": true // Return both dense and sparse }
To use BGE-reranker or other cross-encoder models:
RERANKER_MODEL_PATH=/path/to/bge-reranker-v2-m3-Q8_0.gguf RERANKER_BACKEND=llamacpp RERANKER_N_CTX=512 RERANKER_TEMPERATURE=0.01
The structured output feature for MLX models uses the Outlines library for constrained JSON generation:
outlines.models.from_mlxlm() to wrap MLX modelsoutlines.generate.json() with flexible schema for JSON object generation/path/to/mlx-community/gemma-3-4b-it-qat-4bit), not just metadataresponse_format: {"type": "json_object"}, be explicit about the JSON structure you want{} if the prompt doesn't clearly specify what fields to include