Markdown Converter
Agent skill for markdown-converter
This is a sophisticated **AI-powered terminal chatbot with multi-backend inference support** designed for Windows on ARM with Snapdragon X Elite NPU acceleration. The project implements a three-phase intelligent pipeline (SelfAI) with fallback mechanisms, memory management, and agent-based task exec
Sign in to like and favorite skills
This is a sophisticated AI-powered terminal chatbot with multi-backend inference support designed for Windows on ARM with Snapdragon X Elite NPU acceleration. The project implements a three-phase intelligent pipeline (SelfAI) with fallback mechanisms, memory management, and agent-based task execution.
Key Purpose: Enable efficient local AI inference with automatic fallback from NPU hardware acceleration to CPU execution, all managed through a configuration-driven system with optional planning and merge phases.
The system implements a three-phase pipeline:
┌─────────────────────────────────────────────────────────────────┐ │ SelfAI Pipeline │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. PLANNING PHASE (Ollama-based) │ │ ├─ Accepts user goal/request │ │ ├─ Generates DPPM plan (Distributed Planning Problem Model) │ │ └─ Creates subtasks with dependencies & merge strategy │ │ │ │ 2. EXECUTION PHASE (Multi-backend LLM inference) │ │ ├─ Executes subtasks sequentially/parallel (per plan) │ │ ├─ Uses AgentManager to route to specialized agents │ │ ├─ Falls back between backends: AnythingLLM → QNN → CPU │ │ └─ Saves results and tracks status │ │ │ │ 3. MERGE PHASE (Result synthesis) │ │ ├─ Collects all subtask outputs │ │ ├─ Synthesizes into coherent final answer │ │ └─ Falls back gracefully with internal summary │ │ │ └─────────────────────────────────────────────────────────────────┘
The system supports three execution backends in priority order:
AnythingLLM (NPU) - Primary: Hardware-accelerated inference via Snapdragon X NPU
config.yaml under npu_providerQNN (Qualcomm Neural Network) - Secondary: Direct NPU model execution
models/ directoryCPU Fallback - Tertiary: Local CPU inference via llama-cpp-python
Automatic Failover: If AnythingLLM fails, system automatically tries QNN, then CPU.
AI_NPU_AGENT_Projekt/ ├── CLAUDE.md # This file - architecture documentation ├── README.md # User-facing project overview ├── UI_GUIDE.md # Terminal UI features & customization ├── config.yaml.template # Configuration template ├── config_extended.yaml # Extended configuration example ├── .env.example # Environment variables template ├── requirements.txt # Main dependencies ├── requirements-core.txt # Core CPU dependencies ├── requirements-npu.txt # NPU-specific dependencies │ ├── config_loader.py # Configuration loading & validation ├── main.py # Entry point: Agent initialization ├── llm_chat.py # QNN-based chat interface │ ├── selfai/ # Main SelfAI package │ ├── __init__.py │ ├── selfai.py # Main CLI loop with full pipeline │ ├── core/ │ │ ├── agent.py # Basic agent with tool-calling loop │ │ ├── agent_manager.py # AgentManager: manages multiple agents │ │ ├── model_interface.py # Base interface for LLM models │ │ ├── anythingllm_interface.py # AnythingLLM HTTP client │ │ ├── npu_llm_interface.py # QNN/NPU model interface │ │ ├── local_llm_interface.py # CPU fallback (llama-cpp-python) │ │ ├── planner_ollama_interface.py# Ollama planner client │ │ ├── merge_ollama_interface.py # Ollama merge provider │ │ ├── execution_dispatcher.py # Subtask execution orchestrator │ │ ├── memory_system.py # Conversation & plan storage │ │ ├── context_filter.py # Smart context relevance filtering │ │ ├── planner_validator.py # Plan schema validation │ │ └── smolagents_runner.py # Smolagents integration │ │ │ ├── tools/ │ │ ├── tool_registry.py # Tool catalog & management │ │ ├── filesystem_tools.py # File/directory operations │ │ └── shell_tools.py # Shell command execution │ │ │ └── ui/ │ └── terminal_ui.py # Terminal UI with animations │ ├── models/ # Model storage directory │ ├── Phi-3-mini-4k-instruct.Q4_K_M.gguf # CPU fallback model │ └── [other GGUF/QNN models] │ ├── memory/ # Conversation & plan storage │ ├── plans/ # Saved execution plans │ └── [memory categories]/ # Memory organized by agent categories │ ├── agents/ # Agent configurations │ ├── [agent_key]/ │ │ ├── system_prompt.md # Agent system prompt │ │ ├── memory_categories.txt # Memory categories for this agent │ │ ├── workspace_slug.txt # AnythingLLM workspace │ │ └── description.txt # Agent description │ └── [other agents] │ ├── data/ # Additional data/resources ├── docs/ # Extended documentation ├── scripts/ # Setup & utility scripts ├── archive/ # Old/archived code └── Learings_aus_Problemen/ # Learning notes & problems
┌─────────────────────────────────────────────────────────┐ │ config_loader.py::load_configuration() │ ├─────────────────────────────────────────────────────────┤ │ │ │ 1. Load .env file (secrets) │ │ 2. Load config.yaml (main settings) │ │ 3. Normalize config (support both formats) │ │ 4. Resolve environment variables (${VAR_NAME}) │ │ 5. Validate required fields │ │ 6. Create structured dataclasses │ │ │ └─────────────────────────────────────────────────────────┘
config.yaml contains:
npu_provider - AnythingLLM backend
npu_provider: base_url: "http://localhost:3001/api/v1" workspace_slug: "main" api_key: "loaded-from-.env"
cpu_fallback - Local GGUF model
cpu_fallback: model_path: "Phi-3-mini-4k-instruct.Q4_K_M.gguf" n_ctx: 4096 # Context window size n_gpu_layers: 0 # GPU offload layers
system - General settings
system: streaming_enabled: true # Enable word-by-word output stream_timeout: 60.0 # Streaming timeout in seconds
agent_config - Agent management
agent_config: default_agent: "code_helfer" # Default agent to load
planner - Optional Ollama-based planning
planner: enabled: false # Enable/disable planning execution_timeout: 120.0 # Timeout per subtask providers: # Multiple planner backends - name: "local-ollama" type: "local_ollama" base_url: "http://localhost:11434" model: "gemma3:1b" timeout: 180.0 max_tokens: 768
merge - Optional result synthesis
merge: enabled: false providers: # Multiple merge backends - name: "merge-ollama" type: "local_ollama" base_url: "http://localhost:11434" model: "gemma3:3b" timeout: 180.0 max_tokens: 2048
Required (in .env file):
API_KEY: AnythingLLM API key (required if using AnythingLLM)Optional:
OLLAMA_CLOUD_API_KEY: For cloud-based Ollama providers${VAR_NAME}config_loader.py)Purpose: Centralized, validated configuration management
Key Classes:
NPUConfig: AnythingLLM backend settingsCPUConfig: Local model configurationSystemConfig: General system settingsPlannerConfig: Planning phase configurationMergeConfig: Merge phase configurationAppConfig: Complete application configKey Functions:
load_configuration(): Load, validate, and structure config_normalize_config(): Support both simple and extended formats_resolve_env_template(): Replace ${VAR_NAME} with env valuesselfai/core/agent_manager.py)Purpose: Manage multiple specialized AI agents with memory
Agent Properties:
key: Unique identifier (e.g., "code_helfer")display_name: Human-readable namedescription: What this agent doessystem_prompt: Agent personality/instructionsmemory_categories: Conversation storage categoriesworkspace_slug: AnythingLLM workspaceAgentManager Responsibilities:
Base:
ModelInterface in model_interface.py
chat_completion(), generate_response(), stream_generate_response()Implementations:
AnythingLLMInterface (
anythingllm_interface.py)
NpuLLMInterface (
npu_llm_interface.py)
LocalLLMInterface (
local_llm_interface.py)
planner_ollama_interface.py)Purpose: Generate task decomposition plans (DPPM format)
PlannerOllamaInterface:
Plan Structure:
{ "subtasks": [ { "id": "S1", "title": "Task title", "objective": "What to do", "agent_key": "agent_name", "engine": "anythingllm", "parallel_group": 1, "depends_on": [] } ], "merge": { "strategy": "How to combine results", "steps": [...] } }
Planning Flow:
/plan <goal>memory/plans/execution_dispatcher.py)Purpose: Execute planned subtasks with fault tolerance
ExecutionDispatcher:
Execution Pipeline:
For each subtask: 1. Try Backend 1 (AnythingLLM) 2. On failure → Try Backend 2 (QNN) 3. On failure → Try Backend 3 (CPU) 4. On all failure → Abort plan with error 5. Save result to memory 6. Update plan JSON with result path
Retry Strategy:
retry_attempts: Number of retries (default 2)retry_delay: Wait between retries (default 5s)memory_system.py)Purpose: Persistent conversation and plan storage
Structure:
memory/ ├── plans/ # Saved execution plans (JSON) │ └── 20250101-120000_goal-name.json ├── code_helfer/ # Agent memory (categories) │ ├── agent1_20250101-120000.txt │ └── agent1_20250101-120001.txt ├── projektmanager/ │ └── ... └── general/ # Default category └── ...
File Format (text-based conversations):
--- Agent: Code Helper AgentKey: code_helfer Workspace: main Timestamp: 2025-01-01 12:00:00 Tags: python, debugging --- System Prompt: [system instructions] --- User: [user question] --- SelfAI: [ai response]
Key Features:
context_filter.py)Purpose: Smart retrieval of relevant conversation history
Algorithms:
Integration: Used by
load_relevant_context() to populate chat history
ui/terminal_ui.py)Purpose: Rich terminal interface with animations
Features:
Status Levels:
"success" - Green ✓"info" - Blue ⓘ"warning" - Yellow ⚠"error" - Red ✗selfai/selfai.py (Recommended)Complete 3-phase pipeline:
python /path/to/selfai/selfai.py
Flow:
/plan <goal> → Planning phase/memory → Manage memory/switch <agent> → Switch agentsquit → ExitKey Commands:
/plan <goal> - Create and execute task decomposition plan/planner list - List available planner backends/planner use <name> - Switch planner provider/memory - List memory categories/memory clear <category> - Clear memory/switch <agent_name|number> - Switch active agentquit - Exit programmain.pySimple agent initialization:
python main.py
Flow:
smolagents)Use Case: Basic testing without complex infrastructure
llm_chat.pyDirect QNN/NPU chat:
python llm_chat.py
Features:
┌──────────────────────────────────────────────────────────────────┐ │ selfai.py (Main Loop) │ └────────────────────────┬─────────────────────────────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌─────────┐ ┌──────────┐ │Planning│ │Execution│ │ Merge │ │ Phase │ │ Phase │ │ Phase │ └────┬───┘ └────┬────┘ └────┬─────┘ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │PlannerOllama │ │ExecutionDisp │ │MergeOllama │ │Interface │ │atcher │ │Interface │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ └────────────────┼────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ │AnythingLLM │ │NpuLLM │ │LocalLLM │ │Interface │ │Interface │ │Interface │ └────────────┘ └────────────┘ └────────────┘ ▲ ▲ ▲ │ │ │ ┌───┴──────────────────┼──────────────────┴───┐ │ │ │ │ ┌────────────────────┘ │ │ │ Automatic Fallback in Priority Order │ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌────────────┐ ┌──────────────┐ │AnythingLLM│ │ QNN Models │ │ GGUF Models │ │Server │ │ (.qnn) │ │ (CPU) │ │(NPU) │ │ (NPU) │ │ │ └──────────┘ └────────────┘ └──────────────┘ ┌─────────────────────────────────────────────────┐ │ Supporting Systems │ ├─────────────────────────────────────────────────┤ │ AgentManager → Agent instances & switching │ │ MemorySystem → Conversation & plan persistence│ │ ConfigLoader → Centralized configuration │ │ TerminalUI → Rich terminal interface │ │ ContextFilter→ Smart context retrieval │ └─────────────────────────────────────────────────┘
requirements-core.txt)PyYAML # Config file parsing python-dotenv # Environment variable loading openai # OpenAI API compatibility llama-cpp-python # CPU model inference (GGUF) numpy # Numerical computing pyarrow # Data serialization tabulate # Table formatting smmap # Fast file mapping psutil # System monitoring qai-hub-models # Qualcomm AI Hub models smolagents # Agent toolkit
requirements-npu.txt)httpx==0.28.1 # HTTP client for API calls qai_hub_models # QNN model support
Hardware:
Software:
# Clone repository git clone <repository-url> cd AI_NPU_AGENT_Projekt # Create virtual environment python -m venv .venv source .venv/bin/activate # On Windows CMD: .\.venv\Scripts\activate # Install dependencies pip install -r requirements.txt
# Copy and configure cp config.yaml.template config.yaml cp .env.example .env # Edit config.yaml with your settings # Edit .env with your AnythingLLM API key
# Create models directory mkdir -p models # Download GGUF model for CPU fallback # Place in models/ directory # e.g., Phi-3-mini-4k-instruct.Q4_K_M.gguf
# Create agents directory mkdir -p agents # Create agent directories with: # agents/agent_key/ # ├── system_prompt.md # ├── memory_categories.txt # ├── workspace_slug.txt # └── description.txt
# If using Ollama planner/merge ollama serve # In another terminal, pull models ollama pull gemma3:1b ollama pull gemma3:3b
# If using AnythingLLM for primary inference # Launch AnythingLLM Desktop and configure workspace
python selfai/selfai.py > You: What is Python? AI: [Response from available backend]
python selfai/selfai.py > You: /plan Create a Python web crawler for news sites [Planner decomposes into subtasks] [System executes each subtask] [Merge synthesizes final solution]
python selfai/selfai.py > You: /switch projektmanager Switched to: Project Manager > You: Analyze the project requirements AI: [Response from project manager agent]
python selfai/selfai.py > You: /memory Aktive Memory-Kategorien: - code_helfer - projektmanager > You: /memory clear code_helfer Memory 'code_helfer' komplett geleert (15 Einträge).
Create directory:
agents/my_agent/ ├── system_prompt.md (Agent personality) ├── memory_categories.txt (One per line) ├── workspace_slug.txt (AnythingLLM workspace) └── description.txt (What agent does)
Reference in config.yaml:
agent_config: default_agent: "my_agent"
Create in
selfai/tools/:
class MyTool: @property def name(self): return "my_tool" @property def description(self): return "Tool description" @property def inputs(self): return { "param1": {"description": "..."} } def run(self, param1: str) -> str: # Implementation return result
Register in
selfai/tools/tool_registry.py:
from selfai.tools.my_tool import MyTool # Add to registry
Create new interface in
selfai/core/:
class MyLLMInterface: def generate_response(self, ...): ... def stream_generate_response(self, ...): ...
Instantiate in
selfai/selfai.py:
interface, label = _load_my_llm(models_root, ui) execution_backends.append({ "interface": interface, "label": label, "name": "my_backend" })
Solution:
.env.example to .envconfig.yaml is created from templateSolution:
npu_provider.base_url in config.yamlSolution:
max_output_tokens in confign_ctx (context window)Solution:
/memory clear <category> to manage/memory clear <category> 5 to keep only last 5Solution:
ollama serveplanner.enabled: true in config.yamlollama pull gemma3:1bplanner.providers[0].base_url points to OllamaStreaming (default): Better UX, lower latency perception
system.streaming_enabled: trueBlocking: Simple, predictable latency
| Backend | Speed | Quality | Hardware | Notes |
|---|---|---|---|---|
| AnythingLLM (NPU) | Fast | High | Snapdragon X Elite | Recommended primary |
| QNN | Very Fast | High | Snapdragon X Elite | Direct NPU access |
| CPU (GGUF) | Slow | Medium | Any | Fallback guarantee |
max_tokens: 768 (plan generation)max_tokens: 1536 (result synthesis)max_output_tokens: 512 (regular response)The planner generates plans in DPPM format:
{ "subtasks": [ { "id": "S1", "title": "Analyze Requirements", "objective": "Understand what user needs", "agent_key": "analyst", "engine": "anythingllm", "parallel_group": 1, "depends_on": [], "result_path": "memory/plans/results/S1.txt" }, { "id": "S2", "title": "Design Solution", "objective": "Create architecture", "agent_key": "architect", "engine": "anythingllm", "parallel_group": 2, "depends_on": ["S1"], "result_path": "memory/plans/results/S2.txt" } ], "merge": { "strategy": "Combine analysis and design", "steps": [ { "title": "Synthesis", "description": "Unite results", "depends_on": ["S2"] } ] }, "metadata": { "planner_provider": "local-ollama", "planner_model": "gemma3:1b", "goal": "Create a web application", "merge_agent": "projektmanager" } }
AnythingLLM and Ollama use Server-Sent Events (SSE):
event: message data: {"content": "Hello"} event: message data: {"content": " world"} event: end data: {"done": true}
System automatically decodes and displays streaming chunks.
Separation of Concerns: Each module has single responsibility
*_interface.py → Backend communication*_system.py → Persistent stateexecution_* → Task orchestrationui/ → User interfaceDependency Injection: Core business logic independent of I/O
Graceful Degradation: System continues with reduced capability
Configuration-Driven: Behavior changes without code modification
.env file or API keys.env.example as templatepython-dotenvpathlib for safetyPotential improvements:
config.yaml.template, config_extended.yamlREADME.md, UI_GUIDE.mdLearings_aus_Problemen/ directoryLast Updated: January 2025 Maintained By: AI NPU Agent Project Team License: [Check LICENSE file]