AI NPU Agent Project - Architecture & Documentation

Project Overview

This is a sophisticated AI-powered terminal chatbot with multi-backend inference support designed for Windows on ARM with Snapdragon X Elite NPU acceleration. The project implements a three-phase intelligent pipeline (SelfAI) with fallback mechanisms, memory management, and agent-based task execution.

Key Purpose: Enable efficient local AI inference with automatic fallback from NPU hardware acceleration to CPU execution, all managed through a configuration-driven system with optional planning and merge phases.

Architecture Overview

High-Level System Design

The system implements a three-phase pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                        SelfAI Pipeline                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. PLANNING PHASE (Ollama-based)                               │
│     ├─ Accepts user goal/request                                │
│     ├─ Generates DPPM plan (Distributed Planning Problem Model) │
│     └─ Creates subtasks with dependencies & merge strategy      │
│                                                                  │
│  2. EXECUTION PHASE (Multi-backend LLM inference)               │
│     ├─ Executes subtasks sequentially/parallel (per plan)       │
│     ├─ Uses AgentManager to route to specialized agents         │
│     ├─ Falls back between backends: AnythingLLM → QNN → CPU    │
│     └─ Saves results and tracks status                          │
│                                                                  │
│  3. MERGE PHASE (Result synthesis)                              │
│     ├─ Collects all subtask outputs                             │
│     ├─ Synthesizes into coherent final answer                   │
│     └─ Falls back gracefully with internal summary              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Multi-Backend Inference Strategy

The system supports three execution backends in priority order:

AnythingLLM (NPU) - Primary: Hardware-accelerated inference via Snapdragon X NPU
- Communicates via HTTP API to AnythingLLM server
- Configured in
```
config.yaml
```
  under
```
npu_provider
```
- Supports streaming output
QNN (Qualcomm Neural Network) - Secondary: Direct NPU model execution
- Automatically discovered from
```
models/
```
  directory
- Uses QAI Hub models (e.g., Phi-3.5-Mini-Instruct)
- Optimized for on-device inference
CPU Fallback - Tertiary: Local CPU inference via llama-cpp-python
- Uses GGUF quantized models (e.g., Phi-3-mini-4k-instruct.Q4_K_M.gguf)
- Pure CPU execution, no GPU/NPU required
- Guarantees functionality even without specialized hardware

Automatic Failover: If AnythingLLM fails, system automatically tries QNN, then CPU.

Directory Structure

AI_NPU_AGENT_Projekt/
├── CLAUDE.md                          # This file - architecture documentation
├── README.md                           # User-facing project overview
├── UI_GUIDE.md                        # Terminal UI features & customization
├── config.yaml.template               # Configuration template
├── config_extended.yaml               # Extended configuration example
├── .env.example                       # Environment variables template
├── requirements.txt                   # Main dependencies
├── requirements-core.txt              # Core CPU dependencies
├── requirements-npu.txt               # NPU-specific dependencies
│
├── config_loader.py                   # Configuration loading & validation
├── main.py                            # Entry point: Agent initialization
├── llm_chat.py                        # QNN-based chat interface
│
├── selfai/                            # Main SelfAI package
│   ├── __init__.py
│   ├── selfai.py                      # Main CLI loop with full pipeline
│   ├── core/
│   │   ├── agent.py                   # Basic agent with tool-calling loop
│   │   ├── agent_manager.py           # AgentManager: manages multiple agents
│   │   ├── model_interface.py         # Base interface for LLM models
│   │   ├── anythingllm_interface.py   # AnythingLLM HTTP client
│   │   ├── npu_llm_interface.py       # QNN/NPU model interface
│   │   ├── local_llm_interface.py     # CPU fallback (llama-cpp-python)
│   │   ├── planner_ollama_interface.py# Ollama planner client
│   │   ├── merge_ollama_interface.py  # Ollama merge provider
│   │   ├── execution_dispatcher.py    # Subtask execution orchestrator
│   │   ├── memory_system.py           # Conversation & plan storage
│   │   ├── context_filter.py          # Smart context relevance filtering
│   │   ├── planner_validator.py       # Plan schema validation
│   │   └── smolagents_runner.py       # Smolagents integration
│   │
│   ├── tools/
│   │   ├── tool_registry.py           # Tool catalog & management
│   │   ├── filesystem_tools.py        # File/directory operations
│   │   └── shell_tools.py             # Shell command execution
│   │
│   └── ui/
│       └── terminal_ui.py             # Terminal UI with animations
│
├── models/                            # Model storage directory
│   ├── Phi-3-mini-4k-instruct.Q4_K_M.gguf  # CPU fallback model
│   └── [other GGUF/QNN models]
│
├── memory/                            # Conversation & plan storage
│   ├── plans/                         # Saved execution plans
│   └── [memory categories]/           # Memory organized by agent categories
│
├── agents/                            # Agent configurations
│   ├── [agent_key]/
│   │   ├── system_prompt.md           # Agent system prompt
│   │   ├── memory_categories.txt      # Memory categories for this agent
│   │   ├── workspace_slug.txt         # AnythingLLM workspace
│   │   └── description.txt            # Agent description
│   └── [other agents]
│
├── data/                              # Additional data/resources
├── docs/                              # Extended documentation
├── scripts/                           # Setup & utility scripts
├── archive/                           # Old/archived code
└── Learings_aus_Problemen/            # Learning notes & problems

Configuration System

Configuration Loading Flow

┌─────────────────────────────────────────────────────────┐
│  config_loader.py::load_configuration()                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. Load .env file (secrets)                            │
│  2. Load config.yaml (main settings)                    │
│  3. Normalize config (support both formats)             │
│  4. Resolve environment variables (${VAR_NAME})         │
│  5. Validate required fields                            │
│  6. Create structured dataclasses                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Configuration Structure

config.yaml contains:

npu_provider - AnythingLLM backend

npu_provider:
  base_url: "http://localhost:3001/api/v1"
  workspace_slug: "main"
  api_key: "loaded-from-.env"

cpu_fallback - Local GGUF model

cpu_fallback:
  model_path: "Phi-3-mini-4k-instruct.Q4_K_M.gguf"
  n_ctx: 4096                    # Context window size
  n_gpu_layers: 0                # GPU offload layers

system - General settings

system:
  streaming_enabled: true        # Enable word-by-word output
  stream_timeout: 60.0           # Streaming timeout in seconds

agent_config - Agent management

agent_config:
  default_agent: "code_helfer"   # Default agent to load

planner - Optional Ollama-based planning

planner:
  enabled: false                 # Enable/disable planning
  execution_timeout: 120.0       # Timeout per subtask
  providers:                     # Multiple planner backends
    - name: "local-ollama"
      type: "local_ollama"
      base_url: "http://localhost:11434"
      model: "gemma3:1b"
      timeout: 180.0
      max_tokens: 768

merge - Optional result synthesis

merge:
  enabled: false
  providers:                     # Multiple merge backends
    - name: "merge-ollama"
      type: "local_ollama"
      base_url: "http://localhost:11434"
      model: "gemma3:3b"
      timeout: 180.0
      max_tokens: 2048

Environment Variables

Required (in .env file):

```
API_KEY
```
: AnythingLLM API key (required if using AnythingLLM)

Optional:

```
OLLAMA_CLOUD_API_KEY
```
: For cloud-based Ollama providers
Any variables referenced in config.yaml as
```
${VAR_NAME}
```

Key Components

1. Configuration Loader (

config_loader.py

)

Purpose: Centralized, validated configuration management

Key Classes:

```
NPUConfig
```
: AnythingLLM backend settings
```
CPUConfig
```
: Local model configuration
```
SystemConfig
```
: General system settings
```
PlannerConfig
```
: Planning phase configuration
```
MergeConfig
```
: Merge phase configuration
```
AppConfig
```
: Complete application config

Key Functions:

```
load_configuration()
```
: Load, validate, and structure config
```
_normalize_config()
```
: Support both simple and extended formats
```
_resolve_env_template()
```
: Replace
```
${VAR_NAME}
```
with env values

2. Agent System (

selfai/core/agent_manager.py

)

Purpose: Manage multiple specialized AI agents with memory

Agent Properties:

```
key
```
: Unique identifier (e.g., "code_helfer")
```
display_name
```
: Human-readable name
```
description
```
: What this agent does
```
system_prompt
```
: Agent personality/instructions
```
memory_categories
```
: Conversation storage categories
```
workspace_slug
```
: AnythingLLM workspace

AgentManager Responsibilities:

Load agents from disk
Switch between agents at runtime
Provide agent to execution/memory systems

3. Model Interfaces (Base + Implementations)

Base:

ModelInterface

model_interface.py

Defines common interface for all LLM backends

Methods:

chat_completion()

generate_response()

stream_generate_response()

Implementations:

AnythingLLMInterface (
```
anythingllm_interface.py
```
)
- HTTP client for AnythingLLM API
- Supports streaming via Server-Sent Events (SSE)
- Handles workspace management
- Configuration: base_url, workspace_slug, API key
NpuLLMInterface (
```
npu_llm_interface.py
```
)
- Direct QAI Hub models (Phi-3.5-Mini, etc.)
- NPU inference via QNN runtime
- Auto-discovers .qnn files in models directory
- Optimized for on-device execution
LocalLLMInterface (
```
local_llm_interface.py
```
)
- Wraps llama-cpp-python for CPU inference
- Loads GGUF quantized models
- Fallback when NPU unavailable
- Pure Python implementation

4. Planning System (

planner_ollama_interface.py

)

Purpose: Generate task decomposition plans (DPPM format)

PlannerOllamaInterface:

Calls Ollama API with structured prompt
Validates plan JSON schema
Returns structured plan data

Plan Structure:

{
  "subtasks": [
    {
      "id": "S1",
      "title": "Task title",
      "objective": "What to do",
      "agent_key": "agent_name",
      "engine": "anythingllm",
      "parallel_group": 1,
      "depends_on": []
    }
  ],
  "merge": {
    "strategy": "How to combine results",
    "steps": [...]
  }
}

Planning Flow:

User enters goal with
```
/plan <goal>
```
Planner decomposes into subtasks
Validation checks plan structure
User confirms before execution
Plan saved to
```
memory/plans/
```

5. Execution System (

execution_dispatcher.py

)

Purpose: Execute planned subtasks with fault tolerance

ExecutionDispatcher:

Loads plan from JSON
Executes each subtask via LLM backends
Manages retry logic (2 attempts by default)
Tracks task status: pending → running → completed/failed
Saves results to memory

Execution Pipeline:

For each subtask:
  1. Try Backend 1 (AnythingLLM)
  2. On failure → Try Backend 2 (QNN)
  3. On failure → Try Backend 3 (CPU)
  4. On all failure → Abort plan with error
  5. Save result to memory
  6. Update plan JSON with result path

Retry Strategy:

```
retry_attempts
```
: Number of retries (default 2)
```
retry_delay
```
: Wait between retries (default 5s)
Exponential backoff for network errors

6. Memory System (

memory_system.py

)

Purpose: Persistent conversation and plan storage

Structure:

memory/
├── plans/                    # Saved execution plans (JSON)
│   └── 20250101-120000_goal-name.json
├── code_helfer/              # Agent memory (categories)
│   ├── agent1_20250101-120000.txt
│   └── agent1_20250101-120001.txt
├── projektmanager/
│   └── ...
└── general/                  # Default category
    └── ...

File Format (text-based conversations):

---
Agent: Code Helper
AgentKey: code_helfer
Workspace: main
Timestamp: 2025-01-01 12:00:00
Tags: python, debugging
---
System Prompt:
[system instructions]
---
User:
[user question]
---
SelfAI:
[ai response]

Key Features:

Automatic category assignment
Tag extraction from content
Context filtering (relevance-based)
Plan serialization to JSON

7. Context Filtering (

context_filter.py

)

Purpose: Smart retrieval of relevant conversation history

Algorithms:

Task Classification: Categorize user input (coding, planning, etc.)
Relevance Scoring: Calculate similarity to past conversations
Context Selection: Retrieve top N most relevant past interactions

Integration: Used by

load_relevant_context()

to populate chat history

8. Terminal UI (

ui/terminal_ui.py

)

Purpose: Rich terminal interface with animations

Features:

ASCII banner and spinners
Progress bars for long operations
Color-coded status messages (green=success, yellow=warning, red=error)
Typing animation for AI responses
Stream prefix labels (showing which backend)
Plan visualization with tree structure
Interactive menu selection

Status Levels:

```
"success"
```
- Green ✓
```
"info"
```
- Blue ⓘ
```
"warning"
```
- Yellow ⚠
```
"error"
```
- Red ✗

Entry Points & Execution Flow

Main Entry Point:

selfai/selfai.py

(Recommended)

Complete 3-phase pipeline:

python /path/to/selfai/selfai.py

Flow:

Initialize configuration, agents, memory, UI
Load LLM backends in priority order (AnythingLLM → QNN → CPU)
Load optional planner providers (Ollama)
Load optional merge providers (Ollama)
Enter interactive loop:
- ```
/plan <goal>
```
  → Planning phase
- Normal message → Chat (execution phase)
- ```
/memory
```
  → Manage memory
- ```
/switch <agent>
```
  → Switch agents
- ```
quit
```
  → Exit

Key Commands:

```
/plan <goal>
```
- Create and execute task decomposition plan
```
/planner list
```
- List available planner backends
```
/planner use <name>
```
- Switch planner provider
```
/memory
```
- List memory categories
```
/memory clear <category>
```
- Clear memory
```
/switch <agent_name|number>
```
- Switch active agent
```
quit
```
- Exit program

Alternative Entry Point:

main.py

Simple agent initialization:

python main.py

Flow:

Initialize Agent with local-ollama provider
Simple chat loop without planning/merge
Tool-based execution (via
```
smolagents
```
)

Use Case: Basic testing without complex infrastructure

Alternative Entry Point:

llm_chat.py

Direct QNN/NPU chat:

python llm_chat.py

Features:

Direct QAI Hub model loading
Phi-3.5-Mini on Snapdragon X Elite NPU
Simple interactive chat
No configuration needed

Component Interaction Diagram

┌──────────────────────────────────────────────────────────────────┐
│                        selfai.py (Main Loop)                     │
└────────────────────────┬─────────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
        ▼                ▼                ▼
    ┌────────┐      ┌─────────┐     ┌──────────┐
    │Planning│      │Execution│     │  Merge   │
    │ Phase  │      │  Phase  │     │  Phase   │
    └────┬───┘      └────┬────┘     └────┬─────┘
         │               │               │
         ▼               ▼               ▼
   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
   │PlannerOllama │ │ExecutionDisp │ │MergeOllama   │
   │Interface     │ │atcher        │ │Interface     │
   └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
    ┌────────────┐   ┌────────────┐   ┌────────────┐
    │AnythingLLM │   │NpuLLM      │   │LocalLLM    │
    │Interface   │   │Interface   │   │Interface   │
    └────────────┘   └────────────┘   └────────────┘
        ▲                  ▲                  ▲
        │                  │                  │
    ┌───┴──────────────────┼──────────────────┴───┐
    │                      │                      │
    │ ┌────────────────────┘                      │
    │ │ Automatic Fallback in Priority Order      │
    │ │                                           │
    ▼ ▼                                           ▼
┌──────────┐  ┌────────────┐  ┌──────────────┐
│AnythingLLM│ │ QNN Models │  │ GGUF Models  │
│Server     │ │ (.qnn)     │  │ (CPU)        │
│(NPU)      │ │ (NPU)      │  │              │
└──────────┘  └────────────┘  └──────────────┘


┌─────────────────────────────────────────────────┐
│            Supporting Systems                   │
├─────────────────────────────────────────────────┤
│  AgentManager → Agent instances & switching    │
│  MemorySystem → Conversation & plan persistence│
│  ConfigLoader → Centralized configuration      │
│  TerminalUI   → Rich terminal interface        │
│  ContextFilter→ Smart context retrieval        │
└─────────────────────────────────────────────────┘

Dependencies

Core Dependencies (

requirements-core.txt

)

PyYAML              # Config file parsing
python-dotenv       # Environment variable loading
openai              # OpenAI API compatibility
llama-cpp-python    # CPU model inference (GGUF)
numpy               # Numerical computing
pyarrow             # Data serialization
tabulate            # Table formatting
smmap               # Fast file mapping
psutil              # System monitoring
qai-hub-models      # Qualcomm AI Hub models
smolagents          # Agent toolkit

NPU Dependencies (

requirements-npu.txt

)

httpx==0.28.1       # HTTP client for API calls
qai_hub_models      # QNN model support

System Requirements

Hardware:

Windows on ARM (Snapdragon X Elite or compatible)
Minimum 8GB RAM for CPU inference
16GB+ recommended for NPU optimization

Software:

Python 3.12 (ARM64 build)
AnythingLLM Desktop (ARM64) - Optional for NPU backend
Ollama - Optional for planning/merge phases

Setup Instructions

1. Initial Setup

# Clone repository
git clone <repository-url>
cd AI_NPU_AGENT_Projekt

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows CMD: .\.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration

# Copy and configure
cp config.yaml.template config.yaml
cp .env.example .env

# Edit config.yaml with your settings
# Edit .env with your AnythingLLM API key

3. Prepare Models

# Create models directory
mkdir -p models

# Download GGUF model for CPU fallback
# Place in models/ directory
# e.g., Phi-3-mini-4k-instruct.Q4_K_M.gguf

4. Set Up Agents

# Create agents directory
mkdir -p agents

# Create agent directories with:
# agents/agent_key/
#   ├── system_prompt.md
#   ├── memory_categories.txt
#   ├── workspace_slug.txt
#   └── description.txt

5. Optional: Start Ollama (for planning/merge)

# If using Ollama planner/merge
ollama serve

# In another terminal, pull models
ollama pull gemma3:1b
ollama pull gemma3:3b

6. Optional: Start AnythingLLM

# If using AnythingLLM for primary inference
# Launch AnythingLLM Desktop and configure workspace

Common Workflows

Workflow 1: Simple Chat

python selfai/selfai.py
> You: What is Python?
AI: [Response from available backend]

Workflow 2: Task Decomposition with Planning

python selfai/selfai.py
> You: /plan Create a Python web crawler for news sites
[Planner decomposes into subtasks]
[System executes each subtask]
[Merge synthesizes final solution]

Workflow 3: Agent Switching

python selfai/selfai.py
> You: /switch projektmanager
Switched to: Project Manager
> You: Analyze the project requirements
AI: [Response from project manager agent]

Workflow 4: Memory Management

python selfai/selfai.py
> You: /memory
Aktive Memory-Kategorien:
- code_helfer
- projektmanager

> You: /memory clear code_helfer
Memory 'code_helfer' komplett geleert (15 Einträge).

How to Extend

Adding a New Agent

Create directory:

agents/my_agent/
├── system_prompt.md       (Agent personality)
├── memory_categories.txt  (One per line)
├── workspace_slug.txt     (AnythingLLM workspace)
└── description.txt        (What agent does)

Reference in config.yaml:

agent_config:
  default_agent: "my_agent"

Adding a New Tool

Create in

selfai/tools/

class MyTool:
    @property
    def name(self):
        return "my_tool"
    
    @property
    def description(self):
        return "Tool description"
    
    @property
    def inputs(self):
        return {
            "param1": {"description": "..."}
        }
    
    def run(self, param1: str) -> str:
        # Implementation
        return result

selfai/tools/tool_registry.py

from selfai.tools.my_tool import MyTool
# Add to registry

Adding New LLM Backend

Create new interface in

selfai/core/

class MyLLMInterface:
    def generate_response(self, ...): ...
    def stream_generate_response(self, ...): ...

Instantiate in

selfai/selfai.py

interface, label = _load_my_llm(models_root, ui)
execution_backends.append({
    "interface": interface,
    "label": label,
    "name": "my_backend"
})

Troubleshooting

Issue: "API_KEY is not set"

Solution:

Copy
```
.env.example
```
to
```
.env
```
Add your AnythingLLM API key
Ensure
```
config.yaml
```
is created from template

Issue: "AnythingLLM not available"

Solution:

Verify AnythingLLM server running on configured host:port
Check
```
npu_provider.base_url
```
in config.yaml
System will automatically fall back to QNN or CPU

Issue: CPU inference very slow

Solution:

Reduce
```
max_output_tokens
```
in config
Use quantized models (Q4_K_M)
Reduce
```
n_ctx
```
(context window)
Consider using AnythingLLM + NPU for acceleration

Issue: Memory growing unbounded

Solution:

Use
```
/memory clear <category>
```
to manage
Or:
```
/memory clear <category> 5
```
to keep only last 5
Memory is organized by agent memory_categories

Issue: Planner not working

Solution:

Ensure Ollama running:
```
ollama serve
```
Set
```
planner.enabled: true
```
in config.yaml
Verify Ollama models installed:
```
ollama pull gemma3:1b
```
Check
```
planner.providers[0].base_url
```
points to Ollama

Performance Considerations

Streaming vs. Blocking

Streaming (default): Better UX, lower latency perception
- Enable:
```
system.streaming_enabled: true
```
- Supported by: AnythingLLM, some Ollama versions
Blocking: Simple, predictable latency
- Use if streaming unavailable
- System falls back automatically

Backend Selection

Backend	Speed	Quality	Hardware	Notes
AnythingLLM (NPU)	Fast	High	Snapdragon X Elite	Recommended primary
QNN	Very Fast	High	Snapdragon X Elite	Direct NPU access
CPU (GGUF)	Slow	Medium	Any	Fallback guarantee

Token Limits

Planner
```
max_tokens
```
: 768 (plan generation)
Merge
```
max_tokens
```
: 1536 (result synthesis)
Chat
```
max_output_tokens
```
: 512 (regular response)
Increase for longer responses, decrease for speed

Technical Details

DPPM (Distributed Planning Problem Model) Format

The planner generates plans in DPPM format:

{
  "subtasks": [
    {
      "id": "S1",
      "title": "Analyze Requirements",
      "objective": "Understand what user needs",
      "agent_key": "analyst",
      "engine": "anythingllm",
      "parallel_group": 1,
      "depends_on": [],
      "result_path": "memory/plans/results/S1.txt"
    },
    {
      "id": "S2",
      "title": "Design Solution",
      "objective": "Create architecture",
      "agent_key": "architect",
      "engine": "anythingllm",
      "parallel_group": 2,
      "depends_on": ["S1"],
      "result_path": "memory/plans/results/S2.txt"
    }
  ],
  "merge": {
    "strategy": "Combine analysis and design",
    "steps": [
      {
        "title": "Synthesis",
        "description": "Unite results",
        "depends_on": ["S2"]
      }
    ]
  },
  "metadata": {
    "planner_provider": "local-ollama",
    "planner_model": "gemma3:1b",
    "goal": "Create a web application",
    "merge_agent": "projektmanager"
  }
}

Streaming Protocol

AnythingLLM and Ollama use Server-Sent Events (SSE):

event: message
data: {"content": "Hello"}

event: message
data: {"content": " world"}

event: end
data: {"done": true}

System automatically decodes and displays streaming chunks.

Configuration Validation

Type Checking: Dataclass validation
Required Fields: Missing keys raise ValueError
URL Validation: Tested with health checks
Model Existence: GGUF files checked before use

Development Notes

Code Organization Principles

Separation of Concerns: Each module has single responsibility
- ```
*_interface.py
```
  → Backend communication
- ```
*_system.py
```
  → Persistent state
- ```
execution_*
```
  → Task orchestration
- ```
ui/
```
  → User interface
Dependency Injection: Core business logic independent of I/O
- Interfaces passed as parameters
- Easy to mock for testing
- Flexible backend switching
Graceful Degradation: System continues with reduced capability
- Missing optional features don't crash
- Fallback mechanisms at each level
- Clear status messages about limitations
Configuration-Driven: Behavior changes without code modification
- All settings in config.yaml
- Environment variable interpolation
- Validated at startup

Key Design Patterns

Strategy Pattern: Multiple interchangeable LLM backends
Chain of Responsibility: Backend fallback chain
Observer Pattern: UI status callbacks
Repository Pattern: Memory system abstraction
Factory Pattern: Agent and tool instantiation

Security Considerations

Secrets Management

Never commit
```
.env
```
file or API keys
Use
```
.env.example
```
as template
Load secrets via
```
python-dotenv
```
Validate all API keys at startup

Input Validation

User prompts passed through context filters
Plan validation before execution
Tool arguments checked before execution
Configuration values type-checked

File Operations

All file I/O uses
```
pathlib
```
for safety
Paths resolved relative to project root
No arbitrary shell execution (unless explicit)
Memory files stored in secure locations

Future Enhancements

Potential improvements:

Parallel Subtask Execution: Execute independent tasks concurrently
Custom Tools: User-defined tool integration
Multi-Model Ensemble: Combine outputs from multiple backends
RAG Integration: Vector database for document retrieval
Web UI: Browser-based interface instead of CLI
Distributed Execution: Execute subtasks on remote machines
Audio I/O: Voice input/output support
Plugin System: Dynamic backend/tool loading

References

Configuration:

config.yaml.template

config_extended.yaml

Models: Download from Hugging Face, Ollama, QAI Hub
Documentation:
```
README.md
```
,
```
UI_GUIDE.md
```
Examples: Check
```
Learings_aus_Problemen/
```
directory

External Resources

Version History

Current: Multi-backend inference with planning/merge phases
Previous: Simple chatbot with NPU/CPU fallback

Last Updated: January 2025 Maintained By: AI NPU Agent Project Team License: [Check LICENSE file]

AI NPU Agent Project - Architecture & Documentation

┌─────────────────────────────────────────────────────────────────┐
│                        SelfAI Pipeline                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. PLANNING PHASE (Ollama-based)                               │
│     ├─ Accepts user goal/request                                │
│     ├─ Generates DPPM plan (Distributed Planning Problem Model) │
│     └─ Creates subtasks with dependencies & merge strategy      │
│                                                                  │
│  2. EXECUTION PHASE (Multi-backend LLM inference)               │
│     ├─ Executes subtasks sequentially/parallel (per plan)       │
│     ├─ Uses AgentManager to route to specialized agents         │
│     ├─ Falls back between backends: AnythingLLM → QNN → CPU    │
│     └─ Saves results and tracks status                          │
│                                                                  │
│  3. MERGE PHASE (Result synthesis)                              │
│     ├─ Collects all subtask outputs                             │
│     ├─ Synthesizes into coherent final answer                   │
│     └─ Falls back gracefully with internal summary              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Multi-Backend Inference Strategy

The system supports three execution backends in priority order:

AnythingLLM (NPU) - Primary: Hardware-accelerated inference via Snapdragon X NPU
- Communicates via HTTP API to AnythingLLM server
- Configured in
```
config.yaml
```
  under
```
npu_provider
```
- Supports streaming output
QNN (Qualcomm Neural Network) - Secondary: Direct NPU model execution
- Automatically discovered from
```
models/
```
  directory
- Uses QAI Hub models (e.g., Phi-3.5-Mini-Instruct)
- Optimized for on-device inference
CPU Fallback - Tertiary: Local CPU inference via llama-cpp-python
- Uses GGUF quantized models (e.g., Phi-3-mini-4k-instruct.Q4_K_M.gguf)
- Pure CPU execution, no GPU/NPU required
- Guarantees functionality even without specialized hardware

Automatic Failover: If AnythingLLM fails, system automatically tries QNN, then CPU.

Directory Structure

AI_NPU_AGENT_Projekt/
├── CLAUDE.md                          # This file - architecture documentation
├── README.md                           # User-facing project overview
├── UI_GUIDE.md                        # Terminal UI features & customization
├── config.yaml.template               # Configuration template
├── config_extended.yaml               # Extended configuration example
├── .env.example                       # Environment variables template
├── requirements.txt                   # Main dependencies
├── requirements-core.txt              # Core CPU dependencies
├── requirements-npu.txt               # NPU-specific dependencies
│
├── config_loader.py                   # Configuration loading & validation
├── main.py                            # Entry point: Agent initialization
├── llm_chat.py                        # QNN-based chat interface
│
├── selfai/                            # Main SelfAI package
│   ├── __init__.py
│   ├── selfai.py                      # Main CLI loop with full pipeline
│   ├── core/
│   │   ├── agent.py                   # Basic agent with tool-calling loop
│   │   ├── agent_manager.py           # AgentManager: manages multiple agents
│   │   ├── model_interface.py         # Base interface for LLM models
│   │   ├── anythingllm_interface.py   # AnythingLLM HTTP client
│   │   ├── npu_llm_interface.py       # QNN/NPU model interface
│   │   ├── local_llm_interface.py     # CPU fallback (llama-cpp-python)
│   │   ├── planner_ollama_interface.py# Ollama planner client
│   │   ├── merge_ollama_interface.py  # Ollama merge provider
│   │   ├── execution_dispatcher.py    # Subtask execution orchestrator
│   │   ├── memory_system.py           # Conversation & plan storage
│   │   ├── context_filter.py          # Smart context relevance filtering
│   │   ├── planner_validator.py       # Plan schema validation
│   │   └── smolagents_runner.py       # Smolagents integration
│   │
│   ├── tools/
│   │   ├── tool_registry.py           # Tool catalog & management
│   │   ├── filesystem_tools.py        # File/directory operations
│   │   └── shell_tools.py             # Shell command execution
│   │
│   └── ui/
│       └── terminal_ui.py             # Terminal UI with animations
│
├── models/                            # Model storage directory
│   ├── Phi-3-mini-4k-instruct.Q4_K_M.gguf  # CPU fallback model
│   └── [other GGUF/QNN models]
│
├── memory/                            # Conversation & plan storage
│   ├── plans/                         # Saved execution plans
│   └── [memory categories]/           # Memory organized by agent categories
│
├── agents/                            # Agent configurations
│   ├── [agent_key]/
│   │   ├── system_prompt.md           # Agent system prompt
│   │   ├── memory_categories.txt      # Memory categories for this agent
│   │   ├── workspace_slug.txt         # AnythingLLM workspace
│   │   └── description.txt            # Agent description
│   └── [other agents]
│
├── data/                              # Additional data/resources
├── docs/                              # Extended documentation
├── scripts/                           # Setup & utility scripts
├── archive/                           # Old/archived code
└── Learings_aus_Problemen/            # Learning notes & problems

Configuration System

Configuration Loading Flow

┌─────────────────────────────────────────────────────────┐
│  config_loader.py::load_configuration()                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. Load .env file (secrets)                            │
│  2. Load config.yaml (main settings)                    │
│  3. Normalize config (support both formats)             │
│  4. Resolve environment variables (${VAR_NAME})         │
│  5. Validate required fields                            │
│  6. Create structured dataclasses                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Configuration Structure

config.yaml contains:

npu_provider - AnythingLLM backend

npu_provider:
  base_url: "http://localhost:3001/api/v1"
  workspace_slug: "main"
  api_key: "loaded-from-.env"

cpu_fallback - Local GGUF model

cpu_fallback:
  model_path: "Phi-3-mini-4k-instruct.Q4_K_M.gguf"
  n_ctx: 4096                    # Context window size
  n_gpu_layers: 0                # GPU offload layers

system - General settings

system:
  streaming_enabled: true        # Enable word-by-word output
  stream_timeout: 60.0           # Streaming timeout in seconds

agent_config - Agent management

agent_config:
  default_agent: "code_helfer"   # Default agent to load

planner - Optional Ollama-based planning

planner:
  enabled: false                 # Enable/disable planning
  execution_timeout: 120.0       # Timeout per subtask
  providers:                     # Multiple planner backends
    - name: "local-ollama"
      type: "local_ollama"
      base_url: "http://localhost:11434"
      model: "gemma3:1b"
      timeout: 180.0
      max_tokens: 768

merge - Optional result synthesis

merge:
  enabled: false
  providers:                     # Multiple merge backends
    - name: "merge-ollama"
      type: "local_ollama"
      base_url: "http://localhost:11434"
      model: "gemma3:3b"
      timeout: 180.0
      max_tokens: 2048

Environment Variables

Required (in .env file):

```
API_KEY
```
: AnythingLLM API key (required if using AnythingLLM)

Optional:

```
OLLAMA_CLOUD_API_KEY
```
: For cloud-based Ollama providers
Any variables referenced in config.yaml as
```
${VAR_NAME}
```

Key Components

1. Configuration Loader (

config_loader.py

)

Purpose: Centralized, validated configuration management

Key Classes:

```
NPUConfig
```
: AnythingLLM backend settings
```
CPUConfig
```
: Local model configuration
```
SystemConfig
```
: General system settings
```
PlannerConfig
```
: Planning phase configuration
```
MergeConfig
```
: Merge phase configuration
```
AppConfig
```
: Complete application config

Key Functions:

```
load_configuration()
```
: Load, validate, and structure config
```
_normalize_config()
```
: Support both simple and extended formats
```
_resolve_env_template()
```
: Replace
```
${VAR_NAME}
```
with env values

2. Agent System (

selfai/core/agent_manager.py

)

Purpose: Manage multiple specialized AI agents with memory

Agent Properties:

```
key
```
: Unique identifier (e.g., "code_helfer")
```
display_name
```
: Human-readable name
```
description
```
: What this agent does
```
system_prompt
```
: Agent personality/instructions
```
memory_categories
```
: Conversation storage categories
```
workspace_slug
```
: AnythingLLM workspace

AgentManager Responsibilities:

Load agents from disk
Switch between agents at runtime
Provide agent to execution/memory systems

3. Model Interfaces (Base + Implementations)

Base:

ModelInterface

model_interface.py

Defines common interface for all LLM backends

Methods:

chat_completion()

generate_response()

stream_generate_response()

Implementations:

AnythingLLMInterface (
```
anythingllm_interface.py
```
)
- HTTP client for AnythingLLM API
- Supports streaming via Server-Sent Events (SSE)
- Handles workspace management
- Configuration: base_url, workspace_slug, API key
NpuLLMInterface (
```
npu_llm_interface.py
```
)
- Direct QAI Hub models (Phi-3.5-Mini, etc.)
- NPU inference via QNN runtime
- Auto-discovers .qnn files in models directory
- Optimized for on-device execution
LocalLLMInterface (
```
local_llm_interface.py
```
)
- Wraps llama-cpp-python for CPU inference
- Loads GGUF quantized models
- Fallback when NPU unavailable
- Pure Python implementation

4. Planning System (

planner_ollama_interface.py

)

Purpose: Generate task decomposition plans (DPPM format)

PlannerOllamaInterface:

Calls Ollama API with structured prompt
Validates plan JSON schema
Returns structured plan data

Plan Structure:

{
  "subtasks": [
    {
      "id": "S1",
      "title": "Task title",
      "objective": "What to do",
      "agent_key": "agent_name",
      "engine": "anythingllm",
      "parallel_group": 1,
      "depends_on": []
    }
  ],
  "merge": {
    "strategy": "How to combine results",
    "steps": [...]
  }
}

Planning Flow:

User enters goal with
```
/plan <goal>
```
Planner decomposes into subtasks
Validation checks plan structure
User confirms before execution
Plan saved to
```
memory/plans/
```

5. Execution System (

execution_dispatcher.py

)

Purpose: Execute planned subtasks with fault tolerance

ExecutionDispatcher:

Loads plan from JSON
Executes each subtask via LLM backends
Manages retry logic (2 attempts by default)
Tracks task status: pending → running → completed/failed
Saves results to memory

Execution Pipeline:

For each subtask:
  1. Try Backend 1 (AnythingLLM)
  2. On failure → Try Backend 2 (QNN)
  3. On failure → Try Backend 3 (CPU)
  4. On all failure → Abort plan with error
  5. Save result to memory
  6. Update plan JSON with result path

Retry Strategy:

```
retry_attempts
```
: Number of retries (default 2)
```
retry_delay
```
: Wait between retries (default 5s)
Exponential backoff for network errors

6. Memory System (

memory_system.py

)

Purpose: Persistent conversation and plan storage

Structure:

memory/
├── plans/                    # Saved execution plans (JSON)
│   └── 20250101-120000_goal-name.json
├── code_helfer/              # Agent memory (categories)
│   ├── agent1_20250101-120000.txt
│   └── agent1_20250101-120001.txt
├── projektmanager/
│   └── ...
└── general/                  # Default category
    └── ...

File Format (text-based conversations):

---
Agent: Code Helper
AgentKey: code_helfer
Workspace: main
Timestamp: 2025-01-01 12:00:00
Tags: python, debugging
---
System Prompt:
[system instructions]
---
User:
[user question]
---
SelfAI:
[ai response]

Key Features:

Automatic category assignment
Tag extraction from content
Context filtering (relevance-based)
Plan serialization to JSON

7. Context Filtering (

context_filter.py

)

Purpose: Smart retrieval of relevant conversation history

Algorithms:

Task Classification: Categorize user input (coding, planning, etc.)
Relevance Scoring: Calculate similarity to past conversations
Context Selection: Retrieve top N most relevant past interactions

Integration: Used by

load_relevant_context()

to populate chat history

8. Terminal UI (

ui/terminal_ui.py

)

Purpose: Rich terminal interface with animations

Features:

ASCII banner and spinners
Progress bars for long operations
Color-coded status messages (green=success, yellow=warning, red=error)
Typing animation for AI responses
Stream prefix labels (showing which backend)
Plan visualization with tree structure
Interactive menu selection

Status Levels:

```
"success"
```
- Green ✓
```
"info"
```
- Blue ⓘ
```
"warning"
```
- Yellow ⚠
```
"error"
```
- Red ✗

Entry Points & Execution Flow

Main Entry Point:

selfai/selfai.py

(Recommended)

Complete 3-phase pipeline:

python /path/to/selfai/selfai.py

Flow:

Initialize configuration, agents, memory, UI
Load LLM backends in priority order (AnythingLLM → QNN → CPU)
Load optional planner providers (Ollama)
Load optional merge providers (Ollama)
Enter interactive loop:
- ```
/plan <goal>
```
  → Planning phase
- Normal message → Chat (execution phase)
- ```
/memory
```
  → Manage memory
- ```
/switch <agent>
```
  → Switch agents
- ```
quit
```
  → Exit

Key Commands:

```
/plan <goal>
```
- Create and execute task decomposition plan
```
/planner list
```
- List available planner backends
```
/planner use <name>
```
- Switch planner provider
```
/memory
```
- List memory categories
```
/memory clear <category>
```
- Clear memory
```
/switch <agent_name|number>
```
- Switch active agent
```
quit
```
- Exit program

Alternative Entry Point:

main.py

Simple agent initialization:

python main.py

Flow:

Initialize Agent with local-ollama provider
Simple chat loop without planning/merge
Tool-based execution (via
```
smolagents
```
)

Use Case: Basic testing without complex infrastructure

Alternative Entry Point:

llm_chat.py

Direct QNN/NPU chat:

python llm_chat.py

Features:

Direct QAI Hub model loading
Phi-3.5-Mini on Snapdragon X Elite NPU
Simple interactive chat
No configuration needed

Component Interaction Diagram

┌──────────────────────────────────────────────────────────────────┐
│                        selfai.py (Main Loop)                     │
└────────────────────────┬─────────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
        ▼                ▼                ▼
    ┌────────┐      ┌─────────┐     ┌──────────┐
    │Planning│      │Execution│     │  Merge   │
    │ Phase  │      │  Phase  │     │  Phase   │
    └────┬───┘      └────┬────┘     └────┬─────┘
         │               │               │
         ▼               ▼               ▼
   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
   │PlannerOllama │ │ExecutionDisp │ │MergeOllama   │
   │Interface     │ │atcher        │ │Interface     │
   └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
    ┌────────────┐   ┌────────────┐   ┌────────────┐
    │AnythingLLM │   │NpuLLM      │   │LocalLLM    │
    │Interface   │   │Interface   │   │Interface   │
    └────────────┘   └────────────┘   └────────────┘
        ▲                  ▲                  ▲
        │                  │                  │
    ┌───┴──────────────────┼──────────────────┴───┐
    │                      │                      │
    │ ┌────────────────────┘                      │
    │ │ Automatic Fallback in Priority Order      │
    │ │                                           │
    ▼ ▼                                           ▼
┌──────────┐  ┌────────────┐  ┌──────────────┐
│AnythingLLM│ │ QNN Models │  │ GGUF Models  │
│Server     │ │ (.qnn)     │  │ (CPU)        │
│(NPU)      │ │ (NPU)      │  │              │
└──────────┘  └────────────┘  └──────────────┘


┌─────────────────────────────────────────────────┐
│            Supporting Systems                   │
├─────────────────────────────────────────────────┤
│  AgentManager → Agent instances & switching    │
│  MemorySystem → Conversation & plan persistence│
│  ConfigLoader → Centralized configuration      │
│  TerminalUI   → Rich terminal interface        │
│  ContextFilter→ Smart context retrieval        │
└─────────────────────────────────────────────────┘

Dependencies

Core Dependencies (

requirements-core.txt

)

PyYAML              # Config file parsing
python-dotenv       # Environment variable loading
openai              # OpenAI API compatibility
llama-cpp-python    # CPU model inference (GGUF)
numpy               # Numerical computing
pyarrow             # Data serialization
tabulate            # Table formatting
smmap               # Fast file mapping
psutil              # System monitoring
qai-hub-models      # Qualcomm AI Hub models
smolagents          # Agent toolkit

NPU Dependencies (

requirements-npu.txt

)

httpx==0.28.1       # HTTP client for API calls
qai_hub_models      # QNN model support

System Requirements

Hardware:

Windows on ARM (Snapdragon X Elite or compatible)
Minimum 8GB RAM for CPU inference
16GB+ recommended for NPU optimization

Software:

Python 3.12 (ARM64 build)
AnythingLLM Desktop (ARM64) - Optional for NPU backend
Ollama - Optional for planning/merge phases

Setup Instructions

1. Initial Setup

# Clone repository
git clone <repository-url>
cd AI_NPU_AGENT_Projekt

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows CMD: .\.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration

# Copy and configure
cp config.yaml.template config.yaml
cp .env.example .env

# Edit config.yaml with your settings
# Edit .env with your AnythingLLM API key

3. Prepare Models

# Create models directory
mkdir -p models

# Download GGUF model for CPU fallback
# Place in models/ directory
# e.g., Phi-3-mini-4k-instruct.Q4_K_M.gguf

4. Set Up Agents

# Create agents directory
mkdir -p agents

# Create agent directories with:
# agents/agent_key/
#   ├── system_prompt.md
#   ├── memory_categories.txt
#   ├── workspace_slug.txt
#   └── description.txt

5. Optional: Start Ollama (for planning/merge)

# If using Ollama planner/merge
ollama serve

# In another terminal, pull models
ollama pull gemma3:1b
ollama pull gemma3:3b

6. Optional: Start AnythingLLM

# If using AnythingLLM for primary inference
# Launch AnythingLLM Desktop and configure workspace

Common Workflows

Workflow 1: Simple Chat

python selfai/selfai.py
> You: What is Python?
AI: [Response from available backend]

Workflow 2: Task Decomposition with Planning

python selfai/selfai.py
> You: /plan Create a Python web crawler for news sites
[Planner decomposes into subtasks]
[System executes each subtask]
[Merge synthesizes final solution]

Workflow 3: Agent Switching

python selfai/selfai.py
> You: /switch projektmanager
Switched to: Project Manager
> You: Analyze the project requirements
AI: [Response from project manager agent]

Workflow 4: Memory Management

python selfai/selfai.py
> You: /memory
Aktive Memory-Kategorien:
- code_helfer
- projektmanager

> You: /memory clear code_helfer
Memory 'code_helfer' komplett geleert (15 Einträge).

How to Extend

Adding a New Agent

Create directory:

agents/my_agent/
├── system_prompt.md       (Agent personality)
├── memory_categories.txt  (One per line)
├── workspace_slug.txt     (AnythingLLM workspace)
└── description.txt        (What agent does)

Reference in config.yaml:

agent_config:
  default_agent: "my_agent"

Adding a New Tool

Create in

selfai/tools/

class MyTool:
    @property
    def name(self):
        return "my_tool"
    
    @property
    def description(self):
        return "Tool description"
    
    @property
    def inputs(self):
        return {
            "param1": {"description": "..."}
        }
    
    def run(self, param1: str) -> str:
        # Implementation
        return result

selfai/tools/tool_registry.py

from selfai.tools.my_tool import MyTool
# Add to registry

Adding New LLM Backend

Create new interface in

selfai/core/

class MyLLMInterface:
    def generate_response(self, ...): ...
    def stream_generate_response(self, ...): ...

Instantiate in

selfai/selfai.py

interface, label = _load_my_llm(models_root, ui)
execution_backends.append({
    "interface": interface,
    "label": label,
    "name": "my_backend"
})

Troubleshooting

Issue: "API_KEY is not set"

Solution:

Copy
```
.env.example
```
to
```
.env
```
Add your AnythingLLM API key
Ensure
```
config.yaml
```
is created from template

Issue: "AnythingLLM not available"

Solution:

Verify AnythingLLM server running on configured host:port
Check
```
npu_provider.base_url
```
in config.yaml
System will automatically fall back to QNN or CPU

Issue: CPU inference very slow

Solution:

Reduce
```
max_output_tokens
```
in config
Use quantized models (Q4_K_M)
Reduce
```
n_ctx
```
(context window)
Consider using AnythingLLM + NPU for acceleration

Issue: Memory growing unbounded

Solution:

Use
```
/memory clear <category>
```
to manage
Or:
```
/memory clear <category> 5
```
to keep only last 5
Memory is organized by agent memory_categories

Issue: Planner not working

Solution:

Ensure Ollama running:
```
ollama serve
```
Set
```
planner.enabled: true
```
in config.yaml
Verify Ollama models installed:
```
ollama pull gemma3:1b
```
Check
```
planner.providers[0].base_url
```
points to Ollama

Performance Considerations

Streaming vs. Blocking

Streaming (default): Better UX, lower latency perception
- Enable:
```
system.streaming_enabled: true
```
- Supported by: AnythingLLM, some Ollama versions
Blocking: Simple, predictable latency
- Use if streaming unavailable
- System falls back automatically

Backend Selection

Backend	Speed	Quality	Hardware	Notes
AnythingLLM (NPU)	Fast	High	Snapdragon X Elite	Recommended primary
QNN	Very Fast	High	Snapdragon X Elite	Direct NPU access
CPU (GGUF)	Slow	Medium	Any	Fallback guarantee

Token Limits

Planner
```
max_tokens
```
: 768 (plan generation)
Merge
```
max_tokens
```
: 1536 (result synthesis)
Chat
```
max_output_tokens
```
: 512 (regular response)
Increase for longer responses, decrease for speed

Technical Details

DPPM (Distributed Planning Problem Model) Format

The planner generates plans in DPPM format:

{
  "subtasks": [
    {
      "id": "S1",
      "title": "Analyze Requirements",
      "objective": "Understand what user needs",
      "agent_key": "analyst",
      "engine": "anythingllm",
      "parallel_group": 1,
      "depends_on": [],
      "result_path": "memory/plans/results/S1.txt"
    },
    {
      "id": "S2",
      "title": "Design Solution",
      "objective": "Create architecture",
      "agent_key": "architect",
      "engine": "anythingllm",
      "parallel_group": 2,
      "depends_on": ["S1"],
      "result_path": "memory/plans/results/S2.txt"
    }
  ],
  "merge": {
    "strategy": "Combine analysis and design",
    "steps": [
      {
        "title": "Synthesis",
        "description": "Unite results",
        "depends_on": ["S2"]
      }
    ]
  },
  "metadata": {
    "planner_provider": "local-ollama",
    "planner_model": "gemma3:1b",
    "goal": "Create a web application",
    "merge_agent": "projektmanager"
  }
}

Streaming Protocol

AnythingLLM and Ollama use Server-Sent Events (SSE):

event: message
data: {"content": "Hello"}

event: message
data: {"content": " world"}

event: end
data: {"done": true}

System automatically decodes and displays streaming chunks.

Configuration Validation

Type Checking: Dataclass validation
Required Fields: Missing keys raise ValueError
URL Validation: Tested with health checks
Model Existence: GGUF files checked before use

Development Notes

Code Organization Principles

Separation of Concerns: Each module has single responsibility
- ```
*_interface.py
```
  → Backend communication
- ```
*_system.py
```
  → Persistent state
- ```
execution_*
```
  → Task orchestration
- ```
ui/
```
  → User interface
Dependency Injection: Core business logic independent of I/O
- Interfaces passed as parameters
- Easy to mock for testing
- Flexible backend switching
Graceful Degradation: System continues with reduced capability
- Missing optional features don't crash
- Fallback mechanisms at each level
- Clear status messages about limitations
Configuration-Driven: Behavior changes without code modification
- All settings in config.yaml
- Environment variable interpolation
- Validated at startup