Agents.md ― spec _version: 0.3.3 (2025-07-21)

TL;DR
anml-exp
is a pip-installable anomaly-detection playground with locked dependencies (
uv.lock
), artefact registry, strict dataset hashing, and a structured hardware descriptor in benchmark outputs.

0 · Purpose & Non-Goals

This repository remains a rapid-prototyping and benchmarking framework for anomaly-scoring / detection algorithms.

Out of scope:

Streaming / online detection
Production serving or long-horizon monitoring
Adversarial-attack tooling, drift dashboards, model-card generation

1 · Big Picture

LLM-powered agents collaborate with human maintainers to

Implement a diverse zoo of anomaly-detection models.
Expose a uniform API and artefact registry for seamless benchmarking.
Automate dataset ingestion (SHA-256 verified), metric computation, and result logging.
Guard code quality with linting, typing, tests, docs, and a locked dependency graph (
```
uv.lock
```
).

. ├── src/ │ └── anml_exp/ │ ├── init.py │ ├── models/ │ ├── data/ │ ├── benchmarks/ │ ├── registry/ # NEW: model artefact versioning (#43) │ ├── resources/ │ └── cli.py ├── tests/ ├── docs/ ├── pyproject.toml ├── uv.lock # NEW: reproducible dependency lockfile (#42) └── README.md

Historic note – the hidden
.agents/
folder mentioned in spec v0.2 has been retired; helpers live in
anml_exp/benchmarks/
and
anml_exp/registry/
.

3 · Common Schema & Interfaces

3.1 Base Model API

Every model must subclass

anml_exp.models.base.BaseAnomalyModel

and implement:

Method / Property	Signature	Notes
`fit`	`def fit(self, X, y=None) -> Self`
`score_samples`	`def score_samples(self, X) -> NDArray[float]`	Higher ⇒ more anomalous.
`predict`	`def predict(self, X, *, threshold=None) -> NDArray[int]`
`decision_threshold`	`@property def decision_threshold(self) -> float`
`save` / `load`	optional	Use `anml_exp.registry` for artefact versioning.

anml_exp.registry

stores model binaries and metadata under a semantic version (

MAJOR.MINOR.PATCH

) with SHA-256 digests.

3.2 Dataset Registry

from anml_exp.data import load_dataset

X_train, y_train = load_dataset("kddcup99", split="train")

	•	Each dataset module must declare SHA256 hashes for every file.
	•	load_dataset verifies each hash before extraction; mismatch ⇒ HashError
(#44).
	•	Deterministic splits (seed = 42).

⸻

3.3 Metrics & Result Schema

Benchmarks report:
	•	ROC-AUC
	•	PR-AUC (Average Precision)
	•	F1 @ best Youden threshold
	•	Mean wall-time per 1 000 samples

Each run is saved to

results/{exp_name}/{model_name}.json

and must validate against
anml_exp/resources/results-schema.json.

Structured hardware descriptor (#45)

"hardware": {
  "device_type": "GPU",
  "vendor": "NVIDIA",
  "model": "RTX A6000",
  "driver": "535.104",
  "num_devices": 1,
  "notes": "desktop workstation"
}

Minimal example

{
  "$schema": "./results-schema.json",
  "dataset": "kddcup99",
  "model": "isolation_forest",
  "model_version": "0.1.0",
  "n_samples": 145586,
  "seed": 42,
  "hardware": {
    "device_type": "CPU",
    "vendor": "Intel",
    "model": "i7-1185G7",
    "driver": "N/A",
    "num_devices": 1,
    "notes": "laptop"
  },
  "roc_auc": 0.921,
  "pr_auc": 0.604,
  "f1": 0.432,
  "threshold": 0.79,
  "fit_time": 1.23,
  "score_time": 0.02,
  "params": {"n_estimators": 100, "max_samples": "auto"},
  "artefact_digest": "sha256:13f0…"
}


⸻

4 · Agent Roles

Agent	Intent	Success Criteria
Builder	Generate / extend code (models, loaders, registry).	API compliance, passes tests, artefact registered.
Evaluator	Run benchmarks & aggregate metrics.	JSON validates, hardware descriptor correct.
Reviewer	Static analysis, typing, docs, tests, perf.	CI green (ruff, mypy, pytest, hash check, lock diff).


⸻

5 · Contribution Workflow

flowchart TD
    draft["Builder → Draft PR"]
    review["Reviewer → CI checks"]
    maintainer["Human → Merge / Request changes"]
    draft --> review --> maintainer

CI additionally ensures:
	•	uv sync --frozen produces identical env (#42).
	•	Dataset SHA-256s match declared values (#44).

⸻

6 · Coding Standards
	•	Dependency lock: uv.lock is the single source of truth.
	•	PEP 8 via ruff; PEP 561 typing (mypy --strict).
        •       Speed up mypy in CI by caching `.mypy_cache` and installing
                `mypy[faster-cache]` via `uv pip` to ensure the local
                environment runs the optimized wheels.
	•	NumPy-style docstrings.
	•	pyproject.toml + uv.lock define mandatory and optional extras.

⸻

7 · Testing Strategy
	•	Unit + property tests.
	•	Hash-verification tests for every dataset file.
	•	CI fails if uv lock --check detects drift.
	•	Perf suite (tests/perf/) skipped in CI.

⸻

8 · Installation & Quick-Start

# Reproducible dev install
uv sync --frozen
pip install -e ".[torch,plot]"
# After release:
pip install anml-exp[torch,plot]

CLI:

anml-exp benchmark --dataset toy-blobs \
                   --model isolation_forest \
                   --output results/demo.json


⸻

9 · Road-Map

Milestone	Owner	Exit Criteria
M0 – Skeleton	Builder	Base class, dataset registry (SHA-256), artefact registry, CI, uv.lock.
M1 – Classical Benchmark	Evaluator	3 tabular datasets; JSON outputs pass new schema.
M2 – Deep Models	Builder	AutoEncoder, DeepSVDD, USAD registered & versioned.
M3 – Time-Series Support	Builder + Evaluator	Loader + STOMP baseline + benchmarks.


⸻

10 · Open Questions
	1.	Unified config system (omegaconf) – still pending.
	2.	Preferred experiment tracker (mlflow, wandb, plain JSON).
	3.	CPU vs GPU determinism in CI.
	4.	Sandboxing policy for code-gen agents.

⸻

11 · Meta
	•	_spec_version bumped → 0.3.3 (adds #42–#45).
	•	See CONTRIBUTING.md for human-targeted guidelines.
	•	results-schema.json is the machine-readable contract.

Last updated – 2025-07-20 @ 20:55 AEST

Agents.md ― spec _version: 0.3.3 (2025-07-21)

TL;DR
anml-exp
is a pip-installable anomaly-detection playground with locked dependencies (
uv.lock
), artefact registry, strict dataset hashing, and a structured hardware descriptor in benchmark outputs.

0 · Purpose & Non-Goals

This repository remains a rapid-prototyping and benchmarking framework for anomaly-scoring / detection algorithms.

Out of scope:

Streaming / online detection
Production serving or long-horizon monitoring
Adversarial-attack tooling, drift dashboards, model-card generation

1 · Big Picture

LLM-powered agents collaborate with human maintainers to

Implement a diverse zoo of anomaly-detection models.
Expose a uniform API and artefact registry for seamless benchmarking.
Automate dataset ingestion (SHA-256 verified), metric computation, and result logging.
Guard code quality with linting, typing, tests, docs, and a locked dependency graph (
```
uv.lock
```
).

2 · Canonical Folder Layout

Historic note – the hidden
.agents/
folder mentioned in spec v0.2 has been retired; helpers live in
anml_exp/benchmarks/
and
anml_exp/registry/
.

3 · Common Schema & Interfaces

3.1 Base Model API

Every model must subclass

anml_exp.models.base.BaseAnomalyModel

and implement:

Method / Property	Signature	Notes
`fit`	`def fit(self, X, y=None) -> Self`
`score_samples`	`def score_samples(self, X) -> NDArray[float]`	Higher ⇒ more anomalous.
`predict`	`def predict(self, X, *, threshold=None) -> NDArray[int]`
`decision_threshold`	`@property def decision_threshold(self) -> float`
`save` / `load`	optional	Use `anml_exp.registry` for artefact versioning.

anml_exp.registry

stores model binaries and metadata under a semantic version (

MAJOR.MINOR.PATCH

) with SHA-256 digests.

3.2 Dataset Registry

from anml_exp.data import load_dataset

X_train, y_train = load_dataset("kddcup99", split="train")

	•	Each dataset module must declare SHA256 hashes for every file.
	•	load_dataset verifies each hash before extraction; mismatch ⇒ HashError
(#44).
	•	Deterministic splits (seed = 42).

⸻

3.3 Metrics & Result Schema

Benchmarks report:
	•	ROC-AUC
	•	PR-AUC (Average Precision)
	•	F1 @ best Youden threshold
	•	Mean wall-time per 1 000 samples

Each run is saved to

results/{exp_name}/{model_name}.json

and must validate against
anml_exp/resources/results-schema.json.

Structured hardware descriptor (#45)

"hardware": {
  "device_type": "GPU",
  "vendor": "NVIDIA",
  "model": "RTX A6000",
  "driver": "535.104",
  "num_devices": 1,
  "notes": "desktop workstation"
}

Minimal example

{
  "$schema": "./results-schema.json",
  "dataset": "kddcup99",
  "model": "isolation_forest",
  "model_version": "0.1.0",
  "n_samples": 145586,
  "seed": 42,
  "hardware": {
    "device_type": "CPU",
    "vendor": "Intel",
    "model": "i7-1185G7",
    "driver": "N/A",
    "num_devices": 1,
    "notes": "laptop"
  },
  "roc_auc": 0.921,
  "pr_auc": 0.604,
  "f1": 0.432,
  "threshold": 0.79,
  "fit_time": 1.23,
  "score_time": 0.02,
  "params": {"n_estimators": 100, "max_samples": "auto"},
  "artefact_digest": "sha256:13f0…"
}


⸻

4 · Agent Roles

Agent	Intent	Success Criteria
Builder	Generate / extend code (models, loaders, registry).	API compliance, passes tests, artefact registered.
Evaluator	Run benchmarks & aggregate metrics.	JSON validates, hardware descriptor correct.
Reviewer	Static analysis, typing, docs, tests, perf.	CI green (ruff, mypy, pytest, hash check, lock diff).


⸻

5 · Contribution Workflow

flowchart TD
    draft["Builder → Draft PR"]
    review["Reviewer → CI checks"]
    maintainer["Human → Merge / Request changes"]
    draft --> review --> maintainer

CI additionally ensures:
	•	uv sync --frozen produces identical env (#42).
	•	Dataset SHA-256s match declared values (#44).

⸻

6 · Coding Standards
	•	Dependency lock: uv.lock is the single source of truth.
	•	PEP 8 via ruff; PEP 561 typing (mypy --strict).
        •       Speed up mypy in CI by caching `.mypy_cache` and installing
                `mypy[faster-cache]` via `uv pip` to ensure the local
                environment runs the optimized wheels.
	•	NumPy-style docstrings.
	•	pyproject.toml + uv.lock define mandatory and optional extras.

⸻

7 · Testing Strategy
	•	Unit + property tests.
	•	Hash-verification tests for every dataset file.
	•	CI fails if uv lock --check detects drift.
	•	Perf suite (tests/perf/) skipped in CI.

⸻

8 · Installation & Quick-Start

# Reproducible dev install
uv sync --frozen
pip install -e ".[torch,plot]"
# After release:
pip install anml-exp[torch,plot]

CLI:

anml-exp benchmark --dataset toy-blobs \
                   --model isolation_forest \
                   --output results/demo.json


⸻

9 · Road-Map

Milestone	Owner	Exit Criteria
M0 – Skeleton	Builder	Base class, dataset registry (SHA-256), artefact registry, CI, uv.lock.
M1 – Classical Benchmark	Evaluator	3 tabular datasets; JSON outputs pass new schema.
M2 – Deep Models	Builder	AutoEncoder, DeepSVDD, USAD registered & versioned.
M3 – Time-Series Support	Builder + Evaluator	Loader + STOMP baseline + benchmarks.


⸻

10 · Open Questions
	1.	Unified config system (omegaconf) – still pending.
	2.	Preferred experiment tracker (mlflow, wandb, plain JSON).
	3.	CPU vs GPU determinism in CI.
	4.	Sandboxing policy for code-gen agents.

⸻

11 · Meta
	•	_spec_version bumped → 0.3.3 (adds #42–#45).
	•	See CONTRIBUTING.md for human-targeted guidelines.
	•	results-schema.json is the machine-readable contract.

Last updated – 2025-07-20 @ 20:55 AEST

Agents.md ― spec _version: 0.3.3 (2025-07-21)

Agents.md ― spec _version: 0.3.3 (2025-07-21)

0 · Purpose & Non-Goals

1 · Big Picture

2 · Canonical Folder Layout

3 · Common Schema & Interfaces

3.1 Base Model API

3.2 Dataset Registry

Related Skills

Markdown Converter

Nano Banana Pro

1password

Agents.md ― spec _version: 0.3.3 (2025-07-21)

0 · Purpose & Non-Goals

1 · Big Picture

2 · Canonical Folder Layout

3 · Common Schema & Interfaces

3.1 Base Model API

3.2 Dataset Registry

Agents.md ― spec _version: **0.3.3** (2025-07-21)

Related Skills

Markdown Converter

Nano Banana Pro

1password

Agents.md ― spec _version: 0.3.3 (2025-07-21)