<h1 align="center">
<a href="https://prompts.chat">
<p align="center">
Sign in to like and favorite skills
A comprehensive tool for evaluating AI-powered scanners in detecting various types of content using CalypsoAI API.
Major Release: v2.0.0 - October 2025
This major release introduces significant improvements to dataset handling, format support, and tooling. Key highlights include multiple dataset formats (JSONL, CSV, TSV, Parquet), enhanced migration tools, improved Hugging Face integration, and comprehensive documentation updates.
📋 View Complete Release Notes - Detailed changelog, migration guide, and new features
prompt_evaluator_concurrent.py: Concurrent evaluator for load testing and high-speed processingreport_generator.py: Generate professional PDF reports from evaluation resultstools/improved_dataset_converter.py: Convert between all formats with validationtools/enhanced_dataset_reader.py: Unified reader that auto-detects formatstools/migrate_datasets.py: Batch migration tool for existing datasetstools/demo_improved_formats.py: Complete demonstration of new features# Sequential evaluation with performance metrics and auto-generated report python prompt_evaluator.py --input datasets/test_dataset.jsonl # Concurrent evaluation for faster processing (10 workers) python prompt_evaluator_concurrent.py -i datasets/test_dataset.jsonl -c 10 # Migrate datasets to JSONL format python tools/migrate_datasets.py --list # See current formats python tools/migrate_datasets.py --all --output-format jsonl # Migrate all # Generate PDF report manually python report_generator.py --dataset test_dataset
Install dependencies:
pip install -r requirements.txt
Set up environment variables: Create a
.env file with your CalypsoAI API credentials:
CALYPSOAI_URL=https://calypsoai.app CALYPSOAI_TOKEN=your_api_token_here
Run evaluation:
python prompt_evaluator.py --input datasets/prompt_inject_dataset.csv
PDF report is automatically generated! Check
results/ directory
Generate reports manually (optional):
# Use just the dataset name (not the full path) python report_generator.py --dataset pii_dataset
make commands| Tool | Description | Speed | Best For |
|---|---|---|---|
| Sequential evaluation with full metrics | Baseline | Accuracy testing, small datasets, establishing baseline |
| Advanced evaluation with LLM responses | Baseline | Detailed analysis requiring full response text |
| Parallel evaluation with workers | 5-20x faster | Large datasets, load testing, production scenarios |
| Re-calculate metrics without API calls | Instant | Re-analyzing existing results, metric verification |
| Tool | Purpose | Output |
|---|---|---|
| Generate professional PDF reports | PDF with charts, confusion matrices, performance metrics |
| Download datasets from Hugging Face | Dataset files in local directory |
tools/ directory)| Tool | Purpose | Key Features |
|---|---|---|
| Read any dataset format | Auto-detection, unified interface, metadata support |
| Convert between formats | Validation, proper escaping, metadata preservation |
| Batch migrate datasets | Automatic backups, format detection, safety checks |
| Interactive demonstration | Shows all formats, migration examples, comparisons |
⚠️ Important: The concurrent evaluator can trigger infrastructure auto-scaling. See Concurrent Evaluation Documentation for safety guidelines before use.
# Run evaluation on a dataset (auto-generates PDF report) python prompt_evaluator.py --input datasets/pii_dataset.jsonl # Limit number of prompts processed python prompt_evaluator.py -i datasets/large_dataset.jsonl -l 100 # Specify output format for false positive/negative files python prompt_evaluator.py -i datasets/test.jsonl --format csv
# Run with default concurrency (10 workers) python prompt_evaluator_concurrent.py -i datasets/pii_dataset.jsonl # Adjust concurrency level python prompt_evaluator_concurrent.py -i datasets/large_dataset.jsonl -c 20 # Add rate limiting (50 requests/second) python prompt_evaluator_concurrent.py -i datasets/test.jsonl --rate-limit 50 # Quick test with small sample python prompt_evaluator_concurrent.py -i datasets/test.jsonl -l 50 -c 5
# 1. Establish baseline with sequential python prompt_evaluator.py -i datasets/test.jsonl # 2. Compare with concurrent python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 10 # 3. Review both reports to compare throughput and latency # Check results/test_results.jsonl vs results/test_concurrent_results.jsonl
# Check current dataset formats python tools/migrate_datasets.py --list # Migrate all datasets to JSONL python tools/migrate_datasets.py --all --output-format jsonl # Migrate specific dataset python tools/migrate_datasets.py --input datasets/old_format.csv --output-format jsonl # Convert without backups (use with caution) python tools/migrate_datasets.py --input datasets/test.csv --output-format jsonl --no-backup
# Reports are auto-generated after evaluation, but you can also generate manually: # Generate from dataset name (looks in results/ directory) python report_generator.py --dataset pii_dataset # Auto-detect if only one results file exists python report_generator.py
# Re-calculate metrics from existing results (no API calls) python evaluate_existing_results.py --input results/pii_dataset_results.jsonl # Useful after fixing dataset labels or for metric verification
# Advanced evaluation with full LLM responses python prompt_evaluator_prompts.py --input datasets/test.jsonl # Download datasets from Hugging Face python download_hf_datasets.py # Convert dataset formats with validation python tools/improved_dataset_converter.py input.csv output.jsonl --validate # Run full demo of dataset features python tools/demo_improved_formats.py
# Run concurrent evaluation make run-concurrent DATASET=datasets/pii_dataset.jsonl # With custom settings make run-concurrent DATASET=datasets/test.jsonl CONCURRENCY=20 LINES=100 # Run tests make test-core # Core functionality tests make test-concurrent # Test concurrent evaluator # Code quality make lint # Run linting make format-fix # Format code with black make pre-commit # Run all pre-commit checks
The project includes several sample datasets in the
datasets/sample-datasets/ folder:
codesagar_malicious_llm_prompts_v4_test.jsonl - Prompt injection examplespii_dataset.jsonl - Personally identifiable information examplesfin_advice_dataset.jsonl - Financial advice promptseu-ai-act-prompts.jsonl - EU AI Act compliance promptsxTRam1_safe_guard_prompt_injection_test.jsonl - Additional prompt injection test casesThroughput: Measures how many prompts you can process per second. Essential for capacity planning.
Latency Percentiles: Show the distribution of response times, not just averages.
See Evaluation Metrics Guide for detailed explanations.
Use Sequential When:
Use Concurrent When:
Important: Always check with infrastructure teams before running concurrent evaluation against shared environments.
See Sequential vs Concurrent Comparison for detailed guidance.
Use JSONL (Recommended):
Use CSV/TSV When:
Use Parquet When:
See Improved Dataset Formats for migration guide.
--validate flagFor common issues and solutions, see the Troubleshooting Guide.
Quick fixes:
.env file with API credentials--format-hint parameter or check file extensionThis project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please see our Contributing Guide for details.