Markdown Converter
Agent skill for markdown-converter
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Sign in to like and favorite skills
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Multimodal video analysis system that performs character-dialogue matching by processing audio and visual streams in parallel, then fusing results using advanced heuristic techniques with confidence auto-calibration.
Core Technologies: Whisper ASR, InsightFace (RetinaFace + ArcFace), SORT tracking, wav2vec2, Pyannote, DeepFace
Current Performance: 74-83% dialogue-character match rate (achieved through optimized robust pipeline)
Project Type: Research/Analysis tool (MIT Licensed) Repository: https://github.com/material-lab-io/Hendrix_Character_Dialogue_Analysis Language: Python 3.8+ Hardware: GPU with CUDA recommended (5-10x speedup)
Video Input ├── Audio Branch → Schema A (transcriptions + emotions) │ → Schema B (speaker diarization) └── Visual Branch → Schema C (character detection + tracking) ↓ Fusion → Schema D (character-dialogue matches)
audio_processing_branch/src/audio/ - Whisper, emotion, diarization processorsvisual_processing_branch/src/visual/ - InsightFace detection, SORT tracking, adaptive frame extraction
scene_detector.py - Detects scene boundaries using scenedetectscene_aware_character_clustering.py - Clusters characters within scenesvisual_processing_branch/src/fusion/ - Advanced character matcher with continuity trackingThe fusion system (
advanced_character_matcher.py) employs multiple scoring factors:
# Clone repository git clone https://github.com/material-lab-io/Hendrix_Character_Dialogue_Analysis.git cd Hendrix_Character_Dialogue_Analysis # Environment setup (required) export HF_TOKEN=your_huggingface_token # Required for speaker diarization export TF_USE_LEGACY_KERAS=1 # Required for DeepFace compatibility # Virtual environment (from project root) python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies (includes both audio and visual) cd visual_processing_branch pip install -r requirements.txt # Download models (optional - will auto-download on first use) python scripts/setup_models.py # Verify setup cd ../audio_processing_branch python scripts/test_setup.py
# OPTIMIZED ROBUST PIPELINE - Best performance (74-83% match rate) - RECOMMENDED cd visual_processing_branch python scripts/run_optimized_robust_pipeline.py video.mp4 # Legacy pipeline (older version) python scripts/run_complete_pipeline_v2.py video.mp4 --target-frames 600 # EXPERIMENTAL: Scene-aware pipeline (still in development) # python scripts/run_scene_aware_pipeline.py video.mp4 --scene-threshold 30.0 # Evaluate results python scripts/evaluate_pipeline_output.py output/optimized_robust/session_*
# Audio only cd audio_processing_branch python scripts/complete_audio_pipeline.py video.mp4 --whisper-model base # Visual only cd visual_processing_branch python scripts/tracked_visual_pipeline.py video.mp4 --target-frames 300
# Verify setup cd audio_processing_branch && python scripts/test_setup.py # Convert problematic video formats cd visual_processing_branch && python scripts/video_converter.py input.mp4 output_h264.mp4 # Test performance python scripts/test_tracking_performance.py video.mp4 # Download test videos cd visual_processing_branch && python scripts/download_test_videos.py # Download required models cd visual_processing_branch && python scripts/setup_models.py # Validate schemas python scripts/validate_all_schemas.py output/optimized_robust/session_* # Display schemas in human-readable format python scripts/display_all_schemas.py output/optimized_robust/session_* # Fix schema issues in existing files python scripts/fix_schema_issues.py output/optimized_robust/session_* # Test long video simulation python scripts/test_long_video_simulation.py # Test scene-aware clustering python scripts/test_scene_aware_clustering.py --video video.mp4
# Audio testing cd audio_processing_branch python scripts/test_whisper_component.py video.mp4 python scripts/test_emotion_component.py video.mp4 python scripts/test_diarization_component.py video.mp4 # Visual testing cd visual_processing_branch python scripts/test_insightface_detection.py video.mp4 python scripts/test_complete_fusion.py video.mp4 python scripts/test_heuristic_matching.py video.mp4 python scripts/test_pipeline_performance.py video.mp4
# Use production configurations cd visual_processing_branch python scripts/run_with_config.py video.mp4 --config default # Analyze multiple videos python scripts/test_multiple_videos.py test_videos/ # Process directory # Preprocess videos python scripts/video_preprocessor.py input.mp4 output/ # Extract frames/audio
audio_processing_branch/src/audio/emotion_processor.pyscene_detector.py - Uses scenedetect library for boundary detectionscene_aware_character_clustering.py - Two-stage clustering (intra-scene then inter-scene)output/optimized_robust/session_YYYYMMDD_HHMMSS/ ├── audio_output/ │ └── {video_name}_YYYYMMDD_HHMMSS/ │ └── schemas/ │ ├── schema_a_transcription.json # Transcriptions with emotions │ └── schema_b_speakers.json # Speaker diarization ├── visual_output/ │ ├── character_data_schemaC.json # Character detections with embeddings │ ├── tracking_data.json # SORT tracking data │ ├── visual_pipeline_report.txt # Processing summary │ ├── extraction_stats.json # Frame extraction statistics │ └── lip_movement_data.pkl # Lip sync data └── fusion_output/ ├── schema_d_matches.json # Character-dialogue matches ├── optimized_matching_report.md # Human-readable summary └── character_profiles.json # Enhanced character profiles # Scene-aware pipeline adds: output/scene_aware/scene_aware_session_YYYYMMDD_HHMMSS/ ├── scene_data/ │ └── detected_scenes.json # Scene boundaries ├── visual_output/ │ ├── scene_clustering_results.json # Scene-based clustering │ └── character_data_schemaC_scenes.json # Enhanced with scene info └── fusion_output/ └── scene_aware_matching_report.md # Scene-aware analysis
fusion_output/optimized_matching_report.md - Start here! Human-readable summaryfusion_output/schema_d_matches.json - Detailed character-dialogue matchesvisual_output/visual_pipeline_report.txt - Character detection summarySee
OUTPUT_FILES_GUIDE.md for detailed instructions on viewing results.
--target-frames: More frames = better matching (default: 600)--extraction-mode: adaptive (recommended), uniform, intelligent--min-appearances: Filter noise by requiring multiple detectionsThe project uses predefined configurations in
configs/production_configs.py:
visual_processing_branch/src/schemas.py - All data structures (now with scene support)visual_processing_branch/src/fusion/advanced_character_matcher.py - Character continuity trackingvisual_processing_branch/src/visual/adaptive_frame_extractor.py - Dialogue-aware samplingvisual_processing_branch/src/visual/robust_frame_extractor.py - Multi-level fallbackvisual_processing_branch/src/visual/scene_detector.py - Scene boundary detectionvisual_processing_branch/src/visual/scene_aware_character_clustering.py - Scene-aware character groupingvisual_processing_branch/scripts/run_optimized_robust_pipeline.py - Best performance (74-83% match rate)visual_processing_branch/scripts/run_scene_aware_pipeline.py - Experimental scene clusteringEach schema uses Pydantic models with built-in validation:
segment_id, text, start_time, end_time, emotion, emotion_confidencesegment_id, speaker_id, start_time, end_time, confidencecharacter_id, num_appearances, representative_embeddings, appearance_segmentsmatch_id, character_id, dialogue, matching_score, metadata# Missing HF_TOKEN error echo "HF_TOKEN=your_token" >> .env # DeepFace Keras error export TF_USE_LEGACY_KERAS=1 # CUDA not detected python -c "import torch; print(torch.cuda.is_available())" # Model download failures rm -rf ~/.cache/huggingface/hub/models--* # Clear cache and retry
# Best quality analysis (recommended) cd visual_processing_branch && python scripts/run_optimized_robust_pipeline.py video.mp4 # Quick analysis with parameters python scripts/run_optimized_robust_pipeline.py video.mp4 --whisper-model tiny # Batch processing for video in test_videos/*.mp4; do python scripts/run_optimized_robust_pipeline.py "$video" done # Check results python scripts/display_all_schemas.py output/optimized_robust/session_* | less # Validate schemas python scripts/validate_all_schemas.py output/optimized_robust/session_*
# Audio pipeline parameters --whisper-model {tiny,base,small,medium,large,large-v2,large-v3} --emotion-model MODEL_NAME --num-speakers N # Force N speakers for diarization --min-speakers N # Minimum speakers to detect --max-speakers N # Maximum speakers to detect # Visual pipeline parameters --target-frames N # Number of frames to extract --min-appearances N # Minimum detections for character --extraction-mode {adaptive,uniform,intelligent} --config {default,dialogue,action,dark_scenes,crowd_scenes} # General parameters --output OUTPUT_DIR # Custom output directory --verbose # Enable detailed logging
scripts/test_*.pytest_multiple_videos.py, test_long_video_simulation.pyvalidate_all_schemas.py ensures data integrityaudio_processing_branch/ - Audio analysis pipelinevisual_processing_branch/ - Visual processing and fusionvisual_processing_branch/src/visual/ - Adaptive and robust frame extractorsvisual_processing_branch/src/fusion/ - Advanced character matchingvisual_processing_branch/test_videos/ - Sample videos for testingoutput/ - All pipeline outputs organized by sessiontest_*.py scripts are for testing individual componentsrun_*.py scripts are for running full pipelinesevaluate_*.py scripts are for analyzing resultsvalidate_*.py scripts are for data validationdisplay_*.py scripts are for human-readable output# Use pytest for any unit tests (if created) pytest tests/ # Most testing is done via standalone scripts python scripts/test_*.py # Component-specific tests
| Metric | Current | Target | Status |
|---|---|---|---|
| Match Rate | 74-83% | 80%+ | ✓ Achieved |
| Character Count | ~22 | <15 | In progress |
| Processing Speed | 3x realtime | 5x realtime | In progress |
| Memory Usage | 4-8GB | <4GB | Optimized |
| GPU Utilization | 60-80% | 90%+ | Good |
| Schema Validity | 100% | 100% | ✓ Achieved |
When contributing to this project:
mastertest_*.py in scripts/)~/.cache/huggingface/ (~20GB total)~/.insightface/models/.env file in project root is auto-loadedHF_TOKEN for Pyannote diarizationTF_USE_LEGACY_KERAS=1 for DeepFacenvidia-smirun_optimized_robust_pipeline.py not v2video_converter.pyFor detailed output file documentation, see
OUTPUT_FILES_GUIDE.md
For long video performance analysis, see LONG_VIDEO_ANALYSIS.md