Markdown Converter
Agent skill for markdown-converter
This is an automated Franchise Disclosure Document (FDD) processing pipeline that scrapes, downloads, processes, and extracts structured data from FDD documents filed with state franchise portals.
Sign in to like and favorite skills
This is an automated Franchise Disclosure Document (FDD) processing pipeline that scrapes, downloads, processes, and extracts structured data from FDD documents filed with state franchise portals.
Web Scrapers (NOTE: Architecture in transition):
scrapers/ - not yet implemented):
base/base_scraper.py: Abstract base class with common scraping functionalitybase/exceptions.py: Custom exception hierarchystates/minnesota.py: MinnesotaScraper classstates/wisconsin.py: WisconsinScraper classutils/scraping_utils.py: Common utilitiesfranchise_scrapers/):
MN_Scraper.py: Standalone Minnesota scraperWI_Scraper.py: Standalone Wisconsin scraperDocument Processing (
processing/):
mineru/mineru_processing.py: Task wrapper for MinerU Web APImineru/mineru_web_api.py: MinerU Web API client with browser authenticationsegmentation/document_segmentation.py: FDD section detection and boundary extractionsegmentation/enhanced_detector.py: Advanced section detection using Claudepdf/pdf_extractor.py: Basic PDF text extraction utilitiesData Extraction (
processing/extraction/):
llm_extraction.py: Multi-model LLM framework with routing and fallbackmultimodal.py: Handles image and table extractionData Models (
models/):
Workflows (
workflows/):
base_state_flow.py: Generic state scraping flow (unified implementation)state_configs.py: State-specific configurationsprocess_single_pdf.py: Single PDF processing flowcomplete_pipeline.py: End-to-end orchestration flowStorage (
storage/):
google_drive.py: Google Drive integration for document storagedatabase/manager.py: Database connection and operations managementValidation (
validation/):
schema_validation.py: Pydantic schema validationbusiness_rules.py: Business logic validationAPI Layer (
src/api/):
main.py: FastAPI application with REST endpointsrun.py: Uvicorn server runner# Run for all configured states python main.py run-all # Run for specific state (now uses unified base flow) python main.py scrape --state minnesota python main.py scrape --state wisconsin # With options python main.py scrape --state all --limit 10 --test-mode
python main.py process-pdf --path /path/to/fdd.pdf
python main.py health-check
python main.py orchestrate --deploy --schedule
# Unit tests pytest tests/ -m unit # Integration tests pytest tests/ -m integration # All tests pytest tests/
-- Apply migrations in order psql -d your_database -f migrations/001_initial_schema.sql psql -d your_database -f migrations/002_structured_data_tables.sql -- etc...
# Database SUPABASE_URL=your_supabase_url SUPABASE_ANON_KEY=your_anon_key SUPABASE_SERVICE_KEY=your_service_key # LLM APIs GEMINI_API_KEY=your_gemini_key OPENAI_API_KEY=your_openai_key # Optional, for fallback OLLAMA_BASE_URL=http://localhost:11434 # For local models # Google Drive GDRIVE_CREDS_JSON=gdrive_cred.json # Path to service account JSON GDRIVE_FOLDER_ID=root_folder_id # MinerU Web API MINERU_AUTH_FILE=mineru_auth.json # Browser auth storage # Section Detection USE_ENHANCED_SECTION_DETECTION=true ENHANCED_DETECTION_CONFIDENCE_THRESHOLD=0.7 ENHANCED_DETECTION_MIN_FUZZY_SCORE=80 # Application Settings DEBUG=false LOG_LEVEL=INFO MAX_CONCURRENT_EXTRACTIONS=5
State Scraping Flows Consolidation:
workflows/base_state_flow.py with generic state scraping logicworkflows/state_configs.py for state-specific configurationsscrapers/ module structure that doesn't exist yetfranchise_scrapers/ don't match expected architectureFiles Cleaned Up:
To add a new state (e.g., California):
Option 1: Using current implementation
franchise_scrapers/CA_Scraper.py following MN/WI patternOption 2: Using planned architecture (requires setup)
scrapers/ directory structurescrapers/states/california.py extending BaseScraperworkflows/state_configs.py:
CALIFORNIA_CONFIG = StateConfig( state_code="CA", state_name="California", scraper_class=CaliforniaScraper, folder_name="California FDDs", portal_name="CA DBO" )
STATE_CONFIGS dictionary in state_configs.pyscrapers/states/__init__.py to export the new scraperMinerU Authentication Failed
mineru_auth.json and re-authenticateplaywright install chromiumDatabase Connection Issues
python main.py health-checkScraping Failures
DEBUG=trueLLM Extraction Errors
ollama pull llama3Google Drive Upload Issues
Enable detailed logging:
DEBUG=true LOG_LEVEL=DEBUG python main.py scrape --state minnesota