Markdown Converter
Agent skill for markdown-converter
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Sign in to like and favorite skills
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
DarwinG-Crawl is a web scraping project focused on crawling and extracting product data from the Forest Market website (forestmarket.net). The project uses multiple Python-based crawling approaches with different libraries and strategies.
The project is organized with crawler utilities in
src/crawler/ and uses a modular approach with different scripts for different crawling strategies:
The project relies on these main libraries:
crawl4ai - Primary crawling framework with browser automationplaywright - Alternative browser automation for manual intervention scenariosasyncio - Asynchronous programming for efficient crawlingcsv, json, re, urllib.parse, datetimeThis project uses
uv for fast dependency management. Install dependencies:
# Install dependencies uv sync # Install with development dependencies uv sync --extra dev # Install browser for Playwright uv run playwright install chromium
# Get URLs for all countries → ../fm_data/fm_url_YYYYMMDD_HHMMSS.csv uv run crawl-multi-url # Get URLs for specific countries uv run crawl-multi-url --locations "United States" Japan uv run crawl-multi-url --locations en-US en-SG en-HK # List available countries uv run crawl-multi-url --list-locations # Custom output location uv run crawl-multi-url --output custom_urls.csv
# Extract details from CSV with URLs uv run python crawler/crawl_fm_detailed.py --input fm_data/url/fm_url_20240129_143052.csv # Extract from specific countries only uv run python crawler/crawl_fm_detailed.py --input fm_data/url/fm_url_20240129_143052.csv --countries en-US en-SG # Limit number of products uv run python crawler/crawl_fm_detailed.py --input fm_data/url/fm_url_20240129_143052.csv --max-products 10 # Custom output filename uv run python crawler/crawl_fm_detailed.py --input fm_data/url/fm_url_20240129_143052.csv --output detailed_products.json # Extract single product URL (NEW) uv run python crawler/crawl_fm_detailed.py --url "https://www.forestmarket.net/en-US/product/ABC123" uv run python crawler/crawl_fm_detailed.py --url "https://www.forestmarket.net/product/XYZ789" --output single_product.json
# URL crawlers uv run crawl-url # Basic URL crawler (legacy) uv run crawl-multi-url # Enhanced multi-location URL crawler # Detailed data crawlers uv run crawl-detailed # Detailed product extraction # Legacy crawlers uv run crawl-basic # Basic crawl4ai crawler uv run crawl-fm # Forest Market comprehensive crawler uv run crawl-manual # Interactive crawler with manual intervention
crawl_fm_url_enhanced.py)../fm_data/fm_url_{timestamp}.csv by defaultcrawl_fm_detailed.py)legacy/crawl.py - Simple URL extraction with "View More" automationcrawl_fm.py - Combined URL discovery and product extractionlegacy/crawl_manual_pause.py - Playwright-based with user interactioncrawl_fm_url_enhanced.py)../fm_data/fm_url_YYYYMMDD_HHMMSS.csv
product_id,en-US,en-SG,en-HK,en-KR,en-JP../fm_data/fm_url_YYYYMMDD_HHMMSS_failed_locations.csvcrawl_fm_detailed.py)forest_market_detailed_products.csv (or custom name)
forest_market_detailed_products_failed_urls.csvThe project includes utility tools in the
helper/ directory for data processing and analysis:
helper/json_to_markdown.py)Converts JSON crawler output to readable markdown format with complete data preservation.
# Convert JSON to markdown with all data preserved uv run python helper/json_to_markdown.py --input fm_data/json/fm_detail_20250730_170122.json # Specify custom output file uv run python helper/json_to_markdown.py --input fm_data/json/fm_detail_20250730_170122.json --output products.md # Limit to first N products for testing uv run python helper/json_to_markdown.py --input fm_data/json/fm_detail_20250730_170122.json --max-products 10 # Print to console instead of file uv run python helper/json_to_markdown.py --input fm_data/json/fm_detail_20250730_170122.json --print
Features:
helper/check_urls.py)Compares URLs between CSV input and JSON output to verify completeness.
# Check if all CSV URLs are present in JSON output uv run python helper/check_urls.py --input fm_data/json/fm_detail_20250730_170122.json # Results show: # - Total URLs in each file # - Common URLs count # - Missing URLs (if any) # - Extra URLs in JSON
Use Cases:
The crawlers are specifically designed for forestmarket.net:
en-US (https://forestmarket.net/en-US)en-SG (https://forestmarket.net/en-SG)en-HK (https://forestmarket.net/en-HK)en-KR (https://forestmarket.net/en-KR)en-JP (https://forestmarket.net/en-JP)/product/{product_id} or /{locale}/product/{product_id}k5GQN8LRDaHj, jocY8fxxfRzl)