Markdown Converter
Agent skill for markdown-converter
Python utilities for building the search index and managing songs.
Sign in to like and favorite skills
Python utilities for building the search index and managing songs.
The build pipeline now uses works/ as the primary data source:
PRIMARY (current): works/*/work.yaml + lead-sheet.pro → build_works_index.py → index.jsonl LEGACY (migration complete): sources/*/parsed/*.pro → migrate_to_works.py → works/
Key files:
build_works_index.py - PRIMARY: Builds index from works/ directorywork_schema.py - Defines work.yaml schema and validationbuild_index.py - LEGACY: Builds from sources/ (kept for reference)Some operations require external APIs/databases and only run locally. Others run everywhere.
| Operation | Where | Cache File | Notes |
|---|---|---|---|
| Build index | Everywhere | - | Core build, always runs |
| Harmonic analysis | Everywhere | - | Computes JamFriendly, Modal tags from chords |
| MusicBrainz tags | Local only | | Requires local MB database on port 5440 |
| Grassiness scores | Local only | , | Song-level bluegrass detection |
| Strum Machine URLs | Local only | | API rate limited (10 req/sec) |
| TuneArch fetch | Local only | - | Fetches new instrumentals |
How caching works:
refresh-tags, strum-machine-match)Cache files (commit these after updating):
docs/data/artist_tags.json - MusicBrainz artist → genre mappingsdocs/data/strum_machine_cache.json - Song title → Strum Machine URL mappingsdocs/data/bluegrass_recordings.json - Recordings by curated bluegrass artistsdocs/data/bluegrass_tagged.json - Recordings with MusicBrainz bluegrass tagsdocs/data/grassiness_scores.json - Computed grassiness scores per songscripts/lib/ ├── build_works_index.py # PRIMARY: Build index.jsonl from works/ ├── work_schema.py # work.yaml schema definition and validation ├── migrate_to_works.py # Migrate sources/ → works/ structure ├── build_index.py # LEGACY: Build index from sources/*.pro ├── build_posts.py # Build blog posts manifest (posts.json) ├── enrich_songs.py # Enrich .pro files (provenance, chord normalization) ├── tag_enrichment.py # Tag enrichment (MusicBrainz + harmonic analysis) ├── query_artist_tags.py # Optimized MusicBrainz artist tag queries ├── strum_machine.py # Strum Machine API integration ├── fetch_tune.py # Fetch tunes from TuneArch by URL ├── search_index.py # Search index utilities and testing ├── add_song.py # Add a song to manual/parsed/ ├── process_submission.py # GitHub Action: process song-submission issues ├── process_correction.py # GitHub Action: process song-correction issues ├── chord_counter.py # Chord statistics utility ├── loc_counter.py # Lines of code counter for analytics ├── export_genre_suggestions.py # Export genre suggestions for review ├── batch_tag_songs.py # Batch tag songs using Claude API ├── fetch_tag_overrides.py # Fetch trusted user tag votes from Supabase └── tagging/ # Song-level tagging system ├── CLAUDE.md # Detailed docs for grassiness scoring ├── build_artist_database.py # Build curated bluegrass artist database └── grassiness.py # Bluegrass detection based on covers/tags
# Full pipeline: build index from works/ ./scripts/bootstrap --quick # Build index with tag refresh (local only, requires MusicBrainz) ./scripts/bootstrap --quick --refresh-tags # Add a song manually ./scripts/utility add-song /path/to/song.pro # Count chord usage across all songs ./scripts/utility count-chords # Refresh tags from MusicBrainz (LOCAL ONLY - requires MB database) ./scripts/utility refresh-tags # Match songs to Strum Machine (LOCAL ONLY - ~30 min for 17k songs) ./scripts/utility strum-machine-match
Bootstrap now shows elapsed time and per-stage breakdown:
Bootstrap complete! (45s total) Timing breakdown: - Enrichment: 12s - Build index: 33s
The build pipeline uses pre-computed lookup dicts to avoid O(n*m) nested loops:
| Operation | Before | After | Savings |
|---|---|---|---|
| Strum Machine "the" matching | 17k × 52k = 884M | 52k + 17k = 69k | 12,800× faster |
| Grassiness title lookup | 17k × 56k = 952M | 56k + 17k = 73k | 13,000× faster |
These lookups are built once before the main song loop, then used for O(1) dict access.
Enriches
.pro files with provenance metadata and normalized chord patterns.
x_source, x_source_file, x_enriched)Ensures consistent chord counts across verses/choruses of the same type:
Before: After: Verse 1: [G]Your cheating... Verse 1: [G]Your cheating... Verse 2: When tears come... Verse 2: [G]When tears come... ↑ Added from canonical
Algorithm:
# Enrich all sources uv run python scripts/lib/enrich_songs.py # Dry run (show what would change) uv run python scripts/lib/enrich_songs.py --dry-run # Single source only uv run python scripts/lib/enrich_songs.py --source classic-country # Single file (for testing) uv run python scripts/lib/enrich_songs.py --file path/to/song.pro
Files listed in
sources/{source}/protected.txt are skipped. These are human-corrected files that should not be auto-modified.
Generates
docs/data/index.jsonl from the works/ directory.
works/*/work.yaml for all workslead-sheet.proSongs are grouped by
group_id for the version picker. The grouping algorithm:
(Live), (C), (D)When determining the work's source for attribution:
{meta: x_source tunearch}This ensures works with both a TuneArch lead sheet and a Banjo Hangout tab show "tunearch" as the source.
Tablature parts include provenance for frontend attribution:
"tablature_parts": [{ "instrument": "banjo", "file": "data/tabs/red-haired-boy-banjo.otf.json", "source": "banjo-hangout", "source_id": "1687", "author": "schlange", "source_page_url": "https://www.banjohangout.org/tab/browse.asp?m=detail&v=1687", "author_url": "https://www.banjohangout.org/my/schlange" }]
Matches songs to Strum Machine backing tracks using cached results:
This handles cases like "Angeline Baker (C)" matching "angeline the baker" in the cache.
uv run python scripts/lib/build_works_index.py # Full build uv run python scripts/lib/build_works_index.py --no-tags # Skip tag enrichment
{ "id": "blue-moon-of-kentucky", "title": "Blue Moon of Kentucky", "artist": "Patsy Cline", "composers": ["Bill Monroe"], "key": "C", "tags": ["ClassicCountry", "JamFriendly"], "content": "{meta: title...}[full ChordPro]", "tablature_parts": [ {"type": "tablature", "instrument": "banjo", "path": "data/tabs/..."} ] }
Defines the
work.yaml schema and validation.
@dataclass class Part: type: str # 'lead-sheet', 'tablature', 'abc-notation' format: str # 'chordpro', 'opentabformat', 'abc' file: str # Relative path to file default: bool # Is this the default part? instrument: str # Optional: 'banjo', 'fiddle', 'guitar' provenance: dict # Source info (source, source_file, imported_at) @dataclass class Work: id: str # Slug (e.g., 'blue-moon-of-kentucky') title: str artist: str composers: list[str] default_key: str tags: list[str] parts: list[Part]
Generates
docs/data/index.jsonl from all .pro files in sources/.
sources/*/parsed/*.pro for all songsdef parse_chordpro_metadata(content) -> dict: """Extract {meta: key value} and {key: value} directives. Includes version fields: x_version_label, x_version_type, etc.""" def detect_key(chords: list[str]) -> tuple[str, str]: """Detect key from chord list. Returns (key, mode).""" def to_nashville(chord: str, key_name: str) -> str: """Convert chord to Nashville number given a key.""" def extract_lyrics(content: str) -> str: """Extract plain lyrics without chord markers.""" def normalize_for_grouping(text: str) -> str: """Normalize text for grouping comparison. Lowercases, removes accents, strips common suffixes.""" def compute_group_id(title: str, artist: str) -> str: """Compute base group ID from normalized title + artist.""" def compute_lyrics_hash(lyrics: str) -> str: """Hash first 200 chars of normalized lyrics. Used to distinguish different songs with same title."""
{ "songs": [ { "id": "songfilename", "title": "Song Title", "artist": "Artist Name", "composer": "Writer Name", "first_line": "First line of lyrics...", "lyrics": "Lyrics for search (500 chars)", "content": "Full ChordPro content", "key": "G", "mode": "major", "nashville": ["I", "IV", "V"], "progression": ["I", "I", "IV", "V", "I"], "group_id": "abc123def456_12345678", "chord_count": 3, "version_label": "Simplified", "version_type": "simplified", "arrangement_by": "John Smith" } ] }
Songs are grouped by
group_id, which combines:
This ensures songs with the same title but different lyrics (different songs) get different group_ids, while true versions (same lyrics, different arrangements) share a group_id.
Exact duplicates (identical content) are removed at build time. The first occurrence is kept.
Scores each possible key by:
Tags are added to songs during index build via
tag_enrichment.py.
| Category | Tags |
|---|---|
| Genre | Bluegrass, ClassicCountry, OldTime, Gospel, Folk, HonkyTonk, Outlaw, Rockabilly, etc. |
| Vibe | JamFriendly, Modal, Jazzy |
| Structure | Instrumental, Waltz |
llm_tags.json)JamFriendly: ≤5 unique chords, has I-IV-V, no complex extensionsModal: Has bVII chord (e.g., F in key of G)Jazzy: Has 7th, 9th, dim, aug, or slash chords| File | Purpose |
|---|---|
| LLM-generated tags (primary source, checked into git) |
| Trusted user tag exclusions (checked into git) |
| Cached MusicBrainz artist tags (fallback) |
Tags are applied automatically during every index build:
| Where | What happens |
|---|---|
| Local or CI | reads → applies genre tags |
| Local or CI | Harmonic analysis runs → applies vibe tags (JamFriendly, Modal) |
| Local or CI | exclusions remove bad tags |
Normal flow: LLM tags are pre-computed and checked into git. CI uses them directly.
Re-tagging all songs (local only, requires Anthropic API key):
# Submit batch job (takes ~2 hours to process) uv run python scripts/lib/batch_tag_songs.py # Check status uv run python scripts/lib/batch_tag_songs.py --status <batch_id> # Fetch results when complete uv run python scripts/lib/batch_tag_songs.py --results <batch_id> # Rebuild index and commit ./scripts/bootstrap --quick git add docs/data/llm_tags.json && git commit -m "Refresh LLM tags"
Syncing trusted user votes (local only, requires Supabase credentials):
./scripts/utility sync-tag-votes git add docs/data/tag_overrides.json && git commit -m "Sync tag overrides"
Optimized MusicBrainz queries using LATERAL joins with indexed lookups:
# Query tags for artists (0.9s for 900 artists) from query_artist_tags import query_artist_tags_batch results = query_artist_tags_batch(['Bill Monroe', 'Hank Williams']) # Returns: {'Bill Monroe': [('bluegrass', 45), ('country', 12), ...], ...}
Adds a
.pro file to sources/manual/parsed/ and rebuilds index.
./scripts/utility add-song ~/Downloads/my_song.pro ./scripts/utility add-song song.pro --skip-index-rebuild
Called by GitHub Actions when issues are approved.
Trigger: Issue labeled
song-submission + approved (or song-correction)
Process:
sources/manual/parsed/{id}.proprotected.txt (for corrections)The build script handles both formats:
# Our format {meta: title Song Name} {meta: artist Artist} # Standard ChordPro format {title: Song Name} {artist: Artist}
Both are extracted and normalized.
To add songs from a new source:
sources/{source-name}/parsed/ directory.pro files there./scripts/bootstrap --quick to rebuild indexThe build script automatically scans all
sources/*/parsed/ directories.