Build Scripts (scripts/lib)

Python utilities for building the search index and managing songs.

Pipeline Overview

The build pipeline now uses works/ as the primary data source:

PRIMARY (current):
works/*/work.yaml + lead-sheet.pro  →  build_works_index.py  →  index.jsonl

LEGACY (migration complete):
sources/*/parsed/*.pro  →  migrate_to_works.py  →  works/

Key files:

```
build_works_index.py
```
- PRIMARY: Builds index from works/ directory
```
work_schema.py
```
- Defines work.yaml schema and validation
```
build_index.py
```
- LEGACY: Builds from sources/ (kept for reference)

Local vs CI Operations

Some operations require external APIs/databases and only run locally. Others run everywhere.

Operation	Where	Cache File	Notes
Build index	Everywhere	-	Core build, always runs
Harmonic analysis	Everywhere	-	Computes JamFriendly, Modal tags from chords
MusicBrainz tags	Local only	`artist_tags.json`	Requires local MB database on port 5440
Grassiness scores	Local only	`bluegrass_recordings.json` , `bluegrass_tagged.json`	Song-level bluegrass detection
Strum Machine URLs	Local only	`strum_machine_cache.json`	API rate limited (10 req/sec)
TuneArch fetch	Local only	-	Fetches new instrumentals

How caching works:

Run local command to populate cache (e.g.,
```
refresh-tags
```
,
```
strum-machine-match
```
)
Commit the cache file to git
CI reads cache during build - no external API calls

Cache files (commit these after updating):

```
docs/data/artist_tags.json
```
- MusicBrainz artist → genre mappings
```
docs/data/strum_machine_cache.json
```
- Song title → Strum Machine URL mappings
```
docs/data/bluegrass_recordings.json
```
- Recordings by curated bluegrass artists
```
docs/data/bluegrass_tagged.json
```
- Recordings with MusicBrainz bluegrass tags
```
docs/data/grassiness_scores.json
```
- Computed grassiness scores per song

Files

scripts/lib/
├── build_works_index.py  # PRIMARY: Build index.jsonl from works/
├── work_schema.py        # work.yaml schema definition and validation
├── migrate_to_works.py   # Migrate sources/ → works/ structure
├── build_index.py        # LEGACY: Build index from sources/*.pro
├── build_posts.py        # Build blog posts manifest (posts.json)
├── enrich_songs.py       # Enrich .pro files (provenance, chord normalization)
├── tag_enrichment.py     # Tag enrichment (MusicBrainz + harmonic analysis)
├── query_artist_tags.py  # Optimized MusicBrainz artist tag queries
├── strum_machine.py      # Strum Machine API integration
├── fetch_tune.py         # Fetch tunes from TuneArch by URL
├── search_index.py       # Search index utilities and testing
├── add_song.py           # Add a song to manual/parsed/
├── process_submission.py # GitHub Action: process song-submission issues
├── process_correction.py # GitHub Action: process song-correction issues
├── chord_counter.py      # Chord statistics utility
├── loc_counter.py        # Lines of code counter for analytics
├── export_genre_suggestions.py  # Export genre suggestions for review
├── batch_tag_songs.py    # Batch tag songs using Claude API
├── fetch_tag_overrides.py # Fetch trusted user tag votes from Supabase
└── tagging/              # Song-level tagging system
    ├── CLAUDE.md         # Detailed docs for grassiness scoring
    ├── build_artist_database.py  # Build curated bluegrass artist database
    └── grassiness.py     # Bluegrass detection based on covers/tags

Quick Commands

# Full pipeline: build index from works/
./scripts/bootstrap --quick

# Build index with tag refresh (local only, requires MusicBrainz)
./scripts/bootstrap --quick --refresh-tags

# Add a song manually
./scripts/utility add-song /path/to/song.pro

# Count chord usage across all songs
./scripts/utility count-chords

# Refresh tags from MusicBrainz (LOCAL ONLY - requires MB database)
./scripts/utility refresh-tags

# Match songs to Strum Machine (LOCAL ONLY - ~30 min for 17k songs)
./scripts/utility strum-machine-match

Bootstrap Timing

Bootstrap now shows elapsed time and per-stage breakdown:

Bootstrap complete! (45s total)
  Timing breakdown:
    - Enrichment: 12s
    - Build index: 33s

Performance Notes

The build pipeline uses pre-computed lookup dicts to avoid O(n*m) nested loops:

Operation	Before	After	Savings
Strum Machine "the" matching	17k × 52k = 884M	52k + 17k = 69k	12,800× faster
Grassiness title lookup	17k × 56k = 952M	56k + 17k = 73k	13,000× faster

These lookups are built once before the main song loop, then used for O(1) dict access.

enrich_songs.py

Enriches

.pro

files with provenance metadata and normalized chord patterns.

What It Does

Adds provenance metadata (
```
x_source
```
,
```
x_source_file
```
,
```
x_enriched
```
)
Normalizes chord patterns within sections of the same type
Skips protected files (human corrections are authoritative)

Chord Pattern Normalization

Ensures consistent chord counts across verses/choruses of the same type:

Before:                          After:
Verse 1: [G]Your cheating...     Verse 1: [G]Your cheating...
Verse 2: When tears come...      Verse 2: [G]When tears come...
                                          ↑ Added from canonical

Algorithm:

Group sections by type (verse, chorus, etc.)
Find canonical section (most chords, starts with chord)
For sections missing first chord, add canonical's first chord

Usage

# Enrich all sources
uv run python scripts/lib/enrich_songs.py

# Dry run (show what would change)
uv run python scripts/lib/enrich_songs.py --dry-run

# Single source only
uv run python scripts/lib/enrich_songs.py --source classic-country

# Single file (for testing)
uv run python scripts/lib/enrich_songs.py --file path/to/song.pro

Protected Files

Files listed in

sources/{source}/protected.txt

are skipped. These are human-corrected files that should not be auto-modified.

build_works_index.py (PRIMARY)

Generates

docs/data/index.jsonl

from the

works/

directory.

What It Does

Scans
```
works/*/work.yaml
```
for all works
Reads work metadata (title, artist, composers, tags, parts)
Reads lead sheet content from
```
lead-sheet.pro
```
Detects key and computes Nashville numbers
Identifies tablature parts and includes their paths
Applies fuzzy grouping to merge similar titles
Matches to Strum Machine cache
Outputs unified JSON index

Version Grouping

Songs are grouped by

group_id

for the version picker. The grouping algorithm:

Title normalization: Lowercase, remove accents, strip parenthetical suffixes like
```
(Live)
```
,
```
(C)
```
,
```
(D)
```
Article removal: Remove "the", "a", "an" so "Angeline the Baker" matches "Angeline Baker"
Lyrics hash: First 200 chars of lyrics distinguish different songs with same title
Fuzzy matching: Post-processing pass merges similar titles (85% similarity threshold):
- Handles contractions: "Lovin'" ↔ "Loving"
- Handles plurals: "Heartache" ↔ "Heartaches"
- Handles compound words: "Home Town" ↔ "Hometown"

Source Priority

When determining the work's source for attribution:

x_source in lead-sheet content (highest priority) - e.g.,
```
{meta: x_source tunearch}
```
Lead-sheet part provenance from work.yaml
Tablature part provenance (fallback)

This ensures works with both a TuneArch lead sheet and a Banjo Hangout tab show "tunearch" as the source.

Tablature Attribution

Tablature parts include provenance for frontend attribution:

"tablature_parts": [{
  "instrument": "banjo",
  "file": "data/tabs/red-haired-boy-banjo.otf.json",
  "source": "banjo-hangout",
  "source_id": "1687",
  "author": "schlange",
  "source_page_url": "https://www.banjohangout.org/tab/browse.asp?m=detail&v=1687",
  "author_url": "https://www.banjohangout.org/my/schlange"
}]

Strum Machine Matching

Matches songs to Strum Machine backing tracks using cached results:

Normalize title (lowercase, strip parenthetical suffixes)
Try exact match in cache
Try without articles ("the", "a", "an")
Try matching cache keys with articles removed

This handles cases like "Angeline Baker (C)" matching "angeline the baker" in the cache.

Usage

uv run python scripts/lib/build_works_index.py           # Full build
uv run python scripts/lib/build_works_index.py --no-tags # Skip tag enrichment

Output Format

{
  "id": "blue-moon-of-kentucky",
  "title": "Blue Moon of Kentucky",
  "artist": "Patsy Cline",
  "composers": ["Bill Monroe"],
  "key": "C",
  "tags": ["ClassicCountry", "JamFriendly"],
  "content": "{meta: title...}[full ChordPro]",
  "tablature_parts": [
    {"type": "tablature", "instrument": "banjo", "path": "data/tabs/..."}
  ]
}

work_schema.py

Defines the

work.yaml

schema and validation.

Work Schema

@dataclass
class Part:
    type: str           # 'lead-sheet', 'tablature', 'abc-notation'
    format: str         # 'chordpro', 'opentabformat', 'abc'
    file: str           # Relative path to file
    default: bool       # Is this the default part?
    instrument: str     # Optional: 'banjo', 'fiddle', 'guitar'
    provenance: dict    # Source info (source, source_file, imported_at)

@dataclass
class Work:
    id: str             # Slug (e.g., 'blue-moon-of-kentucky')
    title: str
    artist: str
    composers: list[str]
    default_key: str
    tags: list[str]
    parts: list[Part]

build_index.py (LEGACY)

Generates

docs/data/index.jsonl

from all

.pro

files in

sources/

What It Does

Scans
```
sources/*/parsed/*.pro
```
for all songs
Parses ChordPro metadata (title, artist, composer, version fields)
Extracts lyrics (without chords) for search
Detects key using diatonic heuristics
Converts chords to Nashville numbers for chord search
Computes group_id for song version grouping
Deduplicates exact duplicates (same content hash)
Outputs unified JSON index

Key Functions

def parse_chordpro_metadata(content) -> dict:
    """Extract {meta: key value} and {key: value} directives.
    Includes version fields: x_version_label, x_version_type, etc."""

def detect_key(chords: list[str]) -> tuple[str, str]:
    """Detect key from chord list. Returns (key, mode)."""

def to_nashville(chord: str, key_name: str) -> str:
    """Convert chord to Nashville number given a key."""

def extract_lyrics(content: str) -> str:
    """Extract plain lyrics without chord markers."""

def normalize_for_grouping(text: str) -> str:
    """Normalize text for grouping comparison.
    Lowercases, removes accents, strips common suffixes."""

def compute_group_id(title: str, artist: str) -> str:
    """Compute base group ID from normalized title + artist."""

def compute_lyrics_hash(lyrics: str) -> str:
    """Hash first 200 chars of normalized lyrics.
    Used to distinguish different songs with same title."""

Output Format

{
  "songs": [
    {
      "id": "songfilename",
      "title": "Song Title",
      "artist": "Artist Name",
      "composer": "Writer Name",
      "first_line": "First line of lyrics...",
      "lyrics": "Lyrics for search (500 chars)",
      "content": "Full ChordPro content",
      "key": "G",
      "mode": "major",
      "nashville": ["I", "IV", "V"],
      "progression": ["I", "I", "IV", "V", "I"],
      "group_id": "abc123def456_12345678",
      "chord_count": 3,
      "version_label": "Simplified",
      "version_type": "simplified",
      "arrangement_by": "John Smith"
    }
  ]
}

Version Grouping

Songs are grouped by

group_id

, which combines:

Base hash: MD5 of normalized title + artist
Lyrics hash: MD5 of first 200 chars of normalized lyrics

This ensures songs with the same title but different lyrics (different songs) get different group_ids, while true versions (same lyrics, different arrangements) share a group_id.

Deduplication

Exact duplicates (identical content) are removed at build time. The first occurrence is kept.

Key Detection Algorithm

Scores each possible key by:

How many song chords fit the key's diatonic scale
Bonus weight for tonic chord appearances
Tie-breaking: prefer common keys (G, C, D, A, E, Am, Em)

Tag System

Tags are added to songs during index build via

tag_enrichment.py

Tag Taxonomy

Category	Tags
Genre	Bluegrass, ClassicCountry, OldTime, Gospel, Folk, HonkyTonk, Outlaw, Rockabilly, etc.
Vibe	JamFriendly, Modal, Jazzy
Structure	Instrumental, Waltz

Tag Sources (Priority Order)

LLM tags (primary) - Genre tags from Claude batch API (
```
llm_tags.json
```
)
Harmonic analysis - Vibe tags computed from chord content:
- ```
JamFriendly
```
  : ≤5 unique chords, has I-IV-V, no complex extensions
- ```
Modal
```
  : Has bVII chord (e.g., F in key of G)
- ```
Jazzy
```
  : Has 7th, 9th, dim, aug, or slash chords
MusicBrainz artist tags (fallback) - Only used if LLM tags unavailable
Trusted user overrides - Downvotes from trusted users exclude bad tags

Data Files

File	Purpose
`docs/data/llm_tags.json`	LLM-generated tags (primary source, checked into git)
`docs/data/tag_overrides.json`	Trusted user tag exclusions (checked into git)
`docs/data/artist_tags.json`	Cached MusicBrainz artist tags (fallback)

Build Workflow

Tags are applied automatically during every index build:

Where What happens

Where	What happens
Local or CI	`tag_enrichment.py` reads `llm_tags.json` → applies genre tags
Local or CI	Harmonic analysis runs → applies vibe tags (JamFriendly, Modal)
Local or CI	`tag_overrides.json` exclusions remove bad tags

Local or CI

tag_enrichment.py

reads

llm_tags.json

→ applies genre tags

Local or CI Harmonic analysis runs → applies vibe tags (JamFriendly, Modal)

Local or CI

tag_overrides.json

exclusions remove bad tags

Normal flow: LLM tags are pre-computed and checked into git. CI uses them directly.

Re-tagging all songs (local only, requires Anthropic API key):

# Submit batch job (takes ~2 hours to process)
uv run python scripts/lib/batch_tag_songs.py

# Check status
uv run python scripts/lib/batch_tag_songs.py --status <batch_id>

# Fetch results when complete
uv run python scripts/lib/batch_tag_songs.py --results <batch_id>

# Rebuild index and commit
./scripts/bootstrap --quick
git add docs/data/llm_tags.json && git commit -m "Refresh LLM tags"

Syncing trusted user votes (local only, requires Supabase credentials):

./scripts/utility sync-tag-votes
git add docs/data/tag_overrides.json && git commit -m "Sync tag overrides"

query_artist_tags.py

Optimized MusicBrainz queries using LATERAL joins with indexed lookups:

# Query tags for artists (0.9s for 900 artists)
from query_artist_tags import query_artist_tags_batch
results = query_artist_tags_batch(['Bill Monroe', 'Hank Williams'])
# Returns: {'Bill Monroe': [('bluegrass', 45), ('country', 12), ...], ...}

add_song.py

Adds a

.pro

file to

sources/manual/parsed/

and rebuilds index.

./scripts/utility add-song ~/Downloads/my_song.pro
./scripts/utility add-song song.pro --skip-index-rebuild

process_submission.py / process_correction.py

Called by GitHub Actions when issues are approved.

Trigger: Issue labeled

song-submission

approved

(or

song-correction

)

Process:

Extract ChordPro from issue body (```chordpro block)
Extract song ID from issue body
Write to
```
sources/manual/parsed/{id}.pro
```
Add to
```
protected.txt
```
(for corrections)
Rebuild index
Commit changes

Metadata Parsing

The build script handles both formats:

# Our format
{meta: title Song Name}
{meta: artist Artist}

# Standard ChordPro format
{title: Song Name}
{artist: Artist}

Both are extracted and normalized.

Adding a New Source

To add songs from a new source:

Create
```
sources/{source-name}/parsed/
```
directory
Add
```
.pro
```
files there
Run
```
./scripts/bootstrap --quick
```
to rebuild index

The build script automatically scans all

sources/*/parsed/

directories.

Build Scripts (scripts/lib)

Python utilities for building the search index and managing songs.

Pipeline Overview

The build pipeline now uses works/ as the primary data source:

PRIMARY (current):
works/*/work.yaml + lead-sheet.pro  →  build_works_index.py  →  index.jsonl

LEGACY (migration complete):
sources/*/parsed/*.pro  →  migrate_to_works.py  →  works/

Key files:

```
build_works_index.py
```
- PRIMARY: Builds index from works/ directory
```
work_schema.py
```
- Defines work.yaml schema and validation
```
build_index.py
```
- LEGACY: Builds from sources/ (kept for reference)

Local vs CI Operations

Some operations require external APIs/databases and only run locally. Others run everywhere.

Operation	Where	Cache File	Notes
Build index	Everywhere	-	Core build, always runs
Harmonic analysis	Everywhere	-	Computes JamFriendly, Modal tags from chords
MusicBrainz tags	Local only	`artist_tags.json`	Requires local MB database on port 5440
Grassiness scores	Local only	`bluegrass_recordings.json` , `bluegrass_tagged.json`	Song-level bluegrass detection
Strum Machine URLs	Local only	`strum_machine_cache.json`	API rate limited (10 req/sec)
TuneArch fetch	Local only	-	Fetches new instrumentals

How caching works:

Run local command to populate cache (e.g.,
```
refresh-tags
```
,
```
strum-machine-match
```
)
Commit the cache file to git
CI reads cache during build - no external API calls

Cache files (commit these after updating):

```
docs/data/artist_tags.json
```
- MusicBrainz artist → genre mappings
```
docs/data/strum_machine_cache.json
```
- Song title → Strum Machine URL mappings
```
docs/data/bluegrass_recordings.json
```
- Recordings by curated bluegrass artists
```
docs/data/bluegrass_tagged.json
```
- Recordings with MusicBrainz bluegrass tags
```
docs/data/grassiness_scores.json
```
- Computed grassiness scores per song

Files

scripts/lib/
├── build_works_index.py  # PRIMARY: Build index.jsonl from works/
├── work_schema.py        # work.yaml schema definition and validation
├── migrate_to_works.py   # Migrate sources/ → works/ structure
├── build_index.py        # LEGACY: Build index from sources/*.pro
├── build_posts.py        # Build blog posts manifest (posts.json)
├── enrich_songs.py       # Enrich .pro files (provenance, chord normalization)
├── tag_enrichment.py     # Tag enrichment (MusicBrainz + harmonic analysis)
├── query_artist_tags.py  # Optimized MusicBrainz artist tag queries
├── strum_machine.py      # Strum Machine API integration
├── fetch_tune.py         # Fetch tunes from TuneArch by URL
├── search_index.py       # Search index utilities and testing
├── add_song.py           # Add a song to manual/parsed/
├── process_submission.py # GitHub Action: process song-submission issues
├── process_correction.py # GitHub Action: process song-correction issues
├── chord_counter.py      # Chord statistics utility
├── loc_counter.py        # Lines of code counter for analytics
├── export_genre_suggestions.py  # Export genre suggestions for review
├── batch_tag_songs.py    # Batch tag songs using Claude API
├── fetch_tag_overrides.py # Fetch trusted user tag votes from Supabase
└── tagging/              # Song-level tagging system
    ├── CLAUDE.md         # Detailed docs for grassiness scoring
    ├── build_artist_database.py  # Build curated bluegrass artist database
    └── grassiness.py     # Bluegrass detection based on covers/tags

Quick Commands

# Full pipeline: build index from works/
./scripts/bootstrap --quick

# Build index with tag refresh (local only, requires MusicBrainz)
./scripts/bootstrap --quick --refresh-tags

# Add a song manually
./scripts/utility add-song /path/to/song.pro

# Count chord usage across all songs
./scripts/utility count-chords

# Refresh tags from MusicBrainz (LOCAL ONLY - requires MB database)
./scripts/utility refresh-tags

# Match songs to Strum Machine (LOCAL ONLY - ~30 min for 17k songs)
./scripts/utility strum-machine-match

Bootstrap Timing

Bootstrap now shows elapsed time and per-stage breakdown:

Bootstrap complete! (45s total)
  Timing breakdown:
    - Enrichment: 12s
    - Build index: 33s

Performance Notes

The build pipeline uses pre-computed lookup dicts to avoid O(n*m) nested loops:

Operation	Before	After	Savings
Strum Machine "the" matching	17k × 52k = 884M	52k + 17k = 69k	12,800× faster
Grassiness title lookup	17k × 56k = 952M	56k + 17k = 73k	13,000× faster

These lookups are built once before the main song loop, then used for O(1) dict access.

enrich_songs.py

Enriches

.pro

files with provenance metadata and normalized chord patterns.

What It Does

Adds provenance metadata (
```
x_source
```
,
```
x_source_file
```
,
```
x_enriched
```
)
Normalizes chord patterns within sections of the same type
Skips protected files (human corrections are authoritative)

Chord Pattern Normalization

Ensures consistent chord counts across verses/choruses of the same type:

Before:                          After:
Verse 1: [G]Your cheating...     Verse 1: [G]Your cheating...
Verse 2: When tears come...      Verse 2: [G]When tears come...
                                          ↑ Added from canonical

Algorithm:

Group sections by type (verse, chorus, etc.)
Find canonical section (most chords, starts with chord)
For sections missing first chord, add canonical's first chord

Usage

# Enrich all sources
uv run python scripts/lib/enrich_songs.py

# Dry run (show what would change)
uv run python scripts/lib/enrich_songs.py --dry-run

# Single source only
uv run python scripts/lib/enrich_songs.py --source classic-country

# Single file (for testing)
uv run python scripts/lib/enrich_songs.py --file path/to/song.pro

Protected Files

Files listed in

sources/{source}/protected.txt

are skipped. These are human-corrected files that should not be auto-modified.

build_works_index.py (PRIMARY)

Generates

docs/data/index.jsonl

from the

works/

directory.

What It Does

Scans
```
works/*/work.yaml
```
for all works
Reads work metadata (title, artist, composers, tags, parts)
Reads lead sheet content from
```
lead-sheet.pro
```
Detects key and computes Nashville numbers
Identifies tablature parts and includes their paths
Applies fuzzy grouping to merge similar titles
Matches to Strum Machine cache
Outputs unified JSON index

Version Grouping

Songs are grouped by

group_id

for the version picker. The grouping algorithm:

Title normalization: Lowercase, remove accents, strip parenthetical suffixes like
```
(Live)
```
,
```
(C)
```
,
```
(D)
```
Article removal: Remove "the", "a", "an" so "Angeline the Baker" matches "Angeline Baker"
Lyrics hash: First 200 chars of lyrics distinguish different songs with same title
Fuzzy matching: Post-processing pass merges similar titles (85% similarity threshold):
- Handles contractions: "Lovin'" ↔ "Loving"
- Handles plurals: "Heartache" ↔ "Heartaches"
- Handles compound words: "Home Town" ↔ "Hometown"

Source Priority

When determining the work's source for attribution:

x_source in lead-sheet content (highest priority) - e.g.,
```
{meta: x_source tunearch}
```
Lead-sheet part provenance from work.yaml
Tablature part provenance (fallback)

This ensures works with both a TuneArch lead sheet and a Banjo Hangout tab show "tunearch" as the source.

Tablature Attribution

Tablature parts include provenance for frontend attribution:

"tablature_parts": [{
  "instrument": "banjo",
  "file": "data/tabs/red-haired-boy-banjo.otf.json",
  "source": "banjo-hangout",
  "source_id": "1687",
  "author": "schlange",
  "source_page_url": "https://www.banjohangout.org/tab/browse.asp?m=detail&v=1687",
  "author_url": "https://www.banjohangout.org/my/schlange"
}]

Strum Machine Matching

Matches songs to Strum Machine backing tracks using cached results:

Normalize title (lowercase, strip parenthetical suffixes)
Try exact match in cache
Try without articles ("the", "a", "an")
Try matching cache keys with articles removed

This handles cases like "Angeline Baker (C)" matching "angeline the baker" in the cache.

Usage

uv run python scripts/lib/build_works_index.py           # Full build
uv run python scripts/lib/build_works_index.py --no-tags # Skip tag enrichment

Output Format

{
  "id": "blue-moon-of-kentucky",
  "title": "Blue Moon of Kentucky",
  "artist": "Patsy Cline",
  "composers": ["Bill Monroe"],
  "key": "C",
  "tags": ["ClassicCountry", "JamFriendly"],
  "content": "{meta: title...}[full ChordPro]",
  "tablature_parts": [
    {"type": "tablature", "instrument": "banjo", "path": "data/tabs/..."}
  ]
}

work_schema.py

Defines the

work.yaml

schema and validation.

Work Schema

@dataclass
class Part:
    type: str           # 'lead-sheet', 'tablature', 'abc-notation'
    format: str         # 'chordpro', 'opentabformat', 'abc'
    file: str           # Relative path to file
    default: bool       # Is this the default part?
    instrument: str     # Optional: 'banjo', 'fiddle', 'guitar'
    provenance: dict    # Source info (source, source_file, imported_at)

@dataclass
class Work:
    id: str             # Slug (e.g., 'blue-moon-of-kentucky')
    title: str
    artist: str
    composers: list[str]
    default_key: str
    tags: list[str]
    parts: list[Part]

build_index.py (LEGACY)

Generates

docs/data/index.jsonl

from all

.pro

files in

sources/

What It Does

Scans
```
sources/*/parsed/*.pro
```
for all songs
Parses ChordPro metadata (title, artist, composer, version fields)
Extracts lyrics (without chords) for search
Detects key using diatonic heuristics
Converts chords to Nashville numbers for chord search
Computes group_id for song version grouping
Deduplicates exact duplicates (same content hash)
Outputs unified JSON index

Key Functions

def parse_chordpro_metadata(content) -> dict:
    """Extract {meta: key value} and {key: value} directives.
    Includes version fields: x_version_label, x_version_type, etc."""

def detect_key(chords: list[str]) -> tuple[str, str]:
    """Detect key from chord list. Returns (key, mode)."""

def to_nashville(chord: str, key_name: str) -> str:
    """Convert chord to Nashville number given a key."""

def extract_lyrics(content: str) -> str:
    """Extract plain lyrics without chord markers."""

def normalize_for_grouping(text: str) -> str:
    """Normalize text for grouping comparison.
    Lowercases, removes accents, strips common suffixes."""

def compute_group_id(title: str, artist: str) -> str:
    """Compute base group ID from normalized title + artist."""

def compute_lyrics_hash(lyrics: str) -> str:
    """Hash first 200 chars of normalized lyrics.
    Used to distinguish different songs with same title."""

Output Format

{
  "songs": [
    {
      "id": "songfilename",
      "title": "Song Title",
      "artist": "Artist Name",
      "composer": "Writer Name",
      "first_line": "First line of lyrics...",
      "lyrics": "Lyrics for search (500 chars)",
      "content": "Full ChordPro content",
      "key": "G",
      "mode": "major",
      "nashville": ["I", "IV", "V"],
      "progression": ["I", "I", "IV", "V", "I"],
      "group_id": "abc123def456_12345678",
      "chord_count": 3,
      "version_label": "Simplified",
      "version_type": "simplified",
      "arrangement_by": "John Smith"
    }
  ]
}

Version Grouping

Songs are grouped by

group_id

, which combines:

Base hash: MD5 of normalized title + artist
Lyrics hash: MD5 of first 200 chars of normalized lyrics

This ensures songs with the same title but different lyrics (different songs) get different group_ids, while true versions (same lyrics, different arrangements) share a group_id.

Deduplication

Exact duplicates (identical content) are removed at build time. The first occurrence is kept.

Key Detection Algorithm

Scores each possible key by:

How many song chords fit the key's diatonic scale
Bonus weight for tonic chord appearances
Tie-breaking: prefer common keys (G, C, D, A, E, Am, Em)

Tag System

Tags are added to songs during index build via

tag_enrichment.py

Tag Taxonomy

Category	Tags
Genre	Bluegrass, ClassicCountry, OldTime, Gospel, Folk, HonkyTonk, Outlaw, Rockabilly, etc.
Vibe	JamFriendly, Modal, Jazzy
Structure	Instrumental, Waltz

Tag Sources (Priority Order)

LLM tags (primary) - Genre tags from Claude batch API (
```
llm_tags.json
```
)
Harmonic analysis - Vibe tags computed from chord content:
- ```
JamFriendly
```
  : ≤5 unique chords, has I-IV-V, no complex extensions
- ```
Modal
```
  : Has bVII chord (e.g., F in key of G)
- ```
Jazzy
```
  : Has 7th, 9th, dim, aug, or slash chords
MusicBrainz artist tags (fallback) - Only used if LLM tags unavailable
Trusted user overrides - Downvotes from trusted users exclude bad tags

Data Files

File	Purpose
`docs/data/llm_tags.json`	LLM-generated tags (primary source, checked into git)
`docs/data/tag_overrides.json`	Trusted user tag exclusions (checked into git)
`docs/data/artist_tags.json`	Cached MusicBrainz artist tags (fallback)

Build Workflow

Tags are applied automatically during every index build:

Where What happens

Where	What happens
Local or CI	`tag_enrichment.py` reads `llm_tags.json` → applies genre tags
Local or CI	Harmonic analysis runs → applies vibe tags (JamFriendly, Modal)
Local or CI	`tag_overrides.json` exclusions remove bad tags

Local or CI

tag_enrichment.py

reads

llm_tags.json

→ applies genre tags

Local or CI Harmonic analysis runs → applies vibe tags (JamFriendly, Modal)

Local or CI

tag_overrides.json

exclusions remove bad tags

Normal flow: LLM tags are pre-computed and checked into git. CI uses them directly.

Re-tagging all songs (local only, requires Anthropic API key):

# Submit batch job (takes ~2 hours to process)
uv run python scripts/lib/batch_tag_songs.py

# Check status
uv run python scripts/lib/batch_tag_songs.py --status <batch_id>

# Fetch results when complete
uv run python scripts/lib/batch_tag_songs.py --results <batch_id>

# Rebuild index and commit
./scripts/bootstrap --quick
git add docs/data/llm_tags.json && git commit -m "Refresh LLM tags"

Syncing trusted user votes (local only, requires Supabase credentials):

./scripts/utility sync-tag-votes
git add docs/data/tag_overrides.json && git commit -m "Sync tag overrides"

query_artist_tags.py

Optimized MusicBrainz queries using LATERAL joins with indexed lookups:

# Query tags for artists (0.9s for 900 artists)
from query_artist_tags import query_artist_tags_batch
results = query_artist_tags_batch(['Bill Monroe', 'Hank Williams'])
# Returns: {'Bill Monroe': [('bluegrass', 45), ('country', 12), ...], ...}

add_song.py

Adds a

.pro

file to

sources/manual/parsed/

and rebuilds index.

./scripts/utility add-song ~/Downloads/my_song.pro
./scripts/utility add-song song.pro --skip-index-rebuild

process_submission.py / process_correction.py

Called by GitHub Actions when issues are approved.

Trigger: Issue labeled

song-submission

approved

(or

song-correction

)

Process:

Extract ChordPro from issue body (```chordpro block)
Extract song ID from issue body
Write to
```
sources/manual/parsed/{id}.pro
```
Add to
```
protected.txt
```
(for corrections)
Rebuild index
Commit changes

Metadata Parsing

The build script handles both formats:

# Our format
{meta: title Song Name}
{meta: artist Artist}

# Standard ChordPro format
{title: Song Name}
{artist: Artist}

Both are extracted and normalized.

Adding a New Source

To add songs from a new source:

Create
```
sources/{source-name}/parsed/
```
directory
Add
```
.pro
```
files there
Run
```
./scripts/bootstrap --quick
```
to rebuild index

The build script automatically scans all

sources/*/parsed/

directories.

Build Scripts (scripts/lib)

Related Skills

Markdown Converter

Nano Banana Pro

1password