[API Request] songLyrics v2: Token-level karaoke timing and lyric kind (translation/pronunciation) support #213
Replies: 2 comments 7 replies
-
|
@Tolriq — I'd love your input on this. Symfonium already has excellent TTML lyrics rendering (I use it daily via Google Drive as a media provider specifically for this). This proposal would bring that same experience to Subsonic/OpenSubsonic providers like Navidrome, so users wouldn't need to rely on Google Drive/WebDAV workarounds just for rich lyrics support. The API changes are additive and backward-compatible. |
Beta Was this translation helpful? Give feedback.
-
This just move the problem to the client side, with different clients that can do it differently. Specially if the current line have no end too and the next line is far away, or there's multiple voices and it's up to the client to parse all to find what line actually continue that voice. Symfonium does all that + use timing derivation from the previous cues to estimate from the number of letters with heuristics and max delays. For word highlight this is less important but matters a lot for syllable ones. I won't enforce that if you don't agree, but IMO consistency between the clients is best even if we can't 100% ensure consistency between the servers for all the strange elrc. Maybe at least ensure that if there's one cue with an end then all cues needs an end. This ensure clients that do syllable works consistently and those who only do word (So the usually without end, because using the next start as end of previous estimate really gives bad result 60% of the time) are not too much impacted as they only highlight the whole word and probably stop highlight when displaying the next line anyway. While apple music TTML are clean, the number of variations of elrc in the wild is wild ;) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Type of change
API extension
Proposal description
Extend the existing
structuredLyricsresponse fromgetLyricsBySongIdwith optional fields to support word/syllable-level timing (karaoke) and lyric layer classification (main vocals, translations, pronunciations), gated behind anenhancedquery parameter for backward compatibility.Motivation
The current
structuredLyricsmodel is line-level only — eachlinehas astarttimestamp and avaluestring. This was sufficient for LRC-era lyrics, but modern lyrics formats and services provide much richer data:<hh:mm:ss.ms>within lines).Modern music clients (Apple Music, Spotify) display lyrics with word-by-word highlighting (karaoke mode), not just line-by-line scrolling. CJK music frequently includes pronunciation guides (furigana/romaji for Japanese, pinyin for Chinese) and translations as parallel tracks.
Without API support for these features, OpenSubsonic clients cannot match the lyrics experience of commercial streaming apps, even when servers have access to rich lyrics data.
Proposed Changes
1. New query parameter:
enhancedongetLyricsBySongIdenhancedbooleanfalsetrue, the response includescueLinearrays and non-mainkindtracks (translations, pronunciations). Whenfalseor omitted, onlykind="main"entries are returned with nocueLinedata.This parameter ensures full backward compatibility: clients that don't pass
enhanced=truereceive exactly the same response as today (only main lyrics, line-level only). This avoids breaking any existing client.2. New optional field:
kindonstructuredLyricskindstring"main""main","translation","pronunciation""main"— Primary vocals lyrics (default if omitted, backward-compatible)"translation"— A translation of the main lyrics into another language"pronunciation"— A phonetic/romanized rendering of the main lyrics (e.g., romaji for Japanese, pinyin for Chinese)Translation and pronunciation tracks are only returned when
enhanced=true.3. New optional field:
cueLineonstructuredLyricsA
cueLinearray provides word/syllable-level timing as a parallel structure toline. EachcueLinecorresponds to alineby index. Only returned whenenhanced=true.The server pre-splits different vocal roles into separate
cueLineentries with the sameindex. For example, if line 0 contains both main vocals and background vocals, the server returns twocueLineobjects for index 0 — one withrole=""(or omitted) for the main vocals, and one withrole="bg"for the background vocals.cueLineobject:indexintegerlinearray this cueLine corresponds tostartintegerendintegervaluestringrolestring"bg"(background vocals),"voice"(individual voice part — usevoiceIndexto distinguish),"group"(group/chorus)voiceIndexintegerroleis"voice". Used to distinguish individual singers in multi-voice tracks (e.g. duets, ensembles). The numbering is server-assigned per track and has no cross-track meaningdisplayRolestringttm:name, or a descriptive tag like"Tenor"). Servers include this when the source data provides a name or label; omit otherwisecuearray[cue]cueobject:startintegerendintegercueLine,endMUST be either present on all cues or none. When the source provides partial end times, servers MUST fill missing values (e.g., using the next cue'sstart, or the cueLine'sendfor the final cue). When no cues have end times (e.g., Enhanced LRC with start-only timing),endis omitted from all cues.valuestringNote: Individual
cueobjects do not carry arole. The role is defined at thecueLinelevel, and all cues within a cueLine share that role. This design means the server does the work of splitting by role, so clients can simply iterate cueLines without needing to group or filter.Example Response (JSON)
{ "subsonic-response": { "status": "ok", "version": "1.16.1", "type": "navidrome", "serverVersion": "0.55.0", "openSubsonic": true, "lyricsList": { "structuredLyrics": [ { "kind": "main", "lang": "ko", "synced": true, "line": [ { "start": 2747, "value": "눈을 뜬 순간" }, { "start": 6214, "value": "모든 게 달라졌어" } ], "cueLine": [ { "index": 0, "start": 2747, "end": 6214, "value": "눈을 뜬 순간", "cue": [ { "start": 2747, "end": 3018, "value": "눈" }, { "start": 3018, "end": 3179, "value": "을" }, { "start": 3179, "end": 3582, "value": " " }, { "start": 3582, "end": 4100, "value": "뜬" }, { "start": 4100, "end": 4500, "value": " " }, { "start": 4500, "end": 5200, "value": "순" }, { "start": 5200, "end": 6214, "value": "간" } ] }, { "index": 1, "start": 6214, "end": 9000, "value": "모든 게 달라졌어", "cue": [ { "start": 6214, "end": 6800, "value": "모" }, { "start": 6800, "end": 7200, "value": "든" }, { "start": 7200, "end": 7600, "value": " " }, { "start": 7600, "end": 8000, "value": "게" }, { "start": 8000, "end": 8400, "value": " " }, { "start": 8400, "end": 9000, "value": "달라졌어" } ] } ] }, { "kind": "translation", "lang": "eng", "synced": true, "line": [ { "start": 2747, "value": "The moment I opened my eyes" }, { "start": 6214, "value": "Everything had changed" } ] }, { "kind": "pronunciation", "lang": "ko-Latn", "synced": true, "line": [ { "start": 2747, "value": "nuneul tteun sungan" }, { "start": 6214, "value": "modeun ge dallajyeosseo" } ], "cueLine": [ { "index": 0, "start": 2747, "end": 6214, "cue": [ { "start": 2747, "end": 3179, "value": "nuneul" }, { "start": 3582, "end": 4100, "value": "tteun" }, { "start": 4500, "end": 6214, "value": "sungan" } ] }, { "index": 1, "start": 6214, "end": 9000, "cue": [ { "start": 6214, "end": 7200, "value": "modeun" }, { "start": 7600, "end": 8000, "value": "ge" }, { "start": 8400, "end": 9000, "value": "dallajyeosseo" } ] } ] } ] } } }Example with background vocals (role on cueLine)
When a line contains both main vocals and background vocals, the server splits them into separate cueLines with the same index:
{ "cueLine": [ { "index": 0, "start": 1000, "end": 3000, "value": "Hello echo", "cue": [ { "start": 1000, "end": 1400, "value": "He" }, { "start": 1400, "end": 1800, "value": "llo" } ] }, { "index": 0, "start": 1000, "end": 3000, "value": "Hello echo", "role": "bg", "cue": [ { "start": 2000, "end": 2500, "value": "echo" } ] } ] }Without
enhanced=true(default behavior)When
enhancedis not set orfalse, the response is identical to songLyrics v1:kind="main"entries are returnedcueLinearrays are includedkindfield may still be present but will always be"main"{ "lyricsList": { "structuredLyrics": [ { "kind": "main", "lang": "ko", "synced": true, "line": [ { "start": 2747, "value": "눈을 뜬 순간" } ] } ] } }Backward compatibility impact
Fully backward-compatible. The
enhancedquery parameter gates all new functionality:enhanced=true, the response is identical to songLyrics v1kinddefaults to"main"if omitted — existing lyrics entries are implicitly "main" vocalscueLineis only included whenenhanced=truelinearray is always present and unchangedcueLineis a parallel structure, not a replacement forlineServers that don't support TTML or word-level timing simply never include these fields. Clients that don't support karaoke display simply ignore them.
This would be reported as
songLyricsextension version 2 viagetOpenSubsonicExtensions, so clients can detect support.Backward compatibility
API details
Affected endpoint
getLyricsBySongIdNew query parameter:
enhancedbooleanfalseExtension versioning
structuredLyrics(unchanged)enhancedparameter,kind,cueLine, andcueas optional response fieldsServers report
"songLyrics": [1, 2]viagetOpenSubsonicExtensionswhen they support the new fields.Role enumeration
The
rolefield oncueLineis a finite enum:bgvoicevoiceIndex(0-based integer) to distinguish between voicesgroupAdditional fields:
voiceIndexroleis"voice". Distinguishes individual singers in multi-voice tracksdisplayRoleSecurity impacts
None. This is read-only lyrics data — no new inputs, no authentication changes, no new endpoints.
Potential issues
cueLine/lineindex correspondence — Servers must ensurecueLine[i].indexmaps validly toline[j]. Clients should gracefully handle missing or out-of-range indices.index. Servers MUST emit the main vocals cueLine (empty/omitted role) first, followed by other roles. Clients should group by index if needed for display.cueLine,cue.endis all-or-nothing: either every cue has anendor none do. Servers must normalize partial end times from source formats. Clients can rely on this invariant for consistent highlighting behavior (syllable-level when ends are present, word-level when absent).kindextensibility — Currently three values. Future values (e.g.,"background","description") should be handled gracefully by clients (treat unknown kinds as displayable text).enhancedparameter ensures this cost is opt-in.ttm:agent). Servers SHOULD map these torole: "voice"with sequentialvoiceIndexvalues, and populatedisplayRolefromttm:namewhen available. When agent information is unavailable or unmapped, all cues appear under the main role. This is a progressive enhancement — not all servers will support it initially.syncedandcueLinerelationship —cueLinedata is only meaningful whensynced=true. Servers MUST NOT emitcueLinearrays for unsynced lyrics. Clients may assume cueLines always have timing.cueLineMUST NOT overlap (cue[n].end≤cue[n+1].start). Servers MUST normalize any source overlaps so clients can iterate cues sequentially. Overlapping timing across different cueLines (differentrole/voiceIndex) is expected, since those represent parallel vocal layers.start == endmay occur in source data. Clients should treat these as instantaneous markers rather than skipping them.structuredLyricsentry may have fewerlineentries than the main entry (e.g., translator omitted some lines). Clients should not assume line arrays are the same length acrosskindtracks.Alternative solutions
linewith inline cueslinemodel for clients that don't want cue data. ParallelcueLinekeepslineclean.getKaraokeLyrics)enhancedparameter makes it opt-in.Reference implementation
A working implementation exists in Navidrome (PR #5076) that:
getLyricsBySongIdresponse format withenhancedparameterSource format support
The proposed API structure is source-agnostic but designed to accommodate data from:
<span>timing,ttm:role, iTunes translation/transliteration metadata\k,\kf) can be mapped to cuesI have a working reference implementation in Navidrome and am happy to contribute the spec PR documentation once there's consensus on the approach.
Beta Was this translation helpful? Give feedback.
All reactions