[API Request] songLyrics v2: Token-level karaoke timing and lyric kind (translation/pronunciation) support #213

ranokay · 2026-02-22T13:09:44Z

ranokay
Feb 22, 2026

Type of change

API extension

Proposal description

Extend the existing structuredLyrics response from getLyricsBySongId with optional fields to support word/syllable-level timing (karaoke) and lyric layer classification (main vocals, translations, pronunciations), gated behind an enhanced query parameter for backward compatibility.

Motivation

The current structuredLyrics model is line-level only — each line has a start timestamp and a value string. This was sufficient for LRC-era lyrics, but modern lyrics formats and services provide much richer data:

TTML (W3C Timed Text Markup Language) — Used by Apple Music, Amazon Music, and YouTube for lyrics. Provides word/syllable-level timing, multiple language tracks, and translation/transliteration layers.
Enhanced LRC — Extensions to LRC that include word-level timestamps (<hh:mm:ss.ms> within lines).
Spotify, Musixmatch, LRCLIB — Increasingly provide word-level timing data.

Modern music clients (Apple Music, Spotify) display lyrics with word-by-word highlighting (karaoke mode), not just line-by-line scrolling. CJK music frequently includes pronunciation guides (furigana/romaji for Japanese, pinyin for Chinese) and translations as parallel tracks.

Without API support for these features, OpenSubsonic clients cannot match the lyrics experience of commercial streaming apps, even when servers have access to rich lyrics data.

Proposed Changes

1. New query parameter: `enhanced` on `getLyricsBySongId`

Parameter	Type	Required	Default	Description
`enhanced`	`boolean`	No	`false`	When `true`, the response includes `cueLine` arrays and non-main `kind` tracks (translations, pronunciations). When `false` or omitted, only `kind="main"` entries are returned with no `cueLine` data.

This parameter ensures full backward compatibility: clients that don't pass enhanced=true receive exactly the same response as today (only main lyrics, line-level only). This avoids breaking any existing client.

2. New optional field: `kind` on `structuredLyrics`

Field	Type	Required	Default	Description
`kind`	`string`	No	`"main"`	The role of this lyrics track. One of: `"main"`, `"translation"`, `"pronunciation"`

"main" — Primary vocals lyrics (default if omitted, backward-compatible)
"translation" — A translation of the main lyrics into another language
"pronunciation" — A phonetic/romanized rendering of the main lyrics (e.g., romaji for Japanese, pinyin for Chinese)

Translation and pronunciation tracks are only returned when enhanced=true.

3. New optional field: `cueLine` on `structuredLyrics`

A cueLine array provides word/syllable-level timing as a parallel structure to line. Each cueLine corresponds to a line by index. Only returned when enhanced=true.

The server pre-splits different vocal roles into separate cueLine entries with the same index. For example, if line 0 contains both main vocals and background vocals, the server returns two cueLine objects for index 0 — one with role="" (or omitted) for the main vocals, and one with role="bg" for the background vocals.

cueLine object:

Field	Type	Required	Description
`index`	`integer`	Yes	Zero-based index into the `line` array this cueLine corresponds to
`start`	`integer`	No	Start time in milliseconds (may differ from parent line if cues are more precise)
`end`	`integer`	No	End time in milliseconds
`value`	`string`	No	Full text of the line (for validation/fallback)
`role`	`string`	No	Semantic role of the cues in this cueLine. Omitted or empty for main vocals. Values: `"bg"` (background vocals), `"voice"` (individual voice part — use `voiceIndex` to distinguish), `"group"` (group/chorus)
`voiceIndex`	`integer`	No	Zero-based index identifying a specific voice part. Only present when `role` is `"voice"`. Used to distinguish individual singers in multi-voice tracks (e.g. duets, ensembles). The numbering is server-assigned per track and has no cross-track meaning
`displayRole`	`string`	No	Optional human-readable label for this vocal layer (e.g. a singer name from TTML `ttm:name`, or a descriptive tag like `"Tenor"`). Servers include this when the source data provides a name or label; omit otherwise
`cue`	`array[cue]`	Yes	Ordered list of word/syllable cues

cue object:

Field	Type	Required	Description
`start`	`integer`	Yes	Start time in milliseconds
`end`	`integer`	Conditional	End time in milliseconds. Within a `cueLine`, `end` MUST be either present on all cues or none. When the source provides partial end times, servers MUST fill missing values (e.g., using the next cue's `start`, or the cueLine's `end` for the final cue). When no cues have end times (e.g., Enhanced LRC with start-only timing), `end` is omitted from all cues.
`value`	`string`	Yes	The text of this word/syllable

Note: Individual cue objects do not carry a role. The role is defined at the cueLine level, and all cues within a cueLine share that role. This design means the server does the work of splitting by role, so clients can simply iterate cueLines without needing to group or filter.

Example Response (JSON)

{
  "subsonic-response": {
    "status": "ok",
    "version": "1.16.1",
    "type": "navidrome",
    "serverVersion": "0.55.0",
    "openSubsonic": true,
    "lyricsList": {
      "structuredLyrics": [
        {
          "kind": "main",
          "lang": "ko",
          "synced": true,
          "line": [
            { "start": 2747, "value": "눈을 뜬 순간" },
            { "start": 6214, "value": "모든 게 달라졌어" }
          ],
          "cueLine": [
            {
              "index": 0,
              "start": 2747,
              "end": 6214,
              "value": "눈을 뜬 순간",
              "cue": [
                { "start": 2747, "end": 3018, "value": "눈" },
                { "start": 3018, "end": 3179, "value": "을" },
                { "start": 3179, "end": 3582, "value": " " },
                { "start": 3582, "end": 4100, "value": "뜬" },
                { "start": 4100, "end": 4500, "value": " " },
                { "start": 4500, "end": 5200, "value": "순" },
                { "start": 5200, "end": 6214, "value": "간" }
              ]
            },
            {
              "index": 1,
              "start": 6214,
              "end": 9000,
              "value": "모든 게 달라졌어",
              "cue": [
                { "start": 6214, "end": 6800, "value": "모" },
                { "start": 6800, "end": 7200, "value": "든" },
                { "start": 7200, "end": 7600, "value": " " },
                { "start": 7600, "end": 8000, "value": "게" },
                { "start": 8000, "end": 8400, "value": " " },
                { "start": 8400, "end": 9000, "value": "달라졌어" }
              ]
            }
          ]
        },
        {
          "kind": "translation",
          "lang": "eng",
          "synced": true,
          "line": [
            { "start": 2747, "value": "The moment I opened my eyes" },
            { "start": 6214, "value": "Everything had changed" }
          ]
        },
        {
          "kind": "pronunciation",
          "lang": "ko-Latn",
          "synced": true,
          "line": [
            { "start": 2747, "value": "nuneul tteun sungan" },
            { "start": 6214, "value": "modeun ge dallajyeosseo" }
          ],
          "cueLine": [
            {
              "index": 0,
              "start": 2747,
              "end": 6214,
              "cue": [
                { "start": 2747, "end": 3179, "value": "nuneul" },
                { "start": 3582, "end": 4100, "value": "tteun" },
                { "start": 4500, "end": 6214, "value": "sungan" }
              ]
            },
            {
              "index": 1,
              "start": 6214,
              "end": 9000,
              "cue": [
                { "start": 6214, "end": 7200, "value": "modeun" },
                { "start": 7600, "end": 8000, "value": "ge" },
                { "start": 8400, "end": 9000, "value": "dallajyeosseo" }
              ]
            }
          ]
        }
      ]
    }
  }
}

Example with background vocals (role on cueLine)

When a line contains both main vocals and background vocals, the server splits them into separate cueLines with the same index:

{
  "cueLine": [
    {
      "index": 0,
      "start": 1000,
      "end": 3000,
      "value": "Hello echo",
      "cue": [
        { "start": 1000, "end": 1400, "value": "He" },
        { "start": 1400, "end": 1800, "value": "llo" }
      ]
    },
    {
      "index": 0,
      "start": 1000,
      "end": 3000,
      "value": "Hello echo",
      "role": "bg",
      "cue": [
        { "start": 2000, "end": 2500, "value": "echo" }
      ]
    }
  ]
}

Without `enhanced=true` (default behavior)

When enhanced is not set or false, the response is identical to songLyrics v1:

Only kind="main" entries are returned
No cueLine arrays are included
The kind field may still be present but will always be "main"

{
  "lyricsList": {
    "structuredLyrics": [
      {
        "kind": "main",
        "lang": "ko",
        "synced": true,
        "line": [
          { "start": 2747, "value": "눈을 뜬 순간" }
        ]
      }
    ]
  }
}

Backward compatibility impact

Fully backward-compatible. The enhanced query parameter gates all new functionality:

Without enhanced=true, the response is identical to songLyrics v1
kind defaults to "main" if omitted — existing lyrics entries are implicitly "main" vocals
cueLine is only included when enhanced=true
The existing line array is always present and unchanged
cueLine is a parallel structure, not a replacement for line
Existing clients will see no difference in the response they already parse

Servers that don't support TTML or word-level timing simply never include these fields. Clients that don't support karaoke display simply ignore them.

This would be reported as songLyrics extension version 2 via getOpenSubsonicExtensions, so clients can detect support.

Backward compatibility

No backward compatibility impact.

API details

Affected endpoint

getLyricsBySongId

New query parameter:

Parameter	Type	Required	Default	Description
`enhanced`	`boolean`	No	`false`	Enable enhanced lyrics (cue-level timing, translations, pronunciations)

Extension versioning

Version 1: Current line-level structuredLyrics (unchanged)
Version 2: Adds enhanced parameter, kind, cueLine, and cue as optional response fields

Servers report "songLyrics": [1, 2] via getOpenSubsonicExtensions when they support the new fields.

Role enumeration

The role field on cueLine is a finite enum:

Role	Description
(empty/omitted)	Main vocals (default)
`bg`	Background vocals
`voice`	Individual voice part. Use `voiceIndex` (0-based integer) to distinguish between voices
`group`	Group/chorus vocals

Additional fields:

Field	Description
`voiceIndex`	0-based integer, only present when `role` is `"voice"`. Distinguishes individual singers in multi-voice tracks
`displayRole`	Optional human-readable label for the vocal layer (e.g. a singer name). Servers include when available; omit otherwise

Security impacts

None. This is read-only lyrics data — no new inputs, no authentication changes, no new endpoints.

Potential issues

cueLine/line index correspondence — Servers must ensure cueLine[i].index maps validly to line[j]. Clients should gracefully handle missing or out-of-range indices.
Multiple cueLines per index — When a line has mixed roles, multiple cueLines share the same index. Servers MUST emit the main vocals cueLine (empty/omitted role) first, followed by other roles. Clients should group by index if needed for display.
Cue end consistency — Within a cueLine, cue.end is all-or-nothing: either every cue has an end or none do. Servers must normalize partial end times from source formats. Clients can rely on this invariant for consistent highlighting behavior (syllable-level when ends are present, word-level when absent).
kind extensibility — Currently three values. Future values (e.g., "background", "description") should be handled gracefully by clients (treat unknown kinds as displayable text).
Large responses — Songs with many cues (rapid syllable-level K-pop/J-pop lyrics) can produce larger responses. This is bounded and comparable to other media metadata. The enhanced parameter ensures this cost is opt-in.
Pronunciation/main cue count mismatch — A pronunciation cueLine for a given index may have a different number of cues than the main cueLine (e.g., 4 romanized words vs. 7 CJK syllables). Clients doing ruby/furigana overlay should not assume 1:1 cue correspondence between kind tracks.
Voice/agent mapping — Source formats like TTML may declare individual voice agents (ttm:agent). Servers SHOULD map these to role: "voice" with sequential voiceIndex values, and populate displayRole from ttm:name when available. When agent information is unavailable or unmapped, all cues appear under the main role. This is a progressive enhancement — not all servers will support it initially.
synced and cueLine relationship — cueLine data is only meaningful when synced=true. Servers MUST NOT emit cueLine arrays for unsynced lyrics. Clients may assume cueLines always have timing.
Overlapping cues — Cues within a cueLine MUST NOT overlap (cue[n].end ≤ cue[n+1].start). Servers MUST normalize any source overlaps so clients can iterate cues sequentially. Overlapping timing across different cueLines (different role/voiceIndex) is expected, since those represent parallel vocal layers.
Zero-duration cues — Cues where start == end may occur in source data. Clients should treat these as instantaneous markers rather than skipping them.
Translation/main line count mismatch — A translation structuredLyrics entry may have fewer line entries than the main entry (e.g., translator omitted some lines). Clients should not assume line arrays are the same length across kind tracks.
RTL languages — For right-to-left scripts (Arabic, Hebrew), cues are in logical reading order. Clients are responsible for applying appropriate bidi rendering.

Alternative solutions

Approach	Why not
Raw TTML XML passthrough	Would require every client to implement a TTML parser. Inconsistent with OpenSubsonic's structured-data philosophy.
Extending `line` with inline cues	Breaks the simplicity of the `line` model for clients that don't want cue data. Parallel `cueLine` keeps `line` clean.
New endpoint (`getKaraokeLyrics`)	Unnecessary fragmentation. The data is lyrics — it belongs in the lyrics endpoint.
Role on individual cue objects	Requires client-side grouping by role. Pre-splitting at the cueLine level is cleaner and pushes complexity to the server.
Always returning enhanced data	Breaking change for clients that don't expect it. The `enhanced` parameter makes it opt-in.

Reference implementation

A working implementation exists in Navidrome (PR #5076) that:

Parses TTML files (including Apple Music format with iTunes metadata extensions)
Extracts word-level cues, translations, and pronunciations
Pre-splits cues by role into separate cueLines
Serves them via the proposed getLyricsBySongId response format with enhanced parameter
Includes a frontend karaoke overlay component

Source format support

The proposed API structure is source-agnostic but designed to accommodate data from:

TTML (W3C standard) — full support including <span> timing, ttm:role, iTunes translation/transliteration metadata
Enhanced LRC — word-level timestamps can be mapped to cues
ASS/SSA — karaoke tags (\k, \kf) can be mapped to cues
Any future format with word-level timing

I have a working reference implementation in Navidrome and am happy to contribute the spec PR documentation once there's consensus on the approach.

ranokay · 2026-02-22T13:39:18Z

ranokay
Feb 22, 2026
Author

@Tolriq — I'd love your input on this. Symfonium already has excellent TTML lyrics rendering (I use it daily via Google Drive as a media provider specifically for this). This proposal would bring that same experience to Subsonic/OpenSubsonic providers like Navidrome, so users wouldn't need to rely on Google Drive/WebDAV workarounds just for rich lyrics support. The API changes are additive and backward-compatible.

4 replies

Tolriq Feb 22, 2026
Maintainer

Love the enthusiasms of @deluan about OS ;) Still waiting for your comment on the AudioMuse to move on ;)

Anyway about this:

Most software and usual naming for this is more cue, cueLine, ...

IMO the role does not fit the cue list but the cueLine, this means depending on the source file the parser needs to create another line for each roles.
In the end the clients will still display lines and having to read all cues to find if there's some other role then create another line is more work than necessary. (To ensure easy sorting / position on the client, the extracted line should have the same start /end as the line it related to and the proper cue values for the highlight). This is also specially true when there's many different main voice when 2 singers or more.

This also means that we need to properly define the roles (voice 1, voice 2, ....voice x, background, group)

And IMO this is still kinda a breaking change as you'd return by default a lot more lines (translations, pronunciation) that a client not aware would just display in the middle of the rest with the same display as normal text lines.
So we need to have those extra behind an enhanced=1 parameter to the endpoint for example.

ranokay Feb 22, 2026
Author

Updated the proposal above based on @Tolriq's feedback. Summary of changes:

Renamed token/tokenLine → cue/cueLine
Moved role from individual cue to cueLine level — server pre-splits by role, so clients just iterate without grouping
Enumerated roles using x- prefix convention: x-bg, x-voice1–x-voiceN, x-group (empty/omitted = main vocals)
Added enhanced query parameter — gates all new data (cueLines, translations, pronunciations) for full backward compatibility. Without enhanced=true, response is identical to songLyrics v1.

Reference implementation updated in navidrome/navidrome#5076.

@Tolriq does this align with what you had in mind? Happy to iterate further.

Tolriq Feb 22, 2026
Maintainer

Not sure about the need for the x- since it's dedicated role field, but not important.

To avoid future questions: From one request on Symfonium it seems voice can go high and apple music have some voice1000 for some case for example. Do we support that or not ? (The doc saying voiceN may limit people to think 9 is the max)

And for the cue object you put start and end not required, but a client would probably be in pain to do anything from a cue without those value defined. They should be mandatory or we should define what no value means for the client.

ranokay Feb 22, 2026
Author

Good points, all three:

Dropping x- prefix — Agreed, since it's a dedicated field there's no ambiguity. Updated to just bg, voice1–voiceN, group.
voiceN limit — Clarified that N is any positive integer, no upper bound. voice1, voice100, voice1000 are all valid. Updated the docs to make this explicit.
cue timing — Made start required on cue. Kept end optional with defined fallback: if omitted, defaults to the next cue's start (or the parent cueLine's end for the last cue). This accommodates formats like Enhanced LRC that only provide start times without forcing servers to synthesize potentially inaccurate end values.

Updated the proposal and reference implementation.

Tolriq · 2026-02-23T08:12:29Z

Tolriq
Feb 23, 2026
Maintainer

This accommodates formats like Enhanced LRC that only provide start times without forcing servers to synthesize potentially inaccurate end values.

This just move the problem to the client side, with different clients that can do it differently. Specially if the current line have no end too and the next line is far away, or there's multiple voices and it's up to the client to parse all to find what line actually continue that voice. Symfonium does all that + use timing derivation from the previous cues to estimate from the number of letters with heuristics and max delays. For word highlight this is less important but matters a lot for syllable ones.

I won't enforce that if you don't agree, but IMO consistency between the clients is best even if we can't 100% ensure consistency between the servers for all the strange elrc.

Maybe at least ensure that if there's one cue with an end then all cues needs an end. This ensure clients that do syllable works consistently and those who only do word (So the usually without end, because using the next start as end of previous estimate really gives bad result 60% of the time) are not too much impacted as they only highlight the whole word and probably stop highlight when displaying the next line anyway.

While apple music TTML are clean, the number of variations of elrc in the wild is wild ;)

3 replies

ranokay Feb 23, 2026
Author

Agreed — the all-or-nothing rule is clean and I've updated the spec accordingly.

Updated cue.end spec:

Within a cueLine, cue.end MUST be either present on all cues or none. When the source provides partial end times, servers MUST fill missing values (e.g., using the next cue's start, or the cueLine's end for the final cue). When no cues have end times (e.g., Enhanced LRC with start-only timing), end is omitted from all cues.

I also went through a comprehensive edge case review and expanded the Potential issues section with implementation notes for both servers and clients. New additions:

cueLine ordering: main vocals cueLine MUST come first when multiple cueLines share the same index
Pronunciation/main cue count mismatch: pronunciation cueLines may have different cue counts than main (e.g., 4 romanized words vs 7 CJK syllables)
Voice/agent mapping: TTML ttm:agent → voice1–voiceN as a progressive enhancement
synced + cueLine: cueLines only valid when synced=true
Overlapping / zero-duration cues: servers SHOULD normalize, clients should handle gracefully
Translation line count mismatch: don't assume same length across kind tracks
RTL: cues in logical order, bidi rendering is the client's responsibility

The Navidrome reference implementation already satisfies the all-or-nothing rule (TTML always provides both begin and end), and I've added an explicit main-first ordering guarantee for cueLines.

Tolriq Mar 5, 2026
Maintainer

Since not more reactions, I think you can do the PR with current state. This should trigger more reactions.

ranokay Mar 5, 2026
Author

Thanks @Tolriq! PR is up: #218

It includes everything from the discussion — cue/cueLine types, kind classification, role attribution, the enhanced=true gate, and the all-or-nothing cue.end rule you suggested. Looking forward to any feedback.

[API Request] songLyrics v2: Token-level karaoke timing and lyric kind (translation/pronunciation) support #213

Uh oh!

Uh oh!

ranokay Feb 22, 2026

Type of change

Proposal description

Motivation

Proposed Changes

1. New query parameter: enhanced on getLyricsBySongId

2. New optional field: kind on structuredLyrics

3. New optional field: cueLine on structuredLyrics

Example Response (JSON)

Example with background vocals (role on cueLine)

Without enhanced=true (default behavior)

Backward compatibility impact

Backward compatibility

API details

Affected endpoint

Extension versioning

Role enumeration

Security impacts

Potential issues

Alternative solutions

Reference implementation

Source format support

Replies: 2 comments · 7 replies

Uh oh!

ranokay Feb 22, 2026 Author

Uh oh!

Tolriq Feb 22, 2026 Maintainer

Uh oh!

ranokay Feb 22, 2026 Author

Uh oh!

Tolriq Feb 22, 2026 Maintainer

Uh oh!

ranokay Feb 22, 2026 Author

Uh oh!

Tolriq Feb 23, 2026 Maintainer

Uh oh!

ranokay Feb 23, 2026 Author

Uh oh!

Tolriq Mar 5, 2026 Maintainer

Uh oh!

ranokay Mar 5, 2026 Author

ranokay
Feb 22, 2026

1. New query parameter: `enhanced` on `getLyricsBySongId`

2. New optional field: `kind` on `structuredLyrics`

3. New optional field: `cueLine` on `structuredLyrics`

Without `enhanced=true` (default behavior)

Replies: 2 comments 7 replies

ranokay
Feb 22, 2026
Author

Tolriq Feb 22, 2026
Maintainer

ranokay Feb 22, 2026
Author

Tolriq Feb 22, 2026
Maintainer

ranokay Feb 22, 2026
Author

Tolriq
Feb 23, 2026
Maintainer

ranokay Feb 23, 2026
Author

Tolriq Mar 5, 2026
Maintainer

ranokay Mar 5, 2026
Author