Skip to content

New metrics for Language models: corpus coverage #44

@Trondtr

Description

@Trondtr

We want to add corpus coverage as ab evaluation for our language models, this is perhaps the best evaluation measure.
have
I suggest doing the following, for fsts

  1. Count the number of words or tokens (probably tokens) in the open corpus (evt. both open and closed, but open is more transparent)
  2. Run them through the descriptive analyser
  3. In the table under Language Models, add the coverage as percentage
  4. in the documentation page (next to the map) add the same percentage, and also the size of the corpus

The table under Language Models is crowded, I suggest to save space, as follows:

  1. to remove the version number column from the table and only keep it next to the map, not in the table
  2. If needed: publish the corpus coverage in the table only as percenage, not as percentage and number of šord for corpus.
  3. On the page for each language there is more space, here we may have both percentage and number.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions