-
Notifications
You must be signed in to change notification settings - Fork 109
Description
I developed my own dataset ~9.5 hours for the Arabic Bahraini dialect.
My validation loss is around 1.5 .
I think this is partly due to how I defined the Arabic symbols.
Is my implementation correct?
Could someone please help?
pad = ''
_punctuation = '.!,؟*: '
_special = '-'
Phonemes
_vowels = 'واي'
_non_pulmonic_consonants = ''
_pulmonic_consonants = 'لإإلأابتثجحخدذرزسشصضطظعغفقكلمنهويءؤآ'
_suprasegmentals = 'ˈˌːˑ'
_other_symbols = ''
_diacrilics = 'ّ'
_extra_phons = [] # some extra symbols that I found in from wiktionary ipa annotations
#_extra_phons = ['g', 'ɝ', '̃', '̍', '̥', '̩', '̯', '͡'] # some extra symbols that I found in from wiktionary ipa annotations
phonemes = list(
_pad + _punctuation + _special + _vowels + _non_pulmonic_consonants
- _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics) + _extra_phons
phonemes_set = set(phonemes)
silent_phonemes_indices = [i for i, p in enumerate(phonemes) if p in _pad + _punctuation]