This project is organized into several packages and directories, each contributing to different aspects of the sound law derivation process. The main components of the project are:
-
/src/auxiliary: This package contains classes responsible for handling various auxiliary tasks, such as sequence comparison, string preprocessing, and parallel processing. It includes algorithms like Levenshtein distance and Needleman-Wunsch, which are critical for sequence alignment.
-
/src/mapping: This package manages the conversion and manipulation of linguistic symbols and orthographies. It includes classes for handling the International Phonetic Alphabet (IPA) and other phonetic and orthographic representations.
-
/src/soundsystem: This package is dedicated to modeling phonetic features and sound changes. It includes classes representing individual phonetic elements, such as consonants and vowels, and utilities for manipulating these elements in the context of sound law analysis.
-
/src/naive: This package contains the naive algorithm developed as part of this thesis.
-
/src/list: This package contains a modified variant of List's algorithm \cite{list_2019}.
-
!!!TODO: /visualization: This ...
In the following subsections, we provide detailed descriptions of the classes within the soundsystem package, which are the core of the model.
The FFE (FractionFieldElement) class is designed to handle arithmetic operations on fractional values, which constitute the phone vectors. It implements the FieldElement<T> interface from the Apache Commons Math library as FieldElement<FFE>
The Vowel class is the current implementation for representing vowels within the sound system. It inherits from the Phone class.
The Consonant class represents consonants within the phonetic system. In analogy to the Vowel class, it represents the phonetic features of consonants, such as place and manner of articulation. It inherits from the Phone class.
The Phone class represents generic phones within the sound system. Both Vowel and Consonant inherit from it.
The PlaceholderPhone class is used to represent temporary or placeholder sounds. These are EMPTY_PHONE, UNKNOWN_PHONE, and UNDEFINED_PLACEHOLDER. This class is employed during the preprocessing stage of the algorithm, where certain phonetic elements may need to be temporarily substituted.
The SoundLawMatrix class is responsible for managing the matrix representation of sound laws. It contains several utilities for matrix manipulation.
The FFEField is needed for technical reasons in the implementation of FFE.
The mapping package contains classes for translating between different linguistic representations, like the mapping between orthographic representations and their corresponding phonetic symbols. It defines necessary classes for linguistic abstractions.
The IPA class implements the PhoneticAlphabet interface, specifically mapping between IPA symbols and their corresponding phonetic representations within the system. This class utilizes two primary mappings:
- symbolToPhoneMap: Maps each
IPASymbolto a correspondingPhoneobject, allowing the system to translate IPA symbols into computational representations. - phoneToSymbolMap: Provides the reverse mapping, converting
Phoneobjects back into their IPA symbol equivalents.
The IPASymbol class defines enums of IPA symbols, which are used for the mappings in the IPA class.
The LatinOrthography class implements the Orthography interface, handling the mapping between Latin orthographic representations (letters and letter combinations) and their corresponding IPASymbol lists. This class maintains two mappings:
- representativeToSymbolMap: Maps Latin orthographic sequences to lists of corresponding
IPASymbols. - symbolToRepresentativeMap: Provides the inverse mapping, allowing the conversion of lists of
IPASymbolsback into their Latin orthographic forms.
The Orthography interface defines the contract for any class that needs to handle the mapping between orthographic representations and phonetic symbols. Implementations of this interface, such as LatinOrthography, are required to provide methods for converting between strings (or lists of symbols) and their corresponding phonetic representations. This interface ensures consistency across different orthographic systems that might be implemented in the future.
The PhoneticAlphabet interface specifies the methods required for any phonetic alphabet implementation. It defines methods to:
- Map an
IPASymbolto aPhone. - Map a
Phoneback to anIPASymbol. - Retrieve the set of all
IPASymbolsandPhonesthat the system recognizes.
The IPA class is an implementation of this interface.
The SigmaMapper class serves as a bridge between orthographic sequences and their corresponding phonetic representations. This class provides methods to:
- Convert a string of orthographic symbols into a sequence of
IPASymbols. - Map this sequence to a list of
Phoneobjects.
The NaiveDerivationAlgorithm class, located in the src/naive package, is responsible for deriving potential sound law candidates by comparing aligned sequences of phonetic symbols. It processes both phonetic symbols (IPASymbol) and character-based sequences, identifying differences and generating candidate sound laws. The algorithm simplifies the derivation process by providing a baseline method for identifying changes between corresponding segments in the aligned sequences. Additionally, it aggregates these sound law candidates and ranks them based on their frequency of occurrence.
The CorrespondencePatternDetectionAlgorithm class, located in the src/list package, implements a modified version of List’s algorithm \cite{list_2019} for detecting correspondence patterns in aligned linguistic sequences. This algorithm iterates through alignment sites, merges compatible patterns into larger groups, and identifies patterns of sound correspondences.
The auxiliary package provides utilities and algorithms that support the core functionalities of the project. These classes include sequence comparison algorithms, string preprocessing utilities, and input/output operations.
The LevenshteinDistance class computes the Levenshtein distance \cite{levenshtein_1965}, a method used for calculating the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.
The NeedlemanWunschAlgorithm class provides an implementation of the Needleman-Wunsch algorithm, a dynamic programming algorithm used for global sequence alignment \cite{needleman_wunsch_1970}, which is commonly used in bioinformatics.
The SequenceComparator class is a core utility for comparing sequences, utilizing both the LevenshteinDistance and NeedlemanWunschAlgorithm classes.
The StringPreprocessor class is responsible for preparing strings from input files by normalizing, tokenizing, and mapping them to IPASymbol sequences.
The TextReaderWriter class is used for reading files containing textual data.
The XMLParser class is used for parsing XML files.
The parallelization subdirectory within the auxiliary package contains classes designed to optimize the computational efficiency of the sound law derivator by distributing tasks across multiple threads or processes.
The LevenshteinWorker class is a specialized worker class designed to perform Levenshtein distance calculations in parallel.
The NeedlemanWunschWorker class is analogous to the LevenshteinWorker, but instead, it executes the Needleman-Wunsch algorithm.
The RepresentativeWorkerOrganizer class is responsible for coordinating the activities of worker instances involved in sequence comparison tasks. This class ensures that the workload is evenly distributed among workers and that the results are aggregated.
The SymbolWorkerOrganizer class is analogous to the RepresentativeWorkerOrganizer class but processes IPASymbol lists instead of strings.
The WorkerOrganizer class provides methods for launching, monitoring, and coordinating worker threads. Both the RepresentativeWorkerOrganizer and SymbolWorkerOrganizer class inherit from this class.