Skip to content
tumarkin edited this page Sep 14, 2015 · 2 revisions

yente is designed for iterative matching. The program runs more slowly as the algorithm is customized to allow for less precise matches. Therefore, the recommended sequencing is described below. This description assumes matching names in the FROM-FILE to those in the TO-FILE.

  1. Run an initial match using the default yente configuration. This is a word order-insentive, case-insensitive match that rewards pairs of names that share "rare" words. Review these matches using your data tool (Excel, R, SAS, Stata, etc...). Then, create a new file containing only those names in the original FROM-FILE that did not have a match in the TO-FILE.

  2. Using FROM-FILE-V2, add phonetic pre-processing to allow for approximate matching. The phonetic algorithms are exponentially faster than the misspelling-based algorithms as they do not require comparing all possible word pairs. For example, assume you are comparing a three word name to another three word name. The misspelling algorithm will require comparing all possible permutations of the individual words in each list, which, in this case, is 6 possibilities. Hence, the misspelling-based algorithm will be approximately 6 times slower than the phonetic algorithm. As before, create a new file containing only those names in the FROM-FILE-V2 that did not have a match in the TO-FILE.

  3. Using FROM-FILE-V3, run yente allowing for misspellings. Do not use a phonetic pre-processor. As with the other iterations, create a new file containing only those names in FROM-FILE-V3 that did not have a match in the TO-FILE.

  4. Finally, iteratively run yente with phonetic pre-processing and misspelling algorithms to pick up any remaining matches.

Clone this wiki locally