09:36 pm: not-so-obvious observations about record linkage and deduplication
1) You ought to treat pairs of observations with at least one missing values differently from pairs with non-matching values. This may seem obvious, but most peoples' naive attempts at record linkage miss this point. It's often good to do something like "Matching birth dates: +12 points; non-matching birth dates: -6 points; at least one missing birth date: +0 points."
1a) If you want some theoretical basis for your weighting: for variables where most matching records have the same value and most non-matching records do not, a good weight is often +(-log2(chance that the two values would agree if the records were
not true matches)) for matches and -(-log2(chance that the two values would disagree even if the records
were true matches)) for non-matches. In most cases this means the better your data, the more heavily you weight disagreement, and the more unique your values, the more heavily you weight agreement. In practice for the data I'm dealing with, for widely dispersed variables like first name, last name, and date of birth, that often seemed to mean the weight of agreement was roughly twice the weight of disagreement.
2) The most common names are many orders of magnitude more frequent than the least common names. Trust me, it's more than you think. In many data set a pair of records with the names "Maria Gonzalez" that have the same birth date is significantly less like to be a true match than a pair with the names "Winifred Hope" that have contradictory birth dates. For many large data sets, (-log2(f)) is a good weighting scheme, where "f" is the fraction of the data set that the name in question makes up (that's if your de-duplicating within one data set; if you're comparing two data sets, add the negative log2s of the two frequencies up and divide them in half.) The weight of disagreement, on the other hand, barely changes from (1a) above.
2a) Different languages have different distributions of name frequencies. That means if you don't weight by name frequency, your results might be biased differentially by ethnicity.
3a) For all its popularity, SOUNDEX (at least in its original version, or whatever version SAS uses) misses a lot of true matches and has a very high rate of false matches. The Levenshtein edit distance, normalized by the length of the longer of the two strings, works much better. A good cut-off point seems to be 0.3 (that is, three-tenths of the letters need to be modified to transform one string into the other); past that you get the number of false matches skyrockets. There's also something called Jaro-Winkler that I haven't tried yet.
3b) Assuming that you
are looking at approximate matches (such as Levenshtein distance for names or transposed day and month in birth date), you should weight them almost as heavily as exact matches; 90% of the full weight is often appropriate. It's hard for me to make myself do this, because it
feels wrong, but I've done the calculations and it's the way to go.
4) Nicknames are a many-to-many, non-transitive relation, which means you can't handle them well via standardization. For example, by definition you can't standardize PAT to both PATRICK and PATRICIA, and you don't want to standardize both PATRICK and PATRICIA to PAT.
These are things I've concluded by fiddling around with the Fellegi-Sunter record linkage method, stripping it down, rounding it off, and simplifying the math as far as I can.