You are viewing [info]nfnitperplexity's journal

Infinite Perplexity

Recent Entries

You are viewing the most recent 10 entries

June 11th, 2011

11:14 am: "Invalid node structure" and "incorrect block count for file..." on my iMac hard drive. Google seems to consistently recommend third-party data recovery software. Any recommendations? I do have a backup from two months ago and I don't keep much that's important on my desktop. Ballpark, I would pay $100 to guarantee recovery but I wouldn't pay $500.

April 8th, 2011

09:36 pm: not-so-obvious observations about record linkage and deduplication
1) You ought to treat pairs of observations with at least one missing values differently from pairs with non-matching values. This may seem obvious, but most peoples' naive attempts at record linkage miss this point. It's often good to do something like "Matching birth dates: +12 points; non-matching birth dates: -6 points; at least one missing birth date: +0 points."

1a) If you want some theoretical basis for your weighting: for variables where most matching records have the same value and most non-matching records do not, a good weight is often +(-log2(chance that the two values would agree if the records were not true matches)) for matches and -(-log2(chance that the two values would disagree even if the records were true matches)) for non-matches. In most cases this means the better your data, the more heavily you weight disagreement, and the more unique your values, the more heavily you weight agreement. In practice for the data I'm dealing with, for widely dispersed variables like first name, last name, and date of birth, that often seemed to mean the weight of agreement was roughly twice the weight of disagreement.

2) The most common names are many orders of magnitude more frequent than the least common names. Trust me, it's more than you think. In many data set a pair of records with the names "Maria Gonzalez" that have the same birth date is significantly less like to be a true match than a pair with the names "Winifred Hope" that have contradictory birth dates. For many large data sets, (-log2(f)) is a good weighting scheme, where "f" is the fraction of the data set that the name in question makes up (that's if your de-duplicating within one data set; if you're comparing two data sets, add the negative log2s of the two frequencies up and divide them in half.) The weight of disagreement, on the other hand, barely changes from (1a) above.

2a) Different languages have different distributions of name frequencies. That means if you don't weight by name frequency, your results might be biased differentially by ethnicity.

3a) For all its popularity, SOUNDEX (at least in its original version, or whatever version SAS uses) misses a lot of true matches and has a very high rate of false matches. The Levenshtein edit distance, normalized by the length of the longer of the two strings, works much better. A good cut-off point seems to be 0.3 (that is, three-tenths of the letters need to be modified to transform one string into the other); past that you get the number of false matches skyrockets. There's also something called Jaro-Winkler that I haven't tried yet.

3b) Assuming that you are looking at approximate matches (such as Levenshtein distance for names or transposed day and month in birth date), you should weight them almost as heavily as exact matches; 90% of the full weight is often appropriate. It's hard for me to make myself do this, because it feels wrong, but I've done the calculations and it's the way to go.

4) Nicknames are a many-to-many, non-transitive relation, which means you can't handle them well via standardization. For example, by definition you can't standardize PAT to both PATRICK and PATRICIA, and you don't want to standardize both PATRICK and PATRICIA to PAT.

These are things I've concluded by fiddling around with the Fellegi-Sunter record linkage method, stripping it down, rounding it off, and simplifying the math as far as I can.

February 15th, 2011

03:53 pm: Xander is not able to work on the hula hoop project so I'm taking it over. That includes designing the custom circuit board, which I don't currently know how to do. I installed Eagle and I've been fussing with existing schematics and board layouts for the Arduino Mini Pro and the Arduino Nano. Impressions of circuit design from the ignorant:

1) Circuit boards seem to have simple nervous systems and complex circulatory systems, maybe because the real brainpower is hidden in the integrated circuits. Or it may be that I'm just confused because the schematics tend to abstract away a lot of the power stuff by waving toward the edges of the map and saying, "Here be 5 volts! There be ground!"

2) Capacitors are everywhere! I ask Xander what one of them does and he says, "That one smooths voltage so the chip doesn't get all messed up." And then I ask him what another one does, and he says, "Oh, that one smooths voltage so the chip doesn't get all messed up." So I ask whether are in fact the same thing, and he says, "No, of course not; this one blah blah blah and that one bleh bleh bleh." Just kidding. He's very helpful; I just haven't managed to wrap my head around this stuff yet. I've noticed that some capacitors interpose themselves between sources of power and delicate things, whereas others simply dangle their toes in the stream.

3) What bugs me most is that I may be learning a very complex skill that I use only for one project. Oh well. I guess after we discover an antidote for the zombie plague I can help get the power plants back online.

February 8th, 2011

09:11 am: book title suggestion
Hard-on For Housing: The Magical Investment That's Always A Good Idea No Matter Your Lifestyle, Your Finances, Or Your Local Market.

There are still lots of people who believe this. Their justification is invariably that not only can housing appreciate but you can also live in it. Do these people also believe that sporks are the always the best consumer product you can buy, because they can do two things?

December 3rd, 2010

03:00 pm: Maybe I can just skip from Season 2 to Season 5 of Angel? I mean, obviously there will be huge continuity errors, but (1) I didn't really find Season 2 very interesting, nor did I like what I remember of Season 3, and (2) people just rave about Season 5.

October 29th, 2010

08:53 am: supply-side economics
Sometimes we overlook the simple, economic causes for things:

http://www.newyorker.com/online/blogs/susanorlean/2010/10/haunted.html

This is a matter of supply more than demand. No Chinese manufacturing boom, no Zombie Arm Lawn Stakes. No Moblinky LED Hula Hoop either, for that matter :/

October 21st, 2010

10:49 am: I'd certainly like to believe that many people who will vote Republican this November nevertheless trust the GOP even less than they trust the Democrats, and thus won't support the GOP agenda. But what the numbers I've seen actually say is that in the aggregate, fewer people trust Republicans in congress than trust Democrats in congress. These numbers might miss something important about the intensity of distrust - many of those who say they distrust both the Democrats and the Republicans might be hardcore Republican voters who say they don't trust Republicans in congress 'cuz that's the cool thing for conservatives to say right now. These people certainly trust Democrats even less than they trust Republicans, but that won't show up in aggregate numbers.

That said, I think the mainstream-political-scientist theory of this election cycle is almost certainly right - it's about unemployment, not policy.

October 14th, 2010

03:21 pm: helping my coworkers design programs
"So, let's say Kevin Bacon has syphilis. We INNER JOIN him to the partner database, and pull the control numbers of all his partners. We'll call them the KB1 group. Then we INNER JOIN the KB1 group to the partner database, and pull the control numbers for their partners; those guys are the KB2 group. Then we INNER JOIN the KB2 group to the partner database..."

October 7th, 2010

03:21 pm: Reads the ANES cumulative raw text file into R, and creates two functions: anes.find, which takes a search phrase and returns the codes and one-line descriptions of all ANES variables whose (one-line) descriptions contain that phrase, and anes.code, which takes a code and returns the multi-line description of the ANES variable with that code. It might be good to do a function that searches the multi-line description as well, but that's more annoying.

~
rm(list=ls())
setwd('C:/ANES')

anes <- read.csv("anes_cdf_dat.txt")
raw.text <- readLines("anes_cdf_var.txt", n=-1)

name.to.code <- new.env(hash=TRUE)
code.to.var <- new.env(hash=TRUE)
i <- 5
while(i<=length(raw.text)) {
	if(substring(raw.text[i],1,2)=="==" && substring(raw.text[i+2],1,2)=="==") {
		var.code <- substring(raw.text[i+1],1,7)
		var.name <- substring(raw.text[i+1],16)
		j <- 3
		while((i+j+1)<=length(raw.text) && substring(raw.text[i+j+1],1,2)!="==") {
			j <- j+1
		}
		var.desc <- raw.text[(i+3):(i+j-1)]
		assign(var.name,var.code,name.to.code)
		assign(var.code,c(var.name,var.desc),code.to.var)
		i <- i+j
	}
	else i <- i+1
}
anes.find <- function(my.phrase) {
	var.names <- ls(name.to.code)
	matches <- grep(my.phrase,var.names,ignore.case="TRUE")
	if(length(matches)==0) {
		print("Search phrase not found.")
	} else for(i in 1:length(matches)) {
			if(i>1) {
				print(" ")
			}
			my.code <- get(var.names[matches[i]],name.to.code)
			print(my.code)
			print(get(my.code,code.to.var)[1])
	}
}
anes.code <- function(my.code) {
	print(get(my.code,code.to.var))
}
~

Example:
> anes.find("abortion")
[1] "VCF0838"
[1] "R Opinion: By Law, When Should Abortion Be Allowed"
[1] " "
[1] "VCF0837"
[1] "R Opinion: When Should Abortion Be Allowed"
[1] " "
[1] "VCF0230"
[1] "Thermometer: Anti-Abortionists"

> anes.code("VCF0837")
 [1] "R Opinion: When Should Abortion Be Allowed"                            
 [2] ""                                                                      
 [3] "QUESTION:"                                                             
 [4] "---------"                                                             
 [5] "(1972,1976: Still on the subject of women's rights,)"                  
 [6] "There has been some discussion about abortion during recent years."    
 [7] "Which one of the opinions on this page (1972: card) best agrees with"  
 [8] "your view?  You can just tell me the number of the opinion you choose."
 [9] ""                                                                      
[10] "VALID_CODES:"                                                          
[11] "------------"                                                          
[12] "1.  Abortion should never be permitted."                               
[13] "2.  Abortion should be permitted only if the life and"                 
[14] "      health of the woman is in danger."                               
[15] "3.  Abortion should be permitted if, due to personal"                  
[16] "      reasons, the woman would have difficulty in caring"              
[17] "      for the child."                                                  
[18] "4.  Abortion should never be forbidden, since one should"              
[19] "      not require a woman to have a child she doesn't"                 
[20] "      want."                                                           
[21] "9.  DK; other"                                                         
[22] ""                                                                      
[23] "MISSING_CODES:"                                                        
[24] "--------------"                                                        
[25] "0.  NA; no Post IW"                                                    
[26] "Inap. question not used"                                               
[27] ""                                                                      
[28] "NOTES:"                                                                
[29] "------"                                                                
[30] "See also VCF0838."                                                     
[31] ""                                                                      
[32] "WEIGHT:"                                                               
[33] "-------"                                                               
[34] "VCF0009/VCF0009a"                                                      
[35] ""                                                                      
[36] "TYPE:"                                                                 
[37] "-----"                                                                 
[38] "Numeric  Dec 0"                                                        
[39] ""                                                                      
[40] "SOURCE_VARS:"                                                          
[41] "------------"                                                          
[42] "1972:  720238 "                                                        
[43] "1976:  763796  "                                                       
[44] "1978:  780450 "                                                        
[45] "1980:  801136"    


September 23rd, 2010

10:07 am: Good to see Brad DeLong is on board with my "dropping money from aircraft" stimulus plan! (He suggest helicopters; I was thinking of crop dusters, but the principle is the same.)

Powered by LiveJournal.com