Sunday, April 01, 2012

Predictors of anti-Semitism

GSS respondents were asked to give a number between 0 and 100 that describes their feelings toward Jews. They were told that 0 to 50 indicates not being favorable to and not caring for Jews, while 50 to 100 indicates feeling favorable and warm. Among a sample of  4,487, 12.1% gave a number under 50. Multiple regression reveals which factors predict liking Jews:

Standardized OLS Regression Coefficients

Age .098
Sex .069
Race .059
Size of place -.037
South -.021
Education .071
Income .057
Church attendance .144
Political conservatism -.009


All predictors are significantly related to liking Jews with the two exceptions of conservatism and living in the South. Focusing on the negative, here are the characteristics of people more likely to dislike Jews: young, male, non-white, urban, uneducated, poor, and irreligious.  The strongest predictor is religiosity. Age and education are also important. It surprises me that people who are young and urban are more likely to be anti-Semitic.

I didn't include IQ as a predictor since it is collinear with education, but if it takes the place of educational level, it is a comparatively strong predictor of liking Jews (the coefficient is .177).

19 comments:

IHTG said...

Shouldn't this post be titled "Predictors of philo-Semitism"?

DR said...

Ron two statistics related thing:

1) I don't believe you should be using OLS here. The Ys cannot be normally distributed since they are ordinal 0-100 data. OLS is only a MLE for normally distributed residuals.

When you think about it what you're doing is closer to logit regression (e.g. >50 being 1 or <50 being 0). The below paper suggests several GLS loss functions for ranked ordinal regression:

http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf

Personally what I would do though is normalize the Ys anyway. Simply map each number to its quantile, then plot quantiles to a 0-1 normal distribution then regress. E.g. if the median response variable is 78, and 95% or the responses fall between 46-90 then 46 maps to -2, 78 maps to 0 and 90 maps to 2 for purposes of the response variable.

If you use R here's simple code to do this:

> y = c(60:20, 1:50, 20:30, 30:35) # Example data
> percentiles = (order(y)-0.5)/length(y)
> y.normalized = qnorm(percentiles)


2) More generally you should consider trying cross validation for variable selection. There's nothing magic about a significant t-stat. Especially in the case where you're outside the bounds of OLS like in this scenario.

In general what works well for me is stepwise regression with cross validation to select my optimal number of steps. Boosted regression trees also work very well but the output's harder to interpret.

If you use R there are a ton of packages to explore on the subject.

Jim Bowery said...

When I ran tens of thousands of ecological correlations at the state level, two variables turned out to be the most predictive of other variables:

1) HIV/AIDS.

2) Jews (with Jewish percent of whites higher than Jews alone).

The signs on the correlations with Jews were almost uniformly those predicted by "antisemites".

One might conclude, from this, that rationality is a strong predictor of "antisemitism". However, it seems that "rationality" can very selectively destroyed. Perhaps the fact that there are hundreds of movies about the horrors of The Holocaust and not a single movie about The Holodomor is instructive here.

Anonymous said...

Ah, Jim Bowery shows up. Jim, I've been wanting to ask you something for a while. I recently got DNA and haplogroup tests done. I was wondering, how strongly are mtDNA haplogroups correlated with yDNA haplogroups? That is, for a characteristic male DNA bloodline, how strong are the correlations with female haplogroups? I assume some exist. Particularly, I want to know what the "characteristic" female haplogroups are for my yDNA haplogroup.

Anonymous said...

The idea being that, to form a race, male and female haplogroups evolved together.

Anonymous said...

When I ran tens of thousands of ecological correlations

Correlation is not causation.

Jim Bowery said...

The wisely anonymous "Anonymous" farts: Correlation is not causation.

Hey, guy, get your sophomoric bromides straight. Its: "Correlation doesn't imply causation."

If you want to graduate with BS, you should have said: "Ecological correlations are invalid." That's the better "go to" line for people who don't like data since just about anything can be considered an ecology, including an individual organism.

Jim Bowery said...
This comment has been removed by the author.
SFG said...

"young, male, non-white, urban, uneducated, poor, and irreligious. The strongest predictor is religiosity. Age and education are also important. It surprises me that people who are young and urban are more likely to be anti-Semitic."

Not if they're also black. It's in your model already, but how does the correlation structure look?

Antisemitism is usually more popular among the poor--the richer you are, the less likely you are to hate rich groups such as Jews or bankers or such.

It's also amusing that religiosity is inversely related to anti-Semitism, much as my terrified compatriots might think otherwise. I'd actually be interested in spreading this result if at all possible; it might decrease anti-evangelical attitudes among Jewry, which I think would be good for all concerned. (You might see a decrease in anti-evangelical propaganda if they didn't think you were trying to drive them into the sea, etc.)

Also, if they knew white people were less anti-Semitic, you might have a few more Jewamongyous and a few less Tim Wises, which would be very good for the cause of race realism indeed.

While I know you'd argue blue-collar people are more likely to suffer the adverse effects of Jewish-inspired policies such as immigration...I'm inclined to think it's old-fashioned class antagonism, because the connections are too well obscured. Any way we could test this hypothesis?

DR said...

"When I ran tens of thousands of ecological correlations at the state level, two variables turned out to be the most predictive of other variables"

There's a name for doing unsupervised learning on a ten thousand dimensional feature space with fifty data points. It's called overfitting. None of your data can be trusted in the slightest, especially with regards to what constitutes the most significant variables.

SFG said...

Doesn't overfitting refer to actual model generation? I mean, he's going one-by-one, so even if his significance levels aren't listed, the correlations should be valid. Granted if he correlates 20 variables he should get one that's significant just due to chance, but we should be able to see what the trends are.

And isn't the GSS a lot bigger than 50 people?

DR said...

@SFG

:1) Jim Bowery said "When I ran tens of thousands of ecological correlations at the state level." This strongly implied that what he was looking at was correlation at aggregated values at the state level. If he was using individual GSS data then it'd be much less of an issue. State level only gives you 50 data points.

2) It's overfitting in the following sense. Consider the scenario where I sample 10,000 independent random variables 50 times, such that I have a 10000x50 matrix. Then I run cross correlations for each variable pair, and rank the variables by which ones have the highest average correlation with the other 9999 variables.

In actuality the true correlation is of course 0, because all the variables are independent. In actuality are random samples will have some correlation because of random noise. Therefore some random variables will look like better predictors of others, or more consistently correlated across the board.

One might look at this data and say "A-ha! I've found a few key variables that seem to be related to everything else" (i.e. unsupervised learning). But this is purely an artifact of overfitting, because we know all the variables are independent and any non-zero correlation structure is the result of noise.

Going back to it Jim Bowery made no effort to distinguish overfitted noise from actual signal. He just looked at raw correlation and declared some variables to be important. In traditional statistics you absolutely must have some metric to distinguish noise from signal. Common approaches include looking at T-stat significance or out-sample/cross-validation

Jim Bowery said...
This comment has been removed by the author.
Jim Bowery said...
This comment has been removed by the author.
Jim Bowery said...

First of all, DR is being pedantic. To make the claim that out of hundreds of by-state ecological variables there is going to be insignificant structure flies in the face not only of common sense, but just about everything that has been done in the social sciences on related data.

Secondly, I did, in fact, run the same program (sum the coefficients of determination with all other variables) but just to be perverse (and try to measure exactly what DR is claiming) all the variables were assigned normally distributed random numbers. I even kept their goddamn names just to be silly. The numbers did not come out nearly the same. I didn't keep the results around because it was just on a whim that I ran it -- knowing what the outcome would be -- and not imagining anyone would posit the absurdity that DR has. It just wasn't interesting and neither is DR's "critique".

SFG said...

OK, I was referring to Inductivist's work, not Bowery's.

That said, I looked at Laboratory of the States back when he had it up, and it was as he described. Of course, I don't remember if he had population density in there--big cities have both HIV and Jews.

Tiger said...

Jim Bowery, I have respected your work ever since you posted your correlations on Kuro5hin 10 years ago. I am the anonymous that asked about correlations of haplogroups.

I'm sure by now you have moved on past DNA and haplogroups. But with your ability to crunch numbers, is this something you could do? I'd like to see which female haplogroups correspond with which male haplogroups (especially my own, R1b1)

I saw your wedding announcement years ago and hope that is going well.

DR said...

@Jim

Of course I'm not suggesting that there's zero correlation between all "ecological variables" (I'm not even sure what that specifically refers to) at the state level. You're attacking a strawman.

The point of the simple hypothetical in my scenario was to demonstrate that when you're calculating the correlation what you're getting out is combination of two components: the "true" correlation structure and a noise component. And that furthermore the noise component is significantly large when you have a 10,000+ feature space and 50 data points.

You made a claim about the relative importance of particular variables with regards to the correlation structure. Specifically that two variables are more correlated with other variables than other variables.

My point is that since you have no attempt to measure the error rate from the noise component that your claim about the relative importance of these variables cannot be trusted.

If we knew the "true" correlation structure then we could get some true ranking of the variables using whatever metric you did. However once we introduce the noise component it's going to significantly alter the sampled rankings of the variables.


A simple approach to measuring this error would be to use bootstrapping. Do the following, say 1000 times. Randomly select say 25 out of the 50 states and use whatever correlation ranking metric that you did. (Still not sure what exactly this is, the average magnitude of a variable's cross-correlation).

Rank the variables for each iteration's sub-sample. Then measure the proportion of times that HIV/Jews comes out in the top two. If you have the dataset available I can do this for you with 4 lines of code in R.

Jim Bowery said...

DR, there were obviously-correlated variables like AIDS Prevalence (an obvious HIV positivity correlate) income disparity, immigration, etc. all coming up at the high end and over the years as I added variables, sometimes large groups of variables, the rank order was reasonably stable. These variables never left the 95th percentile.

But it would be interesting to get more information on the structure so if you're willing to do work I'll see if I can locate a backup.