Tuesday, December 6, 2011

Positive Affect, Web Searches, and Poll Support: Herman Cain and Barack Obama Edition

In our paper, Desperately Seeking Sonia?: Latino Heterogeneity and Geographic Variation in Web Searches for Judge Sonia Sotomayor, my coauthor, Sylvia Manzano, and I investigated heterogeneity in Latinos' support for President Obama's nomination of then-Judge Sonia Sotomayor to the Supreme Court. In particular, we were curious whether Latinos of Puerto Rican ancestry were more supportive of her nomination than Latinos of different nationality. The structure of Latino support for Sotomayor's nomination promised to provide some insight into an important debate about Latino politics in the United States.
In the context of the literature on Latino identity politics, Judge Sotomayor’s nomination to the Supreme Court suggests two competing hypotheses about identity-based attachment to the nominee. One stream of literature suggests interest in the nomination should be relatively consistent across the larger Latino community in the US. The second suggests shared origin-specific attachments to Judge Sotomayor should catalyze greater interest in the nomination among Puerto Ricans compared to other Latino origin groups.
The issue also relates to the usefulness of the President's political calculations in nominating a Latina of Puerto Rican heritage, in part, to generate support among a Latino population that is principally Mexican-American.

The case study is perfect for these issues, but, data were a problem. There were no commercial or academic surveys of which we were or are aware that involved the kinds of sample of Latinos we would have needed to investigate the issue directly at the individual level. What to do? We climbed out on a limb:
Leveraging interest as a proxy for positive affect allows us to utilize publicly available data on state-level web search volumes to explore the relationship between population composition and orientations towards Judge Sotomayor. Here we measure interest in Sotomayor’s nomination by the relative volume of web searches in each state involving the term “Sotomayor” using data provided by Google’s Insights service. Google Insights tracks web searches that use the company’s search engine and provides users with comparative data on the relative volume of searches involving designated terms by city, U.S. state, or country (Google Insights for Search 2009). The system eliminates repeated queries from a single user over a short period of time, so that the level of interest is not artificially influenced by these searches. Geographical units are assigned search term-specific search density scores which theoretically range from zero to one hundred. The unit with the highest search volume for a designated term—relative to the volume of all Google searches, in all languages—is assigned a score of 100. Other units receive scores that reflect their relative search volumes proportionally to the observed maximum. So, a score of 50 indicates that a geographical unit produced half as many Google searches for a specified term than the unit with the greatest search volume relative to the number of total searches in each unit.

Data on internet search volumes are increasingly used by social scientists in fields such as public health and economics to investigate phenomena for which standard survey data may not feasibly be collected (Askitas and Zimmerman 2009). Internet search volumes for relevant search terms have been found to correspond to disease prevalence, home sales, auto sales, and unemployment (see Goel et al 2010 for a review). For example, Choi and Varian (2009) have found a correspondence between Google search values for terms such as “jobs” and “unemployment & benefits” are strongly associated with claims for initial unemployment benefits. This research suggests real-world events, such as a spike in unemployment or the nomination of a Supreme Court justice, produce some predictable or equilibrium level of web search volume among the internet-using population. Our claim is that variance in relative web search volumes related to the Sotomayor nomination across geographical units is an indicator of differential interest in and attachment to her elevation to the Supreme Court. To the extent that this variance is positively associated with the relative sizes Latino and Puerto Rican populations (controlling for other factors related to web searches related to the Sotomayor nomination), we may infer that Judge Sotomayor’s ethnicity and national origin primed heightened affective responses among co-ethnics.
That's right. We used state-level data on Google search volumes as a measure of some state-level aggregation of positive affective attachment to Sotomayor.

As many reviewers in various venues pointed out, that's a hefty substantive assumption. Fair enough. We do our best to justify it it terms of the limited literature on web searches and searching for information more generally as well as by being as careful as we can internally, trying to introduce statistical controls for raw political support and opposition, for example. I think the assumption is fair (correct even), but the larger issue of whether we searches mean support is still out there.

It has also occurred to me lately though that the Republican presidential primary gives me a nifty chance to explore the meaning of web search volumes in terms of political support and, by extension, the assumption in my collaborative analysis of the Sotomayor nomination.

So, on a whim this morning, I ran Herman Cain's poll numbers from various survey organization listed on pollingreport.com from August 31 through the most recent GOP primary survey (listed under the website's "GOP field") on November 20 through Stimson's dyadic ratio extraction algorithm (WCALC5) to produce an aggregate measure of Herman Cain's daily national GOP poll support. I also went over to Google Insights and captured the national web search volume index scores for Herman Cain (which are available at weekly intervals). The two time series are illustrated below.

The blue line shows Cain's Stimsonized poll standings, and the red line show's his Google Insights index score for the same period. The series correlate at 0.85 (treating the unsmoothed weekly Google Insight index scores as daily data). Taking account of only one lag, there is no evidence that either series Granger causes the other. Taking account of two lags, there is some evidence that the Stimsonized poll series Granger causes the web search series. It seems more right to me that people would search for information about a candidate they newly supported rather than develop support for a candidate on the basis of web searches. Nevertheless, the funky time-structure of these data make it hard to come to strong conclusions about whether one series causes the other.

The important thing up front is that the two series are basically the same series. Herman Cain's poll support and web search interest during the period of his rapid rise and decline in the GOP primary fields are substantially the same. Whether support causes searches (the claim in the Sotomayor paper) or searches cause support, support and searches are closely tied at the aggregate level.

We win, right?

Sadly, not so fast.

The Cain poll series dries up (for now) on November 20. Cain's Google Insights index, however, keeps going. The Cain poll series (for as far as it goes) and the Cain Google Insights time series (though last week) are illustrated below.

In the period of Cain's rise and initial descent, web search volumes were a very good indicator of his public standing. Over the last two weeks, as interest in Cain has shifted from curiosity about his quirky tax plan to interest in his sex life, web search volumes have taken on a new meaning and will be, it is fair to guess, not positively related to his poll standing during the period in which his scandal is "alive."

As a way to investigate the intuition that web search volumes indicate approval except when they don't (i.e. there is some event that temporarily attracts attention to a politicians that is separate from "normal' attention to him or her), I compared President Obama's approval ratings with Google Insights index scores for "Barack Obama."

So, I compared Obama's average weekly approval ratings from Gallup from the weeks 2009 April 6-12 through 2011 November 28-December 4 were in my spreadsheet with weekly Google Insights search index scores for "Barack Obama" for April 2009 through December 2011. The period observed is the entire Obama administration minus the first quarter of 2009, when both his approval ratings and search index scores are zany because of his recent inauguration. (I'll probably get the data later, but my first instinct was to drop the honeymoon.) The two time series are illustrated below.

For those of you keeping score at home, those time series correlate at 0.69. To get at the idea of "normal" web search activity versus "unusual" activity, I will work with the two unusually visible spikes in the Google scores. The first occurs in early September 2009 around the President's speech to a joint session of Congress at the height of the debate over the proposed health care reform legislation. The second, even more evident spike is coincident with the death of Osama bin Laden in May 2011.

The correlation between the President's Gallup approval rating and his Google Insights index score for the eleven weeks which are included, in whole or in part, in September 2009 and May 2011 is 0.23. Bootstrapping a correlation coefficient for 500 random draws of 11 weeks out of the remaining time returns an estimate of r=0.75, with a standard error of 0.16. Thus, the correlation between the Google data and the Gallup data of 0.23 for the "unusual weeks" is well outside of the 95% confidence interval around the estimated correlation 0.75 for the "normal weeks."

Once again, though, it is clear that with some sensitivity to changes in political context which may alter the meaning of expressions of interest in a topic through web searching for some periods of time, there is a close correspondence between aggregate positive affect and aggregate web search activity.

The take away here is that web search volumes can be a useful source of information about people's political attachments and interests, but they must be used with great caution since the impetus for searching can change (quickly) through time. Web searches as data may be the easiest to use, therefore, in cross-sectional analyses like those in the Sotomayor paper when the danger of changes in the meaning of web search volumes over time is minimized by the analysis of a short time frame.

