Thursday, May 15, 2014

Data for Insight

Big Data. It's the newest Big Thing in business. Delve into your company's massive database of consumer actions and extract valuable insights - insights that can influence future shopping behaviour.

Aiden and Michel, the authors of Uncharted, turn to a different dataset, Google's database of 30 million digitized books (about one quarter of all books ever published), and a different kind of analysis they call culturomics.

They examine patterns of word usage over the history of all these books to shed light on word origins and usages, politics, history and culture. Actually,  to be perfectly accurate, they examine Ngrams, which are sequences of characters that could be words, phrases, or numbers or whatever. Then they plot the frequency of those Ngrams over time across all books in the database.

This is a simply incredible database to explore. Of course, it doesn't capture all culture shifts, because the dataset does not include any publications except books, and, of course, it totally misses the increasing dissemination of information through video. Nevertheless, it's a pretty powerful lens on history.

Obviously, the dataset is a treasure trove of insights  for the linguist. Consider the graph for chortle, galumphing and frumious,  three words introduced by Lewis Carroll in Jabberwocky, published in 1871.

But such analysis yields insights way beyond mere linguistics. Uncharted discusses who gets famous and the idea of the half life of fame. One chapter explores the revealing disappearance of artists' names during a period when they were politically suppressed, either by the Nazis or the Hollywood black list. The Ngram viewer confirms and provides evidence of the suppression, sort of history as demonstrated by a quant!

The Ngram viewer tool used by Aiden and Michel is available to anyone at Google's Ngram viewer and here are just a few of the charts from the book and others that caught my fancy. (Be careful, this site can be addictive!)

Consider the trend in our environmental thinking, as our terminology shifted from greenhouse effect, to global warming, to climate change.

The Ngram viewer makes clear our shift from tea to coffee.

If you didn't already know it, Ngrams would demonstrate the collapse of Detroit's hegemony in the automobile business.

And what do you think of this chart?

There's been a lot of controversy over Google's book digitization project - impassioned arguments about copyright issues, versus the value of such a database. This application skirts the issue of copyright by restricting itself to meta-analysis of the books.

Have fun playing with the Ngram viewer yourself. By the way, the response time for doing a search like the one above "Plot me the frequency of the words men and women in 30 million books published since 1800" is truly amazing when you think about it. We take Google's extraordinary search capability for granted sometimes, but this just highlights the powerhouse in those Googleplexes.

