The Google Books N-gram corpus contains an enormous volume of digitized data, which, to the best of our knowledge, sociologists have yet to fully utilize. In this paper, we mine this data to shed light on the discipline itself by conducting the first empirical study to map the disciplinary advancement of sociology from the mid-nineteenth century to 2008. We analyse the usage frequency of the most common terms in five major sociology categories: disciplinary advancement, scholars of sociology, theoretical dimensions, fields of sociology, and research methodologies. We also construct an overall index deriving from all sociology-related key words using the principal component method to demonstrate the overall influence of sociology as a discipline. Charting the historical evolution of the examined terms provides rich insights regarding the emergence and development of sociological norms, practices, and boundaries over the past two centuries. This novel application of massive content analysis using data of unprecedented size helps unpack the transformation of sociocultural dynamics over a long-term temporal scale.


The emergence of big data has opened many research opportunities and topics for the field of social science. As a lens on human culture (Aiden and Michel, 2013), big data offer enormous possibilities to detect historical trajectories, human interactions, social transformations and political practices with rich spatial and temporal dynamics. Forecasting the next five decades of social science research, King (2009: 91) has predicted a ‘historic change’ in which the profusion of gigantic databases and their investigation will promote ‘our knowledge of and practical solutions for problems of government and politics to grow at an enormous rate’.

One particularly promising new tool for massive content analysis is the Google N-gram corpus, a digitized books repository containing enormous volumes of digitized data. Michel et al. (2011) have described the construction of the first edition of the Google N-gram Corpus with approximately 5 million books and examined the usage frequency of words in order to quantitatively analyse human culture trends in ways unimaginable even a decade ago. Following this seminal study, the Google N-gram corpus has been used to explore the politics of disaster (Guggenheim, 2014), the language of contention (Tarrow, 2013), the transformation of economic life (Bentley et al., 2014; Roth, 2014), patterns of poverty and anti-poverty policy (Ravallion, 2011), linguistic and written language development (Twenge et al., 2012), and the psychology of culture (Greenfield, 2013; Zeng and Greenfield, 2015).

Notwithstanding this recent profusion of academic texts employing digitized texts, sociologists have yet to fully explore the possibilities offered by this new dataset. Whereas almost a decade ago the ‘coming crisis of empirical sociology’ related to sociologists’ failure to engage with the vast proliferation of social data (Savage and Burrows, 2007), sociologists need to think seriously about the challenges and opportunities posed by big data. As Burrows and Savage recently point out (2014: 2):

Sociologists generally used and refined rather familiar methods, talked mainly to each other about esoteric theoretical pre-occupations, and had not caught up with the fact that sociology was no longer an avant-garde discipline which had attracted legions of critical students and scholars in the 1960s and 1970s but had become fully part of the academic machine.

This absence is particularly striking given that the establishment, expansion, and influence of sociology is particularly reliant on words and phrases, rather than figures, functions, equations or other mathematical expressions, as compared to any natural science. Books serve as one of the most telling embodiments of a society’s knowledge over time, and the majority of sociology’s most canonical achievements have seen publication in book form. It seems only appropriate, then, to seize upon the opportunity provided by the Google N-gram corpus to identify and examine the long-term trends and themes that have characterized the field of sociology itself.

Sociology, as one of the core disciplines of the social sciences, is ‘like a caravansary on the Silk Road, filled with all sorts and types of people and beset by bandit gangs of positivists, feminists, interactionists, and Marxists, and even by some larger, far-off states like Economics and the Humanities, all of whom are bent on reducing the place to vassalage’ (Abbott, 2001: 6). Yet, notwithstanding this statement on the complexities of disciplinary advancement of sociology, there is virtually no empirical sociological research that can attest to the development of different ‘sorts and types’ of sociological norms, practices and boundaries. In the current study, we conduct the first empirical analysis, to our knowledge, in the field of sociology to use the corpus of digitized books. We analyse the evolution of the usage of the most common words and phrases in terms of disciplinary advancement, sociology scholars, sociology theories, sociology fields and sociology research methodologies between the 1850s and 2008. We also employ the data extracted from the corpus to quantitatively testify theories of the development of sociology. Our results show that the annual usage frequency count of a particular term based on big-data strategy not only gives clues as to the historical emergence and progress of sociology – indicating, for example, the longevity or popularity of a particular sociology field or method – but also sheds light on the linkage between the development of sociology and broader sociocultural dynamics over centuries.

Data and method

Since 2004, Google has been engaged in digitizing books printed as early as 1473 and representing 478 languages from 40 top universities worldwide (Michel et al., 2011). The first edition of Google corpus for analysis consists of about 5 million volumes of books between 1550 and 2008, excluding journals and serial publications (around 40 per cent of all scanned publications), which represent a different aspect of culture than do books. To avoid data duplication, the team of Google corpus converted billions of book records from over 100 sources of metadata information provide by libraries, retailers, and publishers in order to generate a single non-redundant database of book editions (Michel et al., 2011, Supplementary Online Material).

Following exactly the same procedure described in Michel et al. (2011), the second edition of Google corpus (2012) consists of about 8 million books, representing 6 per cent of all the books printed from the 1500s onward (Lin et al., 2012). Compared to the first edition, the 2012 Google corpus has a larger underlying book collection and higher quality digitalization (Lin et al., 2012). The English corpus alone comprises 4.5 million volumes of books and around half a trillion words (Table 1).

Table 1. The composition of Google Books Corpusa
First edition 2009 Second edition 2012
(5.2 million books) (8.11 million books)
Word count Book count Word count
  1. aNo information is available regarding the book amount of the first edition of Google books.
  2. Sources: Lin et al. (2012).
English 361 billion 4.54 million 468.5 billion
France 45 billion 0.86 million 102.2 billion
Spanish 45 billion 0.79 million 84 billion
German 37 billion 0.66 million 64.7 billion
Chinese (Simplified) 13 billion 0.3 million 26.9 billion
Russian 35 billion 0.59 million 67 billion
Hebrew 2 billion 0.07million 8 billion
Italian 0.3 million 40 billion
Total 538 billion 8.11 million 8,613 billion

The Google Books corpus provides information about how many times per year an ‘n-gram’ appears in all the books included in the corpus, where an n-gram is a continual string of n words (uninterrupted by a space). A 1-gram could be a single word, for example, ‘sociology’, or numbers ‘1.234’. An n-gram is a sequence of 1-grams, such as the phrases ‘sociology theory’ (a 2-gram) and ‘field of sociology’ (a 3-gram). Punctuation and capitalization are preserved in the data set. By searching the Google corpus for a key word or phrase, one can obtain information about the annual occurrence of that keyword or phrase for a given time period. Although the absolute percentage of any individual word is, of necessity, small, the traces of such words, their rise and fall, can help index the most robust sociocultural trends over a long-term timeline.

In the present analysis, we focus on the English-language books corpus. We also analyse some specific terms in both American English and British English books to make a further comparison across different social contexts.1 In terms of time frame, we restrict our research to between mid-1850 and 2008 (inclusive) for two reasons. First, the profession of sociology emerged as a scholarly discipline in the early part of the nineteenth century and only really started to flourish in the mid-1850s,2 with Karl Marx, Herbert Spencer, and other early generation scholars to publish their works in the field of sociology (Boudon, 1989). Second, digitization of written texts is a cumulative process. Contemporary holdings of books published in the early 1800s are often incomplete and scant, meaning that information extracted from books before the 1850s could be from a biased sample. At the other end of the timeline, books published after 2008 are still being digitized and included in the Google Books corpus. Thus far, there is no data match beyond the year 2008 (Lin et al., 2012).

This language and year restriction can substantially alleviate the potential problem of data accuracy because more than 98 per cent of words are correctly digitized for modern English books (Michel et al. 2011, Supplementary Online Material). Still, two concerns may be raised regarding the representativeness of the Google corpus analysed in the present paper.

First, the corpus was constructed using OCR (optical character recognition) technology. As Michel et al. (2011) mention, books with poor OCR quality (due to size, paper quality, or the physical condition) were filtered out. This could lead to a potential sample problem. Second, the corpus is most likely to be biased towards recent books, since more books are published in more recent years, leading to skewed results of word usage. Regarding the first issue, however, books filtered out due to poor OCR quality only accounted for around 4 per cent of all scanned volumes (Michel et al., 2011, Supplementary Online Material) – a considerably small fraction. As for the second concern, we normalized the total number of appearances of a key word using the frequency of ‘the’ in the same year rather than the total number of all words.3 Thus, we obtained the normalized annual frequency of the word usage of our search terms as:

display math

where Rit denotes the word usage of the key word i in year t, Cit represents the total number of appearance the word i in year t, and Ct is the total number of ‘the’ that appeared in all books published in year t. Conceptually, a higher Rit indicates higher frequency of word usage and thus higher cultural and social influence for the time period in question.

Drawing on various sociology textbooks, including A Dictionary of Sociology (Scott and Marshall, 2009), Sociology (Giddens and Sutton, 2013), we conducted a panoramic search of the disciplinary advancement of sociology in five major categories: academic significance, masters of sociology, theoretical dimensions, fields of sociology, and analytical methodologies. ‘Academic significance’ refers to the historical position of sociology in human knowledge as a subject related and compared to other subjects; the key word for this is ‘sociology’ or ‘sociological’. For ‘masters of sociology’, sociologists’ full names serve as the search terms and the goal is to chart key figures’ rise to fame and their academic reputations. The key words for ‘theoretical dimension’ are the names of relevant sociological theories and schools; ‘fields of sociology’ focuses on the sub-branches of sociology and popular research topics; and ‘analytical methodologies’ focuses mainly on the comparison of qualitative and quantitative research methodologies in sociology. Finally, we constructed an overall index deriving from all sociology-related key words using the principal component method to demonstrate the overall sociocultural influence of sociology in two centuries’ books.

Academic significance of sociology

We first counted the appearance of the key word ‘Sociology’ in the corpus since 1850. As a control group we also ran a similar search on the four subjects of ‘Philosophy’, ‘Economics’, ‘Anthropology’ and ‘Psychology’. It is worth noting that we did not run a test on ‘Political Science’ due to the fact that ‘Political’ or ‘Politics’ could be interpreted in numerous ways and thus would likely include non-academic related materials in the results.

The x-axis of Figure 1 demonstrates the year label from 1850 to 2008, while the y-axis stands for the word frequency statistics of different subjects. From Figure 1, one can observe that the word ‘Philosophy’ accounts for approximately 0.007 per cent of the total word count. Compared to other subjects, phrases associated with ‘Philosophy’ appeared earlier and more frequently. However, around the turn of the nineteenth to the twentieth century, the curve for ‘Philosophy’ plunged drastically and did not rise again until the early twentieth century. This finding corresponds with the collapse of classic German philosophy, especially the Hegelian school of philosophy in history (Solomon, 1988). It is noteworthy that from 1890 to 1920, as the word frequency statistics curve for ‘Philosophy’ dropped, the respective curves for the other subjects rose.

Figure 1.

Temporal distribution of subjects in Google N-gram Corpus, 1850–2008

In fact, the word frequency statistics for ‘Sociology’, ‘Economics’ and ‘Anthropology’ rose steadily between mid-late nineteenth century and the 1930s, especially in the case of ‘Economics’, which saw the most drastic uptick in frequency, developing a wide lead over ‘Sociology’, ‘Psychology’ and ‘Anthropology’.

Our analysis yields interesting insights regarding the impact of major world events. For example, during World War I (1914–1918), the statistics for ‘Sociology’, ‘Psychology’ and ‘Economics’ did not drop, but in World War II (1939–1945) the statistics dropped dramatically and only began to increase again with the end of the war. This seems to indicate that WWII had a much greater impact on these disciplines than did WWI. The effect of WWII was reversed, however, in the case of ‘Anthropology’, which saw no decline during WWII; indeed, if anything, it saw a slight rise in its statistics. We believe this can be linked to the expansion of conflict beyond Europe to include Asia, Africa and Oceania, thus increasing states’ demand for strategic knowledge about non-Western countries. A broader war has, on one hand, secured funding on anthropology from government based on strategic purposes to study nationalism, internationalism, racial supremacy and anti-totalitarianism, on the other hand anthropologists themselves were able to shift their research horizon from traditional subjects such as African and Indian tribes to Eastern Europe and Southeast Asia (Price, 2002). Anthropologist Ruth Benedict’s 1946 study of Japan, The Chrysanthemum and the Sword, stands as arguably one of the best-known examples of such state-driven academic research.

The curves for ‘Sociology’, ‘Economics’, ‘Psychology’ and ‘Anthropology’ all peaked during the 1970s and 1980s, then began another round of slow descent in the 1990s. The descent for each subject might simply represent the dilution of knowledge in a constantly expanding corpus: with the total amount of knowledge possessed by human beings constantly on the rise, the percentage increase year to year for each subject or field might understandably be decreasing. However, for ‘sociology’, the decreasing word frequency does not necessarily mean the decline of the importance of sociology as a discipline. We will analyse this further in a later section.

Sociology scholars

We conducted searches for the full English name of 30 major Western sociologists in the Google N-gram corpus. Figure 2 illustrates the top 12 sociologists in word frequency statistics.4 They are (chronologically): Karl Marx, Herbert Spencer, Max Weber, Emile Durkheim, Georg Simmel, Herbert Marcuse, Talcott Parsons, Erving Goffman, Zygmunt Bauman, Jürgen Habermas, Pierre Bourdieu and Anthony Giddens. From Figure 2, we conclude three major findings.

Figure 2.

Temporal distribution of famous sociologists, 1850–2008

Dilution effect: From Karl Marx to Anthony Giddens, it seems that each new sociologist is destined never to surpass his predecessors’ academic significance. This phenomenon does not necessarily suggest that the influence of one sociologist cannot surpass his predecessor. For instance, the influence of Pierre Bourdieu after the 1980s exceeded his predecessors Georg Simmel and Emile Durkheim and reached 0.00005 per cent around 2003, next only to Karl Marx and Max Weber. However, if we categorize sociologists into different generation group, we can see that later generation peaked at 0.00008 per cent in the 1970s represented by Talcott Parsons and none of the descendants could ever pass that point, let alone to reach the statistics of earlier sociologists like Herbert Spencer and Karl Marx. Thus conceived, it is almost impossible for later generation sociologists to surpass the fame of the earlier generation.

This phenomenon is due to the explosive growth in the total amount and categories of human knowledge. In other words, sociology constituted a bigger share of given knowledge during the nineteenth century, as that body of knowledge was still being amassed. When it comes to the twentieth and twenty-first centuries, in contrast, though sociology itself has continued to develop and more and more people have become professional sociologists, the discipline’s relative influence in human knowledge has decreased – not unlike the dilution of a substance mixed with ever larger quantities of water. To the extent that Talcott Parsons appears to be the last sociologist with the same level of influence as the generations that came before him, this may well have as much to do with the changing size of the ‘reservoir’ of all human knowledge as it does with Parsons’ work itself.

Exogenous effect: Compared to other sociologists, the word frequency curves with the highest average upward slope were those of Herbert Spencer and Karl Marx. In other words, Spencer and Marx enjoyed the most rapid ascent to positions of authority within the field in terms of influence. The speed of their rise, however, was supported by strong exogenous forces other than academic factors. Herbert Spencer was a generalist – a combination of philosopher, biologist, anthropologist, sociologist, political theorist, and a classic man of letters. He interacted with social elites throughout his life and was connected to many important ideologists and dignitaries. Spencer utilized his high-status social network to gain authority and audience as a generalist, enabling him to become extremely influential in the late nineteenth century, when the total amount of knowledge was still limited. Karl Marx, in comparison, did not enjoy such success in his lifetime; instead, his influence peaked between the 1920s and 1940s, and then again in the 1960s to the 1970s – precisely when Marxism and Communism were becoming influential beyond the academic world and actually changing the course of twentieth-century history.

Acceleration effect: Whereas most of the first generation of sociologists had to enjoy their fame posthumously, twentieth-century sociologists have become influential much earlier in their careers. With the exception of Herbert Spencer, all of the great names of sociology born in the nineteenth century became most reputable after their death. Karl Marx became most famous some 20 years after his death; Max Weber’s name began to rise exactly after his death in 1920; and, likewise, none of Emile Durkheim, Georg Simmel or Herbert Marcuse lived to see the years in which their numbers truly blossomed. In contrast, sociologists born in the twentieth century were much luckier. For instance, when Talcott Parsons began to gain fame in the 1940s, he was no more than 40 years old. Anthony Giddens became famous at the same age. Jürgen Habermas and Pierre Bourdieu became highly influential slightly later, but both began their ascent when they were in their fifties, around the 1980s–1990s, and Habermas is still alive today.

This acceleration effect can be ascribed to the development and standardization of sociology as a subject. In the late nineteenth century, as the discipline was still being established, there were fewer scholars and academic standards were, if not lower per se, at the very least less formalized, with greater room for flexibility. Sociology, too, was still in the process of legitimating its claim as a science. All these factors contributed to a longer ‘wait time’, so to speak, for a sociology scholar to reach notable fame. Today, both the discipline and the academic field in general are well established, enabling sociologists can make use of better disciplinary infrastructure and pre-existing channels to increase their influence.

Sociological theories

The contribution of sociology towards human knowledge lies in a series of inspiring and explanatory concepts and theories. As such, we conducted key word searches for classic theories of sociology in order to explore their relative impact. Because most nineteenth-century sociological works are more general in nature – concerned as they were with establishing the basic parameters and goals of the discipline – we focused on the most famous, more specific sociological theories of the twentieth century. As Figure 3 illustrates, we concentrated on the ten most famous sociological theories: Conflict Theory, Social Exchange Theory, Structural Functionalism, Structuration Theory, Symbolic Interactionism, Rational Choice Theory, Ethnomethodology, Neo Functionalism, Strength of Weak Ties, and Structural Holes.

Figure 3.

Temporal distribution of sociological theories, 1950–2008

Lifetime trajectory of a theory: We noticed that each theory, from its birth to maturity, from its peak popularity to its point of diminishing returns, has its own life trajectory. In the mid-late twentieth century, the majority of the theories reached a peak in their growth-rate and usage about 30–40 years after their introduction. After that point, their influence begins to diminish. Interestingly, even though the sample of theories is relatively small, this life-cycle pattern fits that found for words more generally by researchers in linguistics. For example, Petersen et al. (2012) have identified universal growth-rate fluctuations in the birth and death rates of words: new words reach a pronounced peak about 30–50 years after the originate, after which point they either enter the long-term lexicon or fall into disuse.

The metabolism of a theory: We also noticed that the influence of earlier theories was superseded by that of newer theories. For instance, the growth rate of Structural Functionalism began to decrease in the mid-1990s while the usage of Structural Holes, a theory 20 years younger, superseded the former. Ethnomethodology and Symbolic Interactionism also appear to be on their way out. Meanwhile, Rational Choice Theory is still increasing in frequency, but now at a slower rate. Furthermore, when we grouped Strength of Weak Ties and Structural Holes together, we found that their total influence had already surpassed that of Structuration Theory and Social Exchange Theory around 2008. In other words, the cultural influence and academic significance of newly developed social capital and social network approaches has already gone beyond that of ‘classical’ sociological theories. Whether they will continue this growth, however, remains to be seen.

Explanatory scale of a theory: Generally speaking, a grand theory possesses stronger generalization ability and a larger scale of utilization. Yet, we found that since at least the mid-twentieth century, the theoretical world is no longer dominated by grand theories. For instance, Anthony Giddens’ Structuration Theory and Talcott Parsons’ Structural Functionalism have fallen significantly below Ethnomethodology, Symbolic Interactionism and Rational Choice Theory, all of which focus on micro-level interactions in society rather than large-scale macro functions of societal structures and institutions. Moreover, as time progresses, there seems to be less and less room reserved for grand theories: theories that thrived after the 1970s, such as Strength of Weak Ties and Structural Holes, all adopt micro or meso perspectives in order to understand human behaviour. While the relative pros and cons of ‘micro’ versus ‘macro’ theories are still the subject of much debate today, we speculate that the ambitious nature of grand theories may have, over time, become a disadvantage, actually limiting their appeal for contemporary theorists. Indeed, it may well be as many postmodern theorists have already declared, that sociology has entered a ‘post grand theories’ era.

Fields of sociology

Sociology is subdivided into many specialized fields and these fields are constantly changing over time. For this analysis, we looked at the shifting pattern of these fields in sociology in order to capture the larger discipline’s related social change. We conducted a key word search for eight of the most prominent fields, namely: Educational Sociology (Sociology of Education), Rural Sociology, Urban Sociology, Political Sociology, Economic Sociology, Sociology of Law, Sociology of Religion and Historical Sociology.

A few interesting findings can be observed in Figure 4. First, Educational Sociology emerged early as the most prominent field, but was replaced by Sociology of Education in the late 1960s. The shift was not merely semantic. Educational Sociology focused primarily on the social and cultural factors affecting relatively smaller social groups, thus neglecting larger societal influences on education in the post-industrial period. The Sociology of Education, on the other hand, turns its interest to the social function of education and thus investigates the role of education as a social institution (Shimbori, 1972). Second, after the 1990s, both the Sociology of Religion and Historical Sociology progressed at a relatively aggressive pace, particularly when compared with all the other fields, which demonstrated signs of descending. Third, Rural Sociology emerged as a sub-field of the discipline in the early twentieth century and exhibited a very high growth rate from the 1950s to the 1980s. This reflects the fact that Rural Sociology is the earliest and the most prominent sub-discipline of American sociology as an outgrowth of the response to the pronounced differentials in rural and urban social organization of the late nineteenth century, with its development peak around 1950s to 1960s (Brunner, 1957; Nelson, 1969).

Figure 4.

Temporal distribution of sociology fields, 1900–2008

In addition to the various fields within sociology, we were also interested to see shifts in terms of substantive research topics, which subjects were deemed ‘hot’ and when. In Figure 5 we compare eight representative terminologies within the social stratification and mobility, and social capital and network areas: Social Identity, Social Movement, Social Mobility, Social Stratification, Social Capital, Social Network, Social Class and Social Strata.

Figure 5.

Temporal distribution of sociological research topics, 1900–2008

From Figure 5, we can observe that the growth-rate fluctuation of Social Mobility and Social Stratification peaked around 1975 and then started to decline. The popularity of Social Network rose rapidly from the late 1980s and surpassed Social Mobility around 1997. As Freeman (2004) argues, with the development of desktop computers and computer programs to manage network data, social network research finally took off from the mid-1980s onwards, shifting from ‘network as metaphor’ to ‘network as a mathematical expression’. Around the same time, research on Social Capital exceeded Social Mobility and finally surpassed Social Class around 2003. In other words, research on each of Social Capital and Social Networks is currently on the rise, while research on each of Social Mobility and Social Stratification is declining. Meanwhile, research on Social Movements started proliferating around the mid-1960s when waves of new movements organized around race and gender emerged in both America and Western Europe (Kriesi et al., 1995; Lovenduski, 1986).

Research methodologies of sociology

Which methods are used most by sociologists – quantitative or qualitative methods? To answer this question, we focus on shifts in the relative balance between the two major research methodologies in sociology over the past century.

We first calculated the average score of annual frequencies of each method in both quantitative and qualitative approaches from 1950 to 1980. Then we normalized the two groups of average scores into Z values and use ZQNZQL to obtain an index of quantitative analysis for each year. Figure 6 shows a plot of this index.

Figure 6.

Index of quantitative analysis, 1950–2008

From Figure 6, we can see that both methods took turns ‘in the lead’ across different time periods. From 1950 to 1980, qualitative methods were more prominent, while the usage of quantitative methods surpassed that of qualitative approaches in the 1980s and 1990s, except for a short period around 1995–1997. After 2000, quantitative methods dominate in a majority of scholarships. It is noteworthy that scholars who utilize qualitative methods are also more likely to publish their research in book format, in contrast to quantitative researchers who are more likely to publish in journals and other formats; therefore, if anything, it is likely that our calculation underestimates the ‘lead’ of quantitative over quantitative methods.

An overall index: influence of sociology

In this section we use the word usage of relevant sociology-related key words in the above categories (except for methodology) to generate an overall measure for the sociocultural influence of sociology in millions of books. We carry out a Principal Components Analysis (PCA) to extract as much information as possible from the corpus while preserving degrees of freedom. We prefer the PCA method to applying the average score of normalized annual frequencies because PCA can ‘concentrate’ much of the sociological signals into the first few factors by ‘screening’ the later factors that are dominated by noise. This is important given that we generate the list of sociology-related words without establishing any theory about how closely the selected signals capture the meaning of ‘sociology’. The factor-predicted score S is calculated by:

display math

where m denotes the number of factors with eigenvalues larger than 1, and math formula is the cumulative proportion of explained variances larger than 90 per cent.

We report the factor loadings, variances, as well as correlation of signals in Table 2. The KMO measure of sampling adequacy, and the SMC between each signal and all other signals strongly suggest that these signals pick up sociology-related dynamics in the corpus. The first three principal components account for around 91 per cent of the variance. Using the first three factors and their respective proportion of variance, we can predict the index for influence of sociology.

Table 2. Factor loadings on and correlations of sociology signalsa
Factor 1 Factor 2 Factor 3
  1. Notes:. The KMO reports the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy, and SMC reports the squared multiple correlations (SMC) between each signal and all other signals.
  2. Factors with an eigenvalue less than 1 are not presented.
  3. aIn robustness check we added more sociology-related words to the list (e.g., middle class, working class, social status, and etc.) and obtained almost identical PCA results.
Eigenvalue 21.437 5.835 1.068
Cumulative Fraction 0.692 0.880 0.914
Sociology Signal KMO SMC
Sociology(ical) 0.1748 –0.2043 0.1223 0.9583 0.9692
Karl Marx 0.1338 –0.2437 0.2394 0.9410 0.9378
Herbert Spencer –0.1148 0.0639 –0.0639 0.9508 0.4585
Max Weber 0.1922 –0.1698 0.0939 0.9552 0.9875
Emile Durkheim 0.2042 –0.1225 –0.0003 0.9584 0.9886
Georg Simmel 0.2016 –0.1026 0.1082 0.9765 0.9764
Herbert Marcuse 0.1790 –0.1686 –0.1605 0.8929 0.9684
Talcott Parsons 0.1445 –0.2773 0.1678 0.9112 0.9854
Erving Goffman 0.2024 –0.0946 –0.1104 0.9038 0.9882
Zygmunt Bauman 0.1672 0.2349 0.2237 0.9302 0.9929
Jürgen Habermas 0.1915 0.1264 –0.2372 0.9456 0.9871
Pierre Bourdieu 0.1765 0.2294 0.1142 0.9514 0.9936
Anthony Giddens 0.1851 0.2035 0.0019 0.9456 0.9946
Conflict Theory 0.2019 –0.0749 –0.1934 0.9627 0.9668
Structural Functionalism 0.1893 –0.1305 –0.2645 0.9554 0.9609
Structuration Theory 0.1673 0.2102 –0.0135 0.9560 0.9365
Social Exchange Theory 0.1982 0.0482 –0.2378 0.9783 0.9390
Symbolic Interactionism 0.1981 0.0063 –0.3075 0.9704 0.9635
Rational Choice Theory 0.1732 0.2355 0.0893 0.9217 0.9933
Ethnomethodology 0.1981 0.0099 –0.3143 0.9375 0.9675
Neo Functionalism 0.1770 0.1410 –0.1785 0.9437 0.9069
Strength of Weak Ties 0.1861 0.1825 0.0395 0.8835 0.9916
Structural Holes 0.1430 0.2273 0.2804 0.8861 0.9114
Social Identity 0.2001 0.1493 0.0098 0.9605 0.9940
Social Stratum(ta) 0.1731 –0.2016 0.2464 0.9367 0.9833
Social Movement(s) 0.2009 0.1344 0.0091 0.9528 0.9917
Social Mobility 0.1726 –0.2358 0.0777 0.9386 0.9938
Social Stratification 0.1681 –0.2455 0.1011 0.9154 0.9872
Social Capital 0.1423 0.2256 0.3957 0.8800 0.9815
Social Network(s) 0.1958 0.1576 –0.0132 0.9022 0.9944
Social Class(es) 0.1713 –0.2424 0.0792 0.9595 0.9943

In Figure 7, we further present the time series of the z-score equivalents of the overall index for sociology, as well as the time series of the word usage of ‘sociology/sociological’. As the figure shows, the influence of sociology as a discipline took off in the 1970s. Although the word usage of ‘sociology/sociological’ began to decline in the 1980s, the overall usage of sociological terms, including sociological theories and topics, began to skyrocket in all other respects. This reflects the extent to which sociology has come to penetrate and influence other domains and disciplines. For example, theories of weak ties and structural holes have been widely applied in the study of business management, while social capital has become a popular topic in research on economic development, political participation and public health. Further, we believe that the impact of sociology will continue to expand in the foreseeable future.

Figure 7.

Overall Index for Sociology, 1850–2008 (Z-Score)

A research case beyond description

With the help of Google corpus, we are able to conduct more substantial research into the development of sociology beyond simply describing the rise and fall of the usage of sociology-related words. We use the case of the early development of sociology in the USA as an example to illustrate how the data extracted from Google corpus can be used to conduct quantitative study.

Upon the creation of American sociology as a professional discipline circa the 1890s (Cortese, 1995; Young, 2009), the tenets of the social gospel movement made sociology an acceptable course of study in many American denominational colleges. This has led to considerable debate among students of history of sociology regarding the nature of the connection between sociology and social gospelism (Henking, 1993; Morgan, 1969; Williams and MacLean, 2012). As Morgan (1969: 42) has indicated, ‘the Social Gospel and early sociology were often indistinguishable in terms of both ideas and leading personnel. This close parallelism is seen as a major factor in the early acceptance of sociology as an academic discipline in the nineteenth century universities.’ Research on this question, however, has only looked at individual case studies and thus lacks the support of hard data.

Digitized written texts provide a statistical solution to this dilemma. We searched using the key words ‘Sociology’, ‘Social Gospel’ and ‘Hull House’,5 with ‘Anthropology’ as a control group, and compared the results from the American English corpus and the British English corpus. As demonstrated in Figure 8, ‘Social Gospel’ and ‘Sociology’ both show signs of growth from 1890 to 1930 in America, with their respective growth rate close to each other; meanwhile, ‘Anthropology’ shows no visible signs of growth. By contrast, the correlation between the growth of ‘Sociology’ and ‘Social Gospel’ was far less obvious in England.

Figure 8.

The social gospel movement and development of sociology in the late nineteenth and early twentieth centuries

The above findings based on visual inspection of the data provide only preliminary evidence of the effects of the social gospel movement on the development of sociology in America. We thus proceed to use the time series (1890–1930) of ‘Sociology’, ‘Social Gospel’, ‘Hull House’ and ‘Anthropology’ to perform a Granger causality test to formally test the proposed connection between sociology and social gospelism. In the language of time series analysis, X is the Granger-cause of Y in the sense that Y can be better predicted using the histories of both X and Y than it can be predicted using the history of Y alone.

Using time series with persistence displayed by a unit root process in a standard ordinary least square equation can lead to spurious results of correlations. Therefore, we first performed stationary tests for all four time series using the Dickey–Fuller General Least Square (DFGLS) method and the Phillips–Perron (P-P) method. We found that all of them are integrated of the first order. We therefore used their first differences to fit a vector autoregressive (VAR) model to examine the relationships among them. The results from the American English corpus in Table 3 clearly show that ‘Social Gospel’ is the Granger-cause of ‘Sociology’ at a 0.05 alpha level, and ‘Hull House’ is the Granger-cause of ‘Sociology’ too at a 0.09 alpha level. In addition, the identified time lag suggests that the social gospel movement within the past 4 years can effectively affect the development of sociology at any given time. However, neither of the two words are the Granger-cause of ‘Anthropology’ even at a 0.1 alpha level. Furthermore, results from the British English corpus demonstrate that there is no Granger-relationship among the time series of ‘Social Gospel’, ‘Hull House’, ‘Sociology’, and ‘Anthropology’ at all. In general, our findings based on time series analyses lend support to the argument that there was a close relationship between the early development of sociology and the social gospel movement in the USA.

Table 3. Granger causality tests for the potential connections between sociology and social gospel movement using two different corpus
American English Corpus (Lag = 4)
Null Hypothesis Observation Chi2-statistics p-value
Sg does not Granger cause Soci 47 22.678*** 0.000
Hull does not Granger cause Soci 47 7.896* 0.095
Anthr does not Granger cause Soci 47 2.019 0.732
Sg does not Granger cause Anthr 47 2.981 0.561
Hull does not Granger cause Anthr 47 1.195 0.879
Socil does not Granger cause Anthr 47 2.581 0.630
British English Corpus (Lag = 1)
Null Hypothesis Observation Chi2-statistics p-value
  1. Notes:. The lag length was chosen according to the Schwarz Bayesian information criterion (SBIC), the Hannan and Quinn information criterion (HQIC), and Akaike information criterion (AIC).
  2. *** p<0.001,**p<0.05,*p<0.1
Sg does not Granger cause Soci 50 1.112 0.292
Hull does not Granger cause Soci 50 0.155 0.694
Anthr does not Granger cause Soci 50 0.006 0.936
Sg does not Granger cause Anthr 50 0.079 0.779
Hull does not Granger cause Anthr 50 0.119 0.729
Socil does not Granger cause Anthr 50 0.594 0.441


This paper is the first of its kind to use the Google Books N-gram corpus, perhaps the largest electronic corpus yet constructed, to map out the disciplinary advancement of sociology in terms of the discipline in general and its major scholars, theories and research fields from the mid-nineteenth century to 2008. The intention of this research is in no way to suggest an evaluative ranking of the theories, scholars, schools, or methodologies that make up sociology. Instead, our goal has been to respond to Back and Puwar’s (2012) call for a ‘live sociology’ to deal with ‘lively data’, or the challenge posed by big data, the knowledge economy and the digitization of everyday life. As such, the aim and, it is hoped, the contribution of this study has been to show that massive content analysis from digitized books can provide rich insights regarding the historical evolution of professional disciplines and long-term sociocultural changes at a macro level.

Conceptually, examination of high frequency use of a specific term in a representative sample of the written texts is particularly important because it helps ‘identify the dynamics of historical emergence, decline, and comparative significance of a political concept’ (Hassanpour, 2013: 299). This gives corpus methodology significant advantages over traditional survey methods in which the sheer quantity of data and the availability of data are limited (Beer and Burrows, 2013; Lin et al., 2012). The use of newspaper data from one or more localities also tends to produce validity and reliability problems and there is no standard solution to correct for potential description and selection bias (Earl et al., 2004; Oliver and Myers, 1999). So far, however, the use of corpus data analysis has barely started among sociologists. With the exploding scale of digitization, more and more materials will be included in the historical corpus in the years to come. This will fundamentally change our scope of research and open venues for sociologists to employ new and creative approaches to social research.

Of course, there is still room for improvement in the present research. First, the full dataset analysed here only accounts for around 6 per cent of all books ever published from 1500 onwards. This means that it may be biased relative to the ensemble of all surviving books. The scanned and digitized books in particular were mainly borrowed from university or public libraries, retailers and publishers, and thus the composition of the corpus reflects the acquisition practices of the participating institutions. Although the assembled collections of books from various participating institutions could still be argued to be representative, the results here are tentative and should be treated with some caution.

Second, there are so many searchable sociology terms and we only cover a small proportion of the sociologists, theories and research fields. Therefore, the phenomena and the patterns observed might not represent the most universal versions. For example, we have only addressed some classic, traditional and established research fields of sociology such as social class, social movements or social capital; other important new fields such as globalization, migration, gerontology, gender, and race or ethnicity are increasingly popular among contemporary sociologists but may be under-represented here. The goal of this study has been to use novel data and visualization methods to shed light on the history of sociology itself, not by any means to summarize over a hundred years of sociological research.

Third, the advanced search function of the database was still limited and, therefore, the accuracy of the search results was far from perfect. For instance, different names may be attached to the same sociological terminology and words are sometimes used in ways that do not convey the same single sociological concept as the one intended in the analysis. Even though we have used Google search engines as a control group and chose the version with the highest level of representation, the accuracy of the results may still be lacking.

Despite its drawbacks, our research strategy is sufficient to show that written literary data in human history can help reinvigorate a sociological imagination able to extrapolate the historical trajectory of a sociological practice. Michel et al. (2011) proposed the concept of ‘culturomics’ to refer to the use of high-throughput digitized resources to study sociocultural trends and the human cultural genome. Similarly, we also suggest to open up a new field – ‘socialomics’ – to study the current state of a dynamic, fluid social world with massive digitized data collection and analysis. The value of establishing such an energetic and forward-thinking approach lies in the fact that the amount of human knowledge accessible to sociologists via physical reading is, in fact, very limited. This glass ceiling of academic research could result in a form of myopia, blinding us to the development of social science within and across media and forums not limited to the book format. With ‘genetic’ analysis of word frequency usage in a digitized era, we are likely to achieve theoretical inspirations and academic knowledge that the early generation of sociologists could not even have imagined.


  1. 1We also examine whether the pattern we find in the main analysis can be applied to the narrative-of-event corpora of newspapers. We searched the same key words in the field of sociology in the corpus of the New York Times and the results show similar general trends. Results of the relevant tests are available from the authors upon request.
  2. 2Although sociology’s exact timeline as a field/profession/discipline remains contested, this general time period works for the purposes of the current paper.
  3. 3Here we follow Bentley et al. (2014) and Acerbi et al. (2013), both studies that use this strategy. According to Acerbi et al. (2013), the word ‘the’ stably accounts for around 6 per cent of all words per year, and is thus a good representative of real writing and real sentences.
  4. 4The curve for the other 18 sociologists were all beneath the statistics curve of Jürgen Habermas. They are: Herbert Blumer, Charles Cooley, Alfred Schutz, George Mead, Harold Garfinkel, Max Horkheimer, Niklas Luhmann, György Lukács, C. Wright Mills, Robert Merton, Ralf Dahrendorf, Gerhard Lenski, Peter Blau, Randall Collins, Jeffrey Alexander, James Coleman, Immanuel Wallerstein and Norbert Elias.
  5. 5Hull House was the most famous ‘good-neighbor’ centre in the social gospel movement. Its founder Jane Addams later won a Nobel Peace Prize.
Please quote the article DOI when citing SR content, including monographs. Article DOIs and “How to Cite” information can be found alongside the online version of each article within Wiley Online Library. All articles published within the SR (including monograph content) are included within the ISI Journal Citation Reports® Social Science Citation Index.