Other Than Scale: Abstract Signs in the Digital Archive

[Delivered as part of the “Mid-Range Reading: Manifesto Edition” panel, organized by Alison Booth, of the DH2018 Conference in Mexico City]

A great deal of digital humanities work over the past decade or so has employed scale as the concept that distinguishes it from other methods of literary and cultural study. Quantitative scholars in particular have quite naturally chosen scale as the specific difference of their method. They speak of the computer as a “macroscope” that permits “macroanalysis.” Critics counted words and documents before computers, but computers let them count and compute lots of them. Contrasting themselves with close readers, “distant readers” propose, with the help of machines, to step back from the individual pages and books to see more and see bigger. When the popular press sees fit to feature DH, it is scale that gets touted and scale that gets maligned.

Claims of scalar difference are often apparently precise. Instead of offering a reading of a single novel, distant readers study the titles of 7,000 British novels from 1740-1850, or ask how not to read a million books, or search through (as of last count) the 60,237 full texts in EEBO TCP I and II. For nearly all quantitative analyses of texts, the authors tell (or could tell) the reader exactly how many words they are counting in exactly how many documents over how many years, since these numbers are the basis of more sophisticated metrics and models.

The concept of scale is not wrong or misguided in any simple sense, and I plan to issue no prohibitions on its use. Nor do I plan to offer a brief for the micro in opposition to the macro (As Roopika Risam and Susan Edwards did at DH2017). I want instead to argue that we should displace scale from its marquee role in differentiating data and corpus based digital inquiry from other approaches. That displacement has perhaps already begun. Surveying recent work by a range of scholars in an attempt to forestall attacks on the use of data in literary study, Ted Underwood observes that “None of them, as far as I can tell, have stopped doing close reading.”  “We also do close reading” is a totally sensible line of defense, albeit one that fortifies distant reading at the expense of its distinctiveness. This is all to the good.

My argument is twofold. First, I want to suggest that in spite of quantitative precision – so many words, so many documents, so many years – we often don’t have a clear idea of what we talk about when we talk about scale. Even when bag-of-words approaches are forthright about discarding word order and syntax, they rarely operate with even a rudimentary account of the range of phenomenon they are discarding. Individual texts, as competent readers make even basic sense of them, are much bigger and more informative than is usually acknowledged by even the most sophisticated quantitative approaches. What has been characterized as an increase in scale can usually be more accurately described as the sacrifice of one sort of information for another. I happen to think (contra Wai Chee Dimock and all talk of fractals) that this sacrifice is very often worthwhile – and it is in any case inevitable. But we ought to know what has been sacrificed, what hecatombs digital approaches to literary history have placed on the altar of scale. As it turns out, digital methods and tools are increasingly well suited to this task too.

My second argument is that the dominance of scale in accounts of digital methods has occluded other, non-scalar distinctions that may, in the long run, prove no less consequential for digital humanities research, including quantitative research. Those include notions like explicitness, falsifiability, reproducibility (!), modeling, prediction, gradualness, sampling, and more, but I want to focus today on one in particular: abstraction, specifically the abstraction characteristic of the linguistic sign. I turn to the insights of construction grammar and corpus linguistics to suggest possibilities for qualitative and quantitative investigation that have so far been overlooked by digital humanities work operating under the rubric of scale.

Let me illustrate using a relatively simple example, the bigram thought leader. The thought leader has assumed a particular preeminence in the Age of TEDx, but the OED gives its earliest appearance as 1887, when it was used to describe Henry Ward Beecher,[i] and some quick Googling antedates this use by a decade.[ii] If the bigram were part of a corpus modeled as topics or vectors, it would be counted as two words – a token each of the types thought and leader – though it functions as a single semantic and syntactic unit, a compound noun, which is why, for example, we pluralize it as thought leaders, not thoughts leaders. The loss of the compound, however, is only a small part of what gets left out of just about any bag of words approach.

What we actually know, when we understand an English noun compound like thought leader, is a hierarchy of abstract signs, pairings of form and meaning. So in addition to being familiar with the conventional expression thought leader, we also know the partially abstract form NOUN leader, as in group leader, squad leader, ring leader, house leader, student leader, and so on. These partially unspecified compounds have a built in under-determination: leaders can be included in or excluded from the groups they lead. A student leader may or may not be a student herself. The compound thought leader has an additional quirk. Presumably a thought leader is a thinker who leads other thinkers, rather than thoughts per se. But the compound thinker leader is blocked by singer songwriter, hunter gatherer, and other [VERB-er]N [VERB-er]expressions which indicate coordination rather than compounding.[iii]

The partially schematic NOUN leader is itself an instance of the still more schematic construction NOUN [VERB-er]N, which finds ample use in Richard Scarry’s classic children’s book Busy Busy Town.


What kids learn when reading, if they haven’t learned it before, is that someone who empties the wastebasket is a wastebasket emptier, someone who makes beds is a bed maker, and, by extension, someone (or something) who wugs wigs is a wig wugger.

And NOUN [VERB-er]N is, in turn, an instance of a still more schematic double noun construction, NOUN NOUN, like media lab and nominal compounds in general X NOUN where the X is any part of speech (and perhaps even a fully general compounding schema X Y, about which I won’t say more). Here then is the hierarchy of constructions for thought leader:


X X                       > Snow white (adj), freeze-dry (verb), stormcloud (noun)

X NOUN               > digital humanities (adj n), downdraft (adv n), flyboy (verb n)

NOUN NOUN     > railroad, party bus, textbook, fire drill, media lab

NOUN [VERB-er]N  > table setter, cherry picker, cake baker, motherfucker

(wig wugger = someone or something who wugs wigs)

NOUN leader     > majority leader, party leader, squad leader, Senate leader,

team leader, student leader, ringleader

Thought leader

Geert Booij (2010) gives the following formulation of the nominal compound construction, drawing on the notation of Ray Jackendoff (2002):

[[a]Xk [b]Ni]Nj <> [SEMi with relation R to SEMk]j

Don’t get intimidated by the variables and symbols. If you are a fluent English speaker, it’s already part of your linguistic knowledge, something you know even if you don’t know that you know it. Humanists trained in the structuralist tradition that looks back to Saussure think of the linguistic sign as looking like this:


But the nominal compound construction is also a sign – a conventional pairing of form and meaning – that looks unfamiliar only because of its abstraction.

Screen Shot 2018-06-19 at 11.27.54 AM.png

Elements of form and meaning are left blank, unspecified, so that they can be filled in with new words to produce an open-ended set of utterances, including ones no one has ever said or written before.

Basically what the notation says is that in a compound noun the rightmost noun (b) is semantically primary, while the word to the left (a) bears a semantic relationship (R) to the noun (b).  The letter X means that the left word can be any part of speech (adj, verb, prep, etc.).  The letters i, j, and k map elements of the signifier to elements of the signified.  It’s because you possess this construction that you know that a gun show is a show and not a gun, an oven mitt is a mitt and not an oven, and a wig wug (whatever that might turn out to be) is a wug and not a wig. It’s because this form is part of your linguistic knowledge that you might have noticed something strange – I mean morphosyntactically strange – about the “Squatty Potty” – namely that its order is wrong, since if it is anything, it is a squatty and not a potty.


The nominal compound construction is language specific, though many languages have a cognate construction. In Spanish the order is reversed, with the noun head (the semantically determinate noun) occupying the left position as in El abrebotellas (lit. open bottles, bottle opener), or El comeflor, (lit. eate flower, a derogator term for hippy). There’s no explicit morpheme –er to indicate agent or instrument, as in the English, but the compound as a whole contributes that meaning without needing to posit an empty or “deep” morpheme without surface realization.

The most interesting part of the English compound noun schema is the variable R, which stands for the relationship between the two concepts specified by the two nouns. The compositional meaning of noun compounds, at least at this level of abstraction, is significantly underdefined by convention (Downing 1977). If you didn’t already know that a party bus is a bus in which parties occur, rather than a bus that takes you to parties, or a bus that looks like a party (here I have a stage direction to push my glasses up my nose), you would have to infer it using a complicated combination of linguistic convention, situational knowledge, and world knowledge (about parties, about busses, etc.).  If someone referred to the bus taking you to a party as a “party bus,” it would be the most unobtrusive and minimal sort of pun, the sort of pun that we pass over all the time without conscious notice,

Nominal compounds are not only productive – you can make new ones that no one has ever used before – they are also recursive. Because a nominal compound is a noun made of nouns, you can compound compounds, and compound the compounds of compounds. On the corporate side, there’s an entire section of glassdoor.com for “[[[ThoughtN LeaderN]N liaisonN]N JobsN]N,” with, the last time I checked, 491 listings, more than half the number of jobs as last years’s MLA Job List.  One add invites you to ““Join us as the Leader of our [[[ThoughtN LeaderN]N LiaisonN]N TeamN]N.

But compounds aren’t just for corporate speak.  Poets sometimes use recursive compounding to special effect, as in George Herbert’s “Prayer 1”: “Reversed thunder, [[[ChristN-sideN]N-piercingADJ] ADJ spearN]N,” or the first line of Gerald Manly Hopkins’s “The Caged Skylark” “As a [[dareV-galeN]N [skyNlarkN]N]N scanted in a dull cage.”  These examples have different compounding patterns:

The first is left branching, like this:  [[[[  ]  ]  ]  ], while the second is formed through adjunction, pairing like this: [[ ][ ]] [[ ] [ ]].  (If there are more precise technical terms for these kinds of compounding patterns, I’d be glad to learn them.)

So far I’ve been telling you bits of knowledge that, if you are an English speaker, you know implicitly and use virtually every day, when you understand the utterances you hear and read and when you produce new ones. Yet the role that abstract signs like nominal compounds play in culture and history have gone virtually without study by humanists. Alphabetical print tools like dictionaries, indexes, and concordances are great for finding words, but they are nearly useless for finding abstract constructions that have little or no fixed alphabetical content. That’s where digital tools come in, especially those built by corpus and computational linguists. Using corpus search tools like those at CQPweb or corpus.byu.edu, you can retrieve instances of NOUN leader or NOUN NOUN.

Screen Shot 2018-06-15 at 10.16.23 AM

Using the morphological analyzer features that are part of most NLP packages, you can sort the NOUN NOUN results for instances where the second noun is a VERB stem plus the suffix –er, yielding only instances of NOUN [VERB-er]N.  It also wouldn’t take too long to sort through the list of double nouns by hand to exclude false positives like bell pepper, which is a nominal compound but not a NOUN [VERB-er]N compound.

Screen Shot 2018-06-20 at 11.28.24 AM

In diachronic corpora, like the Corpus of Historical American English, or the texts of EEBO TCP, which run from 1473-1700, you can study the way these schematic constructions change and vary over time. You can see how they instantiate in a variety of  distinct expressions. Our ability to study abstract constructions will improve as NLP and corpus query tools do, and this happens to be precisely the sort of semantically important feature of language that computational linguists have recently been hard at work on.

My work on abstract constructions, as in my book, Cyberformalism, has been primarily qualitative – I find all the instances and tell philological stories about them – but with sufficient recall and precision in the right corpora, we could study the quantitative distribution of their instantiations, chart how they change over time, and formulate hypotheses about their role in culture and society based on the trends that are revealed. I’d hypothesize, for example, that the type variety of NOUN [VERB-er]N tokens correlates with labor and instrumental specialization. In other words, the type to token ratio of the construction would increase dramatically with the division of labor characteristic of a post-Fordist society, and increase still more in a digital economy that has not only computer engineers, computer programmers, and software developers, but information systems security engineers, JAVA database application programmers, and multimedia web application developers, and so on.  If the trendline for the type to token ratio doesn’t rise with the rise of labor and technological specialization, or if it reveals more complex patterns, then there would be more hypothesizing to do.

Obviously finding, counting, and computing nominal compounds is more complicated than finding words. Their abstraction means that they won’t be matched by any fixed string.  Their orthography is unpredictable: one text’s sky lark is another’s skylark.  Their recursive, matryoshka-like potential means that we don’t even know how many words long, how many constituents, they will have.  We’d have to make judgements about how to count recursive compounds: presumably “[[dareV-galeN]N [skyNlarkN]N]N” would count as three instances rather than just one.  But here is the question: do we study only the aspects of language that are easy to find and count, like words, or do we seek also to make our methods adequate to the nature of the language we study?  I ask the question without offering an answer.

Abstract constructions like noun compounds are constitutive of everything that we write, say, read, or hear. They contribute to the meaning of complex utterances and provide a basis for both everyday linguistic creativity and the extraordinary creativity exemplified by poets. Understanding them is essential to understanding form and meaning at the level of the sentence, the utterance, the line of verse. Studying them expands the possibilities of close reading in a way that, so far as I have seen, identifying large-scale lexical trends do not.  I could also imagine it being helpful for a writer or poet: making unconscious knowledge conscious makes new possibilities for use and misuse.  Abstract constructions are constitutive of what we read when we close read, but measuring words at scale – even the small or function words that signify grammatical relationships rather than lexical concepts – flattens out or discards just these abstractions. In topic models or word embedding models, thought leader becomes just two more words – two tokens of two types – conveniently separated by spaces. In the age of print, it made sense that cultural studies didn’t attend to the history or cultural significance of abstract constructions: we didn’t have the tools or the digital texts to study them. Increasingly we do.

Of course, questions of scale won’t go away if we take a fuller account of the information even in the shortest bits of text. We will always have to make decisions about the proper scale of inquiry, to ask what archive, set of texts, or sample subset of texts is the proper evidentiary basis for our claims, and to determine what scale of analysis fits the scale of the phenomenon. But these will be primarily technical questions, taking their place alongside many others. Scale will cease to serve as the banner concept that sets digital inquiry apart or defines its promise for humanistic understanding.


[i] OED: 1887   L. Abbott & S. B. Halliday Henry Ward Beecher i. ii. 56   Mr. Beecher retains his position as the most eminent preacher and one of the great thought-leaders in America.


[iii] Thanks to Amir Zeldes for this point. My interest in noun compounds began with reading Livio Gaeta and Amir Zeldes (2017) “Between VP and NN: On the Constructional Types of German -er Compounds.” Constructions and Frames 9(1).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s