Searching for Love is Love

In a previous post I tried to show why scholars of literature, culture, and intellectual history might want to direct their attention beyond the words in the lexicon to more complex and abstract signs. Specifically, I argued that tautological forms (tautaulogies, following Joyce) are conventional, Saussurean signs that do consequential political work in the domain of what Gramsci called “common sense.”

The post drew on a range of examples, such as Love is love, It is what it is, and The law is the law. But it didn’t attempt anything like the kind of longue durée history that is performed in the chapters of Cyberformalism (which, did I mention, is out now with JHUP?). The meaning and use of the form varies across languages, cultures, and history. How can we study that variation? How would we begin to piece together a history of tautaulogies, just as we now have abundant histories of words?

These questions are why there is a cyber- in Cyberformalism – why the book falls under the umbrella of the digital humanities. To study linguistic forms, you need to be able to find them. And for that task print finding tools are unsatisfactory. An index, card catalog, concordance, or dictionary will be of little or no use in locating “equative” tautaulogies such as Love is love and Poems are poems or when dogs were dogs, which share no word forms in common. Both include inflections of the verb be, but that fact, on its own, does almost nothing to winnow them from a text archive of any size.

Let’s think a bit about how digital methods might allow us to search for Love is love and other equative tautaulogies. I notated their general form as NP1 Be NP1. How closely our search criteria approximates that form depends on the capabilities of our finding tools. As we will see, search can get technical and complex pretty quickly, even running right up against the current limits of Natural Language Processing. But we can make some progress nonetheless.

 

Let’s say that all you have is a plain text archive and the ability to search it using regular expressions. You can test out which REGEXES match text strings with a tool like regex101.com.

Let’s start with (\w*) (were|was|is|be|are) (\w*). This string says, in effect,find strings of three words, where the middle word will be a form of Be.”   It matches love is love, but it also false positives such as dogs are mammals. And there are many other tautaulogies like Boys will be boys or The law is the law that it won’t capture.

In some interfaces, regex has a capability called backreferencing (or backref for short), which allows it to match the same text that it has already captured. (In language, as in cognition generally, the notion of “the same again” goes very deep, but backref offers a simple, mechanical version of it.)  So the string (\w*) (were|was|is|be|are) \1 means “find a word, followed by a form of be, and then the same as the first word again.” It will match love is love and Let Bartlett be Bartlett, but not dogs are mammals. Adding an optional will, as in the string (\w*) (will)? (were|was|is|be|are) \1, also captures Boys will be boys. But backref has limitations. Regex’s matching of letters is entirely, well, literal. If the text reads Love is love, the backreference will be sensitive to the case of Love and won’t match the lowercase love. If there is a workaround available, I would be glad to know it. Otherwise, more powerful tools are needed.

What about if we are searching in Part of Speech (POS) tagged corpora like those at CQPweb or the BYU corpora? Then additional capacities are available to us. We can search not just for words in general but for categories of words. In CQPweb the search string _N* _VB* _N* reads “any noun followed by any form of be followed by any noun.” If backreferencing were available, you could search for (_N*) _VB* \1, which would match War is war, love is love, girls are girls, etc.   And there are additional levels of flexibility. Instead of capturing just a single noun, you could capture an optional article as well with ((_AT*)? _N*) _VB* \1. While I don’t have a way of testing this out, it should match instances like A deal is a deal, and The law is the law, as well as Love is love. It’s also worth pointing out that a search doesn’t have to capture every kind of equative tautaulogy in a single search string. It will often be easier to run a number of distinct searches, each of which retrieves a subset of true positives.

REGEX and POS tags are never going to be fully up to the task of identifying abstract signs in text archives. In the terms of Chomsky’s Hierarchy, a Type 3 (or “regular”) grammar will always lack the power to fully generate (or match) human languages, which are generated by a Type 2 grammar (with occasional Type 1, or “context sensitive,” exceptions). But more powerful, Type-2-ish search tools are available.

Annotated by a full constituency parser, like the one from Stanford (http://nlp.stanford.edu:8080/parser/), Love is love would look like this:

(ROOT   (S     (NP (NN Love))     (VP (VBZ is)       (NP (NN love)))))

And a tautaulogy with a more complex Noun Phrase, such as The sanctity of law is the sanctity of law, would be annotated like this:

(ROOT   (S     (NP       (NP (DT The) (NN sanctity))       (PP (IN of)         (NP (DT the) (NN law))))     (VP (VBZ is)       (NP         (NP (DT the) (NN sanctity))         (PP (IN of)           (NP (DT the) (NN law)))))))

These embedded, hierarchical, labeled sentences are rich but unwieldy digital objects. There are, at present, few parsed-text archives of a size or accuracy that would be useful for cultural or historicist inquiry. Manipulating or searching them adequately would involve using a search language like Tregex or TGrep2, probably in conjunction with a scripting language like Python. This falls well beyond my technical capabilities. I would need to partner with a good computational or corpus linguist to search parsed texts for tautaulogies or other similarly abstract forms.

Humanities inquiry can sometimes appear to proceed as if it were a-technical, without material or instrumental support. But that is an illusion; as our tools grow familiar and get taken for granted, they become invisible. The supports of lexical research are all ubiquitous, present in virtually every kind of reference work, every book with an alphabetical index. One task of scholars of media and book history (as an early modern scholar, I am an especial fan of work by Ann Blair) is to remind us of the role that such reference technologies play in the production of knowledge.  Another is to show us that the reference technologies that seem second nature to us now (like alphabetization) were never obvious or easy when they were first being built and used.

It should be no surprise that studying new objects of philological, linguistic, cultural, and historical knowledge will involve grappling with research tools that are difficult, technical, complicated, unwieldy, and only partially adequate.  Formulating a search is not extrinsic to the forms that are searched for; it involves thinking through the very nature of those forms. Our conceptions of language develop in tandem with the material supports for that conception.

Advertisements
Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s