How to Measure Text?

…the words we join have been joined before, and continue to be joined daily. So writing is largely quotation, quotation newly energized, as a cyclotron augments the energies of common particles circulating.

– Hugh Kenner, The Pound Era

This month marks the beginning of the complicated process of starting up the Large Hadron Collider, the world’s largest particle accelerator (Kenner would haved called it a “cyclotron”), buried beneath the Franco-Swiss border. Near the top of the LHC’s agenda is having a peek into the fabric of space-time to see about the Higgs-Boson, the theorized source of mass.

But to do so they’ll need data–lots of data. According to CERN, the event summary data extracted from the collider’s sensors will produce around 10 terabytes daily. That is something like, to use the cliché, the equivalent of a Library of Congress’s worth of data every day (the raw data is much much greater).

The physics involved is obviously too complicated for a mere humanities major to discuss in any intelligent way. The interesting thing is the disparity between the sheer amount of data with which the LHC deals, as compared with the scale of the (textual) data of the humanities. How can the LHC, in a single day, focussed on a highly specific set of questions, produce as much information as the literary output of humans represented by the Library of Congress? Why, in short, is the textual data of the humanities so much smaller than the data produced by the LHC?

It is, of course, in some ways a silly, completely naive question. But the differences, in size alone, of these two datasets are nevertheless instructive and worthy of consideration. We might oversimplify the matter, and say that the LHC’s data, collected from its sensors and culled by its arrays of servers, is fundamentally information-poor data. The challenge faced by the LHC project is sorting through the complexities of the data to find the relevant information that will allow physicists to answer the questions they have. Language, by contrast, is information rich–so rich that our challenge is not how to separate the wheat from the chaff, but how to deal with the sheer flood of information compressed in text.

It is this fact that explains the disparity in size between the LHC’s data and the textual record of the humanities. The textual data of the humanities comes “preorganized” by language. While our digital texts encode only strings, language fills texts with syntactic and semantic information of which our systems of markup are completely oblivious.

Martin Wattenberg at IBM’s Watson Research Center puts it well in his interview with Wired when he describes language’s ability to compress information:

Language is one of the best data-compression mechanisms we have. The information contained in literature or email, encodes our identity as human beings. The entire literary canon may be smaller than what comes out of particle accelerators or models of the human brain, but the meaning coded into words can’t be measured in bytes. It’s deeply compressed. Twelve words from Voltaire can hold a lifetime of experience.

What happens if we take this understanding of language seriously? How would it change the way we deal with textual data?

Right now we have plenty of digital texts available, but in order to get the information out of the textual data we have to read it. Right now, only by reading do we attend to the specifically linguistic nature of textual data. Existing text analysis technologies and techniques remain largely quantitative, relying on machine learning techniques to classify texts that are represented by vectors of frequency counts. Key sources of linguistic information, however, like syntax, remain fundamentally unexploited. We are still, in effect, discarding some of the most basic sources of textual information–such as the order in which the words occur (seriously).

One avenue, though admittedly crude, is to use a technique like part-of-speech tagging to supplement raw text with part-of-speech tags which provide a fuller, more information-rich digital representation of the linguistic data. By analysing such part-of-speech tags, taking them in pairs, or looking at where in a sentence they occur, we get some sense of how a writer uses language. We step, in short, over the threshold from a purely quantitative view of language use (e.g. how many times does “of” occur per thousand words? what are the most frequently occurring terms?), to a mode of analysis that is able to extract the sort of information that we, humans, are able to when we read. Such techniques are admittedly crude; but they begin to recapture the fundamentally linguistic nature of textual data which is too easily discarded in representations of natural languages. To truly capitalize on the information contained in textual data requires finding more ways to digitally attend to the specifically linguistic nature of textual data.

We are trying to read the finely wrought braille of language through the burlap sack that current digital tools offer. With the combination of natural language processing tools (such as POS taggers, parsers, etc) and ever-more sophisticated machine learning techniques, we may be able to get closer. Humanities data is not, necessarily, smaller–it is just more compressed.

Former NINES Fellow, Scholars' Lab Fellow, and HASTAC Scholar. Currently Assistant Professor of English at Syracuse University. I completed my PhD at the University of Virginia in August, 2011.


  1. Wow, so the bag iof qords being influenced into one entity takes place in a time where inspiration controls it to form this bag of figurative words into a most best found way as it’s the final (KJV?) translation whos mere translation is an actof grace and perfection of all that’s said in it.

    I would like to talk about this phenomenan with anyof the above 3 of you or the author if anyone feels so inclined to email me at johnjciii as I find this very intersting the way all of you have ambiguously spoken in much detail thatan untrined eye would see no understanding therein what was said; I exactly what is by all of you mean and would love to discuss these cosmological concepts with someone who underatands them as I do.
    I please welcome any and all discussion anyone is willing to divulge with me into these obvioualt vert complwx worldly concepts as I konw their relevance is approaching within an unpredictablyy short number or weeks or months. Phenomenl page!

  2. Thanks very much to both commenters for taking the time to respond.

    @Allison Booth:
    Indeed! Tagging software tends to be highly language-specific, and still has a rather high failure rate.

    More pressingly, when trying to measure authorial style how one tags (or otherwise processes one’s input texts) is only half of the design of the experiment. If you’re trying to answer the question, “Did Oscar Wilde (for example) write this text?”, how do you select a corpus of texts that is sufficiently representative of what one is trying to measure (namely Oscar Wilde’s style versus “everything else”)? This situation is decidedly more complicated than the case where one has a large set of texts from author A, a large set of texts by author B, and an unknown text that is known to be written by either A or B (as in the canonical case of statistical authorship attribution–the Federalist Papers).

    I emphasize design of the entire corpus as a way of resisting the temptation to drift towards a naive positivism in which we imagine that we actually measuring, or quantifying, something called “style” that exists. It is worth remembering that there is absolutely no linguistic “fact” that grounds this work. Plenty of work has managed to successfully discriminate texts based on authorship; but the idea that individuals use language in an identifiably unique way remains a necessary assumption rather than a proven fact (Pierre Menard might, after all, manage to reproduce the Quixote).

    @Brad Pasanek:
    Wattenberg indeed waxes a little sentimental in imagining Voltaire compressing “life experience” into “twelve words.” But I think his broader point about language compressing information remains worthy of consideration. (I can’t help but thinking of Ezra Pound’s note on “The Jewel Stairs’ Grievance” as a particularly apposite example).

    And yet, to ask “what” is compressed (beyond a deliberately evasive term like “information”) may not be as productive a question to ask as what our models are able to successfully demonstrate. Indeed, rather than metaphorically imagining ourselves as “mining” information out of textual data (as if the information were sitting there, waiting for us), it seems safer to assume that we are building models that may prove to be more successful in answering certain questions that we ask them (about authorship, for example) than previous models. Wattenberg’s observation is helpful not as goad to get closer to the “truth” of language, but insofar as it draws our attention to the purely pragmatic opportunities that remain unexploited when we lose sight of the distinctly linguistic nature of textual data.

    The “bag of words” model indeed delivers excellent results for certain tasks. And the attempt to better approximate the linguistic structure of textual data has serious costs (among them the time it takes to tag and parse the texts). When successful, though, it also has benefits. While my more technically able comrade and I are still working with the data, we have already seen a marked improvement in our success rate in authorship discrimination by using POS tags.

    The same has been shown in Hirst and Feiguina’s “Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts,” published last year in Literary and Linguistic Computing and available here:

    In addition to simply tagging parts of speech, they use a partial parser (which identifies not only parts of speech, but larger syntactic units like noun or verb phrases) to convert novels into a stream of tags representing (crudely) the syntactic structure of the document. They then analyze bigrams for purposes of authorship attribution. Intuitively, as Allison pointed out, we imagine that “the most important compression is figurative.” While this method does not allow anything like a complete insight into an author’s use of figurative language, by taking account of the frequency with which an author uses certain types of words in certain patterns, it represents an improvement over the “bag of words.”

    I have some reservations about the design of the specific experiments described in this essay. (They claim to be able to achieve a very high rate of success in identifying text samples of 200 words, which from my experience seems almost impossibly small). The concept, however, seems to be a reasonable next (baby) step beyond the “bag of words” model.

  3. What sort of information do we think is being compressed in a string of words? Martin Wattenberg seems to suggest that it is identity. I’m teaching Borges’s “Pierre Menard” today and doubt that what we find in twelve words of Voltaire is something like “life experience” — but I suppose, yes, it is pretty to think so.
    Menard, author of Quixote, rejects the “biographical” method at the outset: “To be, in some way, Cervantes and to arrive at Don Quixote seemed to him less arduous–and consequently less interesting–than to continue being Pierre Menard and to arrive at Don Quixote through the experiences of Pierre Menard” (Ficciones 49).
    Finally, as a literary critic, I’ve been chastened by the results of the machine learning experiments my CS collaborator and I have designed (many of which concern metaphors). Over and over again we find that just the “bag of words” will serve to train a classifier. The bag of words horrifies me: no word order, punctuation, syntax, grammar, or type-token distinctions, no recognition of allusion, style, idiom, figure, or trope. And still the bag of words delivers excellent results.
    Should we take a moment to think about something even more compressed than a string of words?

  4. This is really interesting to me. Some immediate thoughts: syntactical tagging will only be as good as the linguist creating the tags; word order differs not only between languages but also between historical periods, regions, etc., so determining what is individual style (supposing this is the purpose of the labor going into such tagging) will be tricky. And really, the most important compression in language is figurative, and getting at figures of speech is surely more elusive than other measurable patterns. I’m sure someone will get there…