I’m torn: I’m very interested in stylometry, but I have issues with the fundamental questions that are asked in this field, in particular authorship attribution.

In my research, I’ve thought and written quite a bit about authorship. My dissertation looked at changing concepts of authorship – the singular, cohesive, Romantic genius author as established in collected editions in Japan at the turn of the 20th century – and also at actual practices of writing and authorship that preceded and accompanied these developments. My conclusion about authorship was that it is a kind of performance, embedded in and never preceding the text, and is not coextensive with the historical writer(s) behind the performance – pseudonymous, collective, anonymous, or otherwise.

These performances are necessarily contextualized by space, time, society, culture, literary trends, place of publication, audience. They are more or less without meaning if one doesn’t take context into account, even if not all relevant contexts at once. For a performance takes place within a historic, cultural, and literary moment, and does not exist independently of it. I see that place of performance as both the text and its place of publication, its material manifestation; and it is a performance that is inextricably linked to reader reception.

I also don’t see these performances as necessarily creating a unified authorial identity or unified author-function across space, time, and texts. This may sound extremely counterintuitive given that many performances of authorship share appellations and can be “attributed” to the “same” author, and I recognize that my argument is downright bizarre at times. I blame it on having spent too much time thinking about the implications of this topic. But in a way, our linking of these performances after the fact is artificial, and these different author-functions are, for me, so linked to the time and place of both publication and reading – whether contemporary or not – that they can be seen as separate as well. This is why I concluded that collected literary anthologies are constructing – inventing – an entirely artificial “author” out of the works associated, after the fact, with a historical, individual writer, whose identity and name may not have coincided with that of the authorial performance at all in the first place.

So, that said, let me get to my disagreement with authorship attribution. It’s fundamentally asking the wrong question of authors and authorship: who “really” wrote this text? My argument is that the hand of the historical writer “behind” the authorial performance is a moot point; what matters is the name, or lack of a name, attributed to the text when it is published, republished, read, and reread over time. It’s the performance that takes place at the site of the text, coinciding with and following the creation of the text, deeply associated with and embedded in the text, and located within reception rather than intention. It takes place at a different site than the hand of the historical writer holding a pen or the mind creating an idea. And so the search for the “real” identity of the author is beside the point; what is happening here is really “writership attribution” that is something separate from authorship.

A colleague recently asked me, too, what the greater goal of authorship attribution is – what is it beyond finding out the person behind the text? What is it besides deciding that the entity constructed with the name Shakespeare “really” wrote an unattributed or mysterious text? And I couldn’t answer this question, which brings me to my second fundamental problem with authorship attribution. I don’t see an overarching research question guiding methodology, besides the narrow goal of establishing writership of a text. This could be my own ignorance, and I’d be happy to be corrected on it.

I’m interested to hear your thoughts!

  1. I think authorship attribution is less interesting in that it attributes things to authors, and more interesting in that you’re testing things that may be important for style. So it may be less exciting that one person wrote something, but having an algorithm that can assign a text to the author because it looks at X element means that X element may be interesting or important.

    Also being able to quantitatively talk about word choice, like Henry James abandoning “cried” halfway through his career, is kind of cool to me.

    TL;DR I think the “style” part of stylometry is most interesting and authorship attribution is one way of testing hypotheses or revealing information about style.

  2. (I think I’m willing to accept the distinction between authorship and writership you suggest, though for sake of clarity–and habit–I’ll just say “authorship” below.)

    For a time I worked on an authorship attribution project looking at Oscar Wilde and the pornographic novel *Teleny* which is sometimes attributed to him, and I frequently found myself wishing for someone to make a point very similar to the one you’re making. In doing authorship attribution work one often finds, and finds one’s self, employing metaphors of *signatures* or *fingerprints* or even *DNA* (we want to “discover” Wilde’s unique “stylistic fingerprint,” for instance). The basis of these metaphors, and their origin in criminality, are themselves probably worthy of scrutiny (graphology is notoriously unreliable, DNA gets used in troublesome ways, and the role of fingerprints has a more complicated history than we sometimes allow). They also elide a number of crucial differences between trying to measure an individual, writerly style and the materiality of DNA/fingerprints/handwriting (and there are important differences between those three things, of course). “Style,” if we can provisionally call it that, is a process as much as a single thing. Moreover, there is (and *please* correct me if I’m wrong) absolutely no reason for the intuitively seductive, but theoretically baseless, assumption that each writer has an absolutely unique and identifiable way of using language.

    And yet, despite this lack of theoretical grounding, the techniques, in some sense, seem to *work*. How to reconcile this?

    In large part, I think it is simply a problem in how we talk *about* attribution; and indeed, this is one of those areas where I like to think that greater commerce between “digital humanities” and “theory” (scare quotes all around!) could be beneficial. You write that your disagreement with authorship attribution is that it asks “who ‘really’ wrote this text?” And I agree that, when we use metaphors of fingerprints and stylistic DNA, we are doing this. But I think this is a misleading was to describe what authorship attribution does.

    You mention performance; and indeed, I think it is intellectually healthy and methodologically salutary to imagine authorship, as conceptualized by these methods, as performative in the same way that *gender* is famously described as a performance by Judith Butler. It is something not fully within conscious control; it is iterable; and it creates, through performance, the illusion of some reality “behind” it (gender/the author). And, like gender, just because it is not grounded in some deeper metaphysical reality, it is no less “real.”

    Where people will find this comparison begin to feel strained is that attribution relies on something Butler *never* talks about: quantification.

    What authorship attribution *does* (regardless of what it says) is quantify that performance, and create a system where it is possible to evaluate the likelihood that one author wrote something rather than an other. While in developing this system (what set of authors; how is text pre-processed; what features to extract; what classifier to use) one may end up (perhaps merely incidentally) making claims about the nature of authorship *qua* authorship, I like to imagine that such claims are inessential to what such studies actually *do*. We might say in some cases that the hacking and the yacking aren’t well aligned. *The Federalist Papers* case remains, for me, helpful because it shows simultaneously the enormous power of these methods (the success in determining what is now generally accepted as the “right” answer) and its limitation (a case with lots of text and a closed set of possible authors).

    I hesitate to speak to your closing provocation because I am not someone who has invested a significant portion of his career in authorship attribution methods. I would, however, resist it. The suggestion that there is something “only” or “mere” about determining whether a particular individual wrote a particular work undervalues this knowledge. Being able to offer strong statistical evidence to attribute a work to an individual is valuable. I think one can grant entirely the view of authorship you elaborate above and still think that knowing (even in a tenuous, provisional way) whether a particular individual wrote a particular work is something worth knowing, and (within reason; aye, there’s the rub) worth our effort to try to learn in those cases where it is uncertain. (From my understanding, much of the Shakespeare stuff seems a bit like FUD… but…). The assumption that there is, or needs to be, some “greater” goal may unnecessarily, or unfairly, derogate authorship attribution. (One can imagine similar question about textual scholarship, critical editing, etc etc.; what is the “greater goal” beyond establishing a text and finding its variants?)

    Whoops. That was a long comment.

  3. Erin is spot on – what is important in stylometry is the quantification of style. Stylometry is not purely authorship attribution any more (if it ever was); it’s tough to say that what we’re doing is attributing texts when we’re looking at 1,000s of texts, at the ‘shape’ of entire genres.

    Not to trivialise the project of authorship attribution at all – we’ve classified texts by author since ancient times.

    The point of rigorously modeling style is to gain fundamental insight into the texts at hand. Hoover’s work on Henry James, for example, charts James’ progressive changes in style and demonstrates that his work measurably changes at a syntactic level, This lends credence to the work of his biographers – James was spinning gradually out of control as his career progressed.

    TLDR; We measure and model style to get a better perspective on the text. Sometimes it’s authorship, and other times it’s psychology or social history.

  4. I quite like this challenge you present to the more traditional focus on identifying the “actual” writer/author of a text. Thinking about the construction of the conception of an author, as his or her reputation grows and changes and develops over the centuries, as the result of the actions & thoughts of readers, editors, publishers, critics, is very interesting, and very important. As so much has been developed over the centuries about the *idea* of Shakespeare, we cannot truly say that this fictionalized, imagined, Shakespeare is irrelevant, unimportant, or not worthy of study. In fact, the imagined Shakespeare has probably had far more influence upon our culture and society than the historical Shakespeare.

    This goes, too, to a large extent, for just about any historical figure, an idea discussed recently on PBS’ Idea Channel, in a video entitled “Are There TWO Nikola Teslas?.”

    The notion of “performance” of authorship is also dramatically visible in the visual arts. (I feel like I’ve written about this before, and probably on a comment on this same blog somewhere… pardon me if I’m repeating myself.) By choosing pseudonyms, seals, and signatures, and indeed in their artistic style itself, artists present themselves as certain characters – they create, or try to create, their own reputations. Ike no Taiga (1723-1776) is a fine example of this, as he changed his surname from the two-character Ikeno 池野 to the one-character Ike (no) 池 in order to better perform the identity of a Chinese-style literatus; he also performed the literati identity in other ways, succeeding at creating for himself a reputation as a pure “man of culture,” a literati painter who shunned commercialism, even as he, in fact, sold his paintings to make a living.

    All that said, though, I will say that there must be some value to determining who “really” created a given work. The differentiation between a “real” name and a pseudonym is somewhat meaningless and arbitrary – I’m not sure it matters that Hokusai, Hiroshige, and Taiga were all pseudonyms, and that in many cases we don’t know the “real” (birth/given) name of an artist – but, surely there is value in connecting up multiple pseudonyms to recognizing them as the same individual, or as different individuals. Knowing (or, believing, as best as we can based on the evidence) that Western pictures pioneer Shiba Kôkan, who experimented with Western perspective, realism, light & shading, and oil painting, and who was the first Japanese to produce copperplate engravings, was the same person as woodblock print artist Suzuki Harushige, whose works were so fully in the style of his master Suzuki Harunobu that they were successfully passed off as the master’s works… knowing that these two were the same person tells us a lot about the individual, his interests and abilities, and about the arts world of Edo more broadly, that someone could belong to multiple very different schools or styles, and was not restricted to just one. Finally, knowing that Harushige was not Harunobu – identifying, that is, who “really” designed these prints – tells us a lot about the cultural conceptions of copying, forgery, and intellectual property at that time.

