graycardinal: Alexis Castle, thoughtful (Alexis (thoughtful))
graycardinal ([personal profile] graycardinal) wrote in [community profile] victorian221b2020-07-19 05:59 pm
Entry tags:

On "Shadow Woman": Did ACD not create Sherlock?

Shadow Woman
John Allen (Allen & Allen Semiotics; $5.99 Kindle)

also: Stylometric Analysis of the Sherlock Holmes Canon

A few posts upstream, [personal profile] cimorene dropped a link into the comments for a Web site (and associated ebook) with a highly controversial premise. The author's thesis is straightforward: he argues that Arthur Conan Doyle neither created Sherlock Holmes nor wrote most of the Holmes stories; rather, Holmes was invented by his first wife, Louise (née Hawkins), who wrote most of the Holmes canon prior to "The Final Problem", and after Louise's death, most of post-Reichenbach Holmes was penned by ACD's second wife, Jean (née Leckie).

This is, to say the least, a controversial assertion. However, Allen is far from your usual conspiracy theorist - a good deal of his argument is backed by analysis of hard data (and he's honest about the limitations of the hard data that's available), and the parts that are more speculative (a) are acknowledged as such for the most part, and (b) extremely interesting in the ways that they look at the pre- and post-Reichenbach Holmesian canon, and the differences in tone and character that Holmes scholars have long recognized without questioning the matter of authorship.

I'm still in the midst of reading Shadow Woman - which I was persuaded to buy by the analysis in the second referenced document, downloadable via free PDF from the author's Web site. Stylometric analysis, very briefly, is a technique whereby a set of texts is examined for the presence and frequency of a carefully selected set of core words (mostly "function words", such as prepositions, articles, etc. rather than "content words") in a given set of works, so that comparisons of the resulting calculations can be used to establish prospective authorship of a body of work. In Allen's case, he ran an analysis on the sixty works of the Holmes canon plus virtually all the rest of the published work issued over ACD's byline, but the technique has also been profitably used to address questions of authorship involving the Federalist Papers published prior to the American Revolution.

I am not enough of a mathematician or linguist to properly defend stylometrics in itself, but Allen's conclusions in that analysis were more than persuasive enough to reel me in for the broader case. (That said, I would be delighted to hear [personal profile] sanguinity 's opinions on the chapter in the stylometrics paper that Allen says only real math geeks should actually read.)  And Shadow Woman does not disappoint: Allen makes a multi-faceted case, resting on three broad premises. The first is that ACD's own words and actions over his medical and literary career establish him as consistently unreliable where facts are concerned. The second is that the very character of the stories written by Louise - as Allen proposes - differs sharply from ACD's stated worldview (and later Jean's), particularly with respect to racism and depth of content. And the third is that ACD and his younger sons went to a fair amount of trouble later on to suppress or dispose of as much documentation of Louise's life as they could.

Does Allen prove his case? As to the first premise, I am persuaded at the least that Conan Doyle was absolutely capable of shading the truth to his own benefit. I would be happier here if Allen had included a separate bibliography of his sources at the back of the book, but he gives more than enough information for interested Holmesian scholars to check and critique his work in this regard.

As to the second: I find myself intrigued on several levels. Allen, I think, is sharper than he realizes in bringing up the early Rex Stout argument that "Watson Was A Woman" (and the reactions to Stout's pronouncement). It isn't innately surprising that no one at the time leapt to the conclusion that ACD might be passing a woman's writing off as his own. But I find myself a bit startled that no one else in this century, prior to Allen, seems to have realized the implications of the argument as Stout framed it. The discussion of underlying racism (or its opposite) in early and late canon is both timely and telling. And Allen is certainly right that (a) there's a lot of subtext in early Holmes, which (b) doesn't fit at all with ACD's consistent insistence that no, there wasn't, isn't, and never will be.

On the third point: at this distance in time, here's where we get into more traditional conspiracy theory (and specifically, into potential misbehavior on the part of both ACD himself and Adrian Conan Doyle). But at the same time, this is also where Allen relies least on pure speculation - there's an extensive and closely detailed study of an original handwritten version of "A Scandal In Bohemia" in which samples of two distinct sets of handwriting appear. There is very clearly something hinky going on with that manuscript, and in light of the rest of the case Allen makes, his theories on the matter are far from unreasonable.

I am not quite ready to say that I accept everything Allen asserts as Gospel; I'm not well-read enough in the real-world biographies to make that judgment at this stage. But I absolutely think he's done a credible job of making a credible case, and one that deserves to be taken seriously by mainstream ACD and Holmesian scholarship.

sanguinity: woodcut by M.C. Escher, "Snakes" (Default)

Because you asked...

[personal profile] sanguinity 2020-07-25 11:14 pm (UTC)(link)
I can't comment on stylography itself, having no particular expertise in that field. Re the math: a lot of statistics is invented to a purpose, and the word-score and text-score measures described in the paper do have some face validity. In the word score's favour, it gives a larger response when a word's frequency is highly consistent within an author's work (i.e. when an author ALWAYS uses a word fifty times per ten thousand words); also in its favour, it gives a more definitive response when the two authors being compared have very different frequencies for a given word.

However, the measure also misses some properties I'd ideally like to see. For example, I'd like to see it return a value of 1 or -1 when a word's use perfectly matches the frequencies of one author or the other, and then tail off when the text moves away from being a perfect match. Right now, you can have a scenario where H uses a word 100 times and M uses the same word 10 times, whereas your mystery-text uses it 1,000 times -- that would give a very loud response as "similar to H," when it is arguably nothing like either of them. Worse, that very loud "similar to H" response on one word would dwarf the scores of a complete slate of otherwise perfect frequency matches. (Should an outlandishly dissimilar response like the one I just hypothesized be considered stronger evidence than a perfectly similar response when you're adding up word scores to get a text score? I don't think it should. And yet in this third example on this spreadsheet -- "Almost-perfect M with a big outlier in the H direction" -- that's exactly what happens: one outlandish frequency outweighs 19 perfect frequencies, to give an overall text score nearly identical to the spot-on perfect score of the almost-certainly-wrong author.)

Those are my critiques of the measure off the top of my head. I'd like to see what critiques/improvements have appeared in the literature since Mosteller and Wallace first published this test in 1963. (I say that, but I also notice my utter lack of arsedness to look them up myself.)

That's my critique of the measure -- now as to the test. (Um. "Measure" is the actual number you're looking at, hoping it to make a decision. The average height of a random sample of Americans, say, versus the average height of a random sample of Brits. "Average height" is the measure. The test is the procedure by which you make the DECISION as to whether Americans or Brits are taller. Sure, the average height is 1cm different, but is that enough to say that one group or the other is actually taller? That decision-making procedure is the test.)

First off, credit where credit is due: it's a nonparametric test. That is, it doesn't make any silly assumptions that word scores/text scores are normally distributed or anything. However, the accuracy procedure does make the assumption that the text scores are as likely to skew big as small, and... they're not. Compare the third and fourth scenario in the spreadsheet I linked. Because H uses each tested word more frequently than M does, the individual word scores can skew infinitely large in H's direction; unfortunately, they can't similarly skew similarly heavily in M's direction. (Basically, any big number counts toward H, and the more outlandishly big it is, the more outlandishly it counts toward H. Whereas there's NOTHING that counts toward M with similar strength.) This means that, depending on the frequencies of the selected words in the reference sets, you might have a measure that naturally skews toward one reference set or the other. (To be more concrete about this: if Arthur hates prepositional-phrase constructions, and thus his chosen function words tend to always appear at lower rates than Jean's or Louise's, the test will have a tendency to skew toward Jean and Louise.)

Which means that whole section about measuring the accuracy of the test by counting how many of the eight possible results would be definitively wrong for a text, and then expressing surprise that the wrong results didn't appear at a random-chance kind of frequency... It's hard to evaluate that claim, without knowing the frequencies of each of the twenty words for each of the three authors. Furthermore, Allen's accuracy/validity procedure is inherently unbalanced in that it looks at presumed-Arthur texts more than other kinds of texts, so we can't really get a fair assessment of how it does on each of the three kinds of texts.

(In contrast, the way I was trained to test a model's predictive validity: keep back ten percent of your reference data, build the model with the other ninety percent of your reference data, and then see whether the model correctly predicts authorship on the remaining ten percent. That way, you get performance data on all three bodies of texts, AND you don't have to make assumptions about who met whom when.)

Second, the test is a bit odd in that it doesn't follow the well-respected null/alternative hypothesis structure: rather than saying "We'll presume Arthur's authorship, unless there's strong evidence otherwise," it says "If there's the scantest WHISPER of being more evidence that the authorship is Jean's rather than Arthur, then we'll say it's definitively Jean's." It's not as bad as it could be on that front -- it has to whisper Jeanward twice to count as definitively Jean -- but it does not ask for a good strong signal before declaring a decision. Basically, the results could be a whole bunch of murk -- but murk that is ALMOST IMPERCEPTIBLY more one thing than another -- and it gives the same result as a clear strong definitive signal.

None of this is to say that the paper's results are worthless and should be tossed. (And the results it gives seem to show something.) But I don't find the paper fully convincing, either. If nothing else, there are a couple of ways this analysis could have gone wrong that I'd like to see addressed.

(But, sadly, not enough to re-run the analysis myself.)

sanguinity: woodcut by M.C. Escher, "Snakes" (Default)

Re: Because you asked...

[personal profile] sanguinity 2020-07-26 05:53 pm (UTC)(link)
Whee, cookies! I like cookies!

Heh, you lucked out in your request -- I've done a bunch of grad work on statistical and predictive modeling, so I'm very familiar with the kinds of questions that need to be asked and answered for this kind of task. I might have been a number theorist, and then what would you have done? :-P

I remain curious as to how or whether a test like this could be used to compare a half dozen known pastiches to Holmesian canon for "consistency of Watsonian style", but that's clearly a different proposition than the authorship question Allen raises.

Yup. The first task is to figure out exactly what you mean by "consistency of Watsonian style" -- and once you know what you mean by it, then you can sit down and figure out how to measure it. (Which might turn out to be an iterative process, going back and forth between measure and question and test, until you have all three aligned.) This measure and test is something like "Is the pastiche more like Holmes canon or more like (second reference text) in how it uses prepositions and such?" Maybe you could have the second reference text be the pastiche-author's non-pastiche work? "Is The Peerless Peer more like Holmes canon or more like the other Wold Newton books with respect to the frequency with which [list of reference words] (and to a lesser extent, the linguistic structures that can be inferred from those words) are used in the text?" All the critiques I wrote above would still apply -- you'd have to pick your reference words so that exactly ten of them are more frequently used in Holmes canon and the other ten are more frequently used in Wold Newton canon, etc. -- but you could use this test pretty much as written for that, especially if you were doing it just for fun.

Yes, exactly: there is some merit to the technique, but room to question the results. If I were designing a test from scratch for this stylographic question, I don't think I'd use this measure and test at all, but I have a lot more computational power at my disposal than Mosteller and Wallace had in 1963. (Not that I feel like putting in the work of actually designing the measure and test -- and assuming of course someone else hasn't already invented and published one!)

You're right, that might make an enjoyable LCSS presntation! Certainly would spark an animated discussion in the corridors, I would think.