I can't comment on stylography itself, having no particular expertise in that field. Re the math: a lot of statistics is invented to a purpose, and the word-score and text-score measures described in the paper do have some face validity. In the word score's favour, it gives a larger response when a word's frequency is highly consistent within an author's work (i.e. when an author ALWAYS uses a word fifty times per ten thousand words); also in its favour, it gives a more definitive response when the two authors being compared have very different frequencies for a given word.
However, the measure also misses some properties I'd ideally like to see. For example, I'd like to see it return a value of 1 or -1 when a word's use perfectly matches the frequencies of one author or the other, and then tail off when the text moves away from being a perfect match. Right now, you can have a scenario where H uses a word 100 times and M uses the same word 10 times, whereas your mystery-text uses it 1,000 times -- that would give a very loud response as "similar to H," when it is arguably nothing like either of them. Worse, that very loud "similar to H" response on one word would dwarf the scores of a complete slate of otherwise perfect frequency matches. (Should an outlandishly dissimilar response like the one I just hypothesized be considered stronger evidence than a perfectly similar response when you're adding up word scores to get a text score? I don't think it should. And yet in this third example on this spreadsheet -- "Almost-perfect M with a big outlier in the H direction" -- that's exactly what happens: one outlandish frequency outweighs 19 perfect frequencies, to give an overall text score nearly identical to the spot-on perfect score of the almost-certainly-wrong author.)
Those are my critiques of the measure off the top of my head. I'd like to see what critiques/improvements have appeared in the literature since Mosteller and Wallace first published this test in 1963. (I say that, but I also notice my utter lack of arsedness to look them up myself.)
That's my critique of the measure -- now as to the test. (Um. "Measure" is the actual number you're looking at, hoping it to make a decision. The average height of a random sample of Americans, say, versus the average height of a random sample of Brits. "Average height" is the measure. The test is the procedure by which you make the DECISION as to whether Americans or Brits are taller. Sure, the average height is 1cm different, but is that enough to say that one group or the other is actually taller? That decision-making procedure is the test.)
First off, credit where credit is due: it's a nonparametric test. That is, it doesn't make any silly assumptions that word scores/text scores are normally distributed or anything. However, the accuracy procedure does make the assumption that the text scores are as likely to skew big as small, and... they're not. Compare the third and fourth scenario in the spreadsheet I linked. Because H uses each tested word more frequently than M does, the individual word scores can skew infinitely large in H's direction; unfortunately, they can't similarly skew similarly heavily in M's direction. (Basically, any big number counts toward H, and the more outlandishly big it is, the more outlandishly it counts toward H. Whereas there's NOTHING that counts toward M with similar strength.) This means that, depending on the frequencies of the selected words in the reference sets, you might have a measure that naturally skews toward one reference set or the other. (To be more concrete about this: if Arthur hates prepositional-phrase constructions, and thus his chosen function words tend to always appear at lower rates than Jean's or Louise's, the test will have a tendency to skew toward Jean and Louise.)
Which means that whole section about measuring the accuracy of the test by counting how many of the eight possible results would be definitively wrong for a text, and then expressing surprise that the wrong results didn't appear at a random-chance kind of frequency... It's hard to evaluate that claim, without knowing the frequencies of each of the twenty words for each of the three authors. Furthermore, Allen's accuracy/validity procedure is inherently unbalanced in that it looks at presumed-Arthur texts more than other kinds of texts, so we can't really get a fair assessment of how it does on each of the three kinds of texts.
(In contrast, the way I was trained to test a model's predictive validity: keep back ten percent of your reference data, build the model with the other ninety percent of your reference data, and then see whether the model correctly predicts authorship on the remaining ten percent. That way, you get performance data on all three bodies of texts, AND you don't have to make assumptions about who met whom when.)
Second, the test is a bit odd in that it doesn't follow the well-respected null/alternative hypothesis structure: rather than saying "We'll presume Arthur's authorship, unless there's strong evidence otherwise," it says "If there's the scantest WHISPER of being more evidence that the authorship is Jean's rather than Arthur, then we'll say it's definitively Jean's." It's not as bad as it could be on that front -- it has to whisper Jeanward twice to count as definitively Jean -- but it does not ask for a good strong signal before declaring a decision. Basically, the results could be a whole bunch of murk -- but murk that is ALMOST IMPERCEPTIBLY more one thing than another -- and it gives the same result as a clear strong definitive signal.
None of this is to say that the paper's results are worthless and should be tossed. (And the results it gives seem to show something.) But I don't find the paper fully convincing, either. If nothing else, there are a couple of ways this analysis could have gone wrong that I'd like to see addressed.
(But, sadly, not enough to re-run the analysis myself.)
Because you asked...
However, the measure also misses some properties I'd ideally like to see. For example, I'd like to see it return a value of 1 or -1 when a word's use perfectly matches the frequencies of one author or the other, and then tail off when the text moves away from being a perfect match. Right now, you can have a scenario where H uses a word 100 times and M uses the same word 10 times, whereas your mystery-text uses it 1,000 times -- that would give a very loud response as "similar to H," when it is arguably nothing like either of them. Worse, that very loud "similar to H" response on one word would dwarf the scores of a complete slate of otherwise perfect frequency matches. (Should an outlandishly dissimilar response like the one I just hypothesized be considered stronger evidence than a perfectly similar response when you're adding up word scores to get a text score? I don't think it should. And yet in this third example on this spreadsheet -- "Almost-perfect M with a big outlier in the H direction" -- that's exactly what happens: one outlandish frequency outweighs 19 perfect frequencies, to give an overall text score nearly identical to the spot-on perfect score of the almost-certainly-wrong author.)
Those are my critiques of the measure off the top of my head. I'd like to see what critiques/improvements have appeared in the literature since Mosteller and Wallace first published this test in 1963. (I say that, but I also notice my utter lack of arsedness to look them up myself.)
That's my critique of the measure -- now as to the test. (Um. "Measure" is the actual number you're looking at, hoping it to make a decision. The average height of a random sample of Americans, say, versus the average height of a random sample of Brits. "Average height" is the measure. The test is the procedure by which you make the DECISION as to whether Americans or Brits are taller. Sure, the average height is 1cm different, but is that enough to say that one group or the other is actually taller? That decision-making procedure is the test.)
First off, credit where credit is due: it's a nonparametric test. That is, it doesn't make any silly assumptions that word scores/text scores are normally distributed or anything. However, the accuracy procedure does make the assumption that the text scores are as likely to skew big as small, and... they're not. Compare the third and fourth scenario in the spreadsheet I linked. Because H uses each tested word more frequently than M does, the individual word scores can skew infinitely large in H's direction; unfortunately, they can't similarly skew similarly heavily in M's direction. (Basically, any big number counts toward H, and the more outlandishly big it is, the more outlandishly it counts toward H. Whereas there's NOTHING that counts toward M with similar strength.) This means that, depending on the frequencies of the selected words in the reference sets, you might have a measure that naturally skews toward one reference set or the other. (To be more concrete about this: if Arthur hates prepositional-phrase constructions, and thus his chosen function words tend to always appear at lower rates than Jean's or Louise's, the test will have a tendency to skew toward Jean and Louise.)
Which means that whole section about measuring the accuracy of the test by counting how many of the eight possible results would be definitively wrong for a text, and then expressing surprise that the wrong results didn't appear at a random-chance kind of frequency... It's hard to evaluate that claim, without knowing the frequencies of each of the twenty words for each of the three authors. Furthermore, Allen's accuracy/validity procedure is inherently unbalanced in that it looks at presumed-Arthur texts more than other kinds of texts, so we can't really get a fair assessment of how it does on each of the three kinds of texts.
(In contrast, the way I was trained to test a model's predictive validity: keep back ten percent of your reference data, build the model with the other ninety percent of your reference data, and then see whether the model correctly predicts authorship on the remaining ten percent. That way, you get performance data on all three bodies of texts, AND you don't have to make assumptions about who met whom when.)
Second, the test is a bit odd in that it doesn't follow the well-respected null/alternative hypothesis structure: rather than saying "We'll presume Arthur's authorship, unless there's strong evidence otherwise," it says "If there's the scantest WHISPER of being more evidence that the authorship is Jean's rather than Arthur, then we'll say it's definitively Jean's." It's not as bad as it could be on that front -- it has to whisper Jeanward twice to count as definitively Jean -- but it does not ask for a good strong signal before declaring a decision. Basically, the results could be a whole bunch of murk -- but murk that is ALMOST IMPERCEPTIBLY more one thing than another -- and it gives the same result as a clear strong definitive signal.
None of this is to say that the paper's results are worthless and should be tossed. (And the results it gives seem to show something.) But I don't find the paper fully convincing, either. If nothing else, there are a couple of ways this analysis could have gone wrong that I'd like to see addressed.
(But, sadly, not enough to re-run the analysis myself.)