I started with a list of 1574 database entries for phylogenetic studies. Each had a free text bibliographic reference and some other information. The job was to get DOIs for as many of these as possible.

Why DOIs? Well, they are likely to be more stable than the URL of the PDF; they are canonical, aiding comparison and database joins; and because you can get their metadata in machine readable form via a simple web service .

242 of them already had DOIs in their DOI field, and 18 more had DOIs embedded in the reference.

Of the 1314 remaining, 1093 had entries in a linked database (Treebase), and of these 400 had DOIs in the corresponding Treebase metadata.

There were 914 remaining. I fed all of these through Crossref's SimpleTextQuery form, which gave me another 592 DOIs. (No, I didn't script this piece, I had to do it in chunks of about 100 each and manually copy from the browser window. I tried the bulk query form, where it emails you back results from larger volume requests, but for some unknown reason it didn't work for me. And then I tried to give feedback, and the feedback form was broken...)

I was left with the prospect of finding DOIs for the remaining 320 articles by surfing the web. I don't have a surefire method for finding DOIs, but my general method is to find a landing page for the article, and/or the article itself, since you can usually find the DOI there (not always, see below). I variously try Google web search, Google Scholar, and the search functions at the journal publishers' web sites (many of which are quite awful). This is complicated by the fact that journals change hands over time, so you often think you're searching in the right place, only to realize that the source you're plumbing doesn't cover the year in which the article was published.

Of course this is very labor intensive so I will put the task aside, to dip into in those idle moments when I might have otherwise done a crossword puzzle (if I were the sort of person who did crossword puzzles).

Crossref conveniently gave me Pubmed or Pubmed Central ids for 11 of them, which made DOIs pretty easy to find for 9. (This is sort of odd - Crossref knows Pubmed ids but not DOIs for publications that have DOIs? Interesting.)

An alarming case is http://dx.doi.org/10.2307/25065369, which SimpleTextQuery told me about. The DOI doesn't occur in the article or on either of its two landing pages. So how am I supposed to figure out what it is? In this case I might have been lucky enough to find one of six web pages that link to the article using the DOI (and how did *they* find it?), but how reliable is that?

I would recommend to JSTOR that if they provide a DOI for an article, they should put the DOI on the article's landing page. In fact, they are [[http://www.crossref.org/08downloads/2012/2012_PILA_Membership_Agreement.pdf|contractually bound]] to do so - this is one of the rules of the Crossref game.

I found a problem with the way 90 of the 320 references were written (an extraneous word "null" at the end, an error which was propagated from Treebase), leading to SimpleTextQuery failure in each case. 4 out of a sample of 10 of these became hits when I deleted the "null" and tried again. So perhaps the number of negatives is more like 280 rather than 320.

Anyhow I took a sample (not random enough unfortunately) of the SimpleTextQuery misses to estimate how many of the SimpleTextQuery failures are true negatives vs. false negatives. Out of 33 in the sample, 22 were false negatives - their DOIs can be found by scouring the web. So we can project that there will be 186 false negatives and 93 true negatives.

The true negatives were things you'd expect: journals that don't (or didn't formerly) use DOIs, book chapters, conference proceedings, articles that don't seem to be online at all.

In any case I was a bit surprised that SimpleTextQuery did so poorly (12% wrong answers out of my total original set; 20% of the queries I gave it). But the problem of parsing and matching free text bibliographic references is nontrivial, and I'm thankful for what I got.

Disclaimer: This whole report should be taken as suggestive - I'm just giving data on experience trying to get something done, not on a planned experiment. Many things might be wrong with the process, such as mistakes in the way references were written, sample biases of various sorts, lack of statistical strength, various shortcuts I took, and so on. You shouldn't generalize from here and say that X% of academic articles have DOIs, or that SimpleTextQuery gives the wrong answer Y% of the time. This is just what happened in this particular case.