Information extraction is evaluated by two measures--recall and precision. Recall is a measure of completeness, precision of correctness. When you promise to tell the whole truth, you are promising 100% recall. When you promise to tell nothing but the truth, you are promising 100% precision.
In Message Understanding Conference (MUC) evaluations in the 1990s, systems doing name recognition achieved about 95% recall and precision, which is nearly human-level performance, and very much faster. In event recognition the performance plateaued at about 60% recall and precision.
There are several possible reasons for this. Our analysis of our results showed that the process of merging was implicated in a majority of our errors; we need better ways of doing event and relationship coreference. It could be that 60% is how much information texts ``wear on their sleeves''. Current technology can only extract what is explicit in texts . To get the rest of the information requires inference. A third possibility is that the distribution of linguistic phenomena simply has a very long tail. Handling the most common phenomena gets you to 60% relatively quickly. Getting to 100% then requires handling increasingly rare phenomena. A month's work gets you to 60%. Another year's work gets you to 65%. A fourth possibility is that errors multiply. If you can recognize an entity with 90% accuracy and to recognize a clause-level pattern requires recognizing four entities, then the accuracy should be or about 60%.
This raises the interesting question of what utility there is in a 60% technology. Obviously you would not be happy with a bank statement that is 60% accurate. On the other hand, 60% accuracy in web search would be a distinct improvement. It is best to split this question into two parts--recall and precision.
If you have 60% recall, you are missing 40% of the mentions of relevant information. But there are half a million biomedical articles a year, and keeping up with them requires massive curatorial effort. 60% recall is an improvement if you would otherwise have access to much less. Moreover, recall is measured not on facts but on mentions of facts. If there are multiple mentions of some fact, we have multiple opportunities to capture it.
With 60% precision in a fully automatic system, then 40% of the information in your database will be wrong. You need a human in the loop. This is not necessarily a disaster. A person extracting sparse information from a massive corpus will have a much easier time discarding 40% of the entries than locating and entering 60%. Good tools would help in this as well. In addition, it may be that the usage of language in biomedical text is tightly enough constrained that precision will be higher than in the domains that have so far been the focus of efforts in informaiton extraction.