Michael Fagan undertook a review of various entity and term extraction tools this past weekend.
While the use case (content types, goals, etc.) of the test are not clear, we are heartened to see folks beginning to look at the growing variety of tools and services in the space.
You can read the full posting - as well as a commentary from our Tom Tague - here: http://faganm.com/blog/2010/01/02/1009/
Tom's commentary is also excerpted below:
Michael:
Tom Tague from OpenCalais here.
Wow - you covered a lot of territory here and I could probably spend at least as much space responding to some of the points you've raised - but I’ll try to stay focused on just a few key items.
First of course is the use case - which you don’t reveal. And of course it’s difficult to evaluate tools without the intended use well understood. Are you “tagging” news? Analyzing a large corpus of documents for network effects?
For example - if your use case is simply for entity extraction then the volume of entities is rarely the goal - but rather a mixture of recall and precision. You can have perfect recall and low precision - or the reverse. The goal is to find the appropriate balance. If your use case requires named (e.g. typed) entities, then tools such as Yahoo should not be in the mix - they are term and not entity extraction engines.
Facts & Events: I was also a little surprised that you stopped at entity/term extraction. Most real-world use cases want to understand what’s happening in the text and are heavily dependent on facts and events - an area in which OpenCalais shines. Facts and events reveal the relationships between entities, and make up the core elements of “aboutness,” which are key values / benefits that many use cases for semantic technology seek to derive.
Semantic Links: It appears you missed our connection to the Linked Data Cloud on your “Semantic Links” section. For a growing number of entities that we return, we also return links to a rapidly growing set of Calais-provided, Thomson Reuters information assets that follow the semantic Linked Data standard. These dynamically generated pages also provide relevant ‘sameAs’ links to key resources in the Linking Open Data cloud.
You can see these by entering the text of a news article into our demonstration tool at http://viewer.opencalais.com, copy and paste in a news story that features a number of company names, hit submit, and view the extracted entities, facts and events in the left hand rail. Then expand on the companies, and click on one to find the Calais asset. (For instance, see the Bank of America asset here: http://d.opencalais.com/er/company/ralg-tr1r/e80e12df-622c-3c3e-86dc-a3ffdcc39e25.html (Traditionally, we have also included sameAs links to DBPedia and Freebase, and those will be back. Right now we are adapting to a new format.)
RDF: While the general developer population may lack familiarity with RDF at this time, as you note you do, developers that work with large textual content sets are moving to learn it now. While a variety of alternate representation ideas are available - RDF is the W3C standard and provides the right transport layer for rich knowledge representation. The text/simple format you chose is designed to support simple tagging / entity extraction use cases and leaves much of the richness extracted behind.
Length: OpenCalais supports entry of text documents up to 100K in length. We’ve found this supports the vast majority of our users well while conserving systems resources.
Usage: We welcome the use of OpenCalais for commercial or non-commercial purposes and allows users to submit up to 50,000 documents per day at no charge. After that we need to discuss some sort of value exchange.
We could probably name several other tools - but you absolutely should include Zemanta in any entity extraction test case.
Again - thanks for putting in the time to compare tools. I’d encourage you to come back and revisit the subject in the future.
Regards,
Tom Tague, OpenCalais Initiative Lead
(@TomTague)
Trackback URL for this post:
- KristaThomas's blog
- Login or register to post comments





