Need some help getting going with Calais
Need some help getting going with Calais
Posted on: Mon, 07/04/2011 - 13:19
Hey all,
I'm investigating the use of open calais to simplify the process of adding meaningful metadata to our news stories. As an experiment (and to help us with other manual tasks later), I've whipped together a short ruby script that takes my collection of 800-odd archived stories (in HTML) and pass them through open calais. The goal is to dump out an HTML page which has a table and a few columns of immediately interesting information (story name, date, categories, social tags, companies, people, maybe a few others) so that I can have the guys who actually wrote the articles validate the results, quickly.
I need some help understanding a few things:
- I'm assuming this is allowed, provided I'm not doing more than 4 requests per second
- If I past the URIs returned in either the response, the pages are typically 404. Should I simply treat these as exceedingly precise, but otherwise uninteresting, identifiers?
- What should I do when I encounter errors. For example it confused a reference to "King Edward Street" with a position of "king". This one I can ignore since it is irrelevant, but elsewhere it confused Fort St. John, BC with St. John, NB so if actually care to use that information, what do I need to do to correct it?
Thanks!
Trackback URL for this post:
http://www.opencalais.com/trackback/108429

Hello FranSan,
I managed to get my script working and things are fine with one exception.
Open Calais mis-classifies a lot of content. These are the errors I mean. I get the error codes that come back from the API.
For example running my collection of 800 or so stories through open calais, I came up with several references to the Bank of Canada (not surprising, they're all business articles). Leaving aside the problems in the content itself, there are two separate entities for the Bank of Canada that were returned:
http://d.opencalais.com/genericHasher-1/186f69b0-b837-3764-8a94-2ccc0c9ae646.html --> which is an organization
and
http://d.opencalais.com/genericHasher-1/186f69b0-b837-3764-8a94-2ccc0c9ae646.html --> which is a CITY but clearly wrong (there is no city called Bank of Canada that I am aware of an it is never referred to in our stories if it is).
Given that I want to rely on Open Calais to tag the stories written by my reporters, getting correct entities back is very important. So knowing how I can take information I've gathered and corrected manually and providing the information back to Open Calais would be VERY valuable.
thanks for the help!
Adam van den Hoven.
Good morning Glacier_Media,
Glad to hear that you're up and running with OpenCalais.
As for the Bank of Canada, it certainly looks like a bug, unless the current financial crisis has forced it to to reinvent itself as a city (doubtful).
I'd like to refer your problem to a technical team member. Could you send us the problem document(s) - that's usually the first request the technical team member has so that she/he can replicate the problem. Use the email questions@opencalais.com.
Regards,
Hello Glacier Media,
The 4/sec processing applies to the maximum number of docs that the server will begin *processing* in any second. Some documents are more complicated than others; we have not specified the actual turnaround time because of varying document complexity, length, etc.
The limit on *submissions* (50,000/day) pertains to the number of documents the user submits.
You'll find error messages listed on the following url:
http://www.opencalais.com/documentation/calais-web-service-api/error-messages .
Also suggest that you look at the FAQs.
Hope this helps.