Hello,
I noticed in the docs of the Calais Pipes service that Calais Pipes retrieves the full content of the item and sends this to OpenCalais rather than just the description text that was in the feed.
I want to do something similar, except using the SOAP API rather than the Yahoo Pipes service. My question is how are you doing this, as there can be lots of other content in the retrieved HTML page (navigation, recent posts, etc.) besides just the content. Are you doing a lot of intelligent processing that is only available through the Pipes service, and that thus I would need to replicate in order to use the SOAP API? Would the SOAP API service be intelligent enough to do this if I just passed you the full HTML retrieved from the link in the RSS?

Hi,
Yes, we do intelligent processing on the HTML page. In our next version we will have improved version of that.
The OpenCalais API will process the HTML better if you specify the contentType as HTML.
Ofer
Hi Ofer,
We've tried out downloading the HTML of the linked to item in the RSS and passing it as TEXT/HTML to OpenCalais. One issue we've found is that for blog posts that show comments, OpenCalais will tag words that appear in the comments, which can cause a misleading representation of what the content is about. For example, there may be a story on problems with New York City's educational system, and someone will write a long comment about how things work in Seattle, which in some cases can cause Seattle to get a higher relevancy score. Ideally what we'd want to do is exclude comments entirely.
Would the next version deal with problems like this? Also any idea when it would come out?
JP
Hi,
Yes. We will handle it better in the next version late in August. Stay tuned.
Thanks,
Ofer
Ok, great! As just one more thing to consider if you're not doing this already, in the following page "San Jose" matches with relevancy 1.0 for a couple reasons:
1. "San Jose Mercury News" resulted in San Jose city matches
2. "San Jose" in the navigation bar for selecting your city matched
3. "San Jose" matched when displaying the local weather
4. "San Jose" was part of some other stories they were advertising headlines for on this page (at least at the time I write this)
The page is here: http://www.mercurynews.com/green/ci_9793548?source=rss
We'll hack around this and/or avoid troublesome feeds for now and look forward to your next release, as well as the day when you get content providers to mark up their own content ;)