User offline. Last seen 26 weeks 4 days ago. Offline
Joined: 08/25/2008

Submitting a fragment that contains some html (but not well-formed) without specifying the contentType seems to result in an unsupported language error. I get similar results if I submit the fragment with contentType="text/raw", whereas if I submit with contentType="text/html" or "text/htmlraw" I get the results I want. Example fragment is as follows:

-----------------------

STAR-TELEGRAM/MAX FAULKNER<br />Cowboys Stadium in Arlington at sundown

 STAR-TELEGRAM/MAX FAULKNER
------------------------

Note that changing the <br /> to a space and submitting without specifying contentType also gives good results.

Should I conclude that letting Calais auto-detect content type (not setting ContentType in xml params) will give errors like unsupported language for documents that contain html but are not well-formed? So the only way to get Calais to process such documents is for my code to detect the presence of HTML and set the contentType accordingly? So Calais' auto-detect feature (not setting ContentType) is only useful if any html type documents are well-formed?

Trackback URL for this post:

http://www.opencalais.com/trackback/28879

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 26 weeks 4 days ago. Offline
Joined: 08/25/2008

Will the automatic detection of contentType by Calais (when not specified in params) be adjusted to assume "text/html" rather than "text/raw" when it encounters html tags like <br />? For now, Calais apparently considers the document to be "text/raw" and interprets the <br /> as being from an unsupported language. Could someone indicate whether I should expect it to stay that way or whether the automatic detection will be changing to handle such cases?

Thanks,

John