Input Content
Language
OpenCalais today supports content in English, French and Spanish. OpenCalais applies a Language Identification module before processing the text for entities, events and facts. This module might fail to recognize the language if the submitted text length is small.
If the submitted content is less than 100 characters and the language cannot be recognized, OpenCalais will assume the language is English by default and and the English module will process the text for entities, events and facts. In addition, in such cases, OpenCalais will return "Input Text Too Short" as the language code in the RDF.
HTML Cleaning
A document coded in HTML usually contains much more information than what is perceived as “new and relevant textual content”. This information might be:
- The HTML tags
- Various metadata
- Browsing menus
- Advertisements
- Links to other service providers like Facebook or Digg.
- Links to other HTML pages on the same site that are not part of the main content frame.
- General information about the hosting site, including copyright notice, etc.
In order to optimize extraction and categorization results or, for that matter, any semantic-related algorithm operating on the text, we opt for the text to be as free as possible from the above objects.
The HTML cleaner aims to clean the text as much as possible of these objects, leaving only a basic HTML structure that contains only the new and relevant data. It does so using both HTML-structural and machine-learning-based heuristics.
Format
OpenCalais supports four formats of content: TEXT/HTML, TEXT/HTMLRAW, TEXT/XML and TEXT/RAW.
When no content type is specified, Calais attempts to auto-detect the type (one of: TEXT/XML, TEXT/HTML or TEXT/RAW).
Note: TEXT/TXT is no longer used. If this type is given, it is processed as though the contentType value was TEXT/RAW and the results appears as TEXT/RAW.
TEXT/HTML: No conversion of the submitted content, will remove irrelevant text content as well as HTML tags and scripts. Entity and event detection will be relative to the cleansed text. For optimal results it is recommended to use this contentType when submitting HTML content.
TEXT/HTMLRAW: Will apply OpenCalais' legacy converter and limited cleansing, removing only HTML tags and scripts, This is equivalent to TEXT/HTML of previous versions.
TEXT/XML: Will apply the XML converter for escaping the necessary characters, hence entity and event detection will be relative to the cleansed text. For optimal results it is recommended to use this contentType when submitting XML content. The XML converter also supports the NewsML standard.
If the content is submitted as TEXT/XML, OpenCalais will process the following XML nodes:
| Document Section | Supported XML Tag Names |
| Document Title | TITLE, HEADLINE, HEADER |
| Document Body | BODY, DESCRIPTION, CONTENT |
| Document Date | DATE, DATETIME, DATEANDTIME, PUBDATE |
Document Title and Document Body should contain the content that will be processed by Calais.
Document Date is important: once detected, it is used to resolve relative date mentions (e.g., "yesterday") when such mentions appear in Calais's events and facts. If Document Date is not provided, relative dates will be resolved based on the "date of today".
For optimal extraction of metadata, please make sure your XML content conforms to these tag names.
TEXT/RAW: No conversion of the submitted content; entity and event detection (offset/length) will match the submitted content exactly. You can use this contentType when submitting plain text. Note that this is the only contentType option that works exactly on the submitted input content without modifying/cleansing it at all.
