RDF
A general guideline on how to interpret the Calais response in RDF is presented here.
For extracted metadata elements (Entities, Events and Facts) the RDF includes the following artifacts:
| RDF Artifact | Description |
| Element Description | For each unique entity, event or fact, the description includes the type (for example, Company, Person, Acquisition), attribute values, and an ID (hash) of this unique element. If the Relevance feature is not turned off, the description also includes the relevance score for this unique entity. When an attribute value is referred to by its ID (hash) and not as a literal string, it includes a comment containing the actual value for easier readability. |
Examples:
The entity Company for "ClearForest Ltd." may look like this:
<rdf:Description rdf:about="http://d.opencalais.com/comphash-1/9dd2192a-4cd2-3b9a-ac2f-b6a0d1fed773">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Company"/>
<c:name>ClearForest Ltd.</c:name>
</rdf:Description>The event Acquisition between "Reuters" and “ClearForest Ltd.” may look like this:
<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/e83cd693-2146-32a2-b1fe-c4a73615dbf0">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/em/r/Acquisition"/>
<!--Reuters-->
<c:company_acquirer rdf:resource="http://d.opencalais.com/comphash-1/48344864-ce62-3064-ae05-a3b41fab186c"/>
<!--ClearForest Ltd.-->
<c:company_beingacquired rdf:resource="http://d.opencalais.com/comphash-1/9dd2192a-4cd2-3b9a-ac2f-b6a0d1fed773"/>
<c:status>planned</c:status>
</rdf:Description>| RDF Artifact | Description |
| Element Instances | One or more individual instances (mentions) for each unique metadata element. Each element instance includes the Doc ID, the ID (hash) of the unique element, the detection of the instance (snippet of the input content where the metadata element was identified), a 'prefix' tag (snippet of the input content that precedes the current instance), an 'exact' tag (snippet of the input content in the matched portion of text), a 'suffix' (snippet of the input content that follows the current instance), and the offset and length values of the instance (relative to the input content after it has been converted into XML). Note that for readability purposes each hash reference is followed by a comment that shows the actual element value. |
Examples:
An instance for the unique company ClearForest Ltd. may look like this:
<rdf:Description rdf:about="http://d.opencalais.com/dochash-1/37042f2e-8a72-30c7-afe2-2710efcb4d5f/Instance/11">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
<c:docId rdf:resource="http://d.opencalais.com/dochash-1/37042f2e-8a72-30c7-afe2-2710efcb4d5f"/>
<c:subject rdf:resource="http://d.opencalais.com/comphash-1/9dd2192a-4cd2-3b9a-ac2f-b6a0d1fed773"/>
<!--Company: ClearForest Ltd.-->
<c:detection>[Reuters to acquire text search firm ]ClearForest[ </TITLE>
<DATE> Mon Apr 30, 2007 7:00am EDT]</c:detection>
<c:offset>55</c:offset>
<c:length>11</c:length>
</rdf:Description>An instance for the unique Acquisition element between "Reuters" and "ClearForest Ltd." may look like this:
<rdf:Description rdf:about="http://d.opencalais.com/dochash-1/37042f2e-8a72-30c7-afe2-2710efcb4d5f/Instance/22">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
<c:docId rdf:resource="http://d.opencalais.com/dochash-1/37042f2e-8a72-30c7-afe2-2710efcb4d5f"/>
<c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/e83cd693-2146-32a2-b1fe-c4a73615dbf0"/>
<!--Acquisition: company_acquirer: Reuters; company_beingacquired: ClearForest Ltd.; status: planned; -->
<c:detection>[<DOCUMENT>
<TITLE> ]Reuters to acquire text search firm ClearForest[ </TITLE>
<DATE> Mon Apr 30, 2007 7:00am EDT]</c:detection>
<c:offset>19</c:offset>
<c:length>47</c:length>
</rdf:Description>In addition, the RDF response includes general document and transaction information, such as the document language, submission date and time and request ID. It also includes the input content after it has been converted into valid XML for the actual processing by the Calais backend server (except for the TEXT/RAW option).
There are several changes and enhancements related to the RDF output format. Users asked for an option to exclude the original body from the returned RDF output. OpenCalais now supports this option through a new parameter in the processing directives section of paramsXML:
c:omitOutputtingOriginalText="TRUE"
By default, the original body is returned in any RDF response.
Another change in RDF output is that the RDF header, which includes a summary of all entities extracted from the text, is now sorted alphabetically based on the entity type (the same sorting used in Simple Format).
Lastly, disambiguation results are integrated in RDF output in the following manner.
For Companies: Resolution nodes are added to the output RDF. Each such node contains the following information:
- c:subject - URI of the referred company entity. A resolution node may contain multiple subject properties; one for each company entity which was resolved to this single company.
- c:docId – URI of the document this resolution was created in.
- c:score – a score representing the certainty with which the company was resolved.
- c:name – formal English legal name of resolved company.
- c:ticker – company’s ticker.
The RDF example below shows a company entity (top RDF node) and the respective resolution node.
<rdf:Description rdf:about="http://d.opencalais.com/comphash-1/64136b2b-cb4e-36ac-9f32-f58f4c1f1c8a">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Company" />
<c:name>British Airways</c:name>
<c:nationality>British</c:nationality>
</rdf:Description>
<rdf:Description rdf:about="http://d.opencalais.com/er/company/ralg-tr1r/58ad4ecb-2df0-3d46-8333-2d25dcb364d9">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/er/Company" />
<c:docId rdf:resource="http://d.opencalais.com/dochash-1/88096fc6-9ea2-3c9f-a0a0-c29a0a5fdced" />
<c:subject rdf:resource="http://d.opencalais.com/comphash-1/64136b2b-cb4e-36ac-9f32-f58f4c1f1c8a" />
<c:score>1.0</c:score>
<c:name>British Airways PLC</c:name>
<c:ticker>BAY</c:ticker>
</rdf:Description>For Geographies: Resolution nodes will be added to the output RDF. Each such node contains the ID of the referred entity (city or province or state or country found in input) in its "c:subject" property, the resolved name in its "c:name" property, latitude of the location in its "c:lat" property and longitude of the location in its "c:long" property.
The RDF example below shows a city entity (top RDF node) and the respective resolution node.
<rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/96e9e28b-f95c-3f9c-a374-b3bcfbc02cfd">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/City"/>
<c:name>Golden</c:name>
</rdf:Description>
<rdf:Description rdf:about="http://d.opencalais.com/er/geo/ralg-geo1/e3f5b88c-f2f2-6e4f-7e2c-f0452221c341">
<rdf:type rdf:resource="http://s.opencalais.com/1/type/er/Geo"/>
<c:docId rdf:resource="http://d.opencalais.com/dochash-1/3508bef0-f669-3dec-829c-d3344507f857"/>
<!--Golden-->
<c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/96e9e28b-f95c-3f9c-a374-b3bcfbc02cfd"/>>
<c:name>Golden,Colorado,United States</c:name>
<c:lat>39.7556</c:lat>
<c:long>-105.2206</c:long>
</rdf:Description>
A timeout of 20 seconds is applied if large input content is submitted to Calais., however, instead of dropping the transaction, Calais will return the metadata results extracted so far, and will also indicate the occurrence of a timeout for the submitted content. The message will be as follows:
<c:message>
<rdf:Description>
<rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/Message" />
<c:messageCode>201</c:messageCode>
<c:text>Partial metadata extraction due to timeout </c:text>
</rdf:Description>
</c:message>
Examples
Attached are two example files: input in the form of a TEXT/XML document submitted to Calais, and its resulting RDF output file.
| Attachment | Size |
|---|---|
| RDF-input_08Oct30.txt | 3.79 KB |
| RDF-output_08Nov09.rdf | 100.17 KB |
