User offline. Last seen 1 year 42 weeks ago. Offline
Joined: 02/06/2010

Hello, I am processing some texts with enlighten called via Perl thusly:
$calais->enlighten($buffer, contentType => 'TEXT/TXT', outputFormat => "application/json", enableMetadataType => 'GenericRelations,SocialTags');
The socialTags response in some cases contains unprintable characters despite the fact that the input text is plain ASCII.  One example is at text containing "fatwa."  The name value of the socialTag returned looks like this, broken down into characters:

Char
Decimal Byte Value

F
70

a
97

t
116

w
119

Ã
195

„
132

Â
194


129

The associated references are:
id: http://d.opencalais.com/dochash-1/50093719-3bea-3028-9bf8-1dde9d930d65/SocialTag/5
socialTag: http://d.opencalais.com/genericHasher-1/461dee2e-0177-3b46-9d10-384939c504f
I suppose the last four cases could represent two unicode chars, but I don't know why that would be given that there is only one more letter in the word, the first "a" got encoded fine, and the original text was sent in plain ASCII.
Does anyone know what is going on here?
Thanks,
Steve
 

Trackback URL for this post:

http://www.opencalais.com/trackback/78404

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 1 year 27 weeks ago. Offline
Joined: 12/15/2008

Hi Steve,
Open Calais uses UTF-8 encoding for unicode characters. Since UTF-8 is backward compatible with ASCII (see more at http://en.wikipedia.org/wiki/UTF-8), you see the ASCII value for some and non-ASCII for others.

To preserve character encoding when sending requests (in case your text has unicode characters) and understanding responses (when Open Calais response has unicode characters), its best to use UTF-8 encoding. In JAVA this is done by specifying the encoding for input and output streams used to communicate with the web service, should be similar for Perl.

sumit