User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

Hi,
I am trying to extract entities from articles written in spanish and french languages and both have non-english accents. In all of my tests they accents are not well processed.
I don't know what I am doing wrong.
I took that sentence for test purposes:  Édouard Manet was a French painter. Written in english but with a french accent. I tried to make the query with both the http-encoded sentence and the not encoded sentence, unsuccessfuly.
The query returns in both cases:
�douard Manet was a French painter.
So only the surname "Manet" is extracted, but not the first name "Édouard". It might be a character codification issue, but how can I fix this ?
Thanks in advance,
Maxime.

Trackback URL for this post:

http://www.opencalais.com/trackback/62533

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

Well, it seems to be a problem on my side... Sorry.

Best regards,
Maxime.

User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

Well, actually it worked just for a couple of days!

I don't know why but since not much (maybe the release of the new API version) characters returned by the API are encoded in a way I don't understand.
For exemple, the entity "Sporting de Gijón" (Spanish football club) is well extracted but I get it this way:
Sporting de Gij\x{FFFD}n

I don't understand, I changed nothing in my code and it used to work well.

What is this encoding?

Thank you,
Maxime.

User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

Thanks for your answer, it seems to work now.

Best regards,
Maxime.

User offline. Last seen 1 year 27 weeks ago. Offline
Joined: 12/15/2008

Hi Maxime,
Please make sure that the content you send to Open Calais is passed with UTF-8 encoding. For example, if you are using HTTP REST API and JAVA then the content should be URL encoded in UTF-8 encoding.
Here's some sample code
StringBuilder sb = new StringBuilder(documentBody.length() + 1024);
sb.append("licenseID=").append(licenseId);
sb.append("&content=").append(URLEncoder.encode(documentBody, "UTF-8"));
if (paramsXML != null) {
sb.append("&paramsXML=").append(URLEncoder.encode(paramsXML, "UTF-8"));
}

This should solve the extraction issue.

sumit

User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

Hi Sumit,

I was trying to improve our OpenCalais use and I realized that we are still facing this problem, even when content is URL encoded. For example:

-> java.net.URLEncoder.encode("España es el mejor país del mundo","UTF8");
no entity found.

-> java.net.URLEncoder.encode("Espana es el mejor país del mundo","UTF8");
country: espana

The correct spelling is the first one but due to the accent (I guess) it was not extracted.

Regards,
Maxime.

User offline. Last seen 26 weeks 3 days ago. Offline
Joined: 06/17/2009

More investigations:

it seems the word "España" is not extracted in short texts but it is in longer texts.

Could be an explanation.

Maxime.

User offline. Last seen 2 days 14 hours ago. Offline
Joined: 04/30/2008

Hello Tamax,

Open Calais generally performs better when it receives more than a single brief sentence as input. Users have more success when input is at least 100 characters. See http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-content in the documentation for more info.

Regards,