Non English accents?
Non English accents?
Posted on: Fri, 01/15/2010 - 11:45
Hi,
I am trying to extract entities from articles written in spanish and french languages and both have non-english accents. In all of my tests they accents are not well processed.
I don't know what I am doing wrong.
I took that sentence for test purposes: Édouard Manet was a French painter. Written in english but with a french accent. I tried to make the query with both the http-encoded sentence and the not encoded sentence, unsuccessfuly.
The query returns in both cases:
�douard Manet was a French painter.
So only the surname "Manet" is extracted, but not the first name "Édouard". It might be a character codification issue, but how can I fix this ?
Thanks in advance,
Maxime.
Trackback URL for this post:
http://www.opencalais.com/trackback/62533

Well, it seems to be a problem on my side... Sorry.
Best regards,
Maxime.
Well, actually it worked just for a couple of days!
I don't know why but since not much (maybe the release of the new API version) characters returned by the API are encoded in a way I don't understand.
For exemple, the entity "Sporting de Gijón" (Spanish football club) is well extracted but I get it this way:
Sporting de Gij\x{FFFD}n
I don't understand, I changed nothing in my code and it used to work well.
What is this encoding?
Thank you,
Maxime.
Thanks for your answer, it seems to work now.
Best regards,
Maxime.
Hi Maxime,
Please make sure that the content you send to Open Calais is passed with UTF-8 encoding. For example, if you are using HTTP REST API and JAVA then the content should be URL encoded in UTF-8 encoding.
Here's some sample code
StringBuilder sb = new StringBuilder(documentBody.length() + 1024);
sb.append("licenseID=").append(licenseId);
sb.append("&content=").append(URLEncoder.encode(documentBody, "UTF-8"));
if (paramsXML != null) {
sb.append("¶msXML=").append(URLEncoder.encode(paramsXML, "UTF-8"));
}
This should solve the extraction issue.
sumit
Hi Sumit,
I was trying to improve our OpenCalais use and I realized that we are still facing this problem, even when content is URL encoded. For example:
-> java.net.URLEncoder.encode("España es el mejor país del mundo","UTF8");
no entity found.
-> java.net.URLEncoder.encode("Espana es el mejor país del mundo","UTF8");
country: espana
The correct spelling is the first one but due to the accent (I guess) it was not extracted.
Regards,
Maxime.
More investigations:
it seems the word "España" is not extracted in short texts but it is in longer texts.
Could be an explanation.
Maxime.
Hello Tamax,
Open Calais generally performs better when it receives more than a single brief sentence as input. Users have more success when input is at least 100 characters. See http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-content in the documentation for more info.
Regards,