User offline. Last seen 32 weeks 3 days ago. Offline
Joined: 12/14/2009

Hey, 
after some more testing, I've noticed your entity recognition system falls back on text uppercase use. For example "Lady Gaga Bikes on the Street"* the entity "Lady Bikes"(Person) is extracted. When I lowercase that sentence, no entity is extracted.
Could you confirm this? Does the system take into account "alternative" names of entities (available via dbpedia)? What is the source for entities (wikipedia?)?  How often is this reference set updated? 
Thanks,
 
Yvo
* = I understand this example also introduces the issue with alternative names of entities, but I also find it everytime entities have persontype = unknown.

Trackback URL for this post:

http://www.opencalais.com/trackback/76337

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 1 year 27 weeks ago. Offline
Joined: 12/15/2008

Hi yvoschapp,
Calais uses natural language processing to parse entities from text. In other words given a piece of text, it parses the sentences, figures out the part of speech for words and then applies linguistic rules to determine if something is a person, company, etc. Sometimes it uses lexicons along with the rules to help determine the entity.

When you submit lower case text, the part of speech recognition breaks, for example, it cannot say if a given word/phrase is proper noun and hence cannot tell whether the phrase is person (as in your case) or company. Text submitted with all words upper case will have similar problem. Regular/sentence case is ideal.

Regarding your question about updating reference sets - as mentioned above Calais doesn't completely rely on lexicons/lists from wikipedia or other sources, it sometimes uses them in combination with rules to enhance extraction. The lexicon files that we have are updated in every release.

sumit