Entity recognition
Entity recognition
Posted on: Thu, 07/01/2010 - 11:24
Hey,
after some more testing, I've noticed your entity recognition system falls back on text uppercase use. For example "Lady Gaga Bikes on the Street"* the entity "Lady Bikes"(Person) is extracted. When I lowercase that sentence, no entity is extracted.
Could you confirm this? Does the system take into account "alternative" names of entities (available via dbpedia)? What is the source for entities (wikipedia?)? How often is this reference set updated?
Thanks,
Yvo
* = I understand this example also introduces the issue with alternative names of entities, but I also find it everytime entities have persontype = unknown.
Trackback URL for this post:
http://www.opencalais.com/trackback/76337

Hi yvoschapp,
Calais uses natural language processing to parse entities from text. In other words given a piece of text, it parses the sentences, figures out the part of speech for words and then applies linguistic rules to determine if something is a person, company, etc. Sometimes it uses lexicons along with the rules to help determine the entity.
When you submit lower case text, the part of speech recognition breaks, for example, it cannot say if a given word/phrase is proper noun and hence cannot tell whether the phrase is person (as in your case) or company. Text submitted with all words upper case will have similar problem. Regular/sentence case is ideal.
Regarding your question about updating reference sets - as mentioned above Calais doesn't completely rely on lexicons/lists from wikipedia or other sources, it sometimes uses them in combination with rules to enhance extraction. The lexicon files that we have are updated in every release.
sumit