User offline. Last seen 2 years 11 weeks ago. Offline
Joined: 06/08/2008

Hi,

I would like to know how Calais "learns" new things. For instance, it recognizes "Chelsea Clinton" as a person, but does neither know "Chelsea" as a district of London nor as a football (soccer) club (yet).

I imagine that there's a staff of people working on feeding it new stuff - is this correct?

If so, how many people are doing this?

Is it all done "by hand", or is the process partially (or even completely) automatable? Or are you using your journalistic workforce to do it?

There are many potential entities to recognize. Currently, it doesn't seem to recognize Manchester City as a football club, distinguish that from Manchester United - another football club -, or know that football clubs in England are the same as soccer clubs in the USA. And then there are nicknames and common abbreviations (like ManU :-)... How far are you going to go in making entities like this available, and what are your priorities?

I'm particulary curious about how you're going to tackle disambiguation ("football", "Chelsea", "Paris", &c.). Is there a paper available which explains the strategy of the Calais developers regarding this common difficulty for NLP applications?

Hope you don't take my curiosity for disrespect - I'm just fascinated by your work, and I'm trying to figure out where it might lead to.

Best regards, Dirk

Trackback URL for this post:

http://www.opencalais.com/trackback/1897

Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
User offline. Last seen 10 weeks 14 hours ago. Offline
Joined: 10/31/2009

I am not still sure of the functionalities of opencalais. I think I need to read more.

User offline. Last seen 1 year 31 weeks ago. Offline
Joined: 01/22/2009

Hello,

I've seen a good an interesting opportunity to try calais in a new drupal site. I've installed modules, also arc rdf, all things in place working but no suggestions at any tag. May be because my content is mostly spanish. Is this the reason?

thanks

Gustavo

User offline. Last seen 42 weeks 3 days ago. Offline
Joined: 05/16/2008

Hi Gustavo,

Indeed, currently Calais supports English and French only.

However we do plan to roll out additional languages during 2009, and Spanish is high on the list, so keep visiting.

Regards,

Michal

 

 

User offline. Last seen 1 year 31 weeks ago. Offline
Joined: 01/22/2009

Would you mind creating a "spanish roll out" node in order to be able to suscribe to it? I'll be glad to participate in it at the very beggining.

Thanks Gustavo

User offline. Last seen 2 years 10 weeks ago. Offline
Joined: 06/15/2008

Hi Dirk,

Well, being one of the people behind Calais learning, I first want to say that I am glad you are curious - it means we're doing our work pretty well (although, as you've noticed, there is always a place for improvements).
The processing is completely automatic, and there are only around a dozen people (+ a few past, yet important, contributors), responsible for the NLP code.
Calais is learning using rules we are "teaching" it - we are trying to mimic the way a human reader is identifying or disambiguating an entity (or a relation) when he/she reads a text, using clues within the entity itself, its close context and the entire text context.
We have developed (and still developing) a sophisticated rule-based system with our own (and if I may say, cool) programming language. In writing the rules, we are using elements which are based on several NLP levels (From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases), and we are also using lexicons. This combined lexicons+discovery approach allows Calais to identify an entity even if most of the world (including the people writing the rules) never heard of it, and to disambiguate an entity meaning according to the context it appears in.
Our sports teams identification was just released in the latest Calais update, and will probably improve in the next versions (especially if we'll get feedback about it), so the soccer teams will get their attention...
Regarding common abbreviations - we are trying to identify them and map them to their full names using abbreviations and acronyms creation methods, as well as thesauruses.
We hope to have available in the future as many entities as possible. And as for prioritization - that is not for me to answer...
I hope I have managed to answer at least some of your curiosity, and that we'll be able to keep you fascinated for a long time :)

Regards,
Naama

User offline. Last seen 2 years 11 weeks ago. Offline
Joined: 12/31/1969

If there's such thing as a paper describing the inner workings of Open Calais, I'd second that request :)

Tom
User offline. Last seen 12 weeks 4 days ago. Offline
Joined: 05/07/2008

There's no single paper that provides a good overview of the complete Open Calais technology stack - though there probably should be.

Let us think about this for awhile and get back to the forum. While I'd like to get this put together - it is of course the same people that are building that we would need to take time out for writing.

Regards,

User offline. Last seen 1 year 46 weeks ago. Offline
Joined: 06/19/2008

Hi,

for sure, I'd also be interested in learning more about how Calais learns :)

One specific question: did you ever try deep parsing to improve relation extraction capabilities of Calais and if yes how were your experiences?

Thanks and best Regards

Markus

Tom
User offline. Last seen 12 weeks 4 days ago. Offline
Joined: 05/07/2008

Markus:

Take a look at the R3 release notes on "Exhaustive Extraction" (http://www.opencalais.com/R3Overview). You might find them interesting.

User offline. Last seen 1 year 31 weeks ago. Offline
Joined: 01/27/2009

  Is it possible to use Calais offline ?? People working in Academics and doing research would be interested in this. 

I am would love to be able to use Calais as part of my research, but using the online API is not practical (will need to perform a very large number of queries).

User offline. Last seen 1 day 7 hours ago. Offline
Joined: 12/31/1969

Hi,

Our service can handle large number of queries per day and is extensible.

How many requests per day do you think you will need to send. Can you provide more information of the input type you need (how big are the documents, type of douments - are they news articles or some other specific format?).

You can reply with private message or you can contact questions@opencalais.com. We can discuss this further.