Life in the Linked Data Cloud: Calais Release 4

The Gist: Release 4 of Calais will be a big deal. In that release we’ll go beyond the ability to extract semantic data from your content. We will link that extracted semantic data to datasets from dozens of other information sources, from Wikipedia to Freebase to the CIA World Fact Book. In short – instead of being limited to the contents of the document you’re processing, you’ll be able to develop solutions that leverage a large and rapidly growing information asset: the Linked Data Cloud.

The goal of this post is just to give our community a heads-up to start thinking and planning.

During the course of 2008 we’ve had three significant releases of Calais, with additional point releases nearly each month along the way. We’ve added new knowledge domains, improved performance, delivered integration with a range of tools and developed new user-facing applications. It’s been a year of amazing growth in our developer community and the capabilities of the Calais service.

While every previous release has accomplished something significant, Release 4 is going to introduce something that we think is game changing – and that’s life in the Linked Data cloud. It’s important enough that we want to give all the members of our community time to think about it, prepare for it and get your brains in gear on how you might use it.

Every release of Calais up to this point has focused on meeting the need to extract semantic information from text. Release 4 builds on this by creating the ability to harvest the Linked Data cloud using that semantic data.

For this all to make sense we need to introduce a few things. If you already know about de-referenceable URIs and the Linked Data cloud – skim ahead. If not – please take a moment to ingest the background you need.

When you send text to Calais it returns several things: entities, facts, events and categories. For purposes of today’s discussion we’re going to focus in on entities. Entities are just what they sound like – they are things. Some specific examples are people, companies, organizations, geographies, sports teams and music albums.

When Calais extracts an entity from your text it returns (at least) a few things. It tells you the name of the entity and it tells you what type of entity it is. Unlike other extraction services we don’t just return a list of things – Calais tells you it found a thing of type=Company and a value=IBM or type=Person and value=Jane Doe. But – there’s something else Calais returns that hasn’t meant very much up until now: it returns a Uniform Resource Identifier (URI) for that entity. There’s nothing magic about URIs - they are simply a unique identifier for every entity that Calais discovers. Here’s an example (it’s not pretty) of what the URI for the Company IBM looks like:

Well, that doesn’t look very useful does it? If you were to pull up that URI (when Release 4 is out) all you’d see is RDF with links to places called DBpedia and Freebase and Reuters. But keep those links in mind: they’re the key to a whole new world.

Linked Data is the name of a movement underway (not too surprisingly, initiated by Sir Tim Berners-Lee) that sets a standard and expected behavior for publishing and connecting data on the web. This isn’t about publishing web pages – this is about turning those web pages into data that’s accessible to programs to work with. We’ll give you a quick example to make it real: Wikipedia is one of the single largest sets of information across a broad range of topics in the world. It’s really great if I'm a person who's casually looking for information on a particular topic – but it’s not so great if I’m a computer program that wants to use that data. Why? Because it’s formatted and organized for people – not computers – to read.

But Wikipedia has a twin - in fact a Linked Data twin – called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format called RDF and accessible via the Linked Data standards. And, Wikipedia is not alone. A growing cloud of information sets from DBpedia to the CIA World Fact Book to U.S. Census data to Musicbrainz – and many others – is becoming available. What’s important is that this cloud is 1) growing, and 2) interoperable. There are “pointers” from entries in DBpedia to entries in Musicbrainz and back to entries in Geonames – it’s another big Web – but this time it’s a Web of Data.

So – lots of words and arcane concepts. Let’s try to bring it all together into something that makes sense. We’ll put one sentence out there – and then we’ll give a few examples.

Beginning with Calais Release 4 you and the programs you develop will be able to go from many of the entities Calais extracts directly to the Linked Data Cloud.

A simple example:

I want to process today’s business news. For each article I want to extract all of the companies mentioned – but only if the article also mentions a merger or acquisition. I am only interested in companies whose headquarters (or those of their subsidiaries) are located in New York State. Do all of that and give me a widget for my news site titled “Merger Activity for NY Consulting Companies”. And oh, by the way, this isn’t a research project – I want you to do it real time for the 10,000 pieces of news I process every day.

How would you do that? Option 1 is to hire a bunch of researchers, give them a fast internet connection and teach them to type very very fast.  Option 2 is to write some code that looks like this:

For each Article

   Submit to Calais, get response
       If MergerAcquisition exists then
           For each Company
               Retrieve Calais Company URI, extract DBpedia link
               Send Linked Data inquiry to DBpedia, get response
                   If CompanyIndustry contains “Consulting”
                       If CompanyHeadquarters = “New York”
                          Put them on the list
                       For each subsidiary
                          Send Linked Data query to Dbpedia, get result
                              If CompanyHeadquarters = “New York”
                                  Put them on the list

(lots of endif’s)

Print the list

That really is a pretty straightforward example. How about companies in the news with at least one subsidiary doing business in an area that the CIA Factbook considers dangerous? Or books released by authors who attended Harvard who live in Ohio? Or ... . We think you get the idea.

So. The summary. The combination of semantic data extraction (generic extraction, tags, keywords won’t do the trick) + de-referenceable URIs (entity identifiers you and your programs can retrieve) + the Linked Data Cloud = amazing stuff.

We’d like you to start thinking about it.

Trackback URL for this post:
Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How will the

How will the de-referenceable URIs be ordered/ranked when they are returned as a list or will this be a problem for the program at the other end of the request? Will you apply a type of PageRank algo to the list prior to return it to the requesting program?

cissp tutorial dumps


thanks for nice article.


thanks for nice article.

Thanks for sharing this post

Thanks for sharing this post

A source for MUAC straps

How will the de-referenceable URIs be ordered/ranked when they are returned as a list or will this be a problem for the program at the other end of the request?

Will you apply a type of PageRank algo to the list prior to return it to the requesting program?


no PageRank

Please see the comment below for a similar question.

To clarify, each entity - company, product, etc. has a dereferenceable URI which are part of the returned RDF (they do not appear as a list).

The dereferencable URIs for company, product, etc. may have a list of DBPedia, Freebase links and we do not order them using PageRank or similar algorithms (it is left to the requesting programs). It is a nice idea though and we will keep it in our backburners.


Facinating service

I came to this site after seeing the presentation at

I would like to say that this is a fantastic service. Keep up the good work 

Best release

How will the de-referenceable URIs be ordered/ranked when they are returned as a list or will this be a problem for the program at the other end of the request? Will you apply a type of PageRank algo to the list prior to return it to the requesting program?

Very interesting, but still lots of issues are unclear...

This Release 4 sounds like a very interesting idea. however, some issues remain unclear to me. For example, sometext (such as a web page) about IBM is submitted to Calais, which successfully identifies the following entity (among others):

IBM is a company, and has the following URI:

however, notice the URI is completed home-made by Calais engine, and in DBpedia, IBM is not using the same URI for sure. How does Calais decide this IBM is exactly the samething as that IBM decribed in DBpedia? In other words, how does Calais creates the links at the first place?

IBM might be an easy case - it is pretty much true that we only have one IBM. But what about in other cases? For example, Calais can identify "Peking" (China's capital city, which is often called Beijing) as a city, and again, it will assign a ugly-looking URI to Peking, how does Calais know the fact that Beijing described in DBpedia is exactly the same entity as this Peking?

I might have missed some points, but please help me understand this.


De-Referencable URIs

Each entity identified by Calais has a URI, today there's nothing behind it but in Release 4 this URI will be de-referencable -- you could access it (through an HTTP request), get a response that provides additional information about this entity and relevant links to other Linked Data resources or to other web pages.

So in essence de-referencable URIs aren't returned as a list, but rather there's a lot of background work done to unambiguously resolve the right identity for each extracted entity.

"relevant" links

@MichalF Thanks for the response.  Admittedly, my understanding of the linked cloud may be hazy (pun intended) but bare with me.  OK, the URI is de-referencable and accessible via HTTP request. It is the response content that I'm asking about. Regarding the 'additional information about the entity and relevant links' in Rel4 - assuming there are many relevant links in the response how will Calais order these links? or will this be left for the calling program to sort through and determin the most 'relevant' links?

response content

OK, I think the question is clear now.

Per given URI, the returned content provides the one most relevant link from each source of information relative to the entity in question. In other words, there could be many relevant links but only one per source, so in fact the program/human don't need to determine what the most relevant link is - all of them are relevant.

To give an example, given a URI for a disambiguated city name, the response content could include information like the lat/long as well as links to dbpedia, geonames etc. But there's only one link to dbpedia, one to geonames and so on so you don't need to choose between several ones coming from a single source.

Hope this clarifies things.

Can't wait

How will the de-referenceable URIs be ordered/ranked when they are returned as a list or will this be a problem for the program at the other end of the request?

Will you apply a type of PageRank algo to the list prior to return it to the requesting program?