Well, it’s been over two weeks since we released something cool (www.semanticproxy.com) – time to get cracking on some new stuff.

We’ve placed Release 3.1 of Calais into technology preview status. Just as a reminder, technology preview is a separate instance of Calais that allows developers to evaluate new features and test their software prior to our moving the release to production. You can access the Preview by simply pointing your tool to http://beta.opencalais.com rather than http://api.opencalais.com. Just like Calais, the preview version requires that you have a developer API key – your existing key will work just fine.

This will be a relatively extended Preview – most likely lasting throughout October 2008. We want to give everyone the opportunity to test some significant new features and make sure we have adequate time to respond to any issues you discover. That being said – please don’t wait until the last minute to give things a spin.

As you may have noticed, our releases are getting significantly larger and incorporating substantial new functionality on a monthly basis. Release 3.1 is no different – it contains everything from major new capabilities such as company and geography disambiguation to performance improvements to new output formats to some significant expansions of the types of information it can extract.So, in our tradition of lengthy blog postings – here’s an overview of what’s new in 3.1. I’ve broken this up into a few high-level focus areas. You can also visit the release notes right here.

New and Significant (at least to some of us)

Release 3.1’s big new functionality is disambiguation of company names and geographies. One of the big challenges of automated entity recognition is how to deal with ambiguity – for example “IBM”, “IBM Corp” “International Business Machines”. For the vast majority of use cases you want each of these variations resolved to a single entity called “IBM”. There are similar challenges around geography such as Calais, Maine vs. Calais, France.

For companies we’ve implemented a sophisticated disambiguation capability that is driven by a reference database of tens of million of company names and their variations. This database is primarily focused on public companies – but we’ll be expanding it to contain a broader range of companies in the future. In addition to variations on a company name, we also use hints that may exist in the text, such as location or industry, as additional evidence.

For geography we’re utilizing elements of Freebase and other public data assets to dive in and figure out which Calais or Paris or wherever the text is really talking about. We base this disambiguation not just on the name itself – but hints in the surrounding text (for example longhorns are seldom discussed in the same article as Paris, France – but Paris, Texas is another story). To jumpstart mapping applications we also return the geo coordinates of the geography we’ve detected.

Efficiency and Scalability

We’ve implemented a couple of changes to make life easier for our higher-volume users. First, you now have the option to tell Calais you do not want a copy of the original text returned to you. If your application doesn’t care about offsets of detected items in the text you might consider turning this option on to reduce your bandwidth utilization.

Second, Calais now supports HTTP traffic compression. Given that we’re dealing with text on the input and output sides of the transaction, this can dramatically reduce the size of your transaction, again reducing your bandwidth utilization.

New Output Formats and Integrations

Please take a look at the Release Notes for details on a number of small changes to the RDF, MicroFormats and Simple format outputs. We’ve also added a JSON output format that’s covered in more detail here.

Calais now also talks PopFly! Microsoft’s PopFly is an interesting mashup building platform with a visual development interface. You can now directly integrate Calais within your PopFly mashups. Our documentation for this capability is available here.

Getting Smarter

In keeping with prior releases Calais is also getting smarter. We’ve added a number of new elements to the Calais vocabulary. These include PatentFiling, PatentIssuamce, FDAPhase, PersonEmailAddress, PersonEmployment, new elements for PersonAttributes, and SecondaryIssuance. In addition to these elements, we have one particularly interesting one: PersonRelation. The PersonRelation entity extracts references to symmetric relationships between people in the areas of business, friends, academic, military service or politics. This is one you’ll have to play with to get an idea of – but here’s a simple example:

The text:

The two served together in combat, and McDonald said Odierno was an "absolute joy to work with”.

Would result in:

Person1:  Mark McDonald
Person2:  Ray Odierno
PersonRelationType: Military Service

That’s it for R3.1. Any questions, please feel free to post to the forums or drop us a note at questions@opencalais.com. I’ll be posting an update on what’s in the pipeline for R4 in the next few days – lots of interesting stuff is on the way.

Trackback URL for this post:

http://www.opencalais.com/trackback/7374
AttachmentSize
RelNotes_08Oct3.doc210 KB
Login or Register to post a comment.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Company stock tickers

I notice in the examples that the stock ticker is given as a single term ie "British Airways" => "BAY" , are there any plans to include the exchange name for which that symbol is valid, for example LSE, DAX, NASDAQ etc.. On Yahoo finance symbols are prefixed with the Exchange name so LSE.BAY would mean British airways on London Stock exchange.

Of course this would bring up the issue of companies that are listed on multiple exchanges with possibly different symbols.


Exchanges

We'll look into it and get back to you as soon as possible. This sounds like a good idea.