Calais Pipes Service - Installation and Configuration

Introduction

The
OpenCalais Web service can analyze text and provide rich semantic data
for submitted text. Yahoo! Pipes allow users to easily create custom
RSS feeds. The Calais Pipes Service allows Yahoo! Pipes users to enrich
their custom RSS feeds with semantic metadata.

This
document guides you through the installation and configuration of the
Calais Pipes Service on your server. Download the necessary files in
the format you desire (Zip or Gzip), and follow the instructions provided in this documentation.

Back to top

Your Own Server

The
OpenCalais Web service is available for use on the Calais Web server
for users with a valid license ID. In some cases you might prefer to
have it run on your own server. This would allow you:

  • Monitoring of service use by your audience
  • Possibility to modify the Web service's behavior (open source)

Back to top

License

The Calais Pipes Service is open-source distributed under a BSD-style license. See details in the LICENSE file in the package.

Back to top

Technical Background

The
Calais Pipes Service is implemented as a Java servlet. The package you
downloaded includes a WAR file for deployment of the package without
changes. The Java source code is included as well, allowing you to
alter the behavior of the servlet to suit your needs.

Configuration of the servlet is done using several init-params in the web.xml file.

Back to top

System Requirements

In order to run the service, your Web server will need to meet the following requirements:

  • Ability to run Java servlets (using a Servlet/JSP container such as Apache Tomcat)
  • Access
    to the Java servlet from the public Internet (e.g., by diverting
    traffic from Apache Web server to the Apache Tomcat for a certain
    directory)
  • Unlimited outbound access to the Web (port 80) for servlet (restrictions often applied using a Firewall)
  • Sufficient
    permissions for the Servlet to establish outbound connections
    (permissions usually inherited from the user running the Servlet/JSP
    container on your server)

The servlet was tested in the following environment:

  • Linux Fedora Core 6
  • Apache 2.2.3
  • Tomcat 5.5
  • GNU Java 4.1.1 (Java version 1.4.2)

Back to top

Servlet Description

Understanding
the Calais Pipes Service is important, so that you understand the
processing done on your server and the resources consumed by the
servlet.

Back to top

Servlet Interface

The
servlet is invoked via the HTTP POST method. It expects certain
parameters in the URL itself, and a parameter called 'data' in the body
of the POST message.

The servlet will take the following two parameters from the URL of the call:

  • licenseID - a valid OpenCalais API key
  • richLinks (optional) - true/false - whether to replace feed links with links to the Calais Semantic Proxy

In case licenseID is not valid, the servlet will return the feed unchanged.

In case richLinks is missing, its default will be false (leave links unchanged).

In
the body of the message a single parameter called 'data' is expected.
It contains feed data in the format provided by Yahoo! Pipes. Note that
the servlet is specifically tailored to Yahoo! Pipes, including
non-standard behavior. Should the interface of Yahoo! Pipes change, the
servlet will have to be modified.

Back to top

Servlet Configuration

The servlet will take the following configuration values from the web.xml file configuring the servlet (init-params):

  • SemanticProxyURL
    – the URL to use when replacing links. It will be used as is except for
    {L} which will be replaced with the provided licenseID and {U} which
    will be replaced with the URL. The special value 'disabled' will
    disable link replacement altogether, causing the servlet to ignore the
    richLinks parameter in POST requests
  • InvokeEnlightenTimeout – the timeout in seconds for the OpenCalais Web service to respond to each servlet’s request
  • MaxEntitiesPerDescription – the maximum number of entities to include within a single item’s description
  • MaxEventsFactsPerDescription – the maximum number of events/facts to include within a single item’s description
  • LogFileName - full path to the servlet's log file
  • PageFetchTimeout – the timeout in seconds for fetching a single page listed in the feed
  • DoHTMLCleanup - true/false - whether to extract text from each HTML page before sending to OpenCalais (see HTML Text Extraction)

All parameters are mandatory. The web.xml file provided in the WAR file provides reasonable defaults for all.

Back to top

Servlet Logic

The logic of the servlet, executed for each request, is as follows:

  • Go over all objects within the provided feed data (sent by Yahoo! Pipes)
  • For each feed item (an object with both "link" and "description" fields present)
    • Retrieve the HTML page using the value of the "link" field
    • If DoHTMLCleanup is true, extract the visible text from the HTML page
    • Call the OpenCalais Web service providing the HTML page or text, requesting output in Simple Output Format
    • In case of errors (timeout expiration, etc.) - leave the feed item unchanged
    • For a valid response append text of the following structure at the end of the "description" field
      • <p>(OpenCalais found: <Entity1 Type> - <Entity1 Name> [N1 times], <Entity2 Type> - <Entity2 Name> [N2 times], ..., Event - <Event1 Name> [M1 times], Event - <Event2 Name> [M2 times], ...)
      • The number of entities in the appended text will not exceed servlet configuration parameter MaxEntitiesPerDescription
      • The number of events in the appended text will not exceed servlet configuration parameter MaxEventsFactsPerDescription
      • The entities will be ordered by their count (items with higher count appearing first, items with lower count last or not at all)
      • The
        events will follow the entities and will be ordered by their count
        (events with higher count appearing first, events with lower count last
        or not at all)
    • If richLinks in the POST request is 'true' and the servlet configuration parameter SemanticProxyURL is not 'disabled'
      • Replace
        “link” field with a link to the Semantic Proxy, using the licenseID
        parameter as the API key and the original value of “link” as the URL
    • For
      each request, write a line to log file (as specified in configuration)
      including: time, licenseID, richLinks, remote IP address and error text
      in case of errors

Back to top

Example

Consider the following text:

Apple: Interoperability through DRM-Free Music

In a letter seemingly directed at major music labels and European legislators, Apple's CEO Steve
 Jobs suggests removing copyright protection from music sold online

Apple's iTunes store dominates the online music market along with the iPod, its portable digital
music player. Now, Steve Jobs, Apple's chief executive officer calls upon major music labels to
allow the sale of unprotected music in online stores such as iTunes. Mr. Jobs published his
suggestion in a letter titled Thoughts on Music published on Apple's Web site.

Digital Rights Management, the technology limiting the illegal distribution of music bought
on the Internet, was a prerequisite made by music labels for offering their catalog online. A
byproduct of DRM is that music purchased legally on Apple's iTunes store cannot be played on
portable devices other than the iPod, in much the same way as Sony's Connect music store
provides songs that can only be played on Sony's players.

In his letter Jobs suggests making DRM-free music available for download on music stores such as
iTunes. The result would be that music purchased legally can be played on any portable device or
computer. Consumers will benefit as they will be able to keep their music collection intact as
they shift from one portable player to another.

A Legal Front

The letter may be seen as a step in a legal battle that Apple might be brought into as its
market dominance in digital music makes it a de-facto monopoly. The company that suffered heavily
under the Microsoft monopoly in the field of operating systems, an issue addressed frequently
through litigation, now finds itself on the other end. Norwegian legislators have already ruled
that limiting music bought legally on iTunes for play only on iPods is illegal, and Apple is to
fix that by October, or face litigation.

Other European legislators may follow suit, as they examine the legality of continuing to sell
music online the way it is done today. In his letter Jobs suggests that those 'unhappy with the
current situation should redirect their energies towards persuading the music companies to sell
their music DRM-free'. This would ensure the interoperability that would benefit both consumers
and technology companies. In his letter Jobs notes that DRM failed to prevent music piracy, which
is one of the reasons it should no longer be used.

A Likely Future

Some consider the sale of DRM-free music online to be imminent if not inevitable, and criticize
Jobs for taking credit for an idea that is not his own. However, DRM-free music will also make
it easier to illegally distribute music. While Apple's more concerned in increasing the share
of the online music market and reducing its costs of developing and maintaining DRM technology,
music companies are interested in their own bottom lines.

Sales of CDs have been dropping consistently, and the increase in online music sales has so far
failed to make up for the lost income. In addition, setting a precedent for the emerging online
video-content market might also deter music companies from adopting Apple's view. For the time
being, it appears, online music shoppers will have to stick to the device they have, while those
interested in illegal distribution will continue to purchase the DRM-free CDs. 

Let’s assume that this text is returned by the URL http://www.example.com, and that the following item appears in an RSS feed sent from Yahoo! Pipes to the Web service:

  • link: http://www.example.com
  • description:
    In a letter seemingly directed at major music labels and European
    legislators, Apple's CEO Steve Jobs suggests removing copyright
    protection from music sold online

Calais identifies the following entities in the text:

Type Name Count
Person Steve Jobs 12
Company Apple 10
Company Sony 2
IndustryTerm online music market 2
Company Microsoft 1
Facility Connect music store 1
IndustryTerm online stores 1
IndustryTerm portable devices 1
IndustryTerm portable device 1
IndustryTerm online music shoppers 1
IndustryTerm online music sales 1
IndustryTerm technology 1
Technology operating systems 1
Technology DRM technology 1

Assume the servlet is configured with MaxEntitiesPerDescription = 5 and SemanticProxyURL = http://pipes.opencalais.com?lic={L}&url={U}&ot=html

If the call is with richLinks=false (or no richLinks parameter), the returned value will include the item with:

  • link: http://www.example.com
  • description: In
    a letter seemingly directed at major music labels and European
    legislators, Apple's CEO Steve Jobs suggests removing copyright
    protection from music sold online (OpenCalais
    found: Person – Steve Jobs [12 times], Company – Apple [10 times],
    Company – Sony [2 times], IndustryTerm – online music market [2 times],
    Company – Microsoft [once])

Note that no events are listed as Calais did not identify any.

If
the call is with licenseID=12345 and richLinks=true (assuming 12345 is
a valid licenseID), the returned value will include the item with:

  • link: http://pipes.opencalais.com?lic=12345&url=http://www.example.com&ot=html
  • description: In
    a letter seemingly directed at major music labels and European
    legislators, Apple's CEO Steve Jobs suggests removing copyright
    protection from music sold online (OpenCalais found: Person – Steve
    Jobs [12 times], Company – Apple [10 times], Company – Sony [2 times],
    IndustryTerm – online music market [2 times], Company – Microsoft
    [once])

NOTE: The actual link will have the values URL encoded (http%3A%2F...), but is shown here decoded for readability.

Back to top

Implementation Notes

It is important to remember the following about the implementation:

Outbound Access

The
servlet will access the Web to retrieve each item in the feed. Since
there is no way of knowing which URLs will be included in feeds
submitted to the servlet, outbound access must be open to any Web
address.

If
your Web server is hosted on a shared environment you should ask your
hosting provider to enable this access. Note that in some cases your
provider may consider this a security hazard.

Bandwidth

The
servlet will retrieve each item in the feed. A simple news story may
weigh as much as 50-60 kilobytes. If the feed data provided included 10
items, the bandwidth consumption can be over 1 megabyte. The servlet
will retrieve 10 items, each with 50-60 kilobytes. It will then send
them to Calais, again 10 times 50-60 kilobytes. The Calais response
will also take some bandwidth as well as the servlet's own response.

This
means the servlet can potentially consume a lot of bandwidth. It is
advisable to caution your users about this fact and suggest they do not
send more than 5-6 items in each call. The user guide in the package
explains how to do that using the Yahoo! Pipes editor.

Time

Due
to the servlet's bandwidth consumption, processing a single request may
take as much as a minute or more. The servlet outputs a space character
every 3 seconds, so that Yahoo! Pipes implementation does not time out
in the meantime. For this reason as well, you should advise the user to
send up to 5 or 6 items in a request.

Note that the servlet does little CPU intensive processing, so the bottleneck will likely be the bandwidth.

Back to top

Installation

Now that you have an understanding of the servlet, this section guides you through the installation.

In the downloaded package under the deploy directory you will find a single file called CalaisPipes.war. Copy this file to the Web applications directory of your Servlet/JSP container.

For example, if you are using Apache Tomcat and the Web applications directory is /usr/share/tomcat5/webapps, copy the file to this directory. You will now have a file called /usr/share/tomcat5/webapps/CalaisPipes.war.

Examine
the directory's contents after a few seconds. If your Servlet/JSP
container is configured to automatically deploy WAR files, you should
see a new directory created called CalaisPipes. Continuing with the example, you should have a directory called /usr/share/tomcat5/webapps/CalaisPipes.

If
you do not see such a directory created, or you know your Servlet/JSP
file is configured not to deploy WAR files automatically, you will need
to deploy it manually. Unpack the file using:

jar -xvf CalaisPipes.war

Execute this command from within the Web applications directory.

The servlet is now installed.

Back to top

Configuration

Under your Web applications directory you should have a file called CalaisPipes/WEB-INF/web.xml. In the example above it would be located at /usr/share/tomcat5/webapps/CalaisPipes/WEB-INF/web.xml. Edit this file in a text editor or an XML editor.

Normally,
you should be able to run the servlet without configuration changes on
a Unix-based system. If you're using a Windows system, you will need to
rename the log file (LogFileName parameter) at a minimum.

Consult the table below regarding the semantics of each servlet parameter.

Name Default Description
SemanticProxyURL disabled URL
of semantic proxy where {U} represents the URL submitted to the
Semantic Proxy and {L} represents the API key. The value 'disabled'
means links are not replaced even if richLinks is true
InvokeEnlightenTimeout 60 Timeout
in seconds for each invocation of the OpenCalais Web service. Note that
a single servlet request usually calls OpenCalais more than once
MaxEntitiesPerDescription 5 Maximum number of entities (persons, countries, etc.) that will be appended to the description of a single feed item
MaxEventsFactsPerDescription 5 Maximum number of events (M&As, management changes, etc.) that will be appended to the description of a single feed item
LogFileName /usr/tmp/calaispipes_log.txt The
name of the servlet log file. The default must be changed for Windows
systems. The servlet will log errors to the Servlet/JSP container's log
as well.
PageFetchTimeout 60 Timeout
in seconds for fetching a feed item's text from the Web. Note that a single servlet request usually fetches more than one page
DoHTMLCleanup false Whether to extract visible text from HTML pages before sending to OpenCalais - see HTML Text Extraction

NOTES

  • If you are editing the web.xml file as a text file make sure you encode special characters correctly. This concerns the SemanticProxyURL parameter, which is likely to include ampersands ['&']. An ampersand is encoded as &amp; in an XML file, so instead of http://aaa.com?lic={L}&url={U}... the value should be http://aaa.com?lic={L}&amp;url={U}
  • If you wish to enable link replacement (by setting SemanticProxyURL to a value other than disabled) consult the Semantic Proxy description in http://www.opencalais.com on the required URL format

Back to top

System Configuration

Now that the servlet itself is installed and configured, additional system configuration may be
required, depending on your Web server's initial configuration. You may
examine each of the following sections and determine which ones are
relevant to you, or you may skip to the Verification section and consult this section in case of errors.

Back to top

Public Access

The servlet is now installed and configured. It should be accessible by appending CalaisPipes/CalaisPipes to
the root public folder of your Servlet/JSP container. If this is the
first servlet installed on your Web server, you might not have a public
path for servlets. In this case you should configure your Web server to
forward requests for a certain folder to your Servlet/JSP container.

If
you are using the Apache Web server and Tomcat as the Servlet/JSP
container, you would need to add the following line to the Apache
configuration file:

<Location /tomcat/>
ProxyPass ajp://localhost:8009/ flushpackets=on
</Location>

This would divert requests for folder /tomcat
to the Tomcat container. This assumes a Connector on port 8009 is
configured in Tomcat to use AJP (for more information consult Apache
documentation on the ProxyPass directive and the Tomcat documentation on the Connector element).

In
the example above public HTTP URLs can now be constructed to point to
the servlet. Assuming your Web server's root is http://www.example.com,
URLs should be of the form:

http://www.example.com/tomcat/CalaisPipes/CalaisPipes?licenseID=12345

Search for http://pipes.opencalais.com
in
the user guide document before publishing it to your audience. You'll
find the place to enter the exact URL of your particular servlet.

For
Web servers and containers other than Apache and Tomcat, consult the
Web server and the Servlet/JSP container's documentation in order to
configure public access to the servlet.

Remember
that configuration changes to the Web server and Servlet/JSP container
may require a restart of the server or container.

Back to top

Flushing Output

As mentioned earlier the servlet outputs a space character every 3 seconds to the calling client in order to avoid timeout
expiration in the Yahoo! Pipes implementation. The timeout in Yahoo!
Pipes expires if a Web service does not respond within 5 seconds.
However, the periodic space character tells Yahoo! Pipes that a
response is being prepared.

Note
that processing a request, even for a single-item feed, is likely to
take more than 5 seconds. Therefore, it is necessary that space
characters are indeed sent to the client. The servlet takes care of
flushing its buffer, but along the way the space characters may be swallowed (or buffered).

The space character is flushed using the servlet's getWriter().flush() function.

If you are using Apache and Tomcat this is ensured by adding the flag flushpackets with the value on to the ProxyPass directive that diverts a folder from Apache to Tomcat.

If
you've just added this directive in the previous section, you'd note
that it already includes this flag. If you are relying on an existing ProxyPass directive you need to modify it. Simply add the words flushpackets=yes at the end of the ProxyPass directive.

Remember
that configuration changes to the Web server and Servlet/JSP container
may require a restart of the server or container.

If
you still experience timeout expiration within your Yahoo! Pipes, it is
likely your Web server or container swallow the space characters
flushed by the servlet. Consult your Web server and container
documentation and locate information related to writing responses to
the client, buffering of responses, known bugs about buffering, etc.

Back to top

Encoding Issues

The
servlet's output is written in UTF-8. Depending on your Web server's
and Servlet/JSP container's configuration, the servlet output may
convert some of the characters to fit a different encoding. You would
notice this problem as certain 'special' characters (e.g. a slanted apostrophe [‘] are replaced by a single character (normally a question mark [?]).

This
problem results from your Web server or Servlet/JSP container being
configured to work with a particular character set, particularly a
character set other than UTF-8. You are more likely to experience this
problem on Windows servers as modern Linux servers default to UTF-8.

To fix the problem when using Tomcat under Windows add the following line to the catalina.properties file (usually located under /usr/share/tomcat5/conf):

file.encoding=UTF-8

Then restart Tomcat.

In
Linux systems it is more complicated to fix this problem (although it
is less likely to have it in the first place). The encoding is normally
taken from your system's default locale. Consult your OS documentation
on how to change the locale to UTF-8, but be aware that changing your
Web server's locale settings may have additional undesirable effects.

Back to top

HTML Text Extraction

The
servlet can be configured to attempt an optimization and reduce
bandwidth consumption. This is done by extracting the visible text from
each HTML page before sending it to Calais for processing. A 50-60
kilobytes of HTML may hold as little as 5 kilobytes of text. To turn on
this optimization set the DoHTMLCleanup servlet init-param to true.

This feature is disabled by default due to the following reasons:

  • The extraction of the text is time-consuming, so the reduction in bandwidth is offset by longer processing time
  • The extraction's behavior varies between JVMs (Java Virtual Machines), causing undesirable results in some cases
  • The extraction's behavior varies between different HTML pages

If you wish to test this feature in your particular environment, set the DoHTMLCleanup parameter to true in the web.xml file (found under CalaisPipes/WEB-INF in your Web applications directory).

Test this thoroughly using different pipes, and make sure the results suit your needs.

The
code that is used to extract visible uses the EditorKit class in the
javax.swing.text package and the HTMLEditorKit class in the
javax.swing.text.html package.

Back to top

Verification & Use

In order to verify the installation construct the following pipe in the Yahoo! Pipes editor:

Enter
the URL of an RSS feed you wish to use for verification in the Fetch
Feed edit box. Enter the appropriate URL in the Web Service element's
top edit box with a valid Calais API key.

Click
your mouse on the Pipe Output element and examine the output's
description field. You should see information appended by the servlet
at the end of the description. Consult the Calais Yahoo! Pipes Web
Service document in the package for more uses of the servlet.

Back to top

Troubleshooting

This section addresses common problems.

HTTP 404 - File Not Found:
If invocation of the servlet returns this error, the reason may be

  • Typos in URL - verify its correctness
  • Invalid configuration of public access - see Public Access

HTTP 403 - User Agent Timeout: If invocation of the servlet returns this error, the reason is probably related to flushing of the servlet response. See Flushing Output

Connection Refused / Connection Timeout:
These
errors usually indicate the servlet is unable to connect to Web sites
and retrieve the pages listed in RSS feeds. If this problem happens
rarely, it is not a cause for concern. However, if invocations
repeatedly cause this error, a Firewall, or similar network element is
likely blocking outbound connections by the servlet. Make sure the
servlet is allowed to establish outbound connections to any address on
port 80.

Permission Denied:
- This error indicates that the servlet is attempting operations not
permitted by the user running the Servlet/JSP container on your server.
For example, if you are using Tomcat, then the user running Tomcat
(often called tomcat)
does not have the appropriate permissions to perform operations
required by the servlet. The most likely culprit is establishing
outbound connections. Make sure the OS user is allowed to open sockets
and establish outbound connections.

Characters Replaced with ?:
- If characters are replaced with ? (or another character) in the
servlet's output, this is probably related to character encoding
configuration. See Encoding Issues.

No Entities or Events:
If you see the text '(OpenCalais found: )' attached to the description
of each items [indicating the Calais did not identify any entities or
events], this is likely the result of HTML Text Extraction problems.
Make sure the DoHTMLCleanup servlet parameter in the web.xml file is set to false. For more information see HTML Text Extraction

Back to top

AttachmentSize
Calaispipes_08Sep21.rar106.3 KB
Calaispipes_08Sep21-Gzip_tar.gz113.64 KB