English Semantic Metadata: Entity Disambiguation
Entity disambiguation is the task of resolving the identity of an entity instance.
The Challenge of Disambiguation
Ambiguity in a name can arise from variations in how an entity may be referenced (e.g., IBM, IBM Corp., International Business Machines) or from the existence of several entities with the same name (e.g., George Bush and George W. Bush), or even from spelling mistakes in the name.
When performing entity disambiguation we use a finite reference set of unique entity identifiers (for instance, a list of all the known companies when trying to disambiguate company names). An entity instance is considered disambiguated or resolved when we have mapped the instance string to a single entry within the reference set. Sometimes it is not possible to resolve an entity with high certainty. When there are several possible resolutions, each with its own "score" or level of certainty, we will prefer the resolution with the highest score. To resolve an entity, we will use the entity name itself, and possibly additional contextual clues appearing in the text around it.
For example, when resolving a company name, clues in the text as to what the company does or where it is located may help to resolve among several possible identities of the company.
OpenCalais now supports three types of entity disambiguation: Company disambiguation, Geographical disambiguation and Product (Electronics) disambiguation.
RDF output shows the most comprehensive information about disambiguated companies, geographies and electronic products.
Disambiguation is turned on by default. To disable this feature, include the following parameter in the processing directives in paramsXML:
c:discardMetadata="er/Company;er/Geo;er/Product"
Company Disambiguation
Disambiguation of company names - such as determining whether the company Olympus refers to Olympus Optical Co. Ltd. or Olympus Life and Material Science Europa. The resolution output for a given company mention includes:
- A URI that is unique and uniform across documents
- The formal English legal name of the company
- The company's ticker symbol (for public companies)
For company names that cannot be disambiguated, the returned results will include no resolution information.
Company disambiguation uses a proprietary Thomson Reuters database of companies as its reference set and is tuned for public companies. In future versions we will incorporate additional publicly available reference sets and fine-tune the disambiguation process for non-public companies as well.
Geographical Disambiguation
Geographical disambiguation of location entities (City, ProvinceOrState, and Country) determines whether "Calais" is referring to Calais, Maine or Calais, France. The resolution output for a given geography mention includes:
- A URI that is unique and uniform across documents
- The full resolved name
- City: the resolved name will include city name, province/state name (if applicable, country name.
- ProvinceOrState: the resolved name will include province/state name, country name.
- The geographical coordinates (latitude, longitude) of the resolved geography
For geographies that cannot be disambiguated, the returned results will include no resolution information.
Geographical disambiguation uses Freebase as its reference set. Future versions will incorporate additional publicly available reference sets.
Product (Electronics) Disambiguation
Disambiguation of electronic products helps identifying the full name of products that are sometimes referred to by short and fuzzy names such as SD100 (e.g. "I love my SD100 camera" is referring to "Canon PowerShot SD100 / IXUS II Digital Camera").
The resolution output for a given electronic product mention includes the full product name. For products that cannot be disambiguated, the returned results will include no resolution information.
Disambiguation of electronic products currently uses the Shopping.com catalogue as its reference set. Future versions will incorporate additional sets.
