finally a bnode with a uri

Posts tagged with: zemanta

Linked Data Entity Extraction with Zemanta and OpenCalais

A comparison of the NER APIs by Zemanta and OpenCalais.
I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • OpenCalais gives you more free API calls out of the box than Zemanta (50.000 vs. 1.000 per day). You can get a free upgrade to 10.000 Zemanta calls via a simple email, though (or excessive API use; Andraž auto-upgraded my API limit when he noticed my crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Zemanta responses, in contrast, do not (yet, Andraž told me they are working on it) contain entity types at all. You always need an additional request to retrieve type information (unless you are doing nasty URI inspection, which is what I did with detected URIs from Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Connecting the LOD dots with Calais 4.0 and Zemanta

A fun experiment using open data, RSS, Open Calais and Zemanta.
A couple of weeks ago I wrote about the exciting possibilities of LOD-enabled NLP APIs, and if they could bring another power-up to RDF developers by simplifying the creation of DIY Semantic Web apps. When Thomson Reuters released Calais 4.0 two days ago, I had a go.

The idea: Create a simple tool that aggregates bookmarks and microposts for a given set of tags (from Twitter, identi.ca, Delicious, and ma.gnolia), pumps them through Calais and Zemanta, and then lets me browse the incoming stream of items based on typed entities, not just keywords. Something like a poor man's Twine, but with a fully-fledged SPARQL API and content automagically enhanced from LOD sources. Check out this month's Semantic Web Gang podcast for more details about Calais.

I set myself a time limit of one person day, so I ended up with just a very basic prototype, but it already shows the network effect kicking in when distributed data fragments can be connected through shared identifiers. Each of the discovered facets can be used as a smart filter (e.g. "Show me only items related to the Person Tim Berners-Lee"), and we could also pull in more information about the entities, as we know their respective LOD URI.

Wish I had funds to explore this a little more, but below is a screenshot showing the "HD Streams" test app in action. It's basically sending each micropost and bookmark to the APIs and then does lookups to DBPedia, Semantic CrunchBase and Freebase to retrieve additional type information. Plus a set of SPARQL+ INSERT queries to later accelerate the filtering.

There are some false positives (e.g. the Calais NLP service is typed as a place), but the APIs offer a score for each detection and I've set the barrier for inclusion very low. The interesting thing is that the grouping of items in the facets column is actually done via LOD information. The APIs only return IDs (or URIs), say, for Berlin, but this reference allows HD Streams to pull in more information and then associate Berlin with the "Place" filter.

This, however, is only the most simple use. The really exciting next step would be smart facets based on the aggregated information. Thanks to SPARQL, I could easily add filters that dive deeper into the LOD-enhanced graph. Like "Filter by posts related to Capitals in Europe", or related to places within a certain lat/long boundary, or with a population larger than x, or about products by competitors of y.

Something the prototype is not doing is expanding shortened URLs. Those could be normalized. Calais 4.0 does URL extraction already, this would just be another SPARQL query and a little PHP loop. Then we could add a simple ranking algorithm based on the number of tweets about a certain URL. The current app took just about 12 hours of work, RDF's extensible data model accelerated development through all stages of the process (well, ok, not during the design/theming phase ;). I didn't have to analyze the data coming from the two APIs at all. No pre-coding schema consideraions. I just loaded everything into my schema-free RDF store and then used incrementally improved graph queries to identify the paths I needed. For geeks: Below is the SPARQL+ snippet that injects LOD entity and label shortcuts from Zemanta results directly into the item descriptions ($res is the URL of an RSS item or bookmark. hds is the namespace prefix used by HD Streams):
INSERT INTO <' . $res . '> {
  <' . $res . '> hds:relatedEntity ?lod_entity .
  ?lod_entity hds:label ?label .
}
WHERE {
  <' . $res . '> hds:zemantaDoc ?z_doc .
  ?z_result z:doc ?z_doc ; z:confidence ?conf ; z:object ?z_entity .
  ?z_entity owl:sameAs ?lod_entity .
  ?lod_entity z:title ?label .
  FILTER(?conf > 0.2)
  FILTER(REGEX(str(?lod_entity), "(freebase|dbpedia|cb.semsol)"))
}

I've said it before, but it's worth repeating: RDF and SPARQL are great solutions for today's (and tomorrow's) data integration problem, but they are equally impressive as productivity boosters for software developers.

HD Streams
click for full-size version

Zemanta releases LOD-connected NLP API

Results from Zemanta's new Open Semantic API are interlinked with DBPedia, Freebase, MusicBrainz, Semantic CrunchBase, etc.
When I read OpenCalais' pre-announcement of Calais 4 a couple of weeks ago, I got pretty excited about their plan to offer an NLP API that can be combined with entities from the LOD cloud. It seems we don't have to wait any longer: Today, Zemanta beat the Calais team to it with the release of a new Semantic API (More details on TC). Andraž Tori already (and kindly) sent me a file with hundreds of mappings for Semantic CrunchBase which I'm going to include in the coming days.

I think these APIs have the potential to sweep away feature-poor or closed services in favor of personal DIY SemWeb apps (my first reaction to the Calais 4 post was "bye bye Twine"). Think of a simple RDF/SPARQL tool that, based on a set of tags, subscribes to feeds from the major bookmarking (or other) services, pumps the links through Zemanta's API, and then delivers all the things that might interest you. It wouldn't require a new bookmarking service, it could let you filter by company or product, or even limit results to suggestions by people in your social network. Such an app could provide rich add-on information from LOD datasets like DBPedia. In a very light-weight, loosely coupled, "On Demand" fashion.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds