finally a bnode with a uri

Data Web UX Challenges: Supporting different Roles

The challenge around offering the best tools for the task context.
Some hasty User Experience (UX) thoughts inspired by books I'm currently reading, and adapted to the Data Web context:

Despite a blurry separation between roles in software creation, there often is a personal tendency either towards Information Architecture and Business Analysis or towards Interaction and User Interface Design. In the Data Web space, tool developers and demonstrator creators still seem to be stronger in the former than the latter.

Now how do you get from here to there (where "there" equals more user-friendliness)? Maybe getting a better picture of the different roles your tool has to satisfy can help. As well as deciding on the target audience in case of an end-user-facing application. For example:
  • Data Providers: main focus is getting a schema and related data published without loss of quality.
  • Data Engineers: cares deeply about schema mappings.
  • Information Architects: agile schema change management.
  • Interaction Designer: simple APIs to integrate real data into highly customizable templates at the post-mockup stage.
  • Content and Data Editors: convenient editing tools.
  • ...

You should need less roles to describe the target user group for a particular application (or this is a sign that your focus may be too wide):
  • Curator: convenient editing tools.
  • Citizen: wants to be heard, wants browsing tools.
  • Data Journalists: want navigation and convenient extraction tools.
  • Data Analyst: comparisons and insight generation.
  • ...

For an end-user application, you can often narrow down the target audience and then create a highly tailored user experience. Less so for a tool. And it gets even harder when your tool implements a specification whose purpose is to broaden the reach of a technology. Like the Semantic Web and its younger relative Linked Data. They both want to simplify Knowledge Management to a level that it can work across the Web. The early adopters in this space range from AI pros to front-end enhancers.

I think it's worth spending the time to ensure the best possible experience for each of the roles you can identify for your app or tool. Even if this means that you have to create several tools that operate against the same source. Let's say you use RDFa or microdata in your HTML templates. You may then need separate visual access methods for Data Engineers and Interaction Designers. Similarly, you may please you product development team with a bespoke API, while the labs team appreciates a bleeding-edge SPARQL endpoint to explore new options.

Moving forward back to Self-Employment

I'm self-employed again after an inspiring year at Talis.
My time at Talis Systems officially ended last week. I joined the team during painful times, but I'm glad (and proud) to have been a Talisian at least for one year. I have had a few freelance gigs with Talis before, but being part of the team was a whole different thing. And I could frequently travel to the UK, immprooff my inklish, and discover the nice city of Birmingham. There's a reason why they have that G in GB.

Work-wise, I probably learned more in the last 12 months than during the previous 5 years combined - hat tip to Julian, Leigh and all the other (Ex-)Talis folks. And much of that goes beyond just technical skills. I don't want to bore you, but you can definitely learn a lot about your path through life when you get the opportunity to look at it from a different perspective. Apparently, I first had to become an employee working in a foreign city to see the bigger picture around why I boarded that Semantic Web roller coaster in the first place and where it overlaps with my own ideas and interests.

So I am going back to self-employment. And I am also going to stay in the emerging Data Web market. But I'll approach some things differently this time.

First, change of attitude. To contribute in a personally more healthy way again. I won't argue about technical details and specifications any more. That just turns me into a grumpy person (belated apologies). I doubt that promoting products by advertising their underlying technologies is the best way for establishing and growing a market anyway. That's like trying to heat a room by just burning a lot of matches. Promising, with renewed anticipation after each match, but useless without some larger fire in the end. I would like to help spark off these larger fires. Without constantly burning my fingers (OK, enough fire imagery ;-).

The second change is related, and it is about focus. While I still see many people using the ARC2 toolkit, I have had more encouraging feedback and signs of demand recently around my work for end users (including app developers, in a sense). So my new mission is to improve "information interaction" on the Web, and I'll be offering services in that area.

And it looks like I'm off to a good start. I am already fully booked for the next months.

I'm joining Talis!

I'll start working for Talis' Kasabi team
KASABI data marketplace I received a number of very interesting job offers when I began searching for something new last month, but there was one company that stood out, and that is Talis. Not only do I know many people there already, I also find Talis' new strategic focus and products very promising. In addition, they know and use some of my tools already, and I've successfully worked on Talis projects with Leigh and Keith before. The job interview almost felt like coming home (and the new office is just great).

So I'm very happy to say that I'm going to become part of the Kasabi data marketplace team in September where I'll help create and drupalise data management and data market tools.

BeeNode I will have to get up to speed with a lot of new things, and the legal and travel costs overhead for Talis is significant, so I hope I can turn this into a smart investment for them as quickly as possible. I'll even rename my blog if necessary... ;-) For those wondering about the future of my other projects, I'll write about them in a separate post soon.

Can't wait to start!

Want to hire me?

Seriously. I am looking for a full-time job.
I have been happily working as a self-employed semantic web developer for the last seven years. With steady progress, I dare to say, but the market is still evolving a little bit too slowly for me (well, at least here in Germany) and I can't invest any longer. So I am looking for new challenges and an employer who would like to utilize my web technology experience (semantic or not). I have created a new personal online profile with detailed information about me, my skills, and my work.

My dream job would be in the social and/or data web area, I'm particularly interested in front-end development for data-centric or stream-oriented environments. I also love implementing technical specifications (probably some gene defect).

The potential show-stopper: I can't really relocate, for private reasons. I am happy to (tele)commute or travel, though. And I am looking for a full-time employment (or a full-time, longer-term contract). I am already applying for jobs, mainly here in Düsseldorf so far, but I thought I'd send out this post as well. You never know :)

Schema.org - Threat or Opportunity?

Some thoughts about the impact of Schema.org
I only wanted to track SemTech chatter but it seems all semantics-related tweet streams are discussing just one thing right now: Schema.org. So I apparently will have to build a #semtech filtering app, but I couldn't resist and had a close look at Schema.org, too. And just like everybody else, I'll join the fun of polluting the web with yet another opinion about its potential impact on the Semantic Web initiative and related efforts.

What exactly is Schema.org?

  • It is a list of instructions for adding structured data to HTML pages.
  • Webmasters can choose from a long, but finite list of types and properties.
  • Data-enhanced web pages trigger richer displays in Google/Bing/Yahoo search result pages.

Why the uproar?

  • Schema.org proposes the use of Microdata, a rather new RDF format that was not developed by the RDF community.
  • Schema.org introduces a new vocabulary which doesn't re-use terms from existing RDF schemas.

Who can benefit from it?

  • The web, because the simple template-like instructions on schema.org will boost the amount of structured data, similar to Facebook's Open Graph Protocol.
  • The semantic web market, by offering complementing as well as extending/competing solutions.
  • SEO people, because they can offer their service with less effort.
  • Website owners, who can more reliably customize their search engine displays and increase CTRs.
  • Possibly HTML5 (doctype) deployment, because the supported structures are based on HTML5's Microdata.
  • Verticals around popular topics (Music, Food, ...) because the format shakeout will make their parser writers' lifes easier.
  • Verticals who manage to successfully establish a schema.org extension (e.g. Job Offers).
  • The search engine companies involved, because extracting (known) structures can be less expensive and more accurate than NLP and statistical analysis. Controlling the vocabulary also means being able to tailor it to semantic advertising needs, integrating the schema.org taxonomy with AdWords would make a lot of (business) sense. And finally, the search engines can more easily generate their own verticals now (as Google has already successfully done with shopping and recipe browsers), making it harder for specialized aggregators to gain market share.
  • Spammers, unless the search engines manage to integrate the structured markup with their exisitng stats-based anti-spam algorithms.

Who might be threatened and how could they respond?

  • Microformats and overlapping RDF vocabularies such as FOAF (unlikely) or GoodRelations, which Schema.org already calls "earlier work". Even if they continue to be supported for the time being, implementers will switch to schema.org vocabulary terms. One opportunity for RDF schema providers lies in grounding their terms in the schema.org taxonomy and highlighting use cases beyond the simple SEO/Ad objectives of Schema.org. RDF vocabs excel in the long tail, and there are many opportunities left (especially for non-motorcycle businesses ;-). This will best work out if there are finally going to be applications that utilize these advanced data structures. If the main consumers continue to be search engines, there is little incentive to invest in higher granularity.
  • The RDFa community. They think they are under attack here, and I wonder if Manu is overreacting perhaps? Hey, if they had listened to me they wouldn't have this problem now, but they had several reasons to stick to their approach and I don't think these arguments get simply wiped away by Schema.org. They may have to spend some energy now on keeping Facebook on board, but there are enough other RDFa adopters that they shouldn't be worried too much. And, like the RDF vocab providers, they should highlight use cases beyond SEO. The good news is that potential spam problems, which are more likely to occur in the SEO context, will now get associated with Microdata, not RDFa. And the Schema.org graph can be manipulated by any site owner while Facebook's interest graph is built by authenticated users. Maybe the RDFa community shouldn't have taken the SEO train in the first place anyway. Now Schema.org simply stole the steam. After all, one possible future of the semantic web was to creatively destroy centralized search engines, and not to suck up to them. So maybe Schema.org can be interpreted as a kick in the back to get back on track.
  • The general RDF community, but unnecessarily so. RDFers kicked off a global movement which they can be proud of, but they will have to accept that they no longer dictate how the semantic web is going to look like. Schema.org seems to be a syntax fight, but Microdata maps nicely to RDF, which RDFers often ignore (that's why schema.rdfs.org was so easy to set up). The real wakeup call is less obvious. I'm sure that until now, many RDFers didn't notice that a core RDF principle is dying. RDFers used to think that distinct identifiers for pages and their topics are needed. This assumption was already proved wrong when Facebook started their page-based OGP effort. Now, with Schema.org's canonical URLs, we have a second, independent group that is building a semantic web truly layered on top of the existing web, without identifier mirrors (and so far without causing any URI identity crisis). This evolving semantic web is closer to the existing web than the current linked data layer, and probably even more compatible with OWL, too. There is a lot we can learn. Instead of becoming protective, the RDF community should adapt and simplify their offerings if they want to keep their niches relevant. Luckily, this is already happening, as e.g. the Linked Data API demonstrates. And I'm very happy to see Ivan Herman increasingly speaking/writing about the need to finally connect web developers with the semantic web community.
  • Early adopters in the CMS market. Projects like Drupal and IKS have put non-trivial resources into integrating machine-readable markup, and most of them are using RDFa. Microdata, in my experience, is easier to tame in a CMS than RDFa, especially when it comes to JavaScript operations. But whether semantic CMSs should add support for (or switch to) Schema.org microdata and their vocabulary depends more on whether they want/need to utilize SEO as a (short-term) selling proposition. Again, this will also depend on application developer demands.

What about Facebook?

Probably the more interesting aspect of this story, what will Facebook do? Their interest graph combined with linked data has big potential, not only for semantic advertising. And Facebook is interested in getting as many of their hooks into websites as possible. Switching to Microdata and/or aligning their types with Schema.org's vocabulary could make sense. Webmasters would probably welcome such a consolidation step as well. On the other hand, Facebook is known for wanting to keep things under their own control, too, so the chance of them adopting Schema.org and Microdata is rather low. This could well turn into an RSS-dejavu with a small set of formats (OGP-RDFa, full RDFa, Schema.org-Microdata, full Microdata) fighting for publisher and developer attention.

Conclusion

I am glad that Microdata finally gets some deserved attention and that someone acknowledged the need for a format that is easy to write and to consume. Yes, we'll get another wave of "see, RDF is too complicated" discussions, but we should be used to them by now. I expect RDF toolkits to simply integrate Microdata parsers soon-ish (if we're good at one thing then it's writing parsers), and the Linked Data community gets just another taxonomy to link to. Schema.org owns the SEO use case now, but it's also a nice starting point for our more distributed vision. The semantic web vision is bigger than data formats and it's definitely bigger than SEO. The enterprise market which RDF has mainly been targetting recently is a whole different beast anyway. No kittens killed. Now go build some apps, please ;-)

How to add a missing SQLite extension to an existing PHP system (CentOS)

Creating an sqlite3.so module for your existing PHP setup.
So I happily managed to upgrade my VServer to PHP 5.3. However, this did not enable SQLite 3 to be accessible from PHP. Luckily, I was not the only one facing this problem, and Mike Creuzer's answer on Stack Overflow worked for me almost without any changes. Here is the variation that worked on my 1&1 VServer:

The server didn't have the needed compiler, had to install gcc first:
yum install gcc
I had to install/update the php-devel package to enable extension building:
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum --enablerepo=webtatic update php-devel
I needed a different PHP version ("php --version" will tell you which):
wget http://de.php.net/get/php-5.3.6.tar.gz/from/this/mirror
tar zxvf php-5.3.6.tar.gz
I wanted SQLite version 3, not 2:
cd php-5.3.6/ext/sqlite3
The rest was basically identical then:
phpize
./configure
make
make install
I created the extension file for SQLite3 (/etc/php.d/sqlite3.ini), added a pointer to the sqlite3.so, and now (after restarting Apache) SQLite 3 is available via PHP's PDO interface. Stack Overflow++ :)

SeeAlso: CentOS locale issues.

Upgrading an outdated CentOS-VServer to PHP 5.3 with JSON enabled

A very easy way to switch from a restricted PHP 5.2.6 to PHP 5.3.6 without destroying Plesk.
I'm slowly migrating my servers away from 1&1 to DomainFactory, a more modern hosting provider here in Germany. Slowly, because 1&1 has long-running contracts and I will have to keep a few machines until the end of the year.

So I still have a spare 1&1 VServer running a slightly outdated CentOS 5.4 or so which I would like to use for site previews and tests. Among other things, I'm exploring the possibilities of a simplified lib for Semantic Web development, which involves some of the newer features of PHP 5.3 in combination with JSON-based data processing (and possibly SQLite). Unfortunately, my VServer didn't provide either, and the server initialization panel doesn't offer any current OS, possibly due to the bundled Plesk.

Although I did manage to upgrade either CentOS or PHP via yum, each time I ended up with a broken Plesk. Day-long story shorter: I eventually found a nice and simple guide at webtatic.com that worked just fine. The trick is to keep the CentOS as-is (Plesk apparently has issues with certain versions of SSL) and to update PHP with 5.3 packages that still use the pre-5.3 naming scheme (i.e. "php-common" instead of "php53-common" etc.). This way you won't get module and dependency conflicts; the new PHP is just handled as a basic update :)

Here is a quick summary of the commands needed (via webtatic):
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum --enablerepo=webtatic update php
And that's it. The only thing still missing is SQLite, which I didn't manage to get to work. The PDO extension is there, but PHP itself was initially configured as '--without-sqlite3'. For SQLite experiments, I will probably try JiffyBox, which I played with a little yesterday and which looks very promising.

Update: SQLite is now working, too.

2011 Resolutions and Decisions

I'm shifting focus from infrastructure and research to solutions and customer projects.
All right, this post could easily have become another rant about the ever-growing complexity of RDF specifications, but I'll turn it into a big shout-out to the Semantic Web community instead. After announcing the end of investing further time into ARC's open-source branch, I received so many nice tweets and mails that I was reminded of why I started the project in the first place: The positive vibe in the community, and the shared vision. Thank you very much everybody for the friendly reactions, I'm definitely very moved.

Some explanations: I still share the vision of machine-readable, integration-ready web content, but I have to face the fact that the current approach is getting too expensive for web agencies like mine. Luckily, I could spot a few areas where customer demands meet the cost-efficient implementation of certain spec subsets. (Those don't include comprehensive RDF infrastructure and free services here, though. At least not yet, and I just won't make further bets). The good news: I will continue working with semantic web technologies, and I'm personally very happy to switch focus from rather frustrating spec chasing to customer-oriented solutions and products with defined purposes. The downside: I have to discontinue a couple of projects and services in order to concentrate my energy and reduce (opportunity) costs. These are:
  • The ARC website, mailing list, and other forms of free support. The code and documentation get a new home on GitHub, though. The user community is already thinking about setting up a mailing list on their own. Development of ARC is going to continue internally, based on client projects (it's not dying).
  • Trice as an open-source project (lesson learned from ARC)
  • Semantic CrunchBase. I had a number of users but no paying ones. It was also one those projects that happily burn your marketing budget while at the same time having only negative effects on the company's image because the funds are too small to provide a reliable service (similar to the flaky DBPedia SPARQL service which makes the underlying RDF store look like a crappy product although it is absolutely not).
  • Knowee, Smesher and similar half-implemented and unfunded ideas.
Looking forward to a more simplified and streamlined 2011. Lots of success to all of you, and thanks again for the nice mails!

Semantic WYSIWYG in-place editing with Swipe

Introducing Swipe, Paggr's Microdata editor.
Several months ago (ugh, time flies) I posted a screencast demo'ing a semantic HTML editor. Back then I used a combination of client-side and server-side components, which I have to admit led to quite a number of unnecessary server round-trips.

In the meantime, others have shown that powerful client-side editors can be implemented on top of HTML5, and so I've now rewritten the whole thing and turned it into a pure JavaScript tool as well. It now supports inline WYSIWYG editing and HTML5 Microdata annotations.

The code is still at beta stage, but today I put up an early demo website which I'll use as a sandbox. The editor is called Swipe (like the dance move, but it's an acronym, too). What makes Swipe special is its ability to detect the caret coordinates even when the cursor is inside a text node, which is usually not possible with W3C range objects. This little difference enables several new possibilities, like precise in-place annotations or "linked-data-as-you-type" functionality for user-friendly entity suggestions. More to come soon...

Swipe - Semantic WYSIWYG in-place editor

Is the Semantic Web Layer Cake starting to crumble?

Some thoughts about the ever-growing number of RDF specs.
I recently read an article about how negative assertions about something are automatically getting associated with the person who made them. For example, if you say negative things about your competitor's products, people will subconsciously link these negative sentiments directly with you. A psychology thing. So, my recent rants about the RDF spec mania at the W3C have already lead to an all-time low karma level in the RDF community, and I'm trying hard to keep away from discussions about RDFa 1.1 or RDF Next Steps etc. to not make things worse. (Believe it or not, not all Germans enjoy being negative ;)

Now, why another post on this topic? ARC2's development is currently on hold as my long-time investor/girlfriend pulled the plug on it and (rightly so) wants me to focus on my commercial products. With ARC spreading, the maintenance costs are rising, too. There are some options around paid support, sponsoring and donations that I'm pondering, but for now the mails in my inbox are piling up, and one particular question people keep asking is whether ARC is going to support upcoming SPARQL 1.1 or if I'm going to boycott it and perhaps think that the W3C specs are preventing the semantic web from gaining momentum. Short answer (both times): To a certain extent, yes.

Funnily, this isn't so much a question about developers wanting to implement SPARQL 1.1, but rather if they actually can implement it, in an efficient way. SPARQL 1.1 standardizes a couple of much-needed features that we had in ARC's proprietary SPARQL+ for a couple of years. Things like aggregates and full CRUD which I managed to implement in a fast-enough way for my client projects. But when it comes to all the other features in SPARQL 1.1, the suggestions coming out of the "RDF 2.0" initiative, and the general growth of the stack, I do wonder if the RDF community is about to overcookbake its technology layer cake.

Not that any particular spec was bad or useless, but it is becoming increasingly hard for implementors to keep up. Who can honestly justify the investment in the layer cake if it takes a year to digest it, another year to implement a reasonable portion of it, and then a new spec obsoletes the expensive work? The main traction the Semantic Web effort is seeing happens around Linked Data, which uses only a fraction of the stack, and interestingly in a way non-compliant with other W3C recommendations such as OWL, because the latter doesn't provide the needed means for actual symbol linking (or didn't explain it good enough).

A central problem could be lack of targeting, and lack of formulating the target audience of a particular spec. 37signals once said that good software is opinionated. The RDF community is doing the exact opposite and seems to desperately try to please everyone. The groups follow the throw-it-out-and-see-what-sticks approach. And every new spec is thrown on the stack, with none of them having a helpful description for orientation. No one is taking the time to reduce confusion, to properly explain who is meant to implement the spec, who is meant to use the spec, and how the spec relates to other ones. Sure, new specs raise the market entrance barrier and thus help the few early vendors to keep competition away. But if the market growth gets delayed this way, it may die, or at least an unnecessary number of startups do. (Siderean is one example, their products were amazing. Another one is Radar Networks, which suffered from management issues, but they might have survived if they had spent less money trying to implement an OWL engine for Twine.)

For the fun of it, here are some micro-summaries for RDF specs, how I as a web developer understand them:
  • RDF: "A schema-less key-value system that integrates with the web." (Oha!)
  • RSS 1.0: "Rich data streams." (This is the stuff the thought leaders then said would never be needed, and which now inefficiently have to be squeezed into Atom extensions. Deppen!)
  • OWL 1: "Dumbing down KR-style modeling and inference to the web coder level" (I really liked that approach, it attracted me to the SemWeb idea in the first place, even though I later discovered that RDF Schema is sufficient in many cases.)
  • SPARQL 1.0: "A webby SQL with support for remote databases and without complex JOIN syntax." (Love it!)
  • GRDDL: "For HTML developers who are also XSLT nerds." (A failure, possibly because the target audience was too small, or because HTML creators didn't care for XML processing requirements. Or the chained processing of remote documents was simply too complex.)
  • OWL 2: "Made for the people who created it, and maybe AI students." (Never needed any of its features that I couldn't have more easily with simple SPARQL scripts. I think some people need and use it, though.)
  • RIF: "Even more features than OWL2, and yet another syntax". Alternative summary (for a good ROFL): "Perfect for Facebook's Open Graph". (No use case here. Again, YMMV.)
  • RDFa 1.1: I actually stopped following it, but here is one by Scott Gilbertson: "a bit like asking what time it is and having someone tell you how to build a watch"
  • SPARQL 1.1: "Getting at par with enterprise databases, at any cost." (A slap in the face of web developers. Too many features that are not implementable in any reasonable time, nor in its entirety, nor with user-satisfying performance. Profiles for feature subsets could still save it, though).
  • Microdata: "RDF-in-HTML made easy for CMS developers and JavaScript coders" (Not sure if it'll succeed, but it works well for me.).
  • SKOS: "An interesting alternative to RDFS and OWL and a possible bridge to the Web 2.0 world." (Wish I had time to explore SKOS-centric app development, the potential could be huge.)

I still believe that the lower-end adoption issue could be solved by a set of smaller layer cakes, each baked for and marketed to a defined and well-understood target audience. If the W3C groups continue to add to the same cake, it's going to crumble apart sooner or later, and the higher layers are going to bury the foundations. Nobody is going to taste from it at all then.

Ben Lavender formulated his concerns already several months ago.

The truth about the Semantic Web... by Dan Brickley
Picture: "The truth about the Semantic Web..." by Dan Brickley

And to answer the ARC-related question in more detail, too: Next step is collecting enough funds to test and release a PHP 5.3 E_STRICT version (Thanks so much to all donaters so far, we'll get there!). SPARQL 1.1 compatibility will come, but only for those parts that can be mapped to relational DB functionality. The REST API is on my list, too. Empty graphs, don't think so (which app would need them?). Sub-queries, most probably not. Federated queries, sure, as soon as someone figures out how to do production-ready remote JOINs ;-)

Update: This article has been called unfair and misleading, and I have to agree. I know that spec work is hard, that it's easy to complain from the sideline, and that frustration is part of compromise-driven specifications. Wake-up calls have to be a little louder to be heard, though, but I apologize for the toe-stepping. It is not directed against any person in particular.

Dynamic Semantic Publishing for any Blog (Part 2: Linked ReadWriteWeb)

A DSP proof of concept using ReadWriteWeb.com data.
The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.



In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Trice Bot Console

Here is a quick re-cap of the proposed dynamic semantic publishing process, followed by a detailed description of the individual components:
  • Index and monitor the archives pages, build a registry of post URLs.
  • Load and parse posts into raw structures (title, author, content, ...).
  • Extract named entities from each post's main content section.
  • Build a site-optimized schema (an "ontology") from the data structures generated so far.
  • Align the extracted data structures with the target ontology.
  • Re-purpose the final dataset (widgets, entity hubs, semantic ads, authoring tools)

Archives indexer and monitor

The archives indexer fetches the by-month archives, extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store.

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

Archives triples via SPARQL

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Raw post structures

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

RWW through Paggr Prospect

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

RWW entity types

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

RWW ontology

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:
  • Consolidate author aliases ("richard-macmanus-1 = richard-macmanus-2" etc.).
  • Normalize author tags, Zemanta tags, OpenCalais tags, and OpenCalais "industry terms" to a single "tag" field.
  • Consolidate the various type identifiers into canonical ones.
  • For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.
  • Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).
Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Dynamic RWW Entity Hub

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch.

Dynamic Semantic Publishing for any Blog (Part 1)

Bringing automated semantic page generation a la BBC to standard web environments.
"Dynamic Semantic Publishing" is a new technical term which was introduced by the BBC's online team a few weeks ago. It describes the idea of utilizing Linked Data technology to automate the aggregation and publication of interrelated content objects. The BBC's World Cup website was the first large mainstream website to use this method. It provides hundreds of automatically generated, topically composed pages for individual football entities (players, teams, groups) and related articles.

Now, the added value of such linked "entity hubs" would clearly be very interesting for other websites and blogs as well. They are multi-dimensional entry points to a site and provide a much better and more user-engaging way to explore content than the usual flat archives pages, which normally don't have dimensions beyond date, tag, and author. Additionally, HTML aggregations with embedded Linked Data identifiers can improve search engine rankings, and they enable semantic ad placement, which are attractive by-products.

Entity hub examples

The architecture used by the BBC is optimized for their internal publishing workflow and thus not necessarily suited for small and medium-scale media outlets. So I've started thinking about a lightweight version of the BBC infrastructure, one that would integrate more easily with typical web server environments and widespread blog engines.

How could a generalized approach to dynamic semantic publishing look like?

We should assume setups where direct access to a blog's database tables is not available. Working with already published posts requires a template detector and custom parsers, but it lowers the entry barrier for blog owners significantly. And content importers can be reused to a large extent when sites are based on standard blog engines such as WordPress or Movable Type.

The graphic below (large version) illustrates a possible, generalized approach to dynamic semantic publishing.
Dynamic Semantic Publishing

Process explanation:
  • Step 1: A blog-specific crawling agent indexes articles linked from central archives pages. The index is stored as RDF, which enables the easy expansion of post URLs to richly annotated content objects.
  • Step 2: Not-yet-imported posts from the generated blog index are parsed into core structural elements such as title, author, date of publication, main content, comments, Tweet counters, Facebook Likes, and so on. The semi-structured post information is added to the triple store for later processing by other agents and scripts. Again, we need site (or blog engine)-specific code to extract the various possible structures. This step could be accelerated by using an interactive extractor builder, though.
  • Step 3: Post contents are passed to APIs like OpenCalais or Zemanta in order to extract stable and re-usable entity identifiers. The resulting data is added to the RDF Store.
  • After the initial semantification in step 3, a generic RDF data browser can be used to explore the extracted information. This simplifies general consistency checks and the identification of the site-specific ontology (concepts and how they are related). Alternatively, this could be done (in a less comfortable way) via the RDF store's SPARQL API.
  • Step 4: Once we have a general idea of the target schema (entity types and their relations), custom SPARQL agents process the data and populate the ontology. They can optionally access and utilize public data.
  • After step 4, the rich resulting graph data allows the creation of context-aware widgets. These widgets ("Related articles", "Authors for this topic", "Product experts", "Top commenters", "Related technologies", etc.) can now be used to build user-facing applications and tools.
  • Use case 1: Entity hubs for things like authors, products, people, organizations, commenters, or other domain-specific concepts.
  • Use case 2: Improving the source blog. The typical "Related articles" sections in standard blog engines, for example, don't take social data such as Facebook Likes or re-tweets into account. Often, they are just based on explicitly defined tags. With the enhanced blog data, we can generate aggregations driven by rich semantic criteria.
  • Use case 3: Authoring extensions: After all, the automated entity extraction APIs are not perfect. With the site-wide ontology in place, we could provide content creators with convenient annotation tools to manually highlight some text and then associate the selection with a typed entity from the RDF store. Or they could add their own concepts to the ontology and share it with other authors. The manual annotations help increase the quality of the entity hubs and blog widgets.

Does it work?

I explored this approach to dynamic semantic publishing with nearly nine thousand articles from ReadWriteWeb. In the next post, I'll describe a "Linked RWW" demo which combines Trice bots, ARC, Prospect, and the handy semantic APIs provided by OpenCalais and Zemanta.

Linked Data Entity Extraction with Zemanta and OpenCalais

A comparison of the NER APIs by Zemanta and OpenCalais.
I had another look at the Named Entity Extraction APIs by Zemanta and OpenCalais for some product launch demos. My first test from last year concentrated more on the Zemanta API. This time I had a closer look at both services, trying to identify the "better one" for "BlogDB", a semi-automatic blog semantifier.

My main need is a service that receives a cleaned-up plain text version of a blog post and returns normalized tags and reusable entity identifiers. So, the findings in this post are rather technical and just related to the BlogDB requirements. I ignored features which could well be essential for others, such as Zemanta's "related articles and photos" feature, or OpenCalais' entity relations ("X hired Y" etc.).

Terms and restrictions of the free API

  • The API terms are pretty similar (the wording is actually almost identical). You need an API key and both services can be used commercially as long as you give attribution and don't proxy/resell the service.
  • OpenCalais gives you more free API calls out of the box than Zemanta (50.000 vs. 1.000 per day). You can get a free upgrade to 10.000 Zemanta calls via a simple email, though (or excessive API use; Andraž auto-upgraded my API limit when he noticed my crazy HDStreams test back then ;-).
  • OpenCalais lets you process larger content chunks (up to 100K, vs. 8K at Zemanta).

Calling the API

  • Both interfaces are simple and well-documented. Calls to the OpenCalais API are a tiny bit more complicated as you have to encode certain parameters in an XML string. Zemanta uses simple query string arguments. I've added the respective PHP snippets below, the complexity difference is negligible.
    function getCalaisResult($id, $text) {
      $parms = '
        <c:params xmlns:c="http://s.opencalais.com/1/pred/"
                  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <c:processingDirectives
            c:contentType="TEXT/RAW"
            c:outputFormat="XML/RDF"
            c:calculateRelevanceScore="true"
            c:enableMetadataType="SocialTags"
            c:docRDFaccessible="false"
            c:omitOutputtingOriginalText="true"
            ></c:processingDirectives>
          <c:userDirectives
            c:allowDistribution="false"
            c:allowSearch="false"
            c:externalID="' . $id . '"
            c:submitter="http://semsol.com/"
            ></c:userDirectives>
          <c:externalMetadata></c:externalMetadata>
        </c:params>
      ';
      $args = array(
        'licenseID' => $this->a['calais_key'],
        'content' => urlencode($text),
        'paramsXML' => urlencode(trim($parms))
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.opencalais.com/enlighten/rest/';
      return $this->getAPIResult($url, $qs);
    }
    
    function getZemantaResult($id, $text) {
      $args = array(
        'method' => 'zemanta.suggest',
        'api_key' => $this->a['zemanta_key'],
        'text' => urlencode($text),
        'format' => 'rdfxml',
        'return_rdf_links' => '1',
        'return_articles' => '0',
        'return_categories' => '0',
        'return_images' => '0',
        'emphasis' => '0',
      );
      $qs = substr($this->qs($args), 1);
      $url = 'http://api.zemanta.com/services/rest/0.0/';
      return $this->getAPIResult($url, $qs);
    }
    
  • The actual API call is then a simple POST:
    function getAPIResult($url, $qs) {
      ARC2::inc('Reader');
      $reader = new ARC2_Reader($this->a, $this);
      $reader->setHTTPMethod('POST');
      $reader->setCustomHeaders("Content-Type: application/x-www-form-urlencoded");
      $reader->setMessageBody($qs);
      $reader->activate($url);
      $r = '';
      while ($d = $reader->readStream()) {
        $r .= $d;
      }
      $reader->closeStream();
      return $r;
    }
    
  • Both APIs are fast.

API result processing

  • The APIs return rather verbose data, as they have to stuff in a lot of meta-data such as confidence scores, text positions, internal and external identifiers, etc. But they also offer RDF as one possible result format, so I could store the response data as a simple graph and then use SPARQL queries to extract the relevant information (tags and named entities). Below is the query code for Linked Data entity extraction from Zemanta's RDF. As you can see, the graph structure isn't trivial, but still understandable:
    SELECT DISTINCT ?id ?obj ?cnf ?name
    FROM <' . $g . '> WHERE {
      ?rec a z:Recognition ;
           z:object ?obj ;
           z:confidence ?cnf .
      ?obj z:target ?id .
      ?id z:targetType <http://s.zemanta.com/targets#rdf> ;
          z:title ?name .
      FILTER(?cnf >= 0.4)
    } ORDER BY ?id
    

Extracting normalized tags

  • OpenCalais results contain a section with so-called "SocialTags" which are directly usable as plain-text tags.
  • The tag structures in the Zemanta result are called "Keywords". In my tests they only contained a subset of the detected entities, and so I decided to use the labels associated with detected entities instead. This worked well, but the respective query is more complex.

Extracting entities

  • In general, OpenCalais results can be directly utilized more easily. They contain stable identifiers and the identifiers come with type information and other attributes such as stock symbols. The API result directly tells you how many Persons, Companies, Products, etc. were detected. And the URIs of these entity types are all from a single (OpenCalais) namespace. If you are not a Linked Data pro, this simplifies things a lot. You only have to support a simple list of entity types to build a working semantic application. If you want to leverage the wider Linked Open Data cloud, however, the OpenCalais response is just a first entry point. It doesn't contain community URIs. You have to use the OpenCalais website to first retrieve disambiguation information, which may then (often involving another request) lead you to the decentralized Linked Data identifiers.
  • Zemanta responses, in contrast, do not (yet, Andraž told me they are working on it) contain entity types at all. You always need an additional request to retrieve type information (unless you are doing nasty URI inspection, which is what I did with detected URIs from Semantic CrunchBase). The retrieval of type information is done via Open Data servers, so you have to be able to deal with the usual down-times of these non-commercial services.
  • Zemanta results are very "webby" and full of community URIs. They even include sameAs information. This can be a bit overwhelming if you are not an RDFer, e.g. looking up a DBPedia URI will often give you dozens of entity types, and you need some experience to match them with your internal type hierarchy. But for an open data developer, the hooks provided by Zemanta are a dream come true.
  • With Zemanta associating shared URIs with all detected entities, I noticed network effects kicking in a couple of times. I used RWW articles for the test, and in one post, for example, OpenCalais could detect the company "Starbucks" and "Howard Schultz" as their "CEO", but their public RDF (when I looked up the "Howard Schultz" URI) didn't persist this linkage. The detection scope was limited to the passed snippet. Zemanta, on the other hand, directly gave me Linked Data URIs for both "Starbucks" and "Howard Schultz", and these identifiers make it possible to re-establish the relation between the two entities at any time. This is a very powerful feature.

Summary

Both APIs are great. The quality of the entity extractors is awesome. For the RWW posts, which deal a lot with Web topics, Zemanta seemed to have a couple of extra detections (such as "ReadWriteWeb" as company). As usual, some owl:sameAs information is wrong, and Zemanta uses incorrect Semantic CrunchBase URIs (".rdf#self" instead of "#self" // Update: to be fixed in the next Zemanta API revision), but I blame us (the RDF community), not the API providers, for not making these things easier to implement.

In the end, I decided to use both APIs in combination, with an optional post-processing step that builds a consolidated, internal ontology from the detected entities (OpenCalais has two Company types which could be merged, for example). Maybe I can make a Prospect demo from the RWW data public, not sure if they would allow this. It's really impressive how much value the entity extraction services can add to blog data, though (see the screenshot below, which shows a pivot operation on products mentioned in posts by Sarah Perez). I'll write a bit more about the possibilities in another post.

RWW posts via BlogDB

Contextual configuration - Semantic Web development for visually minded webmasters

A short screencast demonstrating contextual configuration via widgets in semsol's RDF CMS.
Let's face it, building semantic web sites and apps is still far from easy. And to some extent, this is due to the configuration overhead. The RDF stack is built around declarative languages (for simplified integration at various levels), and as a consequence, configuration directives often end up in some form of declarative format, too. While fleshing out an RDF-powered website, you have to declare a ton of things. From namespace abbreviations to data sources and API endpoints, from vocabularies to identifier mappings, from queries to object templates, and what have you.

Sadly, many of these configurations are needed to style the user interface, and because of RDF's open world context, designers have to know much more about the data model and possible variations than usually necessary. Or webmasters have to deal with design work. Not ideal either. If we want to bring RDF to mainstream web developers, we have to simplify the creation of user-optimized apps. The value proposition of semantics in the context of information overload is pretty clear, and some form of data integration is becoming mandatory for any modern website. But the entry barrier caused by large and complicated configuration files (Fresnel anyone?) is still too high. How can we get from our powerful, largely generic systems to end-user-optimized apps? Or the other way round: How can we support frontend-oriented web development with our flexible tools and freely mashable data sets? (Let me quickly mention Drupal here, which is doing a great job at near-seamlessly integrating RDF. OK, back to the post.)

Enter RDF widgets. Widgets have obvious backend-related benefits like accessing, combining and re-purposing information from remote sources within a manageable code sandbox. But they can also greatly support frontend developers. They simplify page layouting and incremental site building with instant visual feedback (add a widget, test, add another one, re-arrange, etc.). And, more importantly in the RDF case, they can offer a way to iteratively configure a system with very little technical overhead. Configuration options could not only be scoped to the widget at hand, but also to the context where the widget is currently viewed. Let's say you are building an RDF browser and need resource templates for all kinds of items. With contextual configuration, you could simply browse the site and at any position in the ontology or navigation hierarchy, you would just open a configuration dialog and define a custom template, if needed. Such an approach could enable systems that worked out of the box (raw, but usable) and which could then be continually optimized, possibly even by site users.

A lot of "could" and "would" in the paragraphs above, and the idea may sound quite abstract without actually seeing it. To illustrate the point I'm trying to make I've prepared a short video (embedded below). It uses Semantic CrunchBase and Paggr Prospect (our new faceted browser builder) as an example use case for in-context configuration.

And if you are interested in using one of our solutions for your own projects, please get in touch!



Paggr Prospect (part 1)


Paggr Prospect (part 2)

Trice' Semantic Richtext Editor

A screencast demonstrating the structured RTE bundled with the Trice CMS
In my previous post I mentioned that I'm building a Linked Data CMS. One of its components is a rich-text editor that allows the creation (and embedding) of structured markup.

An earlier version supported limited Microdata annotations, but now I've switched the mechanism and use an intermediate, but even simpler approach based on HTML5's handy data-* attributes. This lets you build almost arbitrary markup with the editor, including Microformats, Microdata, or RDFa. I don't know yet when the CMS will be publicly available (3 sites are under development right now), but as mentioned, I'd be happy about another pilot project or two. Below is a video demonstrating the editor and its easy customization options.

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds