finally a bnode with a uri

Posts tagged with: arc

2011 Resolutions and Decisions

I'm shifting focus from infrastructure and research to solutions and customer projects.
All right, this post could easily have become another rant about the ever-growing complexity of RDF specifications, but I'll turn it into a big shout-out to the Semantic Web community instead. After announcing the end of investing further time into ARC's open-source branch, I received so many nice tweets and mails that I was reminded of why I started the project in the first place: The positive vibe in the community, and the shared vision. Thank you very much everybody for the friendly reactions, I'm definitely very moved.

Some explanations: I still share the vision of machine-readable, integration-ready web content, but I have to face the fact that the current approach is getting too expensive for web agencies like mine. Luckily, I could spot a few areas where customer demands meet the cost-efficient implementation of certain spec subsets. (Those don't include comprehensive RDF infrastructure and free services here, though. At least not yet, and I just won't make further bets). The good news: I will continue working with semantic web technologies, and I'm personally very happy to switch focus from rather frustrating spec chasing to customer-oriented solutions and products with defined purposes. The downside: I have to discontinue a couple of projects and services in order to concentrate my energy and reduce (opportunity) costs. These are:
  • The ARC website, mailing list, and other forms of free support. The code and documentation get a new home on GitHub, though. The user community is already thinking about setting up a mailing list on their own. Development of ARC is going to continue internally, based on client projects (it's not dying).
  • Trice as an open-source project (lesson learned from ARC)
  • Semantic CrunchBase. I had a number of users but no paying ones. It was also one those projects that happily burn your marketing budget while at the same time having only negative effects on the company's image because the funds are too small to provide a reliable service (similar to the flaky DBPedia SPARQL service which makes the underlying RDF store look like a crappy product although it is absolutely not).
  • Knowee, Smesher and similar half-implemented and unfunded ideas.
Looking forward to a more simplified and streamlined 2011. Lots of success to all of you, and thanks again for the nice mails!

Is the Semantic Web Layer Cake starting to crumble?

Some thoughts about the ever-growing number of RDF specs.
I recently read an article about how negative assertions about something are automatically getting associated with the person who made them. For example, if you say negative things about your competitor's products, people will subconsciously link these negative sentiments directly with you. A psychology thing. So, my recent rants about the RDF spec mania at the W3C have already lead to an all-time low karma level in the RDF community, and I'm trying hard to keep away from discussions about RDFa 1.1 or RDF Next Steps etc. to not make things worse. (Believe it or not, not all Germans enjoy being negative ;)

Now, why another post on this topic? ARC2's development is currently on hold as my long-time investor/girlfriend pulled the plug on it and (rightly so) wants me to focus on my commercial products. With ARC spreading, the maintenance costs are rising, too. There are some options around paid support, sponsoring and donations that I'm pondering, but for now the mails in my inbox are piling up, and one particular question people keep asking is whether ARC is going to support upcoming SPARQL 1.1 or if I'm going to boycott it and perhaps think that the W3C specs are preventing the semantic web from gaining momentum. Short answer (both times): To a certain extent, yes.

Funnily, this isn't so much a question about developers wanting to implement SPARQL 1.1, but rather if they actually can implement it, in an efficient way. SPARQL 1.1 standardizes a couple of much-needed features that we had in ARC's proprietary SPARQL+ for a couple of years. Things like aggregates and full CRUD which I managed to implement in a fast-enough way for my client projects. But when it comes to all the other features in SPARQL 1.1, the suggestions coming out of the "RDF 2.0" initiative, and the general growth of the stack, I do wonder if the RDF community is about to overcookbake its technology layer cake.

Not that any particular spec was bad or useless, but it is becoming increasingly hard for implementors to keep up. Who can honestly justify the investment in the layer cake if it takes a year to digest it, another year to implement a reasonable portion of it, and then a new spec obsoletes the expensive work? The main traction the Semantic Web effort is seeing happens around Linked Data, which uses only a fraction of the stack, and interestingly in a way non-compliant with other W3C recommendations such as OWL, because the latter doesn't provide the needed means for actual symbol linking (or didn't explain it good enough).

A central problem could be lack of targeting, and lack of formulating the target audience of a particular spec. 37signals once said that good software is opinionated. The RDF community is doing the exact opposite and seems to desperately try to please everyone. The groups follow the throw-it-out-and-see-what-sticks approach. And every new spec is thrown on the stack, with none of them having a helpful description for orientation. No one is taking the time to reduce confusion, to properly explain who is meant to implement the spec, who is meant to use the spec, and how the spec relates to other ones. Sure, new specs raise the market entrance barrier and thus help the few early vendors to keep competition away. But if the market growth gets delayed this way, it may die, or at least an unnecessary number of startups do. (Siderean is one example, their products were amazing. Another one is Radar Networks, which suffered from management issues, but they might have survived if they had spent less money trying to implement an OWL engine for Twine.)

For the fun of it, here are some micro-summaries for RDF specs, how I as a web developer understand them:
  • RDF: "A schema-less key-value system that integrates with the web." (Oha!)
  • RSS 1.0: "Rich data streams." (This is the stuff the thought leaders then said would never be needed, and which now inefficiently have to be squeezed into Atom extensions. Deppen!)
  • OWL 1: "Dumbing down KR-style modeling and inference to the web coder level" (I really liked that approach, it attracted me to the SemWeb idea in the first place, even though I later discovered that RDF Schema is sufficient in many cases.)
  • SPARQL 1.0: "A webby SQL with support for remote databases and without complex JOIN syntax." (Love it!)
  • GRDDL: "For HTML developers who are also XSLT nerds." (A failure, possibly because the target audience was too small, or because HTML creators didn't care for XML processing requirements. Or the chained processing of remote documents was simply too complex.)
  • OWL 2: "Made for the people who created it, and maybe AI students." (Never needed any of its features that I couldn't have more easily with simple SPARQL scripts. I think some people need and use it, though.)
  • RIF: "Even more features than OWL2, and yet another syntax". Alternative summary (for a good ROFL): "Perfect for Facebook's Open Graph". (No use case here. Again, YMMV.)
  • RDFa 1.1: I actually stopped following it, but here is one by Scott Gilbertson: "a bit like asking what time it is and having someone tell you how to build a watch"
  • SPARQL 1.1: "Getting at par with enterprise databases, at any cost." (A slap in the face of web developers. Too many features that are not implementable in any reasonable time, nor in its entirety, nor with user-satisfying performance. Profiles for feature subsets could still save it, though).
  • Microdata: "RDF-in-HTML made easy for CMS developers and JavaScript coders" (Not sure if it'll succeed, but it works well for me.).
  • SKOS: "An interesting alternative to RDFS and OWL and a possible bridge to the Web 2.0 world." (Wish I had time to explore SKOS-centric app development, the potential could be huge.)

I still believe that the lower-end adoption issue could be solved by a set of smaller layer cakes, each baked for and marketed to a defined and well-understood target audience. If the W3C groups continue to add to the same cake, it's going to crumble apart sooner or later, and the higher layers are going to bury the foundations. Nobody is going to taste from it at all then.

Ben Lavender formulated his concerns already several months ago.

The truth about the Semantic Web... by Dan Brickley
Picture: "The truth about the Semantic Web..." by Dan Brickley

And to answer the ARC-related question in more detail, too: Next step is collecting enough funds to test and release a PHP 5.3 E_STRICT version (Thanks so much to all donaters so far, we'll get there!). SPARQL 1.1 compatibility will come, but only for those parts that can be mapped to relational DB functionality. The REST API is on my list, too. Empty graphs, don't think so (which app would need them?). Sub-queries, most probably not. Federated queries, sure, as soon as someone figures out how to do production-ready remote JOINs ;-)

Update: This article has been called unfair and misleading, and I have to agree. I know that spec work is hard, that it's easy to complain from the sideline, and that frustration is part of compromise-driven specifications. Wake-up calls have to be a little louder to be heard, though, but I apologize for the toe-stepping. It is not directed against any person in particular.

Dynamic Semantic Publishing for any Blog (Part 2: Linked ReadWriteWeb)

A DSP proof of concept using ReadWriteWeb.com data.
The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.



In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Trice Bot Console

Here is a quick re-cap of the proposed dynamic semantic publishing process, followed by a detailed description of the individual components:
  • Index and monitor the archives pages, build a registry of post URLs.
  • Load and parse posts into raw structures (title, author, content, ...).
  • Extract named entities from each post's main content section.
  • Build a site-optimized schema (an "ontology") from the data structures generated so far.
  • Align the extracted data structures with the target ontology.
  • Re-purpose the final dataset (widgets, entity hubs, semantic ads, authoring tools)

Archives indexer and monitor

The archives indexer fetches the by-month archives, extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store.

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

Archives triples via SPARQL

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Raw post structures

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

RWW through Paggr Prospect

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

RWW entity types

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

RWW ontology

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:
  • Consolidate author aliases ("richard-macmanus-1 = richard-macmanus-2" etc.).
  • Normalize author tags, Zemanta tags, OpenCalais tags, and OpenCalais "industry terms" to a single "tag" field.
  • Consolidate the various type identifiers into canonical ones.
  • For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.
  • Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).
Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Dynamic RWW Entity Hub

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch.

Dynamic Semantic Publishing for any Blog (Part 1)

Bringing automated semantic page generation a la BBC to standard web environments.
"Dynamic Semantic Publishing" is a new technical term which was introduced by the BBC's online team a few weeks ago. It describes the idea of utilizing Linked Data technology to automate the aggregation and publication of interrelated content objects. The BBC's World Cup website was the first large mainstream website to use this method. It provides hundreds of automatically generated, topically composed pages for individual football entities (players, teams, groups) and related articles.

Now, the added value of such linked "entity hubs" would clearly be very interesting for other websites and blogs as well. They are multi-dimensional entry points to a site and provide a much better and more user-engaging way to explore content than the usual flat archives pages, which normally don't have dimensions beyond date, tag, and author. Additionally, HTML aggregations with embedded Linked Data identifiers can improve search engine rankings, and they enable semantic ad placement, which are attractive by-products.

Entity hub examples

The architecture used by the BBC is optimized for their internal publishing workflow and thus not necessarily suited for small and medium-scale media outlets. So I've started thinking about a lightweight version of the BBC infrastructure, one that would integrate more easily with typical web server environments and widespread blog engines.

How could a generalized approach to dynamic semantic publishing look like?

We should assume setups where direct access to a blog's database tables is not available. Working with already published posts requires a template detector and custom parsers, but it lowers the entry barrier for blog owners significantly. And content importers can be reused to a large extent when sites are based on standard blog engines such as WordPress or Movable Type.

The graphic below (large version) illustrates a possible, generalized approach to dynamic semantic publishing.
Dynamic Semantic Publishing

Process explanation:
  • Step 1: A blog-specific crawling agent indexes articles linked from central archives pages. The index is stored as RDF, which enables the easy expansion of post URLs to richly annotated content objects.
  • Step 2: Not-yet-imported posts from the generated blog index are parsed into core structural elements such as title, author, date of publication, main content, comments, Tweet counters, Facebook Likes, and so on. The semi-structured post information is added to the triple store for later processing by other agents and scripts. Again, we need site (or blog engine)-specific code to extract the various possible structures. This step could be accelerated by using an interactive extractor builder, though.
  • Step 3: Post contents are passed to APIs like OpenCalais or Zemanta in order to extract stable and re-usable entity identifiers. The resulting data is added to the RDF Store.
  • After the initial semantification in step 3, a generic RDF data browser can be used to explore the extracted information. This simplifies general consistency checks and the identification of the site-specific ontology (concepts and how they are related). Alternatively, this could be done (in a less comfortable way) via the RDF store's SPARQL API.
  • Step 4: Once we have a general idea of the target schema (entity types and their relations), custom SPARQL agents process the data and populate the ontology. They can optionally access and utilize public data.
  • After step 4, the rich resulting graph data allows the creation of context-aware widgets. These widgets ("Related articles", "Authors for this topic", "Product experts", "Top commenters", "Related technologies", etc.) can now be used to build user-facing applications and tools.
  • Use case 1: Entity hubs for things like authors, products, people, organizations, commenters, or other domain-specific concepts.
  • Use case 2: Improving the source blog. The typical "Related articles" sections in standard blog engines, for example, don't take social data such as Facebook Likes or re-tweets into account. Often, they are just based on explicitly defined tags. With the enhanced blog data, we can generate aggregations driven by rich semantic criteria.
  • Use case 3: Authoring extensions: After all, the automated entity extraction APIs are not perfect. With the site-wide ontology in place, we could provide content creators with convenient annotation tools to manually highlight some text and then associate the selection with a typed entity from the RDF store. Or they could add their own concepts to the ontology and share it with other authors. The manual annotations help increase the quality of the entity hubs and blog widgets.

Does it work?

I explored this approach to dynamic semantic publishing with nearly nine thousand articles from ReadWriteWeb. In the next post, I'll describe a "Linked RWW" demo which combines Trice bots, ARC, Prospect, and the handy semantic APIs provided by OpenCalais and Zemanta.

Microdata, semantic markup for both RDFers and non-RDFers

RDF-in-HTML could have been so simple.
There's been a whole lot of discussion around Microdata, a new approach for embedding machine-readable information into forthcoming HTML5. What I find most attractive about Microdata is the fact that it was designed by HTMLers, not RDFers. It's refreshingly pragmatic, free of other RDF spec legacy, but still capable of expressing most of RDF.

Unfortunately, RDFa lobbyists on the HTML WG mailing list forced the spec out of HTML5 core for the time being. This manoeuver was understandable (a lot of energy went into RDFa, after all), but in my opinion very short-sighted. How many uphill battles did we have, trying to get RDF to the broader developer community? And how many were successful? Atom, microformats, OpenID, Portable Contacts, XRDS, Activity Streams (well, not really), these are examples where RDFers tried, but failed to promote some of their infrastructure into the respective solutions. Now: HTML5, where the initial RDF lobbying actually had an effect and lead to a native mechanism for RDF-in-HTML. Yes, native, not in some separate spec. This would have become part of every HTML5 book, any HTML developer on this planet would have learned about it. Finally a battle won. And what a great one. HTML.

But no, Microdata wasn't developed by an RDF group, so they voted it out again. Now, the really sad thing is, there could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers. The RDFa group recently realized that RDFa needs to be revised anyway, there is going to be an RDFa 1.1 which will require new parsers. If they'd swallowed their pride, they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata.

Here is a short overview of RDF features supported by Microdata:
  • Explicit resource containers, via @itemscope (in RDFa, the boundaries of a resource are often implicitly defined by @rel or @typeof)
  • Subject declaration, via @itemid (RDFa uses @about)
  • Main subject typing, via @itemtype (RDFa uses @typeof)
  • Predicate declaration, via @itemprop (RDFa uses @property, @rel, and @rev)
  • Literal objects, via node values (RDFa also allows hidden values via @content)
  • Non-literal objects, via @href, @src, etc. (RDFa also allows hidden values via @resource)
  • Object language, via @lang
  • Blank nodes
I won't go into details why hiding semantics in RDFa will be penalized by search engines as soon as spammers discover the possibilities, why reusing RDF/XML's attribute names was probably not a smart move with regard to attracting non-RDFers, why the new @vocab idea is impractical, or why namespace prefixes, as handy as they are in other RDF formats, are not too helpful in an HTML context. Let's simply state that there is a trade-off between extended features (RDFa) and simplicity (Microdata). So, what are the core features that an RDFer would really need beyond Microdata:
  • the possibility to preserve markup, but probably not necessarily as an explicit rdf:XMLLiteral
  • datatypes for literal objects (I personally never used them in practice in the last 6 years that I've been developing RDF apps, but I can see some use cases)
Markup preservation is currently turned on by default in RDFa and can be disabled through @datatype in RDFa, so an RDFer-satisfying RDFa 1.1 spec could probably just be Microdata + @datatype + a few extended parsing rules to end up with the intended RDF. My experience with watching RDF spec creation tells me that the RDFa group won't pick this route (there simply is no "Kill a Feature" mentality in the RDF community), but hey, hope dies last.

I've been using Microdata in two of my recent RDF apps and the CMS module of (ahem, still not documented) Trice, and it's been a great experience. ARC is going to get a "microRDF" extractor that supports the RDF-in-Microdata markup below (Note: this output still requires a 2nd extraction process, as the current Microdata draft's RDF mechanism only produces intermediate RDF triples, which then still have to be post-processed. I hope my related suggestion will become official, but I seem to be the only pro-Microdata RDFer on the HTML list right now, so it may just stay as a convention):

Microdata:
<div itemscope itemtype="http://xmlns.com/foaf/0.1/Person">

  <!-- plain props are mapped to the itemtype's context -->
  <img itemprop="img" src="mypic.jpg" alt="a pic of me" />
  My name is <span itemprop="name"><span itemprop="nick">Alec</span> Tronnick</span>
  and I blog at <a itemprop="weblog" href="http://alec-tronni.ck/">alec-tronni.ck</a>.

  <!-- other RDF vocabs can be used via full itemprop URIs -->
  <span itemprop="http://purl.org/vocab/bio/0.1/olb">
    I'm a crash test dummy for semantic HTML.
  </span>
</div>
Extracted RDF:
@base <http://host/path/>
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bio: <http://purl.org/vocab/bio/0.1/> .
_:bn1 a foaf:Person ;
      foaf:img <mypic.jpg> ;
      foaf:name "Alec Tronnick" ;
      foaf:nick "Alec" ;
      foaf:weblog <http://alec-tronni.ck/> ;
      bio:olb "I'm a crash test dummy for semantic HTML." .

Code.semsol.org - A central home for semsol code

Semsol gets code repositories and browsers
The code bundles on the ARC website are generated in an inefficient manual process, and each patch has to wait for the next to-be-generated zip file. The developer community is growing (there are now 600 ARC downloads each month), I'm increasingly receiving patches and requests for a proper repository, and the Trice framework is about to get online as well. So I spent last week on building a dedicated source code site for all semsol projects at code.semsol.org.

So far, it's not much more than a directory browser with source preview and a little method navigator. But it will simplify code sharing and frequent updates for me, and hopefully also for ARC and Trice developers. You can checkout various Bazaar code branches and generate a bundle from any directory. The app can't display repository messages yet (the server doesn't have bzr installed, I'm just deploying branches using the handy FTP option), but I'll try to come up with a work-around or an alternative when time permits.

Code Browser

Back from New York "Semantic Web for PHP Developers" trip

Gave a talk and a workshop in NYC about SemWeb technologies for PHP developers
/me at times square I'm back from New York, where I was given the great opportunity to talk about two of my favorite topics: Semantic Web Development with PHP, and (not necessarily semantic) Software Development using RDF Technology. I was especially looking forward to the second one, as that perspective is not only easier to understand for people from a software engineering context, but also because it is still a much neglected marketing "back-door": If RDF simplifies working with data in general (and it does), then we should not limit its use to semantic web apps. Broader data distribution and integration may naturally follow in a second or third step once people use the technology (so much for my contribution to Michael Hausenblas' list of RDF MalBest Practices ;)

The talk on Thursday at the NY Semantic Web Meetup was great fun. But the most impressive part of the event were the people there. A lot to learn from on this side of the pond. Not only very practical and professional, but also extremely positive and open. Almost felt like being invited to a family party.

The positive attitude was even true for the workshop, which I clearly could have made more effective. I didn't expect (but should have) that many people would come w/o a LAMP stack on their laptops, so we lost a lot of time setting up MAMP/LAMP/WAMP before we started hacking ARC, Trice, and SPARQL.

Marco brought up a number of illustrating use cases. He maintains an (inofficial, sorry, can't provide a pointer) RDF wrapper for any group on meetup.com, so the workshop participants could directly work with real data. We explored overlaps between different Meetup groups, the order in which people joined selected groups, inferred new triples from combined datasets via CONSTRUCT, and played with not-yet-standard SPARQL features like COUNT and LOAD.

And having done the workshop should finally give me the last kick to launch the Trice site now. The code is out, and it's apparently not too tricky to get started even when the documentation is still incomplete. Unfortunately, I have a strict "no more non-profits" directive, but I think Trice, despite being FOSS, will help me get some paid projects, so I'll squeeze an official launch in sometime soon-ish.

Below are the slides from the meetup. I added some screenshots, but they are probably still a bit boring without the actual demos (I think a video will be put up in a couple of days, though).

ARC Graph Gear Serializer Plugin

Patrick Murray-John created an ARC2 converter for Graph Gear visualizations
Patrick Murray-John (who is currently Semantifying the University of Mary Washington) just released a first version of an ARC2 converter for Graph Gear visualizations. Looks pretty cool.
Graph Gear visualization from RDF via ARC

ARC now also GPL-licensed

ARC is now available under the W3C Software or the GPL license
Arto Bendiken and St├ęphane Corlosquet asked me to provide ARC also under the GPL (for Drupal, in addition to the current W3C Software License), so here you are.

ARC is already used by several modules that help turn Drupal into an RDF-powered CMS, for example the RDF API, the SPARQL extension, or the Calais module. The new license will make it easier for the Drupal community to directly bundle ARC with their RDF extensions. I guess that Drupal will have its own complete RDF toolkit one day, but it's great to see ARC being utilized for accelerating the development progress.

New ARC version (DB changes!)

ARC revision 2009-03-04 introduces some low-level changes.
I've just uploaded a new ARC revision (rev 2009-03-04). In preparation of a Store Optimizer (which will improve the RDF store's scalability and performance), I slightly changed the underlying MySQL table structure. Please backup your data before you upgrade. Here's sample code, it's not too complicated:
/* old ARC version */
$store->createBackup($backup_path); // make sure the directory is write-enabled
// on success: $store->drop();
/* new ARC */
$store->query('LOAD <' . $backup_path . '>');
(Note: Store settings, if used, are not part of the dump and have to be manually copied over)

Main changes in this version:
  • the (crappy) inferencer was removed, I'll add a rule-based system at some later stage
  • the store indexes can be defined via a "store_indexes" config option now. Default: array('sp (s,p)', 'os (o,s)', 'po (p,o)'). The previous stores had an additional index 'spo (s,p,o)'.
  • The store got an "extendColumns" method which changes the column types from MEDIUMINT to INT. You don't have to call this method explicitly, the tables will be auto-upgraded should your store reach ARC's previous 16M triples limit.
  • There is a new "store_write_buffer" config option (default: 2500, changed from 5000). This option let's you set the batch size of triples written to the MySQL tables. In certain situations, esp.with shared hosts or large literal objects, the "5000" was too much and led to MySQL rejecting the queries.
  • the toTurtle and toRDFXML methods (and associated methods in the
    Serializers) accept a 2nd "raw" parameter now, in case you don't want a full RDF document, but just the triples. (thx to Claudia Wagner for the suggestion)

If you have questions, just send them to the mailing list and I'll try to help.

RPointer - The resource described by this XPointer element

URIs for resources described in microformatted or poshRDF'd content
I'm often using/parsing/supporting a combination of different in-HTML annotations. I started with eRDF and microformats, more recently RDFa and poshRDF. Converting HTML to RDF usually leads to a large number of bnodes or local identifiers (RDFa is an exception. It allows the explicit specification of a triple's subject via an "about" attribute). Additionally, multi-step parsing a document (e.g. for microformats and then for eRDF) will produce different identifiers for the same objects.

I've searched for a way to create more stable, URI-based IDs. Mainly for two use cases: Technically, for improved RDF extraction, and practically for being able to subscribe to certain resource fragments in HTML pages, like the main hCard on a person's Twitter profile. The latter is something I need for Knowee.

The closest I could find (and thanks to Leigh Dodds for pointing me at the relevant specs) is the XPointer Framework and its XPointer element() scheme, which is defined as: ...intended to be used with the XPointer Framework to allow basic addressing of XML elements.
Here is an example XPointer element and the associated URI for my Twitter hCard:
element(side/1/2/1)
http://twitter.com/bengee#element(side/1/2/1)
We can't, however, use this URI to refer to me as a person (unless I redefine myself as an HTML section ;-). It would work in this particular case as I could treat the hCard as a piece of document, and not as a person. But in most situations (for example events, places, or organizations), we may want to separate resources from their respective representations on the web (and RDFers can be very strict in this regard). This effectively means that we cant use element(), but given the established specification, something similar should work.

So, instead of element(), I tweaked ARC to generate resource() URIs from XPointers. In essence:
The RPointer resource() scheme allows basic addressing of resources described in XML elements. The hCard mentioned above as RPointer:
resource(side/1/2/1)
http://twitter.com/bengee#resource(side/1/2/1)
There is still a certain level of ambiguity as we could argue about the exact resource being described. Also, as HTML templates change, RPointers are only as stable as their context. But practically, they work quite fine for me so far.

Note: The XPointer spec provides an extension mechanism, but it would have led to very long URIs including a namespace definition for each pointer. Introducing the non-namespace-qualified resource() scheme unfortunately means dropping out of the XPointer Framework ("This specification reserves all unqualified scheme names for definition in additional XPointer schemes"), so I had to give it a new name (hence "RPointer") and have to hope that the W3C doesn't create a resource() scheme for the XPointer framework.

RPointers are implemented in ARC's poshRDF and microformats extractors.

poshRDF - RDF extraction from microformats and ad-hoc markup

poshRDF is a new attempt to extract RDF from microformats and ad-hoc markup
I've been thinking about this since Semantic Camp where I had an inspiring dialogue with Keith Alexander about semantics in HTML. We were wondering about the feasibility of a true microformats superset, where existing microformats could be converted to RDF without the need to write a dedicated extractor for each format. This was also about the time when "scoping" and context issues around certain microformats started to be discussed (What happens for example with other people's XFN markup, aggregated in a widget on my homepage? Does it affect my social graph as seen by XFN crawlers? Can I reuse existing class names for new formats, or do we confuse parsers and authors then? Stuff like that).

A couple of days ago I finally wrote up this "poshRDF" idea on the ESW wiki and started with an implementation for paggr widgets, which are meant to expose machine-readable data from RDFa, microformats, but also from user-defined, ad-hoc formats, in an efficient way. PoshRDF can enable single-pass RDF extraction for a set of formats. Previously, my code had to walk through the DOM multiple times, once for each format.

A poshRDF parser is going to be part of one of the next ARC revisions. I've just put up a site at poshrdf.org to host the dynamic posh namespace. For now the site links to a possibly interesting by-product: A unified RDF/OWL schema for the most popular microformats: xfn, rel-tag, rel-bookmark, rel-nofollow, rel-directory, rel-license, hcard, hcalendar, hatom, hreview, xfolk, hresume, address, and geolocation. It's not 100% correct, poshRDF is after all still a generic mechanism and doesn't cover format-specific interpretations. But it might be interesting for implementors. The schema could be used to generate dedicated parser configurations. It also describes the typical context of class names so that you can work around scoping issues (e.g. the XFN relations are usually scoped to the document or embedded hAtom entries).

I hope to find some time to build a JSON exporter and microformats validator on top of poshRDF in the not too distant future. Got to move on for now, though. Dear Lazyweb, feel free to jump in ;)

Writing Inference Rules with SPARQLScript

SPARQLScript can be used for forward chaining, including string manipulations on the run.
In order to keep data structures in Semantic CrunchBase close to the source API, I used a 1-to-1 mapping between CrunchBase JSON keys and RDF terms (with only a few exceptions). This was helpful for people knowing the JSON API, but it wasn't easy to interlink the converted information with existing SemWeb data such as FOAF, or the various LOD sources.

SPARQLScript is already heavily used by the Pimp-My-API tool or the TwitterBot, but yesterday I added a couple of new features and finally had a go at implementing a (forward chaining) rule evaluator (for the reasons mentioned some time ago).

A first version ("LOD Linker") is installed on Semantic CB, with initially 9 rules (feel free to leave a comment here if you need some additional mappings). With SPARQLScript being a superset of SPARQL+, most inference scripts are not much more than a single INSERT + CONSTRUCT query (you can click on the form's "Show inference scripts" button to see the source code):
$ins_count = INSERT INTO <${target_g}>
  CONSTRUCT {?res a foaf:Organization } WHERE {
    { ?res a cb:Company }
    UNION { ?res a cb:FinancialOrganization }
    UNION { ?res a cb:ServiceProvider }
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res a foaf:Organization } }
    FILTER(!bound(?g))
  }
  LIMIT 2000
But with the latest SPARQLScript processor (ARC release 2008-09-12) you can run more sophisticated scripts, such as the one below, which infers DBPedia links from wikipedia URLs:
$rows = SELECT ?res ?link WHERE {
    { ?res cb:web_presence ?link . }
    UNION { ?res cb:external_link ?link . }
    FILTER(REGEX(?link, "wikipedia.org/wiki"))
    # prevent dupes
    OPTIONAL { GRAPH ?g { ?res owl:sameAs ?v2 } . }
    FILTER(!bound(?g))
  }
  LIMIT 500

$triples = "";
FOR ($row in $rows) {
  # extract the wikipedia identifier
  $id = ${row.link.replace("/^.*\/([^\/\#]+)(\#.*)?$/", "\1")};
  # construct a dbpedia URI
  $res2 = "http://dbpedia.org/resource/${id}";
  # append to triples buffer
  $triples = "${triples} <${row.res}> owl:sameAs <${res2}> . "
}

#insert
if ($triples) {
  $ins_count = INSERT INTO <${target_g}> { ${triples} }
}

(I'm using a similar script to generate foaf:name triples by concatenating cb:first_name and cb:last_name.)

Inferred triples are added to a graph directly associated with the script. Apart from a destructive rule that removes all email addresses, the reasoning can easily be undone again by running a single DELETE query against the inferred graph.

I'm quite happy with the functionality so far. What's still missing is a way to rewrite bnodes, I don't think that's already possible. But INSERT + CONSTRUCT will leave bnode IDs unchanged, so the inference scripts don't necessarily require URI-denoted resources.

Another cool aspect of SPARQLScript-based inferencing is the possibility to use a federated set of endpoints, each processing only a part of a rule. The initial DBPedia mapper above, for example, uses locally available wikipedia links. However, CrunchBase only provides very few of those. So I created a second script which can retrieve DBPedia identifiers for local company homepages, using a combination of local queries and remote ones against the DBPedia SPARQL endpoint (in small iterations and only for companies with at least one employee, but it works).

A Faceted Browser for ARC

One of the first Trice components is probably going to be a faceted browser for ARC
I'm going on vacation in a couple of days, and before that, I'm trying to tick off at least a few of the bigger items on my ToDo list. I was hoping for a first Trice preview (now that ARC is slowly getting stable), but this will have to wait until September. However, I managed to get another component that's been on my list for ages into a demo-able state today: A SPARQL/ARC-based faceted browser (test installation at Semantic CrunchBase).

faceted browser

It's an early, but working (I think ;) version. A template mechanism for the item previews is still missing, but I'm already quite happy with the facet column. The facets are auto-generated (based on statistical info and scope-detection), but it's also possible to define custom filters (for more complicated graph patterns, see screenshot below). Once again, SPARQLScript simplified development, thanks to its placeholders for parameterized queries.

faceted browser administration

I think I'm going to use the browser for a first Trice bundle. It's not too sophisticated, but builds on several core features such as request dispatching, RDF/SPARQL-based views and forms, basic AJAX calls, and cached template sections.

ARC Triples Visualizer Plugin

Luis Paulo implemented a graphviz access plugin for ARC
Luis Paulo created a graphviz plugin for ARC that generates .dot files and also SVG or bitmap graphics from triple sets (available features depend on the graphviz libraries installed on your machine). Thanks, Luis, great stuff!

ARC TriplesVisualizer Plugin output

Archives/Search

YYYY or YYYY/MM
No Posts found

Feeds