Saturday 21 March 2009

The "poor man's" semantic web

From its very beginning, the quest for smart queries over that vasts amount of unstructured data we call the the Web has been a very elusive endeavor, yet one of the greatest importance.

The problem is that the Web has developed mostly as a medium of unstructured documents for people rather than of information that can be manipulated automatically.

The very creator of the web as we know, Tim Berners-Lee, has largely discussed this limitation and has also proposed a solution: the Semantic Web. The main idea is augmenting Web pages with hyperlinks to definitions of key terms and rules for reasoning about them logically. This meta data is targeted at computers and the resulting infrastructure will allow complex knowledge intensive tasks such as highly functional personal agents.

But this idea has not yet taken off. The main reason is that adding such metadata results cumbersome to the persons who creates the pages. It is not natural in the context of the content creation, at least not beyond the simple tagging or categorization of pages.

More recently, another even more radical approach has emerged: the answer engines. In this case the goal is not only to enrich the content with semantic, but to produce pure knowledge content like complex conceptual maps, basic facts and inference rules to allow answering question by computing results instead of just searching for already existing documents that contain the desired information. Two example of this approach are True Knowledge and Wolfram.

I'm really skeptic abut this model. First, because I don't believe it is scalable. The amount of human knowledge doubles every few years (two or five, depending who you ask), so how to keep the pace with that feeding such knowledge data bases? Second, I don't think it is economically feasible. Even if you could enter all that knowledge, could you make a business out of it? Unless you provide enough intelligence to actually create new knowledge in non trivial ways, I doubt so.

But more importantly, do we really need such intelligent web?

Let's consider how Google works. its tremendous success of Google comes, to a great extend, to simple yet powerful idea: use the implicit knowledge that exists in the web. Google search algorithm not only use the content of a page but also consider the key words used in the hyperlinks that point to that page. This is basically a sort of poor man's semantic annotation, well if you can call Google poor.

Now, if you have, as Google does, access to billions of pages and hundreds of thousands of processors to mine it, can't you arrive to the same "intelligent" answers to questions as with a answer engine? I guess that the result would be difficult to differentiate. To start with, if a question worth to be asked (and very likely, even if not), surely someone already put the answer in a web page. Consider this question from Wolfram's web page "What is the 307th digit of Pi?". A quick search on Google retrieves as the first this page on which you can search for any digit of pi.

Paraphrasing what Gary Kasparov famously said after been defeated by Deep Blue: "quantity can become quality", meaning that a brute force approach, given enough resources can actually achieve results that are indistinguishable for actual intelligence.

Update: It is clear that the guys at Google Research share this view: The Unreasonable Effectiveness of Data.

Update: Tim O'Reilly on the difference between Semantinc web and pattern recognition approaches

No comments: