Saturday, 21 March 2009

The "poor man's" semantic web

From its very beginning, the quest for smart queries over that vasts amount of unstructured data we call the the Web has been a very elusive endeavor, yet one of the greatest importance.

The problem is that the Web has developed mostly as a medium of unstructured documents for people rather than of information that can be manipulated automatically.

The very creator of the web as we know, Tim Berners-Lee, has largely discussed this limitation and has also proposed a solution: the Semantic Web. The main idea is augmenting Web pages with hyperlinks to definitions of key terms and rules for reasoning about them logically. This meta data is targeted at computers and the resulting infrastructure will allow complex knowledge intensive tasks such as highly functional personal agents.

But this idea has not yet taken off. The main reason is that adding such metadata results cumbersome to the persons who creates the pages. It is not natural in the context of the content creation, at least not beyond the simple tagging or categorization of pages.

More recently, another even more radical approach has emerged: the answer engines. In this case the goal is not only to enrich the content with semantic, but to produce pure knowledge content like complex conceptual maps, basic facts and inference rules to allow answering question by computing results instead of just searching for already existing documents that contain the desired information. Two example of this approach are True Knowledge and Wolfram.

I'm really skeptic abut this model. First, because I don't believe it is scalable. The amount of human knowledge doubles every few years (two or five, depending who you ask), so how to keep the pace with that feeding such knowledge data bases? Second, I don't think it is economically feasible. Even if you could enter all that knowledge, could you make a business out of it? Unless you provide enough intelligence to actually create new knowledge in non trivial ways, I doubt so.

But more importantly, do we really need such intelligent web?

Let's consider how Google works. its tremendous success of Google comes, to a great extend, to simple yet powerful idea: use the implicit knowledge that exists in the web. Google search algorithm not only use the content of a page but also consider the key words used in the hyperlinks that point to that page. This is basically a sort of poor man's semantic annotation, well if you can call Google poor.

Now, if you have, as Google does, access to billions of pages and hundreds of thousands of processors to mine it, can't you arrive to the same "intelligent" answers to questions as with a answer engine? I guess that the result would be difficult to differentiate. To start with, if a question worth to be asked (and very likely, even if not), surely someone already put the answer in a web page. Consider this question from Wolfram's web page "What is the 307th digit of Pi?". A quick search on Google retrieves as the first this page on which you can search for any digit of pi.

Paraphrasing what Gary Kasparov famously said after been defeated by Deep Blue: "quantity can become quality", meaning that a brute force approach, given enough resources can actually achieve results that are indistinguishable for actual intelligence.

Update: It is clear that the guys at Google Research share this view: The Unreasonable Effectiveness of Data.

Update: Tim O'Reilly on the difference between Semantinc web and pattern recognition approaches

Sunday, 15 March 2009

Can Erlang become the next Java?

Erlang is a functional programming language that has been steadily gaining attention in the industry. Some even advocate it as "the next java", but can this actually happen? Can a single language attract enough support to become a de facto standard?

In face of the current fragmentation of the programming languages landscape (Ruby, Python, F#, Scala, and a long etcetera) this seams very unlikely at first. However, the very existence of such fragmentation is indicative of the search for new programming paradigms to tackle the needs of now day's applications.

Java success in the early web days was due, mostly because it offered two key features that where essential for the nascent web applications: dynamic code loading from the network and platform independence. From this point its acceptance grew dramatically in both desktop and server applications, despite the criticisms for its poor performance.

Similarly, Erlang offers several key characteristics for the next generation of distributed applications:
  • Its adequacy to parallel and distributed programming thanks to its simple message based concurrency model.
  • It allows the hot swap of code (code of running process can be change while they are still running!)
  • It has also the reputation of being rock solid, running on Ericsson's ATM Switches with a reliability of 99.9999999% (that is some 30 seconds of downtime a year!).
Such characteristics haven't passed unnoticed to many recent open source projects like Scalaris, CouchDB, RabbitMQ, and have picked the attention of enterprise application developers.

If recent history with Java is indicative, Erlang has all the credits to take off as a main stream (if not dominant) programming language for next generation web applications.

Wednesday, 4 March 2009

Why Google should be worried about twitter?

It is rather uncommon to see a Google's officer to attack competitors, but this is exactly what Eric Schmidt, Google's CEO did. An he did in a rather surprising way: calling twitter it "a poor man's email system" because it lacks all the basic capabilities of a full flagged email like gmail. It is unconceivable that he mistakes Twitter with an email. What he did was just spread some FUD about Twitter. But why Google, with above the 100 million email accounts even bothers to acknowledge the existence of Twitter, with its modest 5 millions users?

In this declining economy inversion in advertising is decreasing and advertisers are looking for better channels. Twitter has demonstrated that it can be used in very innovative ways for marketing. For viral campaigns, for personalized advertising, for brand creation, to receive user feed back. And at least by know, the price is quite low: free!

Also, Twitter users seams to handle commercial messages rather well, maybe because they are easily integrated in the message flow and also conveniently filtered, searched, aggregated, etc. More over, advertisers use Twitter as a truly bidirectional channel not only to advertisers,but also to gather feedback.

It is expected that if Twitter wants to continue in business, it must start charging for commercial usage. It is not clear how, but there are already some interesting ideas.

Twitter has still a long path to follow before threaten Google's dominance. This wouldn't be the first time this happens. One comes easily to mind: when Google came from nowhere and displaced Yahoo. And you know, history has some tendency to repeat itself.

Tuesday, 3 March 2009

What does Twitter change?

The tremendous success of first the Internet and later the web, is mainly due to their main design principle: build a basic, unsophisticated, yet flexible infrastructure and let the applications on the edges put the intelligence (the so called end to end principle).

More over, the Internet/Web infrastructure is based on open standards, like TCP/IP, HTTP and HTML prevents provider lock-ins and lowering the entrance barriers to newcomers.

Such infrastructure allows an open-ended innovation, supporting application architectures and utilization patterns no one had considered when it was designed. For example, who could ever dreamed about Ajax back in the early 90' ?

Twitter follows the same principle and has therefore the same potential of the technologies it is build upon. It offers a basic infrastructure for a publish subscribe communication of short text messages, offering an open API that allows others to develop sophisticated applications like searching, aggregation (Tweetdeck) and trend analysis (hashtags.org, Twist, TweetStats).

One obvious difference is that Twitter is neither an open infrastructure nor is based on open standards -- However, neither is Google Maps and that has not prevented it to become a "de facto" standard. More over, there are other technologies like the venerable IRC and more recently XMPP that have similar capabilities for instant communication and are based on open standards, but haven't had the same impact than the web. Why should then Twitter be different?

What makes Twitter so powerful and gives its tremendous potential is how it is actually used. Users, post anything they found relevant, interesting, funny. They post about themselves, friends, hobbies, work. They expose preferences and dislikes. In other words, Twitter opens people's thoughts and feelings and make them available to others instantaneously, creating a continuous conversation on which you can enter and leave. Participate. Watch.

This is the closest we will ever be to telepathy. And that surely will change the way we communicate.