2.0

Designing a better Web 3.0 search engine

In Uncategorized on January 7, 2007 at 7:09 pm

Author: Marc Fawzi

Twitter: http://twitter.com/#!/marcfawzi

License: Attribution-NonCommercial-ShareAlike 3.0

This post discusses the significant drawbacks of current quasi-semantic search engines (e.g. hakia.com, ask.com et al) and examines the potential future intersection of Wikipedia, Wikia Search (the recently announced search-engine-in-development, by Wikipedia’s founder), future semantic version of Wikipedia (aka Wikipedia 3.0), and Google’s Pagerank algorithm to shed some light on how to design a better semantic search engine (aka Web 3.0 search engine)

Query Side Improvements

Semantic “understanding” of search queries (or questions) determines the quality of relevant search results (or answers.)

However, current quasi-semantic search engines like hakia and ask.com can barely understand the user’s queries and that is because they’ve chosen free-form natural language as the query format. Reasoning about natural language search queries can be accomplished by: a) Artificial General Intelligence or b) statistical semantic models (which introduce an amount of inaccuracy in constructing internal semantic queries). But a better approach at this early stage may be to guide the user through selecting a domain of knowledge and staying consistent within the semantics of that domain.

The proposed approach implies an interactive search process rather than a one-shot search query. Once the search engine confirms the user’s “search direction,” it can formulate an ontology (on the fly) that specifies a range of concepts that the user could supply in formulating the semantic search query. There would be a minimal amount of input needed to arrive at the desired result (or answer), determined by the user when they declare “I’ve found it!.”

Information Side Improvements

We are beginning to see search engines that claim they can semantic-ize arbitrary unstructured “Wild Wild Web” information. Wikipedia pages, constrained to the Wikipedia knowledge management format, may be easier to semantic-ize on the fly. However, at this early stage, a better approach may be to use human-directed crawling that associates the information sources with clearly defined domains/ontologies. An explicit publicized preference for those information sources (including a future semantic version of Wikipedia, a la Wikipedia 3.0) that have embedded semantic annotations (using, e.g., RDFa http://www.w3.org/TR/xhtml-rdfa-primer/ or microformats http://microformats.org) will lead to improved semantic search.

How can we adapt the currently successful Google PageRank algorithm (for ranking information sources) to semantic search?

One answer is that we would need to design a ‘ResourceRank’ algorithm (referring to RDF resources) to manage the semantic search engines’ “attention bandwidth.” Less radical, may be to design a ‘FragmentRank’ algorithm which would rank at the page-component level (ex: paragraph, image, wikipedia page section, etc).

Related

  1. Wikipedia 3.0: The End of Google?
  2. Search By meaning

Update

  1. See relevant links under comments

Tags:

web 3.0, web 3.0, web 3.0, semantic web, semantic web, ontology, reasoning, artificial intelligence, AI, hakia, ask.com, pagerank, google, semantic search, RDFa, ResourceRank, RDF, Semantic Mediawiki, Microformats

  1. I found the following links at http://wiki.ontoworld.org/index.php/SemWiki2006

    1) http://wiki.ontoworld.org/wiki/Harvesting_Wiki_Consensus_-_Using_Wikipedia_Entries_as_Ontology_Elements
    “The English version of Wikipedia contains now more than 850,000 entries and thus the same amount of URIs plus a human-readable description. While this collection is on the lower end of ontology expressiveness, it is likely the largest living ontology that is available today. In this paper, we (1) show that standard Wiki technology can be easily used as an ontology development environment for named classes, reducing entry barriers for the participation of users in the creation and maintenance of lightweight ontologies, (2) prove that the URIs of Wikipedia entries are surprisingly reliable identifiers for ontology concepts, and (3) demonstrate the applicability of our approach in a use case.”

    2) http://wiki.ontoworld.org/wiki/Extracting_Semantic_Relationships_between_Wikipedia_Categories
    “We suggest that semantic information can be extracted from Wikipedia by analyzing the links between categories. The results can be used for building a semantic schema for Wikipedia which could improve its search capabilities and provide contributors with meaningful suggestions for editing theWikipedia pages.We analyze relevant measures for inferring the semantic relationships between page categories of Wikipedia.”

    3) http://wiki.ontoworld.org/wiki/From_Wikipedia_to_Semantic_Relationships:_a_Semi-automated_Annotation_Approach

  2. Thanks for the relevant links.

    Marc

  3. What if you had an AI which used stochastic models and had feedback mechanisms so that it could use evolutionary programming to learn which results were best? Combining Yahoo and Google (people and robots)…?

  4. > What if you had an AI which used stochastic models…

    in a way, the data set (wikipedia pages + wild-wild-web pages) is itself stochastic.

    re feedback mechanism: if google knows what search results you visit, then they can feedback visited pages into pagerank. but in a directed, multi-step search process, the way the user narrows results is explicit, yielding a _much richer_ feedback loop. not just in terms of which results are chosen, but in the _particular way_ sets of results answer the search ‘problem’.

    re evolutionary programming: useful (along with neural networks) as a possible method that the search-engine uses to optimize its operating parameters, in the crawl or result-fetching stages.

    merging/unfiying the crawl and results processes together, you can imagine a human supervised-learning process where the engine learns how to crawl _and_ fetch/present results for randomly-generated, historical, or real-time queries. this way, everyone that uses the engine unknowingly trains it.

    “Using the knowledge linked to by URL u, I can answer search ‘directions’ according to Ontology o”

  5. My line of thought precisely. Although I wonder if that would open it up to a whole new realm of blackhat SEO with click farms in china or on zombie armies? Something for Google et al to try to work out, I guess.

  6. Google has no future.

    Money does not buy the future. It only glues you to the present, and the present becomes the past.

    The future is not for sale. It’s for those who can claim it.

    Money obeys the future, not vice versa.

    Marc

  7. Well, there’s a saying that goes: money talks, bullshit walks.

    However, the problem with Google is bigger than money can fix.

    Google is stuck with a technology and a business model that are less optimal than what is possible today (never mind what will be possible in two or three years), so they either distribute all their profits as dividends and start over with Google 3.0 using a new technology and a new business model (i.e. disrupt themselves) or submit to the fact that their technology and business model are, like all technologies and business models, not immune to disruption.

    But that’s just one view. Another view could be that they will last forever or for a very long time. They may very well last forever or a very long time but definitely not as the dominant search engine. Anyone who thinks so is contradicting nature and idolizing Google.

    Nature is all about survival of the fittest.

    Google’s technology and business model are not the fittest, by design.

    Who will undermine Google?

    That’s the $300B question.

    My answer is: Google itself.

    It’s like being on a seasaw, over a cliff. For now, the mountain side is weighed down by mass misconception and by the competitors’ sub-mediocre execution.

    Speaking of execution, let me inject the word “Saddam” here so Google starts associating this blog with Saddam execution videos. Do you see how dumb Google is???

    It’s not about semantic vs non-semantic design. It’s about bad design vs good design. You can undermine a bad desin a lot easier than a good design.

    It’s time to come up with a good one!

    There are private companies competing with NASA (the organization that put a man on the moon 38 years ago) and they’re succeeding at it … Why shouldn’t we have an X Prize for teh first company to come up with a P2P search engine that beats google (i.e. The People’s Google)?

    Time for breakfast, again.

    Marc
    P.S. I do have to believe in breakfast in order to exist.

  8. I agree with your vision. But there are many technical difficulties. For example, on-the-fly ontology generation is a very hard problem. Especially if you want to play it on the user side, I doubt wether it might work. We will have new search models (other than Google and Yahoo) for Semantic Web. But the time is not ready for the revolution yet.

    Anyway, I believe your thoughts are great. Recently I will post a new article about web evolution. I think you might be interested in reading it. ;-)

  9. No one can say the “time is not ready,” especially not a semantic web researcher. The time is always ready. The question is whether or not we’re ready. I believe we are :) …

    Things already in motion.

  10. > But there are many technical difficulties. For example, on-the-fly ontology generation is a very hard problem.

    Any elementary algorithm can generate on-the-fly ontologies, the question is how useful, reusable, and accurate they are.

    If you think along the lines of “Fluid ontologies”, “Fluid Knowledge,” or “Evolving Ontologies”? May be a killer app for semantic web, because the ‘rigid’ binding OWL (or OWL-like) ontologies to data yields a relatively narrow range of expression.

    > But the time is not ready for the revolution yet.

    The time has always been “ready for the revolution yet”, but it has never been ready for people to state that it hasn’t. ;-)

  11. http://blog.wired.com/monkeybites/2007/01/wikiseek_launch.html
    Tuesday, 16 January 2007
    SearchMe Launches Wikiseek, A Wikipedia Search Engine
    Topic: search

    The search engine company SearchMe has launched a new service, Wikiseek, which indexes and searches the contents of Wikipedia and those sites which are referenced within Wikipedia. Though not officially a part of Wikipedia, TechCrunch reports that Wikiseek was “built with Wikipedia’s assistance and permission”

    Because Wikiseek only indexes Wikipedia and sites that Wikipedia links to, the results are less subject to the spam and SEO schemes that can clutter up Google and Yahoo search listings.

    According to the Wikiseek pages, the search engine “utilizes Searchme’s category refinement technology, providing suggested search refinements based on user tagging and categorization within Wikipedia, making results more relevant than conventional search engines.”

    Along with search results Wikiseek displays a tag cloud which allows you to narrow or broaden your search results based on topically related information.

    Wikiseek offers a Firefox search plugin as well as a Javascript-based extension that alters actual Wikipedia pages to add a Wikiseek search button (see screenshot below). Hopefully similar options will be available for other browsers in the future.

    SearchMe is using Wikiseek as a showcase product and is donating a large portion of the advertising revenue generated by Wikiseek back to Wikipedia. The company also claims to have more niche search engines in the works.

    If Wikiseek is any indication, SearchMe will be one to watch. The interface has the simplicity of Google, but searches are considerably faster — lightning fast, in fact. Granted, Wikiseek is indexing far fewer pages than Google or Yahoo. But if speed is a factor, niche search engines like Wikiseek may pose a serious threat to the giants like Google and Yahoo.

    Steve Rubel of Micro Persuasion has an interesting post about the growing influence of Wikipedia and how it could pose a big threat to Google in the near future. Here are some statistics from his post:

    The number of Wikipedians who have edited ten or more articles continues its hockey stick growth. In October 2006 that number climbed to 158,000 people. Further, media citations rose 300% last year, according to data compiled using Factiva. Last year Wikipedia was cited 11,000 times in the press. Traffic is on the rise too. Hitwise says that Wikipedia is the 20th most visited domain in the US.

    While Wikiseek will probably not pose a serious threat to the search giants, Wikipedia founder Jimmy Wales is looking to compete with the search giants at some point. While few details have emerged, he has announced an as-yet-unavailable new search engine, dubbed Search Wikia, which aims to be a people-powered alternative to Google.

    With numbers like the ones cited above, Wikipedia may indeed pose a threat to Google, Yahoo and the rest.

  12. Copying the Wikipedia 3.0 vision in a half assed way is more about leveraging the hype to make a buck than moving us forward.

    However, I’d give any effort a huge benefit of the doubt just for trying.

    :)

  13. […] Jan 7, ‘07: Also make sure to check out “Designing a Better Web 3.0 Search Engine.” […]

  14. […] turned up a short counter-point blog post about their approach by Marc Fawzi and […]

  15. […] Now see this Evolving Trends article that preceded the description from the above. Designing a Better Web 3.0 Search Engine. […]

  16. […] Designing a Better Semantic Search Engine* […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: