Total Pageviews

08 March 2016

Search 3.0, transformative big data and the road ahead

Search 3.0, transformative big data and the road ahead

Introduction

You may be wondering the significance of the three Scottish flags in the image. I took this picture a few weeks ago. I'm in Edinburgh, there's 3 flags and this article is about Search 3.0 and preferably I'd like Silicon Glen (the Scottish IT sector) to take this idea forward rather than Silicon Valley being the home of the best search engine. So my aim is for a Search 3.0 search engine to be based in Scotland. However I'm open to ideas. First of all, a history in order to explain what search 3.0 is and what transformative big data is all about.

Search 1.0

Although I have been on the Internet since the 1980's, my first experience of the web was in 1993 when I was studying for my Masters in Large Software Systems Development at Napier University and wrote a research paper on cooperative working. I first had email at home since 1988 and used usenet a great deal so was an early adopter of the web and downloaded Mosaic, used early search engines such as Yahoo, Excite, Lycos, and so on when they first came out. I was particularly interested in Altavista when it launched in 1995 as it had the biggest search index at the time and also was built by my former employer Digital. I had floated the idea of a web browser to them in 1989 but that was rather ahead of its time then. The early search engines were interesting and their job was a lot easier than now as there were so few sites however as the web grew the unstructured web needed some order to it so that relevant results came to the fore in the ever growing web.

Search 2.0

Search 2.0 came about when the founders of Google realised that a ranking of pages would help produce more relevant results. Their January 1998 paper on search is available. The basis for this was that the human element of embedding links in pages could be used to deduce that the pages being linked to were more important because people had chosen to link to them. In effect, the human element of adding links allowed a computer algorithm to assign a rank to the pages and produce results which people found more valuable. It was also (somewhat) hard to spam the search index as it required the manual effort of the links to be changed. Trustable sites on .edu and .ac.uk domains also scored higher. Search 2.0 has evolved since then in which ever better and more sophisticated algorithms have tried to make more sense of the data which is out there on the web to produce even more useful results. Despite 17 years of Google, search is still pretty poor.
Try these difficult searches:
  1. You are flying soon. Your airline allows you to take a 2nd bag 40cm x 30cm x 15cm. Your task is to buy a bag that fits. As a secondary task, find a site that allows you to search by bag size.
  2. You are travelling alone searching for a hotel room in London. You require a double bed, ensuite and breakfast. You want a 3 star hotel or better. That in itself is quite hard because by stating one adult, you sometimes get twin rooms returned. Hotelscombined and others when you rank the results by price give you hostels. However, try combining this search to add within 10 minutes walk of a London tube station on the Picadilly line or the number 133 bus or some transport requirement and you're stuck. A bit tricky if you're disabled and want accommodation near a bus route without having to change buses or a tube station accessible by wheelchair.
  3. You need to be at a meeting. Find a travel planner site which allows a portion of your journey by public transport to be swapped with a taxi provided it saves you time and doesn't cost more than £15.
  4. You are a single mother returning to work. You seek a part time job that allows you to balance childcare and work from 9am-3pm Mon to Friday or from home. Your challenge is to find the website that allows you to search for this.
  5. You're looking to move house. The smallest bedroom must be at least 6ft by 8ft. Find the matching houses. Would prefer house to be within 5 mins walk of a bus stop.
  6. Tell me the flights which, allowing for connections at the other end get me to my meeting in London on time. In London you have a choice of 5 airports all with different prices and onward travel times. Let me know the total journey cost too (including by public transport)
  7. You have forgotten a friends birthday but know what would be the ideal present. Find all the local shops open now within a 30 minute drive which have it in stock.
  8. Find me all the used cars in the UK which comfortably take three adults in the back seat for a long journey. 
  9. Find me all the events on in my area. Surprisingly there isn't a predominant global website which does this. 
  10. Find me a job, such that duplicate postings by multiple agencies for the same position are eliminated. Also, show on the job advert the time expected to complete the job application process as I favour jobs without application forms. Let me know the commute time to the job.
(the above list is not meant to be exhaustive and I welcome additions for things that we would find useful, but which can't be searched on via a primary search engine such as bing, google,  duckduckgo or wolfram alpha. 
There are lots of data driven searches for products and services that are simply impossible on the current web. There are three reasons for this.
1. The data is not published at all because it is not gathered in the first place. A bit like a 1993 when I was campaigning for more smoke free areas in pubs, the first step was to get pub guides to survey pubs so we had a current state of the market and some data to work with and actual pubs to speak to about how smoke free areas affected them. 
2. The data is gathered but is in a database somewhere that you have to query via an intermediate website. Such sites usually charge you to list there, unlike Google which is free. This is the likes of lastminute.com, autotrader, zoopla, etc. 
3. The data is published but is not structured in any useful way - instead you get a page of content and somewhere on that page is the info you need and you have to manually scour for it. Such as Amazon listing the size of luggage on the listings page but not giving me a filter to search for luggage under a certain size. We could attempt to solve this problem by applying AI and a deep knowledge of human language to interpret each page but that is a hard job to do error free and extremely hard to to for all the world's languages. As a Gaelic speaker, I support minority languages and I wouldn't want the speakers of minority languages to be sidelined. Data, ideas and feelings are our universal language and speech is only an interpretation of these. 
So here is where clever Google algorithms run out of steam, because of the lack of quality data. So to Search 3.0 and transformative data.
What we've seen is that the old business model of a newspaper listing advertisements hasn't really changed much for the internet age. Ebay, Zoopla, Autotrader - they are simply at the same position in the sales cycle as a newspaper used to be selling adverts and making money from the advertiser based on their readership. What's changed in 20 years?

Search 3.0

This idea isn't new, but I have been promoting it and winning attention for it, just not financial backing. In 2000 I entered the Scottish Enterprise "Who wants to be an entrepreneur" competition with an early version of the idea and my idea got recommended for a feasibility study by Ian Ritchie , leading Scottish entrepreneur  and TED speaker.  I also submitted it to a computer magazine which awarded it one of the top E-commerce ideas in the UK in Feb 2000. The issue then was funding due to the dotcom crash. Great idea, no funding climate. I suggested it to a crowdsourcing site in 2006 where it was called "The next Google". I blogged about it in 2008 and did a Google hangout with Google's Product Manager Director for Search in 2013. Still no traction. Becoming rather fed up with the huge mountain to climb in order to get funding, I feel rather like Queen being told with Bohemian Rhapsody that "the band had no hope of it ever being played on radio". It is the most played song in Radio 1's history. Even Steve Jobs got ridiculed when the iPod was launched. The product that paved the way to the world's most valuable company. Laugh away now. Sometimes the critics get it wrong.  
To counter that I'm putting some of the idea out there because back in the early 90s on the Internet that exactly what people used to do. For free. I did it with the UK Internet List in 1992, the first online guide to Scotland in 1994 and Tim Berners Lee did it with the web in 1991. Link to original post (might not work on mobile browsers). Why do this? To advance the Internet. To encourage debate. To drive forward standards. To recognise this is the first time in the history of the planet where we have a global free platform we can converse on to exchange ideas and to make that a better place for future generations. This only happens once in a planet's history and we are lucky to live in that time. It would be great if we got it right for future generations.
Why not? 

Search 3.0 - Layer 1. Data enrichment

I listed a few examples above of searches I've found frustrating. However this could just be me. I don't know what you find frustrating about the web, what you are looking for that you can't find and what you would like to do to change it. Google probably has an idea because it can track sessions and the long sessions searching for repeatedly similar subjects might be a good indicator that the data is poor but in order to open this up democratically I suggest the following approach.
I'll begin by assuming the refinement of search is based around improving the quality of related high volume e-commerce sites. The reason for this is that if you approach the idea from a VC perspective, this is where you might build the greatest economic value for the search first. However you needn't necessarily follow this approach if you're being altruistic.
Step 1:  Identify the top search categories you want to specialise in initially, e.g. hotel rooms, job listings, rooms to let, restaurants. Cross reference these search terms against the first page of results in Google for these terms, for each of the world’s top cities. Just record the domains returned (including from adverts because the adverts are ranked for relevance). Store this in a database. Rank it by city size if you want a priority order. You now have a list of the top 2nd level search engines by product and locality. You no longer need Google.
For each of the above queries to discover top categories and locations, the same query is sent off to dmoz.org (the Open Directory Project). The categories of results which are returned are what are relevant here are opposed to the actual pages returned. So for a query on travel and London , the top category returned from dmoz would be: Regional: Europe: United Kingdom: England: London: Travel and Tourism
Now you can correlate the Google results by category using the two results above. Furthermore as the dmoz directory is hierarchical, you can build up a hierarchy of websites to allow users to refine their search results. You now have a hierarchical product and geography driven database which references the top websites in the product category and geography. It's still only a list of websites though, no products or services yet.
Step 2: Since we have no data at present for more sophisticated searching, the user is presented with the option to refine the results by keywords against the results returned. Something like “To refine the listings on the webpages below, please indicate what is important to you..”
In the example above, the user could specify grading, price, address etc.
The next user who comes along with a similar search sees the keywords the first user added and then votes for them and/or adds their own. Over time, the keywords entered by users would be shown in each category of search results in order of decreasing popularity. This seeding of the database would occur during the product alpha/beta stages so that there was already a dataset at the time of formal launch. What the site is doing is learning in a Web2.0 sense what criteria are actually important to people to drive the next phase – a popularity contest for how to drive the next phase “how would you like to extend the search capability”, something most websites never ask – they just give you options and it’s “take it or leave it”
Stage 3: you have a database of accommodation websites categorised in a directory from stage 1 and you have a list of how you’d like to search them from stage 2. Next, you send out a search engine spider to these sites, like Google, and spider them. Websites are usually built from templates in Content Management Systems and even complex sites might only have around 20 unique templates. So once you have figured out the templates, the data on them is usually just repeating patterns.
Some might call this site scraping but it's no different to what a search engine does when it indexes content. You should respect the rules from the robots.txt file and behave in a considerate way when indexing other people's content.
We're not dealing with a vast number of websites here and they are based on repeating patterns so joining them up is not so hard. 
 You open up a programmable interface to the site for markup editors, possibly in return for a share of revenue from the advertising revenue generated from those listings. The markup editors would use a tool such as Xrayhttp://westciv.com/xray/ to examine the contents of pages on the site to see where the relevant info occurs that people want to search for. This is effectively an open API for a site scraper tool. mysupermarket.com demonstrates that site scraping works and rather than running into copyright issues all that is happening is an intelligent parsing of the site, rather like a search engine robot. There is nothing particularly revolutionary here – besides mysupermarket for groceries, the same concept has been applied to workhound for jobs and globrix for housing – however these sites are all narrow vertical markets, limited by geography and do not interact with users to extend their search capabilities.
Sites could ban this parsing if they wanted via the standard robots.txt and instructions would be available on how to do this for site owners. The site editors, guided by the top search terms from Stage 2 then indicate where the relevant content is on the page. For instance if you were parsing lastminute.com, the price information is after the text “total stay from”. If you were parsing job listings, the salary on Jobserve is next to the bold “Rate” keyword and so on. Although this is a non trivial job, the existence of existing site scrapers shows this can be done, plus XML/RSS feeds from the site provides additional scope to help with the parsing. The spider would only be sent to sites with a certain minimum (1000?) number of pages (as seen by Google) to ensure that only content rich sites were indexed. The volume of pages returns also gives you a good data set to teach the parsing technique.
Step 4: Once the top sites have been parsed in this way, the parsed information can then be used to drive subsequent searches. Supposing the price info had been parsed, the price keyword would show as bold on the search results indicating that data was available; this would then allow the user to refine further on that option. So we have built in this example a search engine that allows the user to search for hotels by price across all the top relevant accommodation search engines. Exactly the same pattern could be used to write a search engine for jobs, real estate, electronic goods for sale ultimately arriving at a search engine that is like ebay in terms of refining listings down to a level that the user wants, e.g. mobile phones with wifi, 5megapixel camera, etc etc.  This difference however is that ebay charges for a listing whereas this is a general search engine that points off to the original site, allowing the product to be listed on the  site for nothing. You might call this the professional consumer who knows what they want to search for (I have prosume.com for this purpose)
Step5: Complete world domination! (only kidding)
Having targeted the big sites for useful listings and built a really useful product and service search engine, categorised by product type, location and searchable by top keywords we now get to the bit were the Internet as we’ve known it can really change massively. VCs interested in buzzwords might call this disruption. I suppose it is because you no longer need to pay to advertise.
Until now, if you had a specialised product or service then in order to get it really noticed you had to submit it (usually at cost) to a specialised search site. If you have a property to sell in Edinburgh, you put it on ESPC.com. If you have a job, you add it to S1Jobs.com, etc. However, just as Google can index individual sites and list them, the same should be true for products and services. It shouldn’t be necessary to list them on some other site, you should be able to list the products and services effectively on your own site and have them searchable for free, just as Google indexes simple web pages for free. Why not? 
How is this achieved?Having followed steps 1 to 4 above, let us assume that we want to allow people to list a job without having to pay to do it on a job search site. Jobs come from agencies and employers, so in the search category listing for Jobs in UK derived at stage 1, you publish a “get listed here” guide. The guide would refer to the top parsed search terms (derived from stage 4) and the format this needs to appear in on the webpage for it to be successfully parsed. So for a job listing you could require that there are bold fields “Min salary:” and “Max salary:” and next to these the salary information is stored (alternatively this info could come through in the site’s RSS feed). Thus any site can be added provided it can be easily parsed. What is especially exciting is that the search terms are of course driven by users so there is scope here to go well beyond the searchable terms on existing sites. For instance, users might want to search for jobs that are accessible by public transport, yet no job search site offers this. Disabled people might want to search for jobs that they can access from a level entrance (an option already available for tourist accommodation searches). Part time mums might want to search for jobs by specific working hours etc. Asking users how to improve search is a unique feature of this site. By specifying the enhanced template for listing against new criteria, sites would have an incentive to provide this information to make their listings more relevant and searchable.
How do users generate this data? Since the data format is open source, the tools will be freely available and could take the form of web apps, wordpress plug ins, CMS extensions and so on. They would be updated in real time to deal with updates to the agreed schema. 
Where is the value? With open data, there is opportunity for competition - sites can bring the data together in new and interesting ways as we've seen where this has happened with government data. There would be competition for the data in terms of who built the best sites around it. There would be entrepreneurship in taking the data forward, rather than the world of jobseeking where the schema hasn't moved in over 20 years. There would be integration of the data with existing apps to make them more useful. There are lots of opportunities.
The net result is a search engine with the power of eBay's searches, the breadth of Google, the profitability of PlentyOfFish or Jobserve (85% profit) scaled up and the usefulness of Amazon, driving and expanding search according to user preference. With more openneess there is also more decentralisation and less need for high end centralised expensive data centres which usually the consumer ends up paying for in some form.
Besides products and services, there is also my data. I want to control who I share with whom. Like driving a search, the more I share the more relevant info I might get in terms of services, but ultimately it's my choice rather than the number of mandatory fields the people in your business have put up. 
This isn't intented to be a polished article. It's open to inspection, adaption and improvement. As the Internet's data should be.
Craig
Original article at https://www.linkedin.com/pulse/search-30-transformative-big-data-road-ahead-craig-cockburn please also feel free to comment there.

No comments:

Popular Posts