John Smart, 2013: exponential increase in search query length

Fascinating theories. Are search length of queries is still showing signs of exponential growth, empirically speaking?

Here is some raw data which could be analyzed to see whether its truly exponential or just increasing linearly.

Disclaimer the below analysis is extremely rough and I recommend looking at the raw data if you need this for anything important. Also I can’t vouch for the raw data’s accuracy its just information I found randomly on the web.

This is a back of the envelope analysis using only the USA data for January of each year except in instances where there was no January data in which case I used the February data. It seems to conflict with other data. For example this site lists the average query length to Google as 4.29 words as of 2012. I treated 10+ word queries as 10 word queries. Note also the Y axis is truncated. For some reason this data seems to show a drop off in 2013 aside from which the average word length seems to be steadily growing. For many searches of course we can find precisely what we’re looking for with two or one words. So we should certainly expect a level off at some point. And one of Google’s latest innovations is to allow a person to ask follow up questions such as “Who is the current U.S. President”, followed by “how old is he?” This could shorten query lengths slightly. In conversation with other humans we often ask one word follow up questions such as, “really?” or “right?” which we would not currently ask of Google.

On a related note a recent study found that “optimum length for an email is 50 to 125 words”. If we had true AI we might make a 50 to 125 request or series of requests of Google fully explaining what we are looking for as we might do with a friend or colleague.

I would be curious as to anyone’s opinion as to whether search length really dropped off in 2013 and if so why. There are a lot of intersecting factors. Google has no true competitors and is constantly being “gamed” and adjusting its algorithm. The data or my analysis could be wrong or impacted by random variance. The rise of inter connectivity and enhanced availability to access another human could be a factor. For example, we might now send an email knowing it will be quickly responded to when in the past we would have spent time constructing a complex search. We might also make more complicated inquiries of other humans in interest based social platforms such as reddit or twitter for more complicated questions. For example if you have a 10+ word question on programming you’re probably better off emailing a friend or posting on stack overflow. However, if you look at the raw data, even 10+ word searches are growing. And currently most of the searches cluster at 4 words or less. We have a long way to go before we are talking conversationally with our computers.

At the end of the above clip John Smart noted that the average question length for a human to human question was 11 to 14 words. He also stated in the video he thought we would reach this point by 2019. He also noted that by then every child would have a cell phone because they’d be dirt cheap by then. His latter prediction definitely looks like it will be correct.

words per search

The below quote from his essay seems to disagree with my analysis. Emphasis added.

Predicting the CI Emergence

When can we expect the CI’s emergence? In March 2005 Google’s director of search Peter Norvig noted that their average query is now about 2.5 words per query, by comparison to 1.3 on Alta Vista in its heyday, circa 1998. In subsequent email conversation with him he has told me that the actual number is “closer to 2.6 or 2.7.” This is an initial doubling time of only seven years, if this is a quasiexponential function.

It appears that the growth of the CI as a complex adaptive technological system is in the early phase of an S-curve, well before the inflection point, and thus its growth will continue to look exponential for some time to come.

[2008 Note: Average query length to Google now exceeds 4 words, apparently just this month. This is more early evidence that this phase of search query length growth will remain exponential up to the inflection point.]

In my opinion this average search query length, averaged across all the leading search engines of the day (Google, Yahoo!, Bing, etc.) will be one of the key numbers to watch to gauge the growing effectiveness of statistical natural language processing (statistical NLP) in creating a conversational front end for the internet and all our other complex technologies in the 21st century.

Is Microsoft’s new email, hotmail trying to hide emails or is it just horrible?

The new hotmail makes it nearly impossible to cut and paste an email address from an email received on a hotmail address. Perhaps its just horribly designed. Or perhaps its just an attempt to keep people on that platform. To anyone considering migrating off that site it is worth the effort. It strikes me as anti competitive or at the least not in the decentralized spirit of email to hide addresses as the new hotmail seems to do so well.

Question for media members obsessed with Theranos

What concerns you most?

1. VCs losing money that they can afford to lose? Accredited investors who are certainly big boys and can take care of themselves?

2. Elizabeth Holmes getting attention she doesn’t deserve? (who cares)

3. People getting ineffective blood tests for a short while? In no way could their ineffectiveness be kept a secret for long.

4. You think medicine is too lightly regulated?

Take a look at this clip below and realize that biology is hard and if we make a gigantic deal out of every failure we’ll get more heavy handed regulation and drive away people who actually can solve health problems.

Notes to: Machine Learning 101 Episode 10

This podcast series is an excellent overview of machine learning.

  • Nature vs Nurture
  • Deterministic vs probabilistic view of reality
  • environments are not intrinsically probabilistic or non probabilistic
  • goal is to determine probability weights of events
  • Bayesian priors (dog genetic wiring example)
  • Maximum likelihood estimation probabilistic law that makes observed data most likely
  • Math estimation problem
  • How times change

    In Bill Gates deposition he seems to have regarded video on demand as a cute failed idea