Q&A Platforms Evaluated
Using Butler University Q&A Intelligence Index
A Study by the Butler Business Accelerator
Read the full report
Executive Summary
A new study using the Butler University Q&A Intelligence
Index measures how various mobile Q&A platforms deliver
quality, accurate answers in a timely manner to a broad variety of
questions. Based on the results of our analysis, ChaCha led all
Q&A platforms on mobile devices.
|
*Mean accuracy of
responses originally graded on a 5 point scale.
|
Results of the study are based upon review of a large set of
responses from each of the major Q&A platforms, coupled with a
comparison of disparate Q&A platforms that serve answers in
different ways. Our methodology included the creation of a new
metric, termed the Butler University Q&A Intelligence
Index, which measures the likelihood that a user can
expect to receive a correct answer in a timely manner to any random
question asked using natural language. We asked questions via
mobile services and randomized the questions to cover both popular
and long-tail knowledge requests.
ChaCha delivered the highest quality responses consistently
across the largest group of categories and question types, but did
have occasional issues with Objective|Temporal and Sports
questions. Ask.com performed best in the single category of
questions tagged as Objective|Temporal.
Quora was proficient at answering difficult questions that
require expert and extensive explanations, but it was generally
unable to deliver answers within 3 minutes for most information
searches on mobile devices. Quora answered only 24% of the
questions at all, and often the match found did not include a
viable answer.
Siri did not perform nearly as well on this random sampling of
popular and long-tail questions as it did on a recent Piper Jaffray
study, where results indicated that Siri correctly answered 77% of
questions (Elmer-DeWitt, 2012). Our study found Siri only
accurately answered 16% of the questions posed. The variance may be
due to the types of questions asked and the testing conditions.
Piper Jaffray notes that Siri's biggest strengths are in "local
discovery and OS (operating system) commands" which were not highly
represented in our study of more mainstream questions.
Google's response rate was 100%, but the first non-sponsored
result on the search results page (which often times was not fully
visible as an organic search result on the presented page on a
mobile device) only presented an accurate answer about 50% of the
time, according to the Butler University Q&A Intelligence
Index. On a mobile phone, when accounting for the clutter of
ads and the likelihood of extra clicks to achieve the answer,
allowing for the answer to be within the first non-sponsored search
result might be considered generous. Again, this study differs from
the results found in the Piper Jaffray study, but differences are
likely due to variations in methodology. For example Piper Jaffray
found that Google scores highest in terms of navigation and
information (Elmer-DeWitt, 2012).
This study's results support the hypothesis that Q&A
platforms cannot rely on algorithmic search results alone to
deliver quality answers. Search Engine Results Pages (SERP) lack
deep semantic understanding of "natural language" human questions,
and, therefore, cannot account effectively for long-tail questions
like those posed in this study (De Virgilio, Guerra, &
Velegrakis, 2012).
To achieve a score above 50% or 60% on the Butler University
Q&A Intelligence Index, it would appear that Q&A
platforms must supplement algorithmic document indexing with
either:
- Utilization of structured data
- Semantic understanding via artificial intelligence (AI) or real
humans
In terms of handling structured data more effectively, Google is
promoting direct answers using its new Knowledge Graph and Google
Now technology, "which tap into the collective intelligence of the
web and understand the world a bit more like people do," (Google,
2012). The limits of Google's algorithmic technologies are evident
in the empirical results of this study and users' actual
experiences. Other Q&A platforms in this study are also
incorporating similar algorithmic solutions.
Improved machine learning may eventually push past the
algorithmic limitations of document analysis. The efforts of DeepQA
(IBM's Watson on Jeopardy) proved that intensive semantic
processing can help, albeit without cost-efficient ability to scale
using today's systems. While Ferrucci notes that "Computers cannot
ground words to human experiences to derive meaning," Watson showed
potential. The DeepQA project claimed 90% accuracy in answering 60%
of questions and 70% accuracy in answering 100% of questions
(Ferruci, 2010). These results translate to 54 and 70 Butler
University Q&A Intelligence Index scores respectively.
We conclude that, without large advances in semantic processing
and better utilization of knowledge graphs, Q&A platforms can
benefit from the timely injection of human semantic understanding
into the Q&A experience. ChaCha's top score on the Butler
University Q&A Intelligence Index is an indication that
just-in-time human-assisted Q&A can outperform algorithm-only
solutions.
Read the full report