Did IBM Watson Pass the Turing Test?
This post summarizes some talks that I gave on Watson and the Jeopardy challenge.
Recently millions of viewers witnessed computing history being made as IBM’s Watson question answering system defeated human champions in the Jeopardy! TV show. While watching the outstanding performance of Watson in the show, many tend to ponder about the “intelligence” of the entity performing in front of us. In this post I raise the question of whether Watson is indeed intelligent and discuss the meaning of machine intelligence in general.
The Turing Test
Artificial intelligence (AI) is the branch of computer science that aims to create intelligent machines. What is the meaning of an intelligent machine? Alan Turing, a founding father of computer science whose centenary year has being celebrated recently around the world, was one of the first persons to ask that question. In his seminal paper from 1950, Computing Machinery and Intelligence , Turing raised the question: to what extent can a machine think for itself? One of the arguments raised in this work was that since humans are the only “intelligent” entities we are aware of, any intelligent entity should “behave like” a human, i.e., its behavior could not be classified as artificial by an outside observer. Turing suggested a test, now called the Turing test, to examine the intelligence of an unknown entity. An external interrogator observes the behavior of the entity and a human, where both are asked to solve the same set of problems simultaneously. The interrogator is only exposed to the actions taken by the two entities and to the outcome of their actions while solving the problems. When the judge is unable to distinguish correctly between the human and the artificial solutions, the machine has passed the test and should be considered intelligent. At that time, Turing anticipated that in about 100 years (forty years from now) artificial entities would be able pass the test for the majority of the tasks examined.
The Turing test has many shortcomings. One clear limitation is that humans are not ideal problem solvers. The artificial entity can be easily identified and fails the test when it comes to tasks that machines can solve far more effectively than humans. For example, a calculating machine that is being asked to multiply two eight-digit numbers will perform much quicker and more accurately than any human being; therefore, it will be easily revealed. In order to pass the test, the calculating machine would have to perform sub-optimally, making a few mistakes and performing slowly in order to fool the judge. This does not make any sense from an engineering point of view and raises some doubts about the importance of mimicking the human behavior.
An alternative practical approach to AI, called weak AI, does not bother too much with the precise definition of intelligence, and focuses on building machines that are effective problem solvers for specific tasks. The goal of AI, accordingly, is to advance the technology by identifying tasks for which humans are currently better than machines, and look for solutions to those tasks, probably by observing and learning from humans, to ultimately arrive at an artificial solution that catches up to and even surpasses humans. Some examples of domains in which humans are still superior to machines are game playing, natural language understanding, speech recognition, image analysis, automatic car driving, and more. The artificial solution should not necessarily be identical to the human solution, as our goal is not to mimic humans but rather to achieve human level performance, and eventually outperform humans.
How can we determine that the artificial program outperforms the human solution? An evaluation test bed is needed to compare the performance of the two problem solvers. In contrast to the Turing test, the evaluation criteria should focus only on the effectiveness of the competing solutions, whether artificial or humanistic.
Chess playing is an ideal domain for exemplifying the weak AI approach. In the early days of AI, many researchers designed and built chess-playing programs. But those programs were not able to challenge master-level human players. For many years, AI experts believed that high-level chess playing would remain infeasible for artificial programs in the near future (while some even argued that it would never be possible). The main argument raised by those experts was that an artificial expert-level playing system needs to be able to formalize the sophisticated heuristics used by the human experts—a task that is most likely infeasible .
In 1997, Deep-Blue, a chess program developed at IBM Research, defeated Garry Kasparov, the human chess champion at that time, in a six-game tournament . In contrast to the expectations of specialists, Deep Blue did not apply any sophisticated heuristics to evaluate the board positions. The strength of its playing strategy came mostly from advanced computational machinery and a clever distributed search algorithm that was able to search the game tree many moves ahead, given the limited time allocated per move. This brute-force computational power was sufficient to beat Kasparov, one of the best human chess players of all times. Since then, many new chess programs have been developed with high-level playing performance, even surpassing Deep Blue with far fewer computational resources.
Did Deep Blue pass the Turing test? The answer is: certainly not. Any chess expert can easily identify a game played by an artificial program because such artificial players generally follow a typical strategy that assumes the worst case scenario, i.e., the opponent is assumed to play perfectly from the player’s perspective. This assumption leads to a typically conservative playing strategy, which is easily revealed. For example, such a strategy will never set traps since a perfect opponent is not expected to fall for them. However, from a weak AI perspective, chess playing is no longer a challenge. Computer scientists and practitioners might still find some interest in building machines that surpass Deep Blue. This is also true for strong AI researchers and other cognitive scientists who are still interested in modeling human strategies for chess playing. In contrast, weak AI researchers should turn their interests and efforts to new unexplored areas where humans are still superior, for example, the game of GO.
IBM Watson and the Jeopardy! Challenge
Natural language understanding (NLU) has always been a challenging domain for AI. While a lot of progress has been made in this area during the last 50 years, the current technology is still unable to handle a free discussion with humans. Question answering (QA) is a specific NLU sub-task in which questions posed in natural language are automatically answered. QA has been extensively studied over the years and many QA systems were developed that can answer a wide range of question types including: facts, lists, definitions, and others.
Can an artificial QA system outperform humans in answering natural language questions? This is a typical weak AI challenge that was raised recently by a team of IBM researchers. In order to answer such a question we first need an evaluation system in which the performance of humans and artificial systems can be objectively tested and compared. The Jeopardy! TV show was selected as an evaluation platform. Jeopardy! is a well-known television quiz show that has been broadcast in the United States for more than 50 years. It features trivia in a wide variety of topics, including history, language, literature, arts, science, popular culture, geography, and many more. Three human contestants compete against one another in a competition that requires rapidly understanding and answering, with penalties for incorrect answers.
IBM Watson, a QA system developed at IBM Research, was built specifically to play the game of Jeopardy! [2,3]. Watson was designed to answer Jeopardy! questions within a response time of less than three seconds. Watson’s main innovation is its ability to quickly execute more than 100 different language analysis techniques to analyze the question, find and generate candidate answers, and ultimately score and rank them. Watson’s knowledge-base contains 200 million pages of structured and unstructured content, consuming 4 terabytes of disk storage. The hardware for Watson includes a cluster of 2,880 POWER7 processor cores and 16 terabytes of RAM, with massively parallel processing capability.
In an official Jeopardy! two-game match, broadcast in three episodes during February 2011, Watson beat Brad Rutter, the biggest all-time money winner on Jeopardy!, and Ken Jennings, the record holder for the longest championship streak. The match ended with a clear win for Watson. Jennings earned $4,800, Rutter took $10,400, and Watson won $35,734.
Did Watson pass the Turing test? As in the case of Deep Blue, the answer is ‘no’. Clearly, there are many questions that are deemed more difficult for humans than for machines and vice versa. For example, factoid questions about non-curios and non-famous events in history might be easier for machines since the human brain is much more limited in remembering such events. In contrast, puzzle questions are much easier for humans. For example, Jeopardy ‘common bonds’ questions (e.g., “what is in common to butter, carving, steak?” answer: knife) are much more difficult for machines because the expected answer type cannot be easily formalized.
Given the discrepancy between human and machines capabilities in answering specific types of questions, a smart interrogator can select questions that are difficult for humans while easy for machines, and vice versa. By analyzing the performance of the two entities being tested on these questions, the machine will presumably be detected . However, from weak AI perspective, the research focus should be moved now to answering more challenging questions (e.g., puzzles) and to other unresolved tasks in the NLU domain.
The Watson success in winning Jeopardy! was a tremendous advancement in the QA domain. Watson technology, based on deep natural language analysis and strong computational power, demonstrated again that tasks considered infeasible for machines can be solved effectively and efficiently, and even surpass human performance, when studied and researched by a motivated team allocated with appropriate and sufficient resources. The Deep Blue and Watson success stories strengthen our belief that other research areas can benefit from the weak AI approach of challenging human superiority as a platform for advancing computational technology. The question of whether such technologies can pass the Turing test seems to be an interesting thought exercise, however insignificant from a technological perspective.
1. Due to some concerns of the IBM team that the Jeopardy! show’s writers would exploit Watson’s deficiencies, it was agreed that questions for the match will be randomly selected from a pool of Jeopardy! clues that were never broadcast.
- Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep Blue. Artificial Intelligence 134 (1), 2002, pages 57-83.
- Dave A. Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. Building Watson: An overview of the DeepQA project. AI Magazine, Volume 31 (3), 2010, pages 59-79.
- Dave A. Ferrucci. Introduction to “This is Watson”. IBM Journal of Research and Development, Volume 56 (3, 4), 2012, pages 1-15.
- David Levy and Monty Newborn. How computers play chess. Computer Science Press, 1991.
- Alan M. Turing. Computing machinery and intelligence. Mind 59, October 1950, pages 433-460.