Deep Commonsense Intelligence

8 min readSep 7, 2020

This story is a summary of the lecture given by Yejin Choi who is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington.

Have we done solving AI?

(2015) super-human performance on object recognition
(2016) Google neural machine translation
(2017) super-human performance on speech recognition
(2018) human-level performance on reading comprehension on SQuAD (Stanford QA dataset)

Of course not. What we have with deep learning today is just solving a “dataset” without solving the underlying “task”. To bridge the gap between dataset-oriented machine learning and the way humans recognize the phenomenon, we have to think about two questions.

Where are we with respect to system-1 vs system-2 reasoning?
Do current datasets encourage narrow learning?

Where are we with respect to system-1 vs system-2 reasoning?

The term system-1 and system-2 are from the book <Thinking, Fast and Slow> written by Daniel Kahneman who is the winner of the Nobel prize. System-1 refers to ‘intuition’ or ‘instinct’ which is unconscious, fast, associative, and automatic. On the other hand, system-2 indicates ‘rational thinking’ which takes effort, slow, logical, lazy, and indecisive. So, where are we with deep learning from this perspective?

Much earlier, Kahneman also suggested “three cognitive systems”.

Maps of Bounded Rationality(2003) ⓒKahneman

An interesting point is that both intuition(system 1) and reasoning(system 2) in the content layer require the conceptual representations — temporal reasoning the concept of the time.

According to Kahneman, perception is corresponding to ‘the object recognition’ and ‘the image segmentation’, which we are good at in computer vision or speech recognition area so far.

Current computer vision is kind of seeing two monsters in a tunnel. It says the two faces are identical though the chaser has hostile intentions and the chased is afraid, which is far below the level of humans to make inference without an effort(intuition, system-1).

None of these inferences is absolutely true. The inferences are stochastic in nature. Everything is defeasible with additional context. Many intuitive inferences are commonsense inferences, a great deal of which can be best described in natural language — full scope of language, not just words, or even graphs of words.

Visual COMET Reasoning about the Dynamic Context of a Still Image
1.4 million inferences over 60K images: the first large-scale repository of visual commonsense graphs.

When we get an image as an input of the model, not only using just language but also using vision can give us a better understanding of the context of the situation in 3 ways — before, because, and after the image. In the original VQA(Visual Question Answering), language-only baseline did too well without looking at the image because they ignored the image.

For Reasoning, especially for intuitive (commonsense) inferences, we need to embrace all of language as the symbols, not just words or graphs of words. Reasoning needs to be pursued as a generative task as opposed to a discriminative task because the space of reasoning in the language is infinite.

Do current datasets encourage narrow learning?

Datasets have been the fuel for the success of deep learning. However, the current dataset and evaluation paradigm might be encouraging “dataset solvers” instead of true “task solvers”, and this leads to serious over-estimation of true AI capabilities.

Imagine a human learns something but has just a lot of exam questions, not declarative descriptions of what knowledge supposed to be learned. Is it possible to get the correct concept in a way of generalizing QAs in a robotic manner? Humans are not capable of learning in this way, neither should machine be able to do this. Learning paradigm should be designed completely differently. We start to provide more declarative knowledge more directly to AI as opposed to keep providing exam problems to be solved. Humans cannot do, machines cannot do.

A rainbow of commonsense challenges:

Physical IQA (AAAI 2020): Test knowledge of affordances and physical attributes
Social IQA (EMNLP 2019)
Visual Commonsense Reasoning (CVPR 2019)
Abductive Commonsesne Reasoning (ICLR 2020)
HellaSwag (ACL 2019): Can a Machine Really Finish Your Sentence?
WinoGrande (AAAI 2020): Adversarial Winograd Schema Cahallenge at Scale
Cosmos QA (EMNLP 2019)

Humans are no longer able to look at the datasets and identify biases based on our own cognitive power because we just cannot see what machines can see. Therefore dataset must evolve together with the evolving SOTA(state-of-the-art models).

Can a machine really finish your sentence? ⓒYejin Choi

Commonsense inference still proves difficult for even SOTA. Though its questions are trivial for humans (>95% accuracy), SOTA models struggle (<48%). This is achieved via AF(Adversarial Filtering), a data collection paradigm wherein series of discriminators iteratively select an adversarial set of machine-generated wrong answers. Therefore, benchmarks must evolve. By trying the adversarial evalution, we can check if a model only solved a “dataset” without solving an underlying task.

WinoGrande: Adversarial Winograd Schema Challenge (WSC) at Scale (Outstanding Paper Award at AAAI’20) can be the new benchmark for commonsense reasoning. Recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question:

Wheter these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.

To investigate, a large-scale dataset, WinoGrande, was adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.

The best SOTA methods on WinoGrande achieve 59.4–79.1%, which are 15–35% below the human performance of 94.0%, depending on the amount of the training data allowed. Furthermore, new SOTA results on five related benchmarks — WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications:

They demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning.
They are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks.

Large neural models have demonstrated human-level performance on language and vision benchmarks, while their performance degrades considerably on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a dataset rather than the underlying task by overfitting to spurious dataset biases. Algorithmic bias reduction in existing and future benchmarks is really important to mitigate such overestimation.

AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. In followed up reasearch, Adversarial Filters of Dataset Biases (ICML2020), theoretical understanding for AFLite is propsed as the generalized framework for optimum bias reduction. AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Models Trained on AF’ed Data — Better on Out-of-Distribution ⓒYejin Choi

Filtering results in a large drop in model performance (e.g., from 92% to 62% for SNLI-Stanford Natural Language Inference), while human performance still remains high. It thus shows that such filtered datasets can pose new research challenges for robust generalization by serving as upgraded benchmarks.

Going back to the ground question, the current paradigm of QA datasets, the deep learning community has invested so much into the model scale up and the larger-scale dataset, but not as much as effort was not put into the quality and implications of what goes into the datasets.

What I cannot create, I do not understand. — Richard Feynman

We need to change the practice of teaching AI. Can you imagine learning only by solving a lot of exam problems? It’s hard even for humans. We need to evaluate machines generatively and teach machines concepts more directly. In real-life scenario, clear choices to choose conveniently are not given.

Common Sense

Common Sense is the basic level of practical knowledge and reasoning, concerning everyday situations and events, that are commonly shared among most people. This is essential for AI to understand human needs and actions better. In the <the book of why> written by Judea Pearlm, the writer maintains that we should teach intelligent machines “cause and effect” well enough to build truly intelligent machines. A lot of existing knowledge is knowledge of “what”, but knowledge of “why” and “how” (inferential: causes and effects) lacks.

ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning (AAAI 2019)
COMET generalizes well on out-of-domain examples by combining the self supervised learning of observed knowledge and (semi-) supervised learning of declarative knowledge. It can reason about compositional situations. Demo page of Mosaic knowledge graphs

Mosaic Knowledge Graphs example © The Allen Institute for Artificial Intelligence

Closure

When Yejin Choi tried to revisit the Common Sense of AI, she was told not to speak the word commonsense because of the past failures in 70s-80s based on weak computing power, not much data, no crowdsourcing, not as strong computational models, not ideal conceptualization.

But she still believes she can achieve something in this filed by using metaphor of moon-shot thinking.

We cannot reach the moon by making the tallest building in the world. To get to the moon, we basically should play the entire different game rules. — Yejin Choi

Other recommended books

<The Enigma of Reason>: Reasons are used primarily not to guide oneself, but to justify oneself in the eyes of others, and to convince others. Reasonning serves the purpose of communication. Human reason is a mechanism of intuitive inferences in which logic plays at best a marginal role.
<How we reason>
<Mental models>