1 Introduction . 1 1.1 Motivation . 1 1.2 Visual Question Answering in AI tasks . 4 1.3 Categorisation of VQA . 6 1.
3.1 ClassiFied by Data Settings . 6 1.3.2 ClassiFied by Task Settings . 7 1.3.3 Others .
8 1.4 Book Overview . 8 References . 9 Part I Preliminaries 2 Deep Learning Basics . 15 2.1 Neural Networks . 15 2.2 Convolutional Neural Networks .
17 2.3 Recurrent Neural Networks and variants . 18 2.4 Encoder-Decoder Structure . 20 2.5 Attention Mechanism . 21 2.6 Memory Networks .
21 2.7 Transformer Networks and BERT . 23 2.8 Graph Neural Networks Basics . 24 References . 26 3 Question Answering (QA) Basics . 29 3.1 Rule-based methods .
29 3.2 Information retrieval-based methods . 30 3.3 Neural Semantic Parsing for QA . 31 3.4 Knowledge Base for QA . 31 References . 32 Part II Image-based VQA ix x Contents 4 The Classical Visual Question Answering .
37 4.1 Introduction . 37 4.2 Datasets . 38 4.3 Generation VS. ClassiFication: Two answering policies . 39 4.
4 Joint Embedding Methods . 40 4.4.1 Sequence-to-Sequence Encoder-Decoder Models . 40 4.4.2 Bilinear Encoding for VQA . 42 4.
5 Awesome Attention Mechanisms . 44 4.5.1 Stacked Attention Networks . 44 4.5.2 Hierarchical Question-Image Co-attention . 47 4.
5.3 Bottom-Up and Top-Down Attention . 48 4.6 Memory Networks for VQA . 50 4.6.1 Improved Dynamic Memory Networks . 50 4.
6.2 Memory-Augmented Networks . 52 4.7 Compositional Reasoning for VQA . 54 4.7.1 Neural Modular Networks . 54 4.
7.2 Dynamic Neural Module Networks . 56 4.8 Graph Neural Networks for VQA . 57 4.8.1 Graph Convolutional Networks . 58 4.
8.2 Graph Attention Networks . 60 4.8.3 Graph Convolutional Networks for VQA . 62 4.8.4 Graph Attention Networks for VQA .
63 References . 65 5 Knowledge-based VQA . 69 5.1 Introduction . 69 5.2 Datasets . 70 5.3 Knowledge Bases introduction .
72 5.3.1 DBpedia . 72 5.3.2 ConceptNet . 73 5.4 Knowledge Embedding Methods .
73 5.4.1 Word-to-vector representation . 73 5.4.2 Bert-based representation . 75 5.5 Question-to-Query Translation .
76 5.5.1 Query-mapping based methods . 77 5.5.2 Learning based methods . 78 5.6 How to query knowledge bases .
79 5.6.1 RDF query . 79 5.6.2 Memory Network query . 81 References . 82 6 Vision-and-Language Pre-training for VQA .
87 6.1 Introduction . 87 6.2 General Pre-training Models . 88 6.2.1 Embeddings from Language Models . 88 Contents xi 6.
2.2 Generative Pre-Training Model . 89 6.2.3 Bidirectional Encoder Representations from Transformers . 89 6.3 Popular Vision-and-Language Pre-training Methods . 93 6.
3.1 Single-Stream Methods . 94 6.3.2 Two-Stream Methods . 96 6.4 Fine-tuning on VQA and Other Downstream Tasks . 98 References .
101 Part III Video-based VQA 7 Video Representation Learning . 105 7.1 Hand-crafted local video descriptors . 105 7.2 Data-driven deep learning features for video representation . 108 7.3 Self-supervised learning for video representation . 109 References .
110 8 Video Question Answering . 113 8.1 Introductions . 113 8.2 Datasets . 114 8.2.1 Multi-step reasoning dataset .
114 8.2.2 Single-step reasoning dataset . 118 8.3 Traditional Video Spatio-Temporal Reasoning Using Encoder-Decoder Framework . 119 References . 126 9 Advanced Models for Video Question Answering . 129 9.
1 Attention on Spatio-Temporal Features .