Buckets:

mishig's picture
|
download
raw
51.2 kB

MASTER THESIS IN COMPUTER SCIENCE

Author:

Afroze Ibrahim Baqapuri
afroze.baqapuri@epfl.ch

Supervisors:

Dr. François Fleuret
francois.fleuret@idiap.ch

Dr. Eric Cosatto
cosatto@nec-labs.com

January 15, 2016# Contents

1Introduction9
1.1Organization of report . . . . .10
2Literature Review13
2.1Deep learning in images and computer vision . . . . .13
2.2Deep learning in text and NLP . . . . .16
2.3Deep learning in image and text multimodal models . . . . .18
2.3.1Introduction . . . . .18
2.3.2Evaluation Metrics . . . . .20
2.3.3Image and text mapping . . . . .21
2.3.4sentence generation from images . . . . .24
3Resources27
3.1Flickr8K . . . . .27
3.2Flickr30K . . . . .28
3.3OverFeat . . . . .28
3.4Word2vec . . . . .29
4Research Methodology31
5Proposed Model37
5.1Abstract Architecture . . . . .37
5.2Details Visual and Textual models . . . . .39
5.3Unique training methodology . . . . .42
5.4Comparison to other models . . . . .42
6Experiments and Results45
6.1Preprocessing . . . . .45
6.2Evaluation Metrics . . . . .46
6.3Training Details . . . . .47
6.4Experimental Results . . . . .47
6.4.1I2T and T2I training approach . . . . .48
6.4.2 Results of BoW model . . . . . 49
6.4.3 Word2Vec results . . . . . 49
6.4.4 Results of n-gram models . . . . . 51
6.5 Are deep textual models necessary . . . . . 52
6.6 Comparison with existing systems . . . . . 54
7 Conclusion 57
7.1 Future work . . . . . 59
Appendices 61
A Acronyms 61
B Extra Resources 62
B.1 MNIST . . . . . 62
B.2 Caltech256 . . . . . 62
B.3 ImageNet . . . . . 62
B.4 Pascal VOC 2008 . . . . . 63
C Software 64
D Further results of I2T and T2I training methodologies 65
Bibliography 65
# Abstract

The ability to describe images with natural language sentences is the hallmark for image and language understanding. Such a system has wide ranging applications such as annotating images and using natural sentences to search for images. In this project we focus on the task of bidirectional image retrieval: such a system is capable of retrieving an image based on a sentence (image search) and retrieve sentence based on an image query (image annotation). We present a system based on a global ranking objective function which uses a combination of convolutional neural networks (CNN) and multi layer perceptrons (MLP). It takes a pair of image and sentence and processes them in different channels, finally embedding it into a common multimodal vector space. These embeddings encode abstract semantic information about the two inputs and can be compared using traditional information retrieval approaches. For each such pair, the model returns a score which is interpreted as a similarity metric. If this score is high, the image and sentence are likely to convey similar meaning, and if the score is low then they are likely not to.

The visual input is modeled via deep convolutional neural network. On the other hand we explore three models for the textual module. The first one is bag of words with an MLP. The second one uses n-grams (bigram, trigrams, and a combination of trigram & skip-grams) with an MLP. The third is more specialized deep network specific for modeling variable length sequences (SSE). We report comparable performance to recent work in the field, even though our overall model is simpler. We also show that the training time choice of how we can generate our negative samples has a significant impact on performance, and can be used to specialize the bi-directional system in one particular task.# Acknowledgments

I would like to thank NEC and EPFL for giving me the opportunity to work on such an interesting project. Special thanks go to Bing Bai, Iain Melvin, Eric Cosatto, Igor Durdanovic, and Martin Renqiang Min for the many fruitful discussions.# Chapter 1

Introduction

The ability to describe images with natural language sentences is the hallmark for image and language understanding. It has significant applications such as searching images with natural sentences, and automatic captioning of images. Currently, commercially available image searching systems search the text adjacent to an image, instead of searching the content of the image itself. This is a severe bottleneck on the performance of these systems. Advances in this field will have far-reaching effects since images are ubiquitous on the Internet and we are always interacting with them.

Humans can describe an image with relative ease. However, for computers this is not a trivial task. The difficulty arises mainly because the two input modalities have very different statistical properties, and cannot be directly compared. For example, images have spacial dependency, while sentences have temporal dependency (sequence). But, even for humans, the task is of highly subjective nature, and various forms of descriptions exist. Different descriptions could focus on different aspects, or different objects in the image, they could also have different level of detail in describing the content, and they could be abstract like describing the mood or concrete like describing objects. All of these could be correct simultaneously correct, yet very different. Modeling this variance is very important for machine learning systems, and it is achieved in high-quality data sets by having multiple humans describe the same image.

The recent work in this field has focus on two approaches: Multimodal retrieval and sentence generation. Multimodal retrieval deals with the tasks of retrieving sentences from image queries (or image captioning) and retrieving images from sentence queries (or image search). On the other hand, image based sentence generation systems focus on creating natural and fluent sentences directly from the image, which describe the contents of the image.

In our work, we focus on the former approach: multimodal retrieval. Wedesign our system taking motivation from traditional retrieval systems, in particular the supervised semantic indexing (SSI) [Bai et al., 2009] system at NEC used for document retrieval. We expand on the system by using separate, specialized textual and visual models. We use these models to extract semantic information from the input sentence and image, and then transform it into a common vector space, where we can easily compare the two modalities using traditional information retrieval approaches.

The goal of our project is to explore and understand the significance of deep learning techniques in this task. We use deep learning architectures and concepts within the individual text and image modules. For the visual component we experiment using deep convolutional neural networks, which have become synonymous with image recognition [Krizhevsky et al., 2012, LeCun et al., 1998]. For the textual component we experiment with different features and network architectures. We try bag of word approach, n-gram models, tf-idf features and convolutional neural network for sequence embedding. In contrast to other previous work, our model is simpler since we don't use more complicated models like recurrent neural networks [Karpathy and Fei-Fei, 2014] or recursive neural networks [Socher et al., 2014]. However, even with the simpler approach, our model achieves comparable performance to some recent work in the field.

1.1 Organization of report

The rest of the thesis report is organized as follows. In Chapter 2 we present a through overview of available scientific literature in the field. We start with deep learning from a historical perspective and discuss its main ideas and significance, first in the image community and then in text processing. Finally we discuss the recent previous work in the combined field of images and texts as a multimodal input.

Following this, In Chapter 3 we briefly describe some of the important resources used in our project. This includes data sets we trained and tested our model on and some publicly available tools we used in our larger model.

In Chapter 4 we provide an overview of research methodology and time line of experiments we conducted. This chapter serves as an optional section. So the reader can skip over it without losing necessary information in understanding our final models and experiments (reference to the chapter will be made whenever we make a point which needs support from there). The material covers our experience with the problem how it influenced the way we approached the task (including the set backs we faced).

Chapter 5 presents an overview and description of the model we use in our retrieval system including flow charts and mathematical description. It alsoexplains some interesting features of the model, and how does it compare with recent work.

Moving forward, Chapter 6 deals with the bulk of the experimentation done on our model, and the results we get. We start off by discussing the preprocessing, then give details on the evaluation metrics used and meta parameters selected. Finally we conduct several experiments and report their results.

In Chapter 7 we sum up our results by comparing them with other recent models. We then discuss the significance of our results and possible future work in this direction.

Finally, in Appendix, we give a brief overview of the software environment we used to train and test our models, and some additional resources we used which did not become part of our final model.# Chapter 2

Literature Review

In this section we will provide a summary of important research conducted in the field of deep learning during the last decade. The literature review will comprise of three parts:

    1. Deep learning in images and computer vision.
    1. Deep learning in text and NLP.
    1. Deep learning in image and text multimodal modeling.

The last part is most relevant to our task, but we feel a background discussion is important to understand and appreciate the workings of it.

2.1 Deep learning in images and computer vision

An important aspect of deep learning is to use end-to-end automatically trainable systems which do not rely on human-designed heuristics. Traditional machine learning systems are divided into two modules. First, the featureextractor module transforms the input data into low dimensional vectors which can be easily matched and compared, and which are relatively invariant to distortions. These are then fed into the classifier module, which is general-purpose and trainable. A major problem with this approach is that performance is largely determined by human input, and the feature extractor part is task-specific so it needs to be redone for every little task.

Countering this traditional approach, [LeCun et al., 1998] showed that hand-crafted feature extraction can be advantageously replaced with automatic learning algorithms which operate directly on raw input data. The individual modulescan thus be replaced by unified system which optimizes a global performance criterion.

Computer vision, and especially object recognition, has a long history of using deep multi-layered neural networks. [LeCun et al., 1995] did a comparison of hand-written digit recognition on the MNIST data set, in which they compare the performance of a multi-layered (deep) convolutional neural network (CNN) with traditional machine learning approaches like linear classifier (logistic regression), principal component analysis, and nearest neighbour classifier. The comparison shows that deep CNNs outperform the traditional algorithms on object recognition tasks, and reach state-of-the-art.

The convolutional neural networks are one of the first used deep learning models, and they are biologically inspired variants of the ANN. [Hubel and Wiesel, 1968] found the existence of receptive fields (small sub-regions in the visual field) in the visual cortex of brain, and CNNs (among other models) try to emulate this behaviour. In other words, they are specialized network architectures specifically designed to recognize two dimensional objects, while being invariant to exact position of the pattern and distortions. In the CNN each unit takes its input from a local receptive field on the layer below forcing it to extract local features. Furthermore units within a plane or featuremap are constrained to share a single set of weights, this makes the operations performed by a feature map to be shift invariant [Le Cun et al., 1990]. The weight – sharing technique also reduce the number of free parameters, thereby reducing the memory requirements and training complexity of the networks.

Complete CNNs are formed by stacking together multiple convolutional layers (each with featuremap planes and local receptive fields). Sub-sampling layers are also added improving invariance to shift and distortions. The entire network is trainable with gradient descent using the back-propagation procedure. [LeCun et al., 1998] popularized the LeNet-5 which is a CNN architecture performing state-of-the-art on MNIST hand writing recognition at the time it was published.

Even during 1990's it was evident that deeper and larger networks have the tendency to perform better. However, this potential was overshadowed by the outrageous time and memory which took to train larger networks - and was not possible during that time. Moreover, larger networks - being more powerful at modeling - also easily overfit to training data, and there were not efficient techniques to combat this annoyance. Even though near human performance was reached for simple task like hand written recognition, but this was not reciprocated to objects recognition in realistic settings which exhibit considerable variability.

In 2009 [Deng et al., 2009] released teh ImageNet data set comprising over 15 million high-resolution images labeled into more than 22,000 categories. It hasbeen well understood that training complicated data with high variability using very few examples leads to severe overfitting in the CNNs, so the release of this large-scale data set was a big step for object recognition problem. By this time computer hardware technology had also progressed enough to train much larger and deeper networks in reasonable amount of time. [Krizhevsky et al., 2012] used graphical processor units (GPUs) for a very fast and efficient implementation of their CNN, which has 650,000 neurons and 60 million parameters (in contrast the LeNet-5 had 60,000 free parameters). They entered their neural network in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, which was considerably better than the state of the art (second-best entry had 26.2% error rate).

Besides larger data set and larger networks, they also empirically showed the importance of some useful techniques for avoiding overfitting. The easiest illustrated way is to use data augmentation to increase the size of data set. They take five different patches of the original image (along with their horizontal reflections) thus increasing the training data by a factor of 10 - although the additional images are very close the original one. They also alter intensities of the colour channels to add further data augmentation. These techniques are useful for adding more shift, inversion, illumination, and colour invariance to our model. Dropout [Hinton et al., 2012] is another technique for combating overfitting, by reducing complex-co adaptations of neurons. Therefore a single neuron is forced to learn more robust features without relying on other neighbouring neurons. Pretraining the neural network in a greedy layer-by-layer fashion with an unsupervised objective function is another popular technique [Hinton et al., 2006, Schölkopf et al., ]. The intuition behind this idea is that unsupervised training will give a good initialization of weights for the neural network based on the actual statistical properties of the data it will be used for (e.g. object images, human speech, etc.) instead of random initializations which often get stuck in poor local minimas. Following this the network can be fine-tuned on a supervised task such as object recognition.

Mathematically speaking, the CNN transforms the into a low dimensional feature vector representation. In this way a good CNN model can also act as a good feature extractor for images, and the resulting images can be used in more complicated tasks. [Sermanet et al., 2014] display this concept for object localization. They train a CNN for classification on the ImageNet data set, and for localization they replace the final classification layer with a regression network. This regression network is simply an MLP with two hidden layers of 4,096 and 1,024 units, connected finally to the output layer of 4 units which predicts the coordinates of the bounding boxes. The final layer is class-specific having 1,000 versions (one for each class) while the rest of the regression network shares weights. During localization only the regression net-work's weights are updated, and the remaining larger network only acts as a feature extractor. They run the classifier and the regressor simultaneously since they share most of the network, in this way they get a bounding box for each class along with a confidence number based on classification confidence.

2.2 Deep learning in text and NLP

A seminal paper in the domain of statistical learning applied to natural language processing (NLP) was written by [Bengio et al., 2003]. Classical machine learning approaches to NLP calculate n-gram conditional probabilities based simply on the co-occurrence frequencies of words in a document. These models construct tables of conditional probabilities for a next given a fixed context (of previous $n - 1$ words). A fundamental problem with this is the curse of dimensionality, meaning that the possible combinations of contexts grows exponentially with $n$ , thus making the models quickly intractable. Another problem is that as the size of context increases, the occurrence of sequences gets extremely rare in a document, thus undermining the statistical relevance of their probability distributions.

The authors of the paper attempt to solve this problem by using a neural network to learn distributed representation of words. This distributed representation, which is sometimes termed word embedding in modern literature, is feature vector which conveys semantic and syntactic meaning of the word. In other words, all the words in the vocabulary can be transformed into a vector representation of a fixed dimension (usually much smaller than the original size of the vocabulary). Furthermore the joint probability of the entire sequence (of arbitrary length) can be expressed as a function of these feature vectors. In this paper they propose to use feed forward neural networks to compute the probability of the next word in the sequence given the previous $n$ words (in the form of these word vectors). The advantage of neural networks is that they can be trained to jointly learn the word feature vectors (first layer of the neural network) and the parameters of the probability function (free parameters of the network) using a common global objective function (maximizing log likelihood).

The advantage of using these word vectors is that it would escape the curse of dimensionality. The words would now be represented in a fixed real-valued (comparatively) low-dimensional vector, and similar words would have similar meanings. An example given by the paper is that the sentences "A cat is walking in the bedroom" would have similar representation to "The dog was running in a room", since the model would learn the similarity of the individual words in the sequence (dog, cat), (the,a), (bedroom, room), etc. It is noteworthy that theyreport 24% and 8% improvements in terms of perplexity over the best n-gram results for the task of predicting the next word in a sequence performed on two large scale data sets. The idea of using neural networks for language modeling in fact dates back to [Miikkulainen and Dyer, 1991], however the authors of the paper under discussion were the first to propose a large scale statistical model which learns distributed dense representations of words in a sequence and using them to automatically estimate the joint probability function.

Building on to this, [Collobert and Weston, 2008] propose a single unified convolutional neural network architecture that performs well for various challenging NLP tasks, such as part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. Traditional approaches analyze these tasks separately, using hand-crafted features specific for each task which makes this approach intractable for complicated tasks. The model they propose - in contrast - has a deep architecture composing of many layers which can be trained in an end-to-end fashion. The first layer extracts features for each word (word embeddings, or distributed word representations). The second layer extracts features form the sentence treating it as a sequence with structure. Variable length inputs are incorporated by passing a convolutional neural network over the word features and then performing max-pooling over each resulting dimension to give a fix length vector. The following layers are classical NN layers (fully connected).

Since all of these tasks are related , they argue that it would make sense to share some features between these tasks in order to improve the generalization of the network, and so they propose multitask learning approach to jointly train the model on all these tasks. They design their network architecture to share the layers closer to the input (these layers would encode word embeddings, which should intuitively be common to all language-related tasks). As we go deeper into the network, the features extracted become more complex and abstract, and so the last layers of the network are task specific.

They also pretrain their network with unlabeled training data , since it is available in much vast quantities as compared to labeled data. They train a language model with an unsupervised ranking criterion, which would predict if the middle word in the sequence is related to the context or not. For positive examples they took fixed length phrases from wikipedia, and they generated the negative examples by substituting the middle word in a valid phrase by any other random word. They demonstrated that this approach trains a powerful language model by showing that word vectors which are close to one another in the embedding space are also close in semantic meaning. For example "France", "Spain" and "Italy" have close vector representations to one another, as well as "scratched", "smashed" and "ripped".

[Mikolov et al., 2013a] propose yet another unsupervised approach of learn-ing vector representations of word. They propose a skip gram model, which predicts the surrounding words in the window given the center word. This is directly opposite of earlier approaches which predict the centre word given the context window. Given a sequence of training words, the objective criterion is to maximize the average log probability of the words in surrounding, conditional on the centre word. One interesting feature about their model is that it preserves linear regularities among the learned representations. This makes it possible to perform interesting analogical reasoning using simple vector arithmetic. For example, the result of the vector calculation: $\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"})$ is closest to $\text{vec}(\text{"queen"})$ . Also, $\text{vec}(\text{"russia"}) + \text{vec}(\text{"river"})$ is close to $\text{vec}(\text{"volga_river"})$ .

Furthermore, [Mikolov et al., 2013b] extends the previous model to include vector representations for phrases as well. They based their work on the insight that idiomatic phrases like "Boston Globe" and "Air Canada" can be semantically understood well by combining the individual words within the phrases. Therefore they treated phrases as individual tokens (just like words), but limiting their vocabulary to only those phrases which appear frequently together, and infrequently in other contexts. They train vector embeddings of dimensionality 300 for these words and phrases, and release the on the internet for public use 1.

2.3 Deep learning in image and text multimodal models

There has been a lot of progress in multi-label classification problem of associating images with individual words or tags. However, the more challenging problem of associating images with complete natural sentences has only recently started to gain attention. The research in this area has focused primarily on two tasks, namely:

    1. Mapping images and sentences into a combined space
    1. Generating descriptions of images in terms of complete and natural sentences.

The first poses it as an information retrieval problem, while the later treats it as a natural language generation problem.

2.3.1 Introduction

[Hodosh et al., 2013] have written one of the seminal papers in this field, providing an interesting discussion on the problem statement, an in-depth comparison


1https://code.google.com/p/word2vec/of the available data sets (including what kind of data is required for good modeling), an analysis of modeling techniques employed in the early stages of this field, and a detailed discussion of the various evaluation metrics used in different previous related works.

[Hodosh et al., 2013] go into the philosophy of image description by arguing that there are three different kinds of image descriptions:

    1. Conceptual descriptions identify what is being depicted in the image. They are concerned with the concrete descriptions of the depicted scenes and entities, their attributes and relations, as well as events they participate in.
    1. Non-visual descriptions provide additional background information that cannot be obtained from the image alone, e.g about the situation, time or location in which the picture was taken.
    1. Perceptual descriptions capture low-level visual properties of images, e.g if it is a photograph or a sketch, or what colors or shades dominate.

They argue that out of these three, conceptual descriptions are the most relevant for image understanding tasks. They observe that using user generated captions uploaded with images in popular image-sharing websites (such as Flickr.com) do not serve as good training data because people tend to provide information that could not be easily obtained just from looking at the image itself. For example, the kind of description our models require is "Three people setting up a tent" while people tend to provide captions like "Our trip to the Olympic Peninsula". Hence they establish a need of data collected purposefully for this specific task.

Most of good quality data sets are collected via crowdsourcing (for example using Amazon Mechanical Turk) in which multiple descriptive captions are assigned to each image. Pascal1K [Farhadi et al., 2010], Flickr8K [Rashtchian et al., 2010], Flickr30K [Young et al., 2014], and MS-COCO [Lin et al., 2014] are examples of such good quality data sets.

Some of the earliest work in this field used shallow learning techniques and fixed (as opposed to learn-able) image and text features. [Makadia et al., 2010, Ordonez et al., 2011] use nearest neighbour search kNN! (kNN!) for image annotation and image description respectively. On the other hand, [Hodosh et al., 2013] use kernel canonical correlation analysis kernel canonical correlation analysis (KCCA). KCCA [Bach and Jordan, 2003] is technique which takes training data consisting of pairs of corresponding items drawn from two different modalities and finds maximally correlated linear projections of each set of items (by first mapping the original items into higher-order spaces) into a newly induced common space. Similarly, popular shallow image features include SIFT descriptors [Lowe, 2004] and simple bag of words (BoW) kernel.### 2.3.2 Evaluation Metrics

Since image description is a subjective task, the ideal setting for evaluating a system would be averaged human judgement. However, since human judgement is expensive and slow, there have been a number of metrics employed in evaluating these systems. These metrics can be divided into two categories:

  • • metrics for text generation systems.
  • • metrics for ranking systems.

BLEU [Papineni et al., 2002] and ROUGE [Lin, 2004] scores are popular metrics in the automatics image description generation systems. Originally, they are standard metrics for machine translation and summarization respectively, but have been used to evaluate multiple caption generation systems [Vinyals et al., 2014, Ordonez et al., 2011, Gupta et al., 2012]. Given a caption $c$ and an image $i$ with a set of reference captions $R_i$ , the BLEU score of a proposed image-caption pair $(i, s)$ is based on the n-gram precision of $s$ against $R_i$ , while ROUGE is based on corresponding n-gram recall. If $c_s(w)$ is the number of times word $w$ occurs in $s$ , they are defined as:

BLEU(i,s)=wsmin(Cs(w),maxrRicr(w))wscs(w)BLEU(i, s) = \frac{\sum_{w \in s} \min(C_s(w), \max_{r \in R_i} c_r(w))}{\sum_{w \in s} c_s(w)} ROUGE(i,s)=rRiwrmin(Cs(w),cr(w))rRiwrcr(w)ROUGE(i, s) = \frac{\sum_{r \in R_i} \sum_{w \in r} \min(C_s(w), c_r(w))}{\sum_{r \in R_i} \sum_{w \in r} c_r(w)}

[Hodosh et al., 2013] try to compare BLEU and ROUGE scores against human judgements, and examine to what extents do these both agree. Based on results they question these metrics' usefulness for evaluating caption generation systems. [Reiter and Belz, 2009] also argue that they are more useful as metrics for fluency, but poor measures of content quality of language generation. However, unless a more suited metric is devised, these score are continue to be used for evaluating modern caption generation systems.

Next, we will touch upon metrics which can be used to evaluate the quality of a ranked list in information retrieval tasks. Recall@K ( $R@K$ ) is the percentage of test queries for which a model returns the correct result among the top $k$ results. It is especially useful in the context of search where a user may be satisfied with the first $k$ results containing a single relevant item. Conversely, median rank is equal to the value of $k$ for which the $R@K$ is equal to 50%. The $k$ in $R@K$ typically varies between $k = 1, 10$ . [Hodosh et al., 2013] consider $k = 1$ as a very strict threshold and, after comparing it with human judgement, view it as a lower bound on actual performance.In our systems queries can have multiple (variable number) of relevant answers since each test image may be associated with multiple relevant captions, and each test caption may deem fit for multiple images besides the one it was originally written for. R-precision [] is the metric of choice in these conditions, since it is a single number which allows us to rank models according to their overall performance (no threshold like $k$ ) The R-precision of a system with query $q_i$ and known relevant results $r_i$ is defined as its precision at rank $r_i$ . In simpler terms it is the percentage of relevant items among the top $r_i$ responses returned by the system.

2.3.3 Image and text mapping

[Frome et al., 2013] were one of the first ones to use deep learning in ranking images with text. They call their model DeViSE or deep visual-semantic embedding. Their original task was to improve performance of their image classification system for large number of object categories and for labels on which the visual system was not trained (zero shot prediction). They propose to leverage information from textual information of image to improve their object classification performance.

They begin by pre-training an efficient deep convolutional neural network (CNN) for visual object recognition, based on the architecture by [Krizhevsky et al., 2012]. In parallel, they pretrain a simple neural language model well-suited for learning semantically-meaningful vector representations of individual words (word embeddings) using skip-gram text modeling architecture [Mikolov et al., 2013a, Mikolov et al., 2013b]. Following this, they construct the DeViSE model by chopping of the top layers of the CNN and re-training it to predict the word embedding vector of the corresponding image label.

They used hinge margin ranking loss criterion in the second phase of their training, and observe significant improvements over using $L_2$ loss criterion. Although they never trained or tested their system for natural image descriptions, but they did influence a lot of work in this field. [Karpathy et al., 2014] implemented their model for image and sentence mapping to compare performance against their own system.

[Gong et al., 2014] use CNN for modeling images in their image based sentence retrieval system. They first embed image and sentence into a common space and then use it to rank the pair. They perform a comparison between using 4,096 CNN activations (trained on ImageNet [Deng et al., 2009]) as image features versus KCCA with 4,000 dimension fixed features. There was a reported 9.5% improvement in Recall@10 for CNN over KCCA, even when the CNN activations remained fixed and it was not fine-tuned on image-text data set.Unlike previous work, [Karpathy et al., 2014] go on a finer level and map fragments of images (objects) and fragments of sentences (dependency tree relations) into a common embedding space. Their model works for bi-directional retrieval: image given a text query, and text given an image query. They interpret an image being made up of multiple entities, and therefore propose to break it down into more manageable fragments. artificial neural network (ANN)s are used to compute vector representation of these image and sentence fragments in a multimodal embedding space, and the dot product between a pair of these vectors (one image fragment and one sentence fragment) are interpreted as a compatibility score. The global image-sentence compatibility score is computed as a fixed function of their fragments.

Their image model comprises of a Region convolutional neural network (RCNN) which detects objects in the images and returns their bounding boxes. For image fragments they use the top 19 locations detected by the RCNN and the complete image, so each image is broken down into 20 fragments. Following this another CNN is applied on each of these fragments to return a 4,096 dimensional embedding vector representing the image fragment. The architecture of this CNN closely resembles that by [Krizhevsky et al., 2012]. On the other hand, the sentence fragments are considered as edges of the sentence’s dependency tree. Therefore, each sentence fragment consists of two words which are joined in any stage of the dependency tree. Each word in the dictionary of 400,000 is represented using 1-of-k encoding, and vector embeddings for the words are obtained through an unsupervised objective and fixed throughout the training. The sentence fragment score is calculated using these word embeddings as well as separate embeddings which represent the type of relation between the words.

They plot a matrix where the rows represent the image fragments and the columns represent the sentence fragments. Each element of the matrix (or box) shows the dot product score between those two multimodal fragments. They define two kinds of objective functions to train their models:

  • Fragment alignment objective: This objective explicitly learns the representation of sentence fragments in the visual domain. It encodes the intuition that if a sentence fragment is contained in an image, at least one of the boxes should give a high score with that sentence fragment, while all other boxes corresponding to images which do not contain this sentence fragment (in their descriptions) should have a low score with this fragment. It also favours that there should be at least one high scoring box in each column of the matrix (meaning every sentence fragment should be matched by at least one image fragment).
  • Global ranking objective: This objective tries to enforce that the image-sentence similarities are consistent with the ground-truth. First the globalsimilarity score is computed by averaging the pairwise fragment scores. After this the entire system is trained in an end-to-end fashion with the margin ranking loss criterion.

The model improves performance when using a combination of these two objective functions. They also report that dividing image in fragments has performance improvement versus treating the entire image as a single fragment, and that dependency tree sentence fragments perform consistently better than bag of words (BoW) and bigram features. Finally they found that fine-tuning the image score calculating CNN on image-text data further improves results.

[Socher et al., 2014] go one step ahead and use the entire sentence dependency tree - as opposed to only its edges as sentence fragments - to model the sentences using DT-RNN (dependency tree recursive neural networks). They argue this is important for accurately representing complicated sentences in the visual domain. They test their model against other recursive and recurrent neural networks, KCCA and BoW baseline, concluding that DT-RNN out performs all of these in the task of image and sentence mapping. DT-RNN also give similar vector representations to multiple captions which describe the same image, adding more weight to their model. The DT-RNN is different from other previous recursive neural network models [Socher et al., 2011] which are based on constituency tree (CT-RNN). They report that the sentence vectors computed by DT-RNN are more apt at capturing the meaning of the sentence in terms of "visual representation". Moreover DT-RNN vector embeddings are more robust to changes in syntactic structure or word order as opposed to CT-RNNs or recurrent neural networks. The final sentence embedding is a vector of length 50.

For the image side, they train a deep CNN first using unsupervised objective: reconstruct the input keeping the neurons sparse, followed by supervised learning on classifying 14 Million images of ImageNet into 22,000 categories. The CNN used was particularly large, with 1.36 billion parameters, and they achieved 18.3% precision@1 on the full ImageNet data set. Following the CNN training, they chop off the last layer to get 4,096 dimensional vector embeddings for the images.

During the multimodal mapping, the 4096 they transform the image representation vector to the same size as sentence vector. Following this - like other similar works - they take a dot product of these to produce a compatibility score and back propagate error using margin ranking loss criterion. However, the 4,096 dimensional image vector and 50 dimensional sentence vector are fixed and are not updated during this phase. They report improved results over the Pascal1K data set [Farhadi et al., 2010] compared to bag of words (BoW), CT-RNN, recurrent ANN and KCCA models.[Karpathy and Fei-Fei, 2014] build upon their prior work on image text matching using fragments of image and sentence [Karpathy et al., 2014] (discussed above), add a new features to increase the modeling capacity of their model. Their initial approach translated words directly into vector embeddings, and did not consider word ordering and context. To address this problem they use a bi-directional recurrent neural network (BRNN) to model the input text sequence and convert it into a complete sentence embedding. They report significant improvements using this approach.

Taking a somewhat different approach, [Srivastava and Salakhutdinov, 2012] use multimodal deep boltzmann machines (DBM) for this task. The model learns joint probability density over the space of multimodal inputs. Missing modalities can be filled in by sampling from the conditional distributions over them. For example we can learn the joint probability $P(v_{img}, v_{txt} | \theta)$ , where $v_{img}$ is the vector representation of an image and $v_{txt}$ is the vector representation of a text sentence. Once this distribution is estimated we can draw samples from the conditional probabilities to fill in the missing modalities: drawing samples from $P(v_{txt} | v_{img}, \theta)$ to give the missing text sentence, and vice versa.

A multimodal DBM is an extension of an DBM to model multimodal inputs. A DBM is formed by stacking together RBMs, in other words a multilayer RBM, which are usually trained in a greedy layer-wise strategy. They construct two independent DBMs: an image-specific DBM and a text-specific DBM. Next these two are connected together via another RBM layer to construct the multimodal DBM. In other words, the outputs of the two DBMs are connected to a another layer of binary hidden units on top of them, called the "joint representation". The intuition behind this model is that each data modality has very different statistical properties which make it difficult for a single hidden layer model to directly find correlations across modalities. This difficulty is overcome by putting layers of hidden units between the inputs of both modalities (the separate DBMs)

They evaluate their model on information retrieval for multimodal and unimodal queries. In multimodal query, the aim is to give higher similarity score to the image and text pairs belonging to the same instance, over false pairings. While in unimodal query, either text or image is provided, and the model predicts the missing modality out of a pool of possible options.

2.3.4 sentence generation from images

Now we will briefly touch upon the related task of generating natural sentences from images. Although it is not the focus of our project, but it is still to interesting to see the approach taken in that area, especially since there has been a growing interest in it recently. [Kiros et al., 2014] present a model in which they use a convolutional neural network (CNN) for this task. The CNN has fourinput pathways, one for an image and the remaining three for words. The idea is that, given an image and three previous words, the model learns to predict the next word in the sequence. In this way they learn a joint "multimodal language model".

Recurrent neural networks (RNN) are quite popular for text generation, and so many researchers use them in this task, albeit in different settings [Karpathy and Fei-Fei, 2014, Vinyals et al., 2014]. [Vinyals et al., 2014] are influenced by modern ANN based machine translation systems, and they employ a encoder-decoder type architecture for their model. Specifically, they use a CNN as an encoder and an RNN as a decoder. The CNN encodes the input image in vector space feature embeddings which are input to the RNN. The RNN takes the image encoding as input and the word generated in the current time step to generate a complete sentence one word at a time. Special words have been marked to tell the model that it is the starting word and for predicting the finishing word. They also use long short term memory (LSTM) inside the RNN so that it has some memory of older generated words. They train their model to directly optimize the log likelihood of the target sentence given the input image.

Using a similar approach, [Karpathy and Fei-Fei, 2014] also use an RNN trained on multimodal inputs. For each input image feature vector, and the current generated word (starting at a fixed word) it learns to predict the next word in a sequence (ending at a fixed word).# Chapter 3

Resources

We employed various data sets and resources in our research and experimentation. Below we will provide a brief description for them. Most of the high quality data set is collected through crowd sourcing methods, labeled by human input. Although some large-scale automatically generated data sets exist (for example using user provided annotations when uploading images on photo-sharing websites). However, [Hodosh et al., 2013] makes a convincing case about the lack of pertinence of these captions with the objectives of our task, and hence we do not use them (like many researchers in the domain).

3.1 Flickr8K

Flickr8K1 is a data set comprising of 8,092 images collected from the Flickr website [Rashtchian et al., 2010]. Each image has five captions along with it, which describe the contents of the image. The images in this data set focus on people or animals (mainly dogs) performing some action. The images tend not to contain well known locations, but are manually selected to depict a variety of scenes and situations. The images are captioned by human captioners (five for each image) using crowdsourcing via Amazon’s Mechanical Turk2.

In order to avoid grammar mistakes, the captioners (who were only from US) had to pass a brief quality control test based on spelling and grammar. Following this, they were asked to write sentences that describe the depicted scenes, situations, events and entities (people, animals, other objects). Multiple captions were collected to model the inherent variability of different humans in describing the same image. As a consequence, the captions of the same images are often not direct paraphrases of each other: the same entity or event or


1http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html

2https://www.mturk.com/situation can be described in multiple ways (man vs. bike rider, doing tricks vs. jumping), and while everybody mentions the bike rider, not everybody mentions the crowd or the ramp.

It is worth mentioning that when accessing the data set, we found out that several of the images had been removed from Flickr (the authors gave links redirecting to the original images 3) so we could only get download 6,793 out of the total 7,678 images. Therefore, although we can use to judge the performance of our algorithm on the data set, we cannot use it in comparison to previous work.

3.2 Flickr30K

Flickr30K 4 is an extension over the Flickr8K data set, comprising of 31,783 images collected by [Young et al., 2014]. Similar to its counterpart, each image is associated with five captions written by humans using the Amazon's Mechanical Turk. The images consists of everyday activities, events, and scenes.

It is important that the annotators are not familiar with the specific entities and circumstances depicted in them, to avoid that they give overly specific annotations. For example the annotations "Three people setting up a tent" are more relevant to the task as compared to "Our trip to the Olympic peninsula" (for the same image - the latter is likely an annotation given by a person familiar with the significance of the image). Moreover different annotators use different levels of specificity, from describing the overall situation (performing a musical piece) to specific actions (bowing on a violin). This variety of descriptions associated with the image is necessary to induce similarities between expressions that are not trivially related by syntactic rewrite rules.

3.3 OverFeat

OverFeat 5 is a publicly available visual feature extractor based on the convolutional neural network submitted by [Sermanet et al., 2014] to ILSVRC-2013 large-scale object recognition competition. At the time of its release it ranked 4th in classification, 1st in localization, and 1st in detection tasks of ILSVRC-13 data sets. We use the accurate version of their model which has 144 million free parameters and 5.4 billion connections, and reaches performance of 14.18% on the competition validation set. Their network architecture is similar to the architecture by [Krizhevsky et al., 2012], with the addition that it is trained on images at multiple scales, and it has improved inference step.


3http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html

4http://shannon.cs.illinois.edu/DenotationGraph/

5http://cilr.nyu.edu/doku.php?id=software:overfeat:start### 3.4 Word2vec

Word2vec 6 is a publicly available tool which provides an efficient implementation of learning continuous bag of word, and skip-gram based vector representations for words. The model and implementation is based on the work of and released by [Mikolov et al., 2013b]. In addition to the implementation, they also provide vector representations of words and phrases which they learned by training this model on Google News Dataset (about 100 billion words). These are 300-dimensional vectors for 3 million words and phrases. An interesting feature of these vector representations are that they capture linear regularities in the language. For example the result of the vector calculation: $\text{vec}(\text{"Madrid"}) - \text{vec}(\text{"Spain"}) + \text{vec}(\text{"France"})$ is closest to $\text{vec}(\text{"Paris"})$ .


6https://code.google.com/p/word2vec/

Xet Storage Details

Size:
51.2 kB
·
Xet hash:
d3d8477f7090bbdadf96c14d89a052d38505ce4a200b48ad86ddd0d716c336d3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.