Vectorizing Image + Text in one Vectorspace with CLIP

by Roman Egger

The development of machine learning algorithms is making great strides and especially in the field of natural language processing (NLP) and image processing, groundbreaking developments can be observed from one year to the next.

Deep learning approaches, in particular, have led to revolutionary developments, however, the fact that the preparation of vision datasets is very labor- and cost-intensive remains a serious problem. Classical computer vision algorithms recognize patterns of the pixels of an image (these are the features) by analyzing shapes, distances between shapes, colors, contrast ratios, and much more. Millions of images must therefore be laboriously labeled. A photo with a beach and a palm tree needs to be labeled as such to be used as data input.

At the beginning of 2021, Open.ai presented a groundbreaking development with CLIP (Constrative Language – Image Pre-training). CLIP is a neural network trained on 400 million image-text pairs from the Internet. The images used have been trained using natural language supervision, giving CLIP “zero-shot” capabilities like GPT2 or GPT3. This multi-modality training CLIPs performance can be compared with ResNet-50 on ImageNet without the need for 1.28 million labeled data, making CLIP a game-changer for visual classification tasks. In short, CLIP pre-trains an image encoder and a text encoder, resulting in images and texts being represented in the same vector space. The evaluation of CLIP took place as a zero-shot image classifier, but there are numerous other applications of CLIP.

To try out the features and benefits of CLIP, I considered the following scenario in a tourism context. Visual communication is becoming increasingly important in tourism marketing. Instagram and Co have given classic destination marketing a hard time because the relevance of user-generated content is beyond doubt. UGC and the messages of destination management organizations (DMOs) are fighting for attention. In this respect, the analysis of UGC is becoming increasingly important for destinations. On the one hand, to get a feel for the perceived image of tourists and to shape the development of offers accordingly (read our paper about clustering destination image using ML), on the other hand, to post tailored information via social media channels that bring high engagement.

Study & Method
Pictures are worth a thousand words, but without context, they can become interchangeable. For my example, I chose images of beaches. To do this, I crawled 600 Instagram posts with the hashtag #wonderfulbeaches. Geoapify was used to extract the geo-location (lat/long) from the location descriptions of the posts. Then I selected 10 photos of beaches (only landscape without people or buildings) for the countries (Australia, Brazil, Croatia, Cuba, France, Greece, Indonesia, Italy, Mexico, Philippines, Portugal, Spain, Thailand, and Turkey). These 140 pictures were downloaded and saved together with texts. For each image, there were three textual descriptions. “This is a beach in [country]. So for example -” This is a beach in Greece” the two other sentences contained also a positive and negative sentiment. “This is a wonderful beach in Greece”, “This is a terrible beach in Greece”. This was done to see the textual impact of country names in the embedding of the images as well as the sentiment impact. Since CLIP uses the same vector space (512 dimensions) for both images and texts, vectors were created for the images alone, for the texts alone, and once the vector sum for image and text. The idea was to extend the image vector by the textual “context vector”.

The vectors were then reduced with t-SNE. Figure 1 shows the two-dimensional text-vector space for the countries used. It is interesting to see how the semantic proximity of two terms partly reflects the geographical proximity. Something like Turkey and Greece or Spain and Portugal. (If you are not aware of embeddings, read here)

Figure 1: Word-Embedding (Countries)

Also, the image embeddings of CLIP already provided interesting insights. Using Louvain clustering (a community detection method, I normalized the data, used 20 PCA components in pre-processing and 15 k-neighbors) similar types of beaches are already grouped together. Figure 2 shows that the clustering of the beaches already corresponds to the geographic spaces. For example, beaches in the Philippines and Thailand (purple on the top) have the same characteristics as beaches in Croatia (light blues on the right) (Figure 3).

Figure 2: Image Embedding (Beaches)
Similar beaches – Philippines, Thailand
Similar beaches – Croatia, Italy

However, it became really interesting when I looked at the combined image-text vectors. Figure 4 shows the geographical clustering of the images, Figure 5 the clustering of the image-text vectors on a map. It is clearly visible that the individual beaches can be assigned much better to the actual countries. Of course, the clusters are not that perfect if the regions are very close together (eg. Italy, Croatia)

Figure 4: Only Image-Vectors – resulting in mixed clusters
Figure 5: Image-Text vectors – resulting in good clusters

Now I tried to evaluate the results with a classification task. Three models were developed for this purpose. Once a Neural Network, once SVM, and a Random Forest. The model parameters are shown in Table 1.

Model parameters
Neural NetworkSVMRandom Forest
Hidden layers: 20 Activation: ReLu Solver: Adam Alpha: 0.0003 Max iterations: 200 Replicable training: TrueSVM type: ν-SVM, ν=0.25, C=1.0 Numerical tolerance: RBF, exp(-auto|x-y|²) Numerical tolerance: 0.001 Iteration limt: 100Number of trees: 200 Maximal number of considered features: unlimited Replicable training: No Maximal tree depth: unlimited Stop splitting nodes with maximum instances: 5  
Table 1: Model Parameters

In the following, I tried to classify the images and the images with text information, using the country as a target variable. As expected, the results based purely on the images are relatively poor. The best results are obtained with the Random Forest model.

ModelAUCCAF1PrecisionRecall
SVM0.68478021978021960.27142857142857140.27663535911983110.299820417677560570.2714285714285714
Random Forest0.7731868131868130.307142857142857160.308565414932355230.325832897261468630.30714285714285716
Neural Network0.68318681318681320.185714285714285720.191108768068630950.207333539476396630.18571428571428572
Model Scores: Images only

In order not to integrate the vector for a country too unambiguously in the combination of image and text, an alternative combination was generated by CLIP in the next step. For this purpose, not the country but the continent in which the country is located was embedded as text information. For a beach in Spain, Portugal, or Italy “Europe” was vectorized as text, for countries like Thailand or Indonesia “Asia”. It turns out that the Random Forest model again performs best, followed by the SVM. I must say, however, that I did not try to tune the models extensively.

ModelAUCCAF1PrecisionRecall
SVM0.82648351648351650.43571428571428570.44432485358056960.468181619967334150.4357142857142857
Random Forest0.9119230769230770.52857142857142860.5228484130739770.54424732014017730.5285714285714286
Neural Network0.79142857142857150.25714285714285710.25522451404804350.278147622685437860.2571428571428571
Model Scores: Images + Text (continents)

Also from the Confusion Matrix, it can be seen that the correct assignment to the according countries was now already quite ok in some cases.

Figure 6: Confusion Matrix – Continents

Finally, the image-country vectors were classified and it shows that here almost all images of beaches could be assigned to the correct countries.

ModelAUCCAF1PrecisionRecall
SVM0.95659340659340660.80714285714285720.80614273897618130.81620046620046630.8071428571428572
Random Forest0.99087912087912070.950.94940815241567120.95418470418470440.95
Neural Network0.89423076923076920.52142857142857150.52228261091552630.55198868591725740.5214285714285715
Model Scores: Image + Text (countries)
Figure 7: Confusion Matrix – Countries

It was also interesting to see how the positions in the vector space shifted when a positive or negative connotation was included in the sentence (wonderful vs. terrible). The shift is noticeable but not so strong as to displace the dominance of the country’s information. This study thus shows that it is possible to extend image embeddings with text embeddings using CLIP and thus make them more precise. In the future, for example, the images of Instagram posts could be enriched with text information, or photos on review platforms could be combined with text descriptions. Projects like “Concept” from Maarten Grootendorst use CLIP to develop a kind of topic modeling for images. In the future, the combination of images and text could provide a much better insight than classical topic modeling attempts with LDA, NMF, etc. can do today.

Leave a Reply

Your email address will not be published. Required fields are marked *