Computer Vision and NLP for Smart Image Processing

Image for post
Image for post
Moscow, city of mystery … (this photo is kindly provided by Igor Shabalin)

Nowadays, everyone who has a smartphone may become a photographer. As a result, every day tons of new photos appear in social media, websites, blogs, and personal photo libraries. And although the process of taking photo can be quite exciting, sorting them out and making a description for each manually afterwards can be pretty boring and time consuming.

This article discusses how you can use both Computer Vision (CV) and Natural Language Processing (NLP) technologies together to obtain a set of descriptive tags for a photo and then generate a meaningful description based on those tags, thus saving valuable time.

What’s Inside the Photo?

We, humans, can answer that question in a moment, once the photo is in our hands. Machines can answer that question too, provided they are familiar with CV and NLP. Look at the following photo:

Image for post
Image for post

How would your application know what is in the above photo? With tools like Clarifai’s Predict API, this can be a breeze. Below is a set of descriptive tags, this API gives you after processing the above photo:

‘straw’, ‘hay’, ‘pasture’, ‘wheat’, ‘cereal’, ‘rural’, ‘bale’, …

As you can see, these tags give you appropriate information about what can be seen in the picture. If all you need is to automatically classify visual content, having those tags will be quite enough to get your job done. For the task of image description generation however, you’ll need to take it one step further and take advantage of some NLP techniques.

In this article, you’ll see a simplified example of how this can be implemented, showing you how some words from the generated tag list can be weaved into simple phrases. For a conceptual discussion on this topic, you might also want to check out my post on Clarifai blog: Generate Image Descriptions with Natural Language Processing.

Getting Ready

To follow along with the script discussed in this article, you’ll need to have the following software components:

Python 2.7+∕3.4+

spaCy v2.0+

A pretrained English model for spaCy

Clarifai API Python client

Clarifai API key

You’ll find installation instructions on the respective sites. Apart from that, you’ll also need a Python library that allows you to obtain and parse data from Wikipedia.

Automate Tagging of Your Photo

To begin with, let’s look at the code that you might use to automate tagging of a photo. In the implementation below, we’re using Clarifai’s general image recognition model to obtain descriptive tags for a submitted photo.

from import ClarifaiApp, client, Image
def what_is_photo(photofilename):
app = ClarifaiApp(api_key='Your Clarifai API key here')
model = app.models.get("general-v1.3")
image = Image(file_obj=open(photofilename, 'rb'))
result = model.predict([image])
tags = ''
items = result['outputs'][0]['data']['concepts']
for item in items:
if item['name'] == 'no person':
result = "{}, ".format(item['name'])
tags = tags +result
return tags

To test out the above function, you might append the following main block to the script:

if __name__ == "__main__":
tag_list = list(what_is_photo("country.jpg").split(", "))

In this particular example, we pick up the first seven descriptive tags generated for the submitted photo. Thus, for the photo provided in the What’s Inside the Photo? Section earlier, this script generates the following list of descriptive tags:

['straw', 'hay', 'pasture', 'wheat', 'cereal', 'rural', 'bale']

That would be quite enough for the purpose of classification, and might be used as the source data for NLP to generate a meaningful description, as discussed in the next section.

Turning Descriptive Tags into a Description with NLP

They told us at school that in order to master the language you need to read a lot. In other words, you have to train on the best examples of using this language. Turning back to our discussion, we need some text(s) which use the words from the tag list. Of course, you can obtain a huge corpus, say, a Wikipedia database dump, containing a huge number of different articles. However, in the era of AI powered search, you can narrow down your corpus only to those texts that are most relevant to the words in the tag list you have. The following code illustrates how you might obtain the content of a single article from Wikipedia, containing the information related to the tag list (you need to append it to the code in the main function from the previous section):

import wikipedia
tags = ""
tags = tags.join(tag_list[:7])
wiki_resp =
print("Article url: ", wiki_resp.url)

Now that you have some text data to work on, it’s time for NLP to come into play. Below are the initial steps where you initialize the spaCy’s text-processing pipeline and then apply it to the text (append it to the previous code snippet).

nlp = spacy.load('en')
doc = nlp(wiki_resp.content)

In the following code, you iterate over the sentences in the submitted text, analyzing syntactic dependencies in each sentence. In particular, you look for the phrases, which contain words from the submitted tag list. In a phrase, two words from the list are supposed to be syntactically related with a head/child relationship. If you’re confused by the terminology used here, I would recommend check out Natural Language Processing Using Python that explains NLP concepts in detail and contains a lot of easy-to-follow examples. You can start reading right now: Chapter2 and Chapter12 are free. Also, an example of where syntactic dependency analysis might be used in practice can be found in the Generating Intents and Entities for an Oracle Digital Assistant Skill article I recently wrote for Oracle Magazine.

Turning back to the code below, be warned that it is a simplification — a real-world code would be a bit complicated, of course. (append the code below to the previous code in the main script)

x = []
for sent in doc.sents:
if bool([t for t in sent if t.lemma_ in tag_list[:7] and t.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_]):
t = [t for t in sent if t.lemma_ in tag_list[:7] and t.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_][0]
y = [(t.i, t), (t.head.i, t.head)]
y.sort(key=lambda tup: tup[0])
x.append((y[0][1].text + ' ' + y[1][1].text, 2))
if bool([t for t in sent if t.lemma_ in tag_list[:7] and t.head.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_ and t.head.head.lemma_ != t.head.lemma_]):
t = [t for t in sent if t.lemma_ in tag_list[:7] and t.head.head.lemma_ in tag_list[:7] and t.head.lemma_ != t.lemma_ and t.head.head.lemma_ != t.head.lemma_][0]
if t.i > t.head.i > t.head.head.i:
y = [(t.head.head.i, t.head.head), (t.head.i, t.head), (t.i, t)]
x.append((y[0][1].text + ' ' + y[1][1].text + ' ' + y[2][1].text, 3))
x.sort(key=lambda tup: tup[1], reverse= True)
if len(x) != 0:

This code gives me the following phrase for the photo provided in What’s Inside the Photo? section earlier in this article:

Hay in bales

This looks like a relevant description for that photo.

Written by

is the author of Natural Language Processing with Python and spaCy (2020, No Starch Press,

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store