Now save the API key and the endpoint since we'll need those to access the API in the code.
Using the API
Now we get to the fun stuff where we can get into some code. I'll be using Python, but you're very welcome to use the language of your choice since this is a simple API call. I'm also using Azure ML since it's very easy to get a Jupyter Lab instance running plus most machine learning and data science packages already installed.
Imports
First, we need to import some modules. We have four that we will need to import.
- JSON: This will be used to read in a config file for the API key and endpoint
- Requests: Will be used to make the API calls. This is pre-installed in an Azure ML Jupyter instance, so you may need to run
pip install requests
if you are using another envrionment.
- Time: Used to delay API calls so the server doesn't get hit too much by requests.
- OS: Used to saved and help clean image data on the local machine.
- PPrint: Used to format JSON when printing.
import json
import requests
import time
import os
import pprint
The API Call
Now, we can start building and making the API call to get the image data.
Building the Endpoint
To start building the call, we need to get the API key which is kept in a JSON file for security reasons. We'll use the open
method to open the file to be able to read it and use the json
module to load the JSON file. This creates a dictionary where the JSON keys are the key names of the dictionary where you can get the values.
config = json.load(open("config.json"))
api_key = config["apiKey"]
Now that we have the API key we can build up the URL to make the API call. We can use the endpoint that we got from the Azure Portal and help build up the URL.
endpoint = "https://api.bing.microsoft.com/"
With the endpoint, we have to add some to it to tell it that we want the Image Search API. To learn more about the exact endpoints we're using here, this doc has a lot of good information.
url = f"{endpoint}v7.0/images/search"
Building the Headers and Query Parameters
Some more information we need to add to our call are the headers and the query parameters. The headers is where we supply the API key and the query parameters detail what images we want to return.
Requests makes it easy to specify the headers, which is done as a dictionary. We need to supply the Ocp-Apim-Subscription-Key
header for the API key.
headers = { "Ocp-Apim-Subscription-Key": api_key }
The query parameters are also done as a dictionary. We'll supply the license, image type, and safe search parameters here. Those are optional parameters, but the q
parameter is required which is what query we want to use to search for images. For our query here, we'll search for aston martin cars.
params = {
"q": "aston martin",
"license": "public",
"imageType": "photo",
"safeSearch": "Strict",
}
Making the API Call
With everything ready, we can now make the API call and get the results. With requests
we can just call the get
method. In there we pass in the URl, the headers, and the parameters. We use the raise_for_status
method to throw an exception if the status code isn't successful. Then, we get the JSON of the response and store that into a variable. Finally, we use the pretty print method to print the JSON response.
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
result = response.json()
pprint.pprint(result)
And here's a snapshot of the response. There's quite a bit here but we'll break it down some later in this post.
{'_type': 'Images',
'currentOffset': 0,
'instrumentation': {'_type': 'ResponseInstrumentation'},
'nextOffset': 38,
'totalEstimatedMatches': 475,
'value': [{'accentColor': 'C6A105',
'contentSize': '1204783 B',
'contentUrl': '[https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg](https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg)',
'creativeCommons': 'PublicNoRightsReserved',
'datePublished': '2021-02-06T20:45:00.0000000Z',
'encodingFormat': 'jpeg',
'height': 1530,
'hostPageDiscoveredDate': '2021-01-12T00:00:00.0000000Z',
'hostPageDisplayUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
'hostPageFavIconUrl': '[https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api](https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api)',
'hostPageUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
'imageId': '38DBFEF37523B232A6733D7D9109A21FCAB41582',
'imageInsightsToken': 'ccid_WTqn9r3a*cp_74D633ADFCF41C86F407DFFCF0DEC38F*mid_38DBFEF37523B232A6733D7D9109A21FCAB41582*simid_608053462467504486*thid_OIP.WTqn9r3aKv5TLZxszieEuQHaF5',
'insightsMetadata': {'availableSizesCount': 1,
'pagesIncludingCount': 1},
'isFamilyFriendly': True,
'name': 'Aston Martin Car Free Stock Photo - Public Domain '
'Pictures',
'thumbnail': {'height': 377, 'width': 474},
'thumbnailUrl': '[https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api](https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api)',
'webSearchUrl': '[https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486](https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486)',
'width': 1920}]
A few things to note from the response:
nextOffset
: This will help us page items to perform multiple requests.
value.contentUrl
: This is the actual URL of the image. We will use this URL to download the images.
Paging Through Results
For a single API call we may get around 30 items or so by default. How do we get more images with the API? We page through the results. And the way to do this is to use the nextOffset
item in the API response. We can use this value to pass in another query parameter offset
to give the next page of results.
So if I only want at most 200 images, I can use the below code to page through the API results.
new_offset = 0
while new_offset <= 200:
print(new_offset)
params["offset"] = new_offset
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
result = response.json()
time.sleep(1)
new_offset = result["nextOffset"]
for item in result["value"]:
contentUrls.append(item["contentUrl"])
We initialize the offset to 0 so the initial call will give the first page of results. In the while
loop we limit to just 200 images for the offset. Within the loop we set the offset
parameter to the current offset, which will be 0 initially. Then we make the API call, we sleep or wait for one second, and we set the offset
parameter to the nextOffset
from the results and save the contentUrl
items from the results into a list. Then, we do it again until we reach the limit of our offset.
Downloading the Images
In the previous API calls all we did was capture the contentUrl
items from each of the images. In order to get the images as training data we need to download them. Before we do that, let's set up our paths to be ready for images to be downloaded to them. First we set the path and then we use the os
module to check if the path exists. If it doesn't, we'll create it.
dir_path = "./aston-martin/train/"
if not os.path.exists(dir_path):
os.makedirs(dir_path)
Generally, we could just do the below code and loop through all of the content URL items and for each one we create the path with the os.path.join
method to get the correct path for the system we're on, and open the path with the open
method. With that we can use requests
again with the get
method and pass in the URL. Then, with the open
function, we can write to the path from the image contents.
for url in contentUrls:
path = os.path.join(dir_path, url)
try:
with open(path, "wb") as f:
image_data = requests.get(url)
f.write(image_data.content)
except OSError:
pass
However, this is a bit more complicated than we would hope it would be.
Cleaning the Image Data
If we print the image URLs for all that we get back it would look something like this:
https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg
https://images.pexels.com/photos/592253/pexels-photo-592253.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260
https://images.pexels.com/photos/2811239/pexels-photo-2811239.jpeg?cs=srgb&dl=pexels-tadas-lisauskas-2811239.jpg&fm=jpg
https://get.pxhere.com/photo/car-vehicle-classic-car-sports-car-vintage-car-coupe-antique-car-land-vehicle-automotive-design-austin-healey-3000-aston-martin-db2-austin-healey-100-69398.jpg
https://get.pxhere.com/photo/car-automobile-vehicle-automotive-sports-car-supercar-luxury-expensive-coupe-v8-martin-vantage-aston-land-vehicle-automotive-design-luxury-vehicle-performance-car-aston-martin-dbs-aston-martin-db9-aston-martin-virage-aston-martin-v8-aston-martin-dbs-v12-aston-martin-vantage-aston-martin-v8-vantage-2005-aston-martin-rapide-865679.jpg
https://c.pxhere.com/photos/5d/f2/car_desert_ferrari_lamborghini-1277324.jpg!d
Do you notice anything in the URLs? While most of then end in jpeg
there are a few with some extra parameters on the end. If we try to download with those URLs we won't get the image. So we need to do a little bit of data cleaning here.
Luckily, there are two patterns we can check, if there is a ?
in the URL and if there is a !
in the URL. With those patterns we can update our loop to download the images to the below to get the correct URLs for all images.
for url in contentUrls:
split = url.split("/")
last_item = split[-1]
second_split = last_item.split("?")
if len(second_split) > 1:
last_item = second_split[0]
third_split = last_item.split("!")
if len(third_split) > 1:
last_item = third_split[0]
print(last_item)
path = os.path.join(dir_path, last_item)
try:
with open(path, "wb") as f:
image_data = requests.get(url)
#image_data.raise_for_status()
f.write(image_data.content)
except OSError:
pass
With this cleaning of the URLs we can get the full images.