Update 2: At some point in 2019, the Met released https://metmuseum.github.io/ which documents an API which can actually be used! I was able to get Appraisal Bot up and working again. The responses from the API are slightly different from what was documented in the original post here, but it’s generally similar to what was available. The documentation provided by the Met is pretty good so I don’t plan on writing a new post documenting the new API.
As far as I can tell, Incapsula is a resource hosting service that obfuscates responses if a request doesn’t come with the right header or cookie or something. If you hit the API from a web browser it’s all fine but anything else doesn’t work. I don’t understand how they expect anyone to use their API like this. It seems fundamentally broken.
I emailed email@example.com with a detailed description of the problem and I haven’t gotten any response. I have abandoned my appraisal bot that used the API once it stopped working.
Another option that I looked at was their github repo that regularly updates with snapshots of their data https://github.com/metmuseum/openaccess . It notably doesn’t have any image links which makes it just about useless. They also don’t seem to respond to any issues.
Original Post: The Met has an Open Access Policy which is that the museum makes public domain art images in their collection easily available for download. And that’s great if you want to use their website.
But that’s not great if you want an automated way to get at the data and images. There are two ways to fetch the data, one is downloading the full dataset from a regularly updated Github repo and the other is a web API. The Github data notably doesn’t include references to the images. It would also require manually updating a local copy of the data at regular intervals.
But the web API exists and as far as I can tell it’s completely undocumented. The only mentions of the exact syntax I could find were in some comments on some social media posts like this.
I ended up using the API and stumbling around trying to figure it out, so I’m going to document the API for anyone else who has a similar need.
Calling the API
The API for the Met returns the results from a search, very similarly to performing a search on the normal website. It even uses the same url structure as the website. If you do a simple search through the collection on the Met website for the word “test”, you’ll get a url that looks something like this:
The part of the url after the question mark is a query string. It’s used to pass along parameters from the search. In this case it’s mostly default values except for the
q=test which is the field for what we’re actually searching for.
You can play around with this url pretty easily by removing fields from the url by hand or by using the UI on the website to create a different search.
If we check the box for “Show Only: Pubic Domain Artworks” we get a new parameter,
And if we click the box for “Object / Material: Musical Instruments”, we get another parameter,
One search parameter that is not obvious is the
offset parameter. It encodes the page selection on the search results page. That is, if you perform a search and your
perPage is 20, when you’re on page one
offset is 0, on page two
offset is 20, three 40 and so on.
Note that none of the parameters are required. They’ll be added with default values if they’re missing. For example, you don’t need to search for any text at all. You could perform a search for any item with the Armor material.
So how does all this transfer over to the API? Well all you have to do is take the query string and append it onto
metmuseum.org/api/collection/collectionlisting? like so:
Go ahead and try out that url in your browser. You should get a page of JSON back.
At this point I’d recommend playing around with making some different calls by hand by editing the query string parameters. To better understand your results I’d recommend putting the JSON in a visualizer like https://jsonvisualizer.com/. It’ll help you see the structure of the data, the different objects and the fields they have on them.
At a high level, the response JSON is broken up into a
request object, a
results array, a
facets array and some assorted simple fields on the top level object.
request object contains the parameters used to make the request. I didn’t find this useful but if you wanted to do some validation this could help.
results array is the main body of the data. It’s length is at most the value of the
perPage parameter. It is an array of items returned by the search. It could be empty if your search had no results.
facets array contains a lot of information about the possible values of fields on the results objects. I haven’t found a use for this.
The last piece of useful data is the
totalResults field on the top level object. This field is the total number of search results for your search query. For example, a very specific query will have very few total results.
Using the Data From an Item
An element in the results array looks like this:
Most of it’s pretty self explanatory but there’s a few non-obvious bits.
url field has to be appended to
https://metmuseum.org to actually form a valid url. And you don’t necessarily need the query string for the url to work. The query string just adds a “Back to Search Results” link on the item’s page. I usually print the item url to a log after I’ve picked a random item and I found it much more readable to chop off the query string.
image field is a valid url, but the
largeImage fields are not. To get full urls out of them, you’ll have to use the first couple characters of the
regularImage strings to find where to chop off the
image string and append the
regularImage string. In the example image above,
image is https://images.metmuseum.org/CRDImages/rl/mobile-large/SLP0129.jpg and
rl/web-large/SLP0129.jpg so we find where the
rl is in
image and substitute the ending from
largeImage to get the final result of https://images.metmuseum.org/CRDImages/rl/web-large/SLP0129.jpg.
Note that the
image url does NOT always have the same format. For example, you can’t rely on every url containing the phrase
CRDImages. You have to use the first few characters of the large and regular image urls to find where to substitute.
Randomly Selecting an Item From a Category
At this point your usage will vary based on your own needs. I’ll explain how I used it for Appraisal Bot.
I needed to pull random public domain images from the collection to process with Appraisal Bot. I also decided that I only wanted to pull in art from hand picked categories. That way I could curate the results to be mostly items that could show up on an Antiques Roadshow style show. Unfortunately the API doesn’t provide an endpoint for randomly picking an item, so I came up with my own method.
First, we randomly pick a material to search for from a curated list that I made. Then we call the API with that material,
offset of 0 and
perPage of 1. We don’t actually care about
results at this point, all we really want is to see the value of
totalResults. That gives us the total number of items with that material.
Next, we roll a random number between 0 and
totalResults. Then we call the API with the material,
offset of the random number and
perPage of 1. Essentially we’re making a search with 1 item per page and randomly selecting the page we’re on. That’ll return us one item in the results array. And that’s our item!
Limitations of the API
As far as I can tell, there is no way to perform a binary OR search on any search parameters. For example, I can’t search for Paintings OR Prints in the same search. This is limiting to me because of the wide range in tagging quality for materials. I’d love to be able to search for Drinking Vessels OR Vessels because they’re basically the same sort of item and not every applicable item is tagged with both. But since I can’t, in order to achieve my intended distribution, I’d have to implement a more complicated weighted random table for materials that accounts for the fact that most items are both Drinking Vessels and Vessels but not every one is.