About 12 interns and researchers at Microsoft Research have created an automatic image captioning software system. Take the above image as an example. Their new system can automatically caption the image as ‘A woman holding a camera in a crowd’
Not only was the project great to work on: I’m also proud of the results, which are in a preprint. You can think about a captioning system as a machine translation system, from pixels to (e.g.) English. Machine translation experts use the BLEU metric to compare the output of a system to a human translation. BLEU breaks the captions into chunks of length (1 to 4 words), and then measures the amount of overlap between the system and human translations. It also penalizes short system captions.
To understand the highest possible BLEU score we could attain, we tested one human-written caption (as a hypothetical “system”) vs. four others. I’m happy to report that, in terms of BLEU score, we actually beat humans! Our system achieved 21.05% BLEU score, while the human “system” scored 19.32%.
Now, you should take this superhuman BLEU score with a gigantic boulder of salt. BLEU has many limitations that are well-known in the machine translation community. We also tried testing with the METEOR metric, and got somewhat below human performance (20.71% vs 24.07%).
The real gold standard is to conduct a blind test and ask people which caption is better (sort of like what I asked you above). We used Amazon’s Mechanical Turk to ask people to compare pairs of captions: is one better, the other one, or are they about the same? For 23.3% of test images, people thought that the system caption was the same or better than a human caption.
Read more about it here.