AI Telephone — A Battle of Multimodal Models
Generative AI is on fire right now. The past few months especially have seen an explosion in multimodal machine learning — AI that connects concepts across different “modalities” such as text, images, and audio. As an example, Midjourney is a multimodal text-to-image model, because it takes in natural language, and outputs images. The magnum opus for this recent renaissance in multimodal synergy was Meta AI’s ImageBind, which can take inputs of 6(!) varieties and represent them in the same “space”.
With all of this excitement, I wanted to put multimodal models to the test and see how good they actually are. In particular, I wanted to answer three questions:
Telestrations is much like the game of telephone: players go around in a circle, taking in communication from the person on one side, and in turn communicating their interpretation to the person on their other side. As the game ensues, the original message is invariably altered, if not lost entirely. Telestrations differs, however, by adding bimodal communication: players alternate between drawing (or illustrating) a description, and describing (in text) a description.
0 Comments