It seems like every few months, someone publishes a machine learning paper or demo that makes my jaw drop. This month, it’s the new image-making model of OpenAI, DALL · E.
It takes a text caption of the 12-billion-parameters neural network (ie “an armchair in the shape of an avocado”) and draws images to match it:

I think its pictures are quite inspiring (I would buy one of those avocado chairs), but what is even more impressive is DALL · E’s ability to understand and present concepts of space, time, and even logic. (More in a second).
In this post, I’ll give you a quick overview of what DALL · E can do, how it works, how it fits in with recent trends in ML, and why it’s important. We go away
What is DALL · E and what can it do?
In July, OpenAI, the creator of DALL · E, released a similar giant model, called the GPT-3, which rocketed the world. Ability to generate text like a human, Includes session eds, poems, sonnets and even computer code. DALL · E is a natural extension of GPT-3 that signals text and then reacts with images, not words. In an example from OpenAI’s blog, for example, from the model prompt “provides a living room with two white chairs and a painting of the Colosseum. The painting is placed atop a modern chimney”:

Very clever, isn’t it? You can probably already see how this can be useful for designers. Note that DALL · E can generate a large set of images from a prompt. The images are then sorted by another OpenAI model, called The clip, Tries to determine which photos best match
How was DALL · E constructed?
Unfortunately, we do not have a ton of details on this yet because OpenAI has yet to publish a full paper. But at its core, DALL · E uses the same new neural network architecture that accounts for a ton of recent advances in ML: a Transformer. Transformers, discovered in 2017, is an easy-to-parallel type of neural network that can be scaled and trained on vast datasets. They are particularly revolutionary in natural language processing (they are the basis of models such as BERT, T5, GPT-3 and others), improving quality of life Google search Results, translations, and even in Predicting Protein Structures.
[Read:
]Most of these large language models are trained on large text datasets (such as Wikipedia or all) Web crawl). What makes DALL · E unique, however, was that it was trained on sequences that were a combination of words and pixels. We do not yet know what the dataset was (it probably included pictures and captions), but I can guarantee you that it was likely large scale.
How “smart” is DALL · E?
While these results are impressive, whenever we train a model on a large dataset, the skeptical machine learning engineer is right to ask if the results are simply high-quality because they were copied or missed from the source material went.
Not only are the images being revived to prove DALL · E, the authors of OpenAI forced it to present some very unusual signs:
“A professional high quality depiction of the giraffe turtle Chimra.”

“A snail made of veena.”

It is difficult to imagine that the model came across many giraffe-turtle hybrids in its training data set, making the results more impressive.
What’s more, these strange hints DALL · E: give an even more attractive hint about it.
Zero-shot visual reasoning
Typically, in machine learning, we train models by giving thousands or millions of examples that we want them to better format and expect them to choose the pattern.
For example, to train models that identify dog breeds, we can show thousands of images of dogs labeled by breed a neural network and then test the ability to tag new pictures of dogs Huh. It is a limited-scope work that seems almost bizarre compared to the latest tricks of OpenAI.
Zero-shot learning, on the other hand, is the model’s ability to perform tasks that they were not specifically trained to perform. For example, DALL · E was trained to make pictures from captions. But with the right text prompt, it can also convert images into diagrams:

DALL · E can also present custom signs on road signs:

In this way, DALL · E can function almost like a Photoshop filter, even though it was not specifically designed to behave in this way.
The model even shows visual concepts (ie “macroscopic” or “cross-section” pictures), locations (ie “a picture of China’s food”), and time (“a photo of Elamo Square, San”) . “” “” From a street at night (a picture of a phone from the 20s). Francesco. For example, here’s what it is in response to the sign “A picture of China’s food”:

In other words, DALL · E can do more than paint a beautiful picture for captions; In a sense, it can also answer blind questions.
To test the visual reasoning ability of DALL · E, the authors took the visual IQ test. In the examples below, the model was to complete the lower right corner of the grid following the test’s hidden pattern.

“DALL · E is often able to solve matrices that consist of continuous simple patterns or basic geometric logic,” the authors write, but it does better on some problems than others. When the colors of the riddles were inverted, DALL · E worsened— “suggesting its abilities may be brittle in unexpected ways.”
what does this mean?
What strikes me most about DALL · E is its ability to perform amazingly well on so many different tasks, which the authors did not say:
“We find that DALL · E […] Capable of performing many types of image-to-image translation work when prompted correctly.
We did not anticipate that this capability would emerge, and no modifications were made to the neural network or training process to encourage it. “
This is surprising, but not entirely unexpected; DAL · E and GPT-3 are two examples of a larger topic in deep learning: Large neural networks trained on exceptional Internet data (an example of “self-supervised learning”) can be highly versatile, capable of doing a lot of things. . Were not specifically designed for.
Of course, do not mistake this for common sense. its Not difficult Tricking this type of model to look pretty dumb. We will get to know more when we are openly accessible and we can start playing with them. But that doesn’t mean I can’t get excited in the meantime.
this Article was written by Dale Markitz, An applied AI engineer at Google in Austin, Texas, where he works on implementing machine learning in new fields and industries. She loves solving her own life problems with AI, and talks about it on YouTube.
Published January 10, 2021 – 11:00 UTC
Leave a Reply