How does Midjourney work?
AI produced art ? (A look at MidJourney)
As a developper, the one thing that I’m most uncomfortable with is producing graphical content. However I have good news.
AI is progressing in leaps and bounds. It’s able to do stuff that just 2 or 3 years ago I would have loudly declared as being totally impossible. And now that is being made available to the public. New initiatives from Google, and OpenAI and MidJourney are pushing the boundaries of what art AI can produce.
And throughout this video I’ll be showing examples of what I’ve created on MidJourney, with the prompts, to allow you to see what is possible.
But it raises a few questions…
The test for whether a computer is self-aware, can think, was formulated by one Adam Turing in 1950. The test is basically : can you tell by chatting with it, whether the answers are provided by a human or by a computer.
Can we apply the same thing to art? If we show the images to people, can they say : this was drawn by a human, and that is created by a computer? Could that answer the question whether computers can create art?_ Or should the question be more basic: does this convey or produce an emotion?
And if so, is the new technology coming out of Google, and OpenAI, and MidJourney able to meet either test? I’ll let you be the judge of where you stand, but I know I’ve been seriously impressed by what I’ve seen come out of MidJourney. I don’t really have the answer. What I can talk about is:
- how we can create stuff with this?
- and how does it work, technically?
Because what is nice is that you can test them for yourself. OpenAI and MidJourney have both opened up their beta to anyone, although there is a waiting list for OpenAI’s Dall-E 2.
MidJourney’s interface is original : you interact with a Discord bot. And they have recently released a new version of their model which is showing even more impressive results.
I’ve run a few tests to show you how it works. Basically, once you’re on a channel, you type “/imagine” and start expressing what you want. You can type in pretty much anything.
I’ve played around with images of a rocket lifting off, and I’ve played at giving it different styles. Here are a few examples.
This one I’ve specified I wanted an octane render and volumetric lighting.
This next one I’ve stated that I wanted in the style of the French graphic novel artists, Enki Bilal and Moebius.
I’ve also done versions in the style of 60s sci-fi posters and tarot cards.
The range of things you can do with this is wide. But more than anything else, it’s fun!
There are of course times when it doesn’t understand what you are on about, it seems to be trained on art more than anything else, so for example just asking it to draw a computer didn’t produce much.
And you can specify stuff like the aspect ratio using the --ar
flag, and the level of quality using the --q
flag.
Now that is all well and good, but how does it go about doing this? Well, let me try to explain.
You’ve probably already heard of deep fakes. Those used adversarial neural networks : one network was trying to fake, and the other was attempting to detect the fake.
And in a sense the image generation going on here is doing something similar, in the sense that it uses neural networks differently.
Allow me to explain.
The first step is to teach the model to understand the link between words and images. If we have enough images, and they are all correctly labelled, we can teach a neural network to recognise what looks like a banana, or a koala bear, or a motorbike. That part is fairly easy — or at least we’ve already been doing it for a while.
The next step is to create a process to take our images and to transform them into random noise . So, I have a system that can take all these images of banana’s or koala bears, or motorbike, and make them into random noise. This is called the diffusion process.
Now comes the fun part.
If we reverse the system that creates the noise, we can input random noise, and produce an image. That’s what we call Reverse Diffusion.
And with our labelled data, and the Reverse Diffusion, we can teach the model what it means to produce something that looks like a banana. Or a koala bear. Or a kola bear on a motorbike. Or even a banana on a motorbike. And we can use our first model to check whether it recognises what is being produced, and keep only the most valid versions.
I’m curious as to what you think about all this. Is it art? The images I’ve shown you are mainly from MidJourney (because it is what I have easy access to) but it OpenAI’s Dall-E 2, and Google’s Imagen models look like they have a lot of promise.
But Google’s systems are not available. Why? For two reasons.
The first is that because with great power comes great responsibility. The models are producing images that have the potential to be at least as credible as those produced by the adversarial networks, with a lot more ease.
The other problem, is that these models contain and encapsulate our society’s biases. If all the images that the model has seen of CEOs are photos of white men in suits… well, the model will equate CEO with being white and male.
And that underlines the difficulty with AI, and with computer programs in general.
They are only ever as good as the data they have been provided with. Garbage in, garbage out.
And correcting for those biases first means identifying them.
And that is probably the greatest challenge facing AI today.
We help you better understand software development. Receive the latest blog posts, videos, insights and news from the frontlines of web development