2023-09-1110 mins read

Why AI Sucks at Text in Images

This article explains why AI image models struggle with legible text in images, and how we partly worked around the problem.

Why is A.I. so bad at writing in text in images?

If you have used A.I. generators for images before, such as MidJourney, Dall-E or Image GPT, you will have noticed that they are really bad at writing text in images. Why is this?

Well, first, lets take a quick look at using MidJourney, we will start with our first prompt using the string the letter "D" cute, adorable, octane renderer, 8K, high detail, --aspect 3:2

Well great... that seemed to work ok! Well yeah, it's just one letter, ok lets try something else like "Hello"

So, as you can see the first frame seems to have something resembling the word "Hello" but the rest are either unreadable or missing letters, so we can see already it's struggling with text in these images.

Ok, so lets try something else, lets try "Steve Jones" - a common enough sounding name (well if you're from the UK anyhow!) - lets see what we get:

Again, we can see that it's kind of resembling the text, but it's not really readable, and it's missing letters.

Whats going on?

So why are the images above pretty amazing looking as long as you disregard the spelling? I mean I love the lighting, the 3D Effects are just awesome and the quality and colors are great, but just what the hell is going on with that spelling?

Well, the answer is that the A.I. is not really understanding the text, it's just seeing it as a bunch of pixels, and it's trying to recreate those pixels in the image.

Imagine you had a (very horrible) experiment where you raised 5 children from infants, and you taught them how to paint, sculpt and draw, but you never taught them how to read or write. I'm sure they could all be excellent artists, and if you asked them to draw a portrait, or Landscape they would impress you with their awesome skills, but if you asked them to draw you a painting that had some words in it, they would probably struggle. To them the words and letters are just a combination of shapes, and they don't understand the meaning of them.

This is also true for the A.I. image generators we see right now, they have only been trained using hundreds of thousands of normal types of images, i.e. cars, animals, objects, actions, colors etc. And although they are now mind-blowing at creating these images, they have not had the training at reading another few billion images containing the same type of objects plus text!

So the text they are trying to add is an approximation of what they think it should be... hence why we get the results we see above.

How can we fix this?

So what did we do at 3D Names to help the A.I. understand text? Well, we actually ignore the fact that the A.I. is bad at text, but we take advantage of the fact that it's really amazing at creating images, and we combine the two.

That's why on our image generator tool we get you to first create the text you want to use, edit the font etc, and then once you're happy with it we then instruct our A.I. tool to focus on drawing around the text mask we have given it!

As you can see the results are pretty amazing, and we are able to get the A.I. to create some really cool looking images, and the best part is that the text is readable 99% of the time! The only time it's not readable is when the text is too small, or the font is too thin, or if you try and ask the A.I. to do something that is just way to complicated but we are working on improving this all the time.

Here are some examples of the images we have created using this method: