Let’s Peek Inside an AI’s Mind! 🧠 The Secrets of the Data Generative AI “Learns” From

Let’s Peek Inside an AI’s Mind! 🧠

The Secrets of the Data Generative AI “Learns” From

Summary of this Article ✨

Hey everyone! You’ve probably been hearing a lot about “Generative AI” lately, and it’s truly amazing, isn’t it? It can create essays or draw beautiful pictures in the blink of an eye, almost like magic.

But where does all that “smartness” come from? 🤔 The secret lies in the massive amount of “data” that the AI learns from.

In this article, we’ll take a fun journey together to explore what exactly AI is “eating” to get so smart, how up-to-date its knowledge is, and whether our data is safe… Let’s explore the behind-the-scenes of AI!

What Does AI “Eat” to Get Smarter? 📚

The training data, which is like “food” for AI, can be broadly classified into three types.
Tap (or hover over) the cards to see what they are! 👇

Information from the Internet

(Tap me!)

A Giant Library!

This includes publicly available information from all over the world, like websites, news articles, and blogs. For example, a massive dataset called Common Crawl is often used.

Licensed Data

(Tap me!)

Special Textbooks!

This is data that specific companies or organizations have given permission (a license) to use. It’s useful for learning specialized knowledge or high-quality text.

User Data

(Tap me!)

Conversations with Everyone!

The conversations we have with AI may be used to improve its performance. Of course, you can change your settings to prevent your data from being used, out of respect for your privacy!

Different Models, Different Specialties? 🎨

Even though we just say “AI,” the training data varies depending on the company that develops it.
This is what creates each AI’s unique “personality” and “strengths.”

The Honor Student All-Rounder 📖 (OpenAI)

OpenAI, famous for ChatGPT, trains its models on a wide range of data, including information from the internet and books. The recent GPT-4o is a multimodal AI that can understand not just text but also images and audio, making it even more versatile.

The Popular Kid Who Knows Social Media Trends 😎 (Meta)

Meta’s AI, “Llama,” which runs Facebook and Instagram, includes public social media posts in its training data. This might be why it’s so good at more natural, human-like conversations. However, this has also led to debates about its data sources.

The Open-Source Genius Painter 🖼️ (Stability AI)

This is the company known for the image generation AI “Stable Diffusion.” It was primarily trained on LAION-5B, a dataset of 5.8 billion image-text pairs collected from the internet. While famous for using an open dataset, it has also faced issues with inappropriate images being included.

The Secretive, Solitary Artist 🤫 (Midjourney)

Midjourney is popular for generating incredibly beautiful images. But what data it’s trained on is mostly a secret. This has sparked major debates over copyright issues, with questions like, “Did it train on artists’ work without permission?”

AI’s Knowledge Has an “Expiration Date”!? 📅

AI might seem like it knows everything, but its knowledge is actually frozen at a specific point in time.
This is called a “knowledge cutoff.” Let’s look at it on a timeline!

January 2022

GPT-3.5

The initial model for the sensational ChatGPT. It only knows information up to this date.

April 2023

GPT-4 / Gemini Pro

Smarter models, but their knowledge base is still from around this time.

December 2023

Llama 3

Meta’s model also updated its knowledge!

2024 and beyond 🚀

GPT-4o / Gemini (Latest)

Finally, a weakness overcome! These latest models can now search the internet when needed to provide you with real-time information!

Key Point 💡

You could say that even AI with a “knowledge cutoff” has evolved into a hybrid model that combines static knowledge (memory) and dynamic information (search) by adding a real-time search feature!

“Don’t Learn This!” AI’s Filter Function 🗑️

The internet has good information and bad information, right?
Developers are working hard to apply “filters” so that AI doesn’t learn strange things. Let’s check with a quiz!

🤔 Quiz Time!

Does AI learn all information from the internet (including personal info and discriminatory language) as is?

(Tap to see the answer)

A. No, that’s not the case!

Developers apply various filters to remove things like hate speech and personal information from the training data. They also work hard to reduce bias so that AI doesn’t develop prejudiced views.

Is Our Data Safe? 🛡️

“Is my conversation with AI used for training?” “Who owns the copyright of AI-generated art?”
Let’s explore these questions about privacy and copyright in a chat!

Hey, Dr. AI! The stuff I write in ChatGPT, can other people see it or is it used for training without my permission? I’m a little worried… 😥

Great question, Aoi! Many AI services let the user choose what happens.

Dr.

For example, OpenAI and Google provide a mechanism to opt-out (refuse) from the settings, telling them “don’t use my data for model training.” For business plans, it’s almost always the default that data isn’t used for training.

Dr.

I see, so I can change the settings! But what about copyright? If I ask Midjourney to create something in the style of a famous artist, is that okay?

Mmm, that’s the hottest topic of debate around the world right now! 🔥

Dr.

AI development companies argue, “It’s the same as a human learning from various works of art; this is fair use.” On the other hand, artists are filing lawsuits, claiming, “They’re copying our work for profit without permission!” This is a very difficult problem that no one has the right answer to yet.

Dr.

Today’s Summary 📝

AI gets smart by “eating” massive amounts of data from the internet, licensed sources, and user data.
The type of data an AI eats determines its personality (specialty), like OpenAI’s GPT or Meta’s Llama.
AI’s knowledge has an “expiration date,” but the latest models can now get real-time information by searching the internet.
Developers try to filter out harmful data, but it’s not perfect and challenges remain.
For privacy, you can often opt-out of data usage in the settings. As for copyright, it’s currently a major global debate.

Is the inside of an AI’s mind a little clearer now? Understanding what AI learns from and the rules it operates by is very important for us to get along well with it in the future. We can’t take our eyes off the evolution of AI! ✨