My journey into the depths of AI: From data to evaluation

In my previous article, I discussed the irreplaceable value of human experts and the challenge we face as many knowledge bearers retire. I proposed that rather than simply replacing human expertise with AI, we should seek optimal ways to combine the strengths of both. Here, I’ll share my technical journey into understanding AI systems from the inside out.

I developed a roadmap for myself. When we look at the major elements in artificial intelligence, there are various questions:

How do I get to the data and information, how can I process it, how can I make the quality so good that the next steps will be more successful?
What do I actually want to generate from the collected data and content, why and for whom?
How can I actually find out that what is being generated meets our quality standards and is consistently output in such a way that it can be relied upon and is also technically and ethically traceable?

Building from scratch: The data journey begins

Of course, I could have relied on various frameworks. For Retrieval Augmented Generation (RAG), for example, I could have used LlamaIndex, for agents CrewAI or Microsoft Autogen, or one of the visual solutions like N8N. This approach builds on my earlier work building practical AI tools for real business needs, where I similarly prioritized understanding the underlying mechanisms rather than using off-the-shelf solutions. But it was important to me to build it all myself. I really wanted to understand what’s behind it, what’s tricky and where problems can arise.

So I approached it like this: I transferred various projects I had already worked on – for example, automated crawling and parsing of websites and PDFs, and converting them into machine-readable text and data – into my own blueprint framework. And so it continued: Deeper and deeper. My ambition was to be able to extract texts from all websites (many operators make this difficult, either intentionally or unintentionally) and then store these texts with meta information in a database.

In order to be able to access the knowledge in a targeted manner later, the website text or documents such as Word, Excel, CSV, PPT or PDFs are divided into very small pieces. You can imagine it as if you were cutting up five A4 pages with scissors. However, it must always be clear that the snippets belong to this original collection.

The challenge of chunking

This alone is a big challenge. Because how do I ensure that the connection between the snippets is not lost? You have to work with overlapping snippets and think carefully about how small the snippets should be. Are a few paragraphs enough or does it have to be at sentence level, perhaps even semantic, by saying, “Okay, semantically a new topic begins here”? AI experts call this process “chunking”. This chunking process is conceptually similar to how I’ve explored reverse content creation through AI conversations, where breaking down and reassembling information creates new perspectives.

Next comes “meta-enriching”: Once you’ve collected all the data, made it machine-readable and broken it into snippets, they’re divided into small pieces. To better understand later what the snippets belong to and what they’re about, you need to summarise meta information with the help of artificial intelligence.

At the beginning, this seemed too time-consuming for me. But by now I’ve understood that everything here has a purpose. Because if you enrich the snippets with AI, then you can create snippets with summaries and perhaps even ten questions that this snippet answers. This then helps later to better identify the right snippet for search queries. The idea behind it is fascinating: Later you can see which question from the user is closest in content to chunks connected with a question and then play out the relevant answer. This makes you faster and the answers better.

The power of embeddings

And now we come to “embedding”: If I want to find not only keywords but also related content that provides an answer to my question or task through search queries, then it’s worthwhile to push aside the world of traditional databases and deal with vector databases. This is now about “vectorising” each chunk, i.e. each text snippet.

At this point, I don’t want to go into too much detail, because that’s another mental shaft to dig down. Just for rough understanding: The contents of the snippets are mapped onto vectors. This results in a kind of fingerprint for each chunk, which can then be compared with other snippets. The finer and more detailed the fingerprint (the larger the vector embedding dimension), the more precise the gradation will be later, but the longer the comparison will take.

The next step is about querying the information. I can use various possibilities: for example, a web chat agent, an intelligent email robot or a direct connection to another system (via an “API”). There are many variations here, and here too many possibilities unfold for which there isn’t really a “best practice” yet and new solutions are constantly emerging.

An example of the complexity: Which provider (OpenAI, Google, Anthropic, or preferably local on my system), which model (Gemini 2.5, Sonnet, GPT4o, …), what creativity temperature do I set, how should I search the vector database and the other database? How many hits do I take to generate something with a prompt? That’s a science in itself again.

The art of prompting

The topic of prompting also comes into play here. If I’m planning a chat application, for example, I have to consider what personality my chatbot has. That’s the system prompt. And what prompts can I expect from the user? Depending on how I set the system prompt, the result is completely different again. Because in the system prompt I give clear instructions, firstly on how it’s processed, but also on how and what information is searched for and how it’s then converted into an answer. This builds on my earlier exploration of ChatGPT communication strategies, where I discovered how defining clear interaction rules significantly improves AI performance.

That’s roughly the topic of RAG. That’s a nice example of insane complexity. Although it could look completely different tomorrow. But one thing should also be clear: A lot can go wrong in this AI system. And that’s why we finally come to the secret star: testing, “evaluation”.

The three critical components

And now finally people come back into play. The first part – the provision and processing of data – super important for every project. And here we need people with a clever head. They have to decide what is actually needed and which data and information are relevant and current.

Of course, you can already prepare this a little with artificial intelligence. But experts are urgently needed who can assess what is missing and what is needed. I find the first part particularly exciting because it seems boring from the outside. It’s work that isn’t particularly fun. But that’s precisely why it’s interesting, because if you do it right once, you’ll have much more success in the long term.

The middle – generating something into something else, inventing something into something else or finding a needle in a virtual haystack and producing from it – i.e. everything that also runs through ChatGPT on a small scale, is in incredible motion. This is where all these small projects happen and the many developers who, like in the first grade, call out, “Hey, look what I can do!”.

Here we’re also moving a lot in the realm of feasibility, where people are just testing what’s possible, especially in combination with sound, image, text and so on. This is incredibly dynamic and in the last six months alone, I have the feeling that if you started with something a few months ago and took a model, it’s all different again now. The area is sexy, but incredibly dynamic, and I actually like to stay away from this area, although I of course use it too.

The critical role of evaluation

It’s the third part that particularly appeals to me: Evaluation. This also sounds very boring. But from personal experience, I can say that it’s incredibly important to test even perfectly planned and trained systems thoroughly. These evaluation challenges are increasingly important as AI systems like Google’s AI-generated search summaries reshape entire industries, making rigorous testing and assessment essential. What I find interesting is that all decisions made beforehand lead to a result that needs to be tested and thereby also changed. I now understand where the adjusting screws are.

But if I were an evaluator, I would ultimately only be interested in the result that the person who is supposed to do something with it gets. And it’s not just about the result always being consistent and of high quality, but also that it’s understandable and transparent how and where this data actually comes from. And that’s the topic of responsible artificial intelligence.

Finding the golden answer

I’ve already played around a bit with evaluating, but that’s the field I clearly want to move into, because we definitely need people here. Because – just briefly, I’ll go into it in more depth in later articles – it’s always about the golden answer. That means I can play through the output for a user scenario a hundred times and then see if there are differences in the answer. I can test this with a one hundred percent golden answer that I’ve worked out together with an expert. And then you can see how to keep improving your prompt and also the data basis, so that you really get consistent results – perhaps even verifiable.

But to get there, you have to work closely with experts and get input from them. Probably the whole company has to participate and help improve the answers every day for 10 to 15 minutes. You can do this, for example, by doing a test run, getting 100 answers and having everyone read through ten of them and click a thumbs up or thumbs down.

The future of AI evaluation

What’s interesting about evaluating for me is that I really only look at the output and can then give feedback, together with the people involved before, on where it still falters and what needs to be changed. At the same time, I could imagine that evaluating shouldn’t come at the end, but already at the beginning, when you’re thinking about something and planning it, how you can really test it deeply and document that it has worked, what flows in and what comes out. This is also part of evaluating. That means I would also like to be right at the beginning, but I see my focus at the moment at the end with evaluating. And that’s where I’m moving now.

In conclusion, my journey through the technical depths of AI has brought me full circle - back to the importance of human expertise. While AI systems can process vast amounts of data and generate impressive outputs, it’s the human touch in evaluation that ensures these systems deliver consistent, trustworthy, and truly useful results. The future belongs not to AI alone, but to the thoughtful integration of human wisdom and machine capability.

tl;dr: To better understand how AI can augment retiring human experts, I built an AI system (like RAG) from scratch, focusing on data processing, chunking, embeddings, and prompting. This hands-on approach highlighted the system’s complexity. While data handling and AI generation are important, the most critical part is Evaluation. Human experts are essential here to define quality, test outputs against “golden answers,” ensure consistency, and make AI trustworthy and transparent.