Powerful public AI tools like ChatGPT are great tools for content creation. But as with any technology, there are limits. We are starting to feel those limitations: from the quality of output to risks, such as copyright. So, what now? Welcome to the fascinating world of private AI! This is artificial intelligence trained only on your own data. We dove deep into this over the past six months and are happy to share our findings.
This article is a translation of this article I wrote for FrankWatching.com on Private AI, a leading Dutch marketing blog.
The challenges of public AI
Artificial intelligence has now established itself as a powerful tool for content creation. But traditional AI also brings challenges:
- Outdated data: most AI systems work with datasets that age quickly. For example, a model trained on data from September 2021 misses more recent developments, which can lead to outdated or irrelevant content.
- Dilution due to lots of bad information: public models like ChatGPT’s are trained based on practically all the information on the Internet. So a lot of training data is of low quality. With the large amount of data on the Internet, and the output that is an average of that, in many cases you end up with inauthentic content.
- Quality concerns: the well-known phrase “rubbish in, rubbish out” is very applicable to artificial intelligence. If an AI system is fed low-quality data, it is likely to produce low-quality output.
- Copyright risks: without clear source attribution or transparency, AI can generate content that infringes copyright. This can expose companies to legal risks.
- Black box: many people consider traditional AI a “black box,” because it is not always clear how decisions are made. This lack of transparency can create trust issues.
The solution: private AI
That’s where private AI comes in. To address many of these issues, you can set up your own private AI trained only on your own data. With real-time training data and training on at least 1,000 articles per topic, you can come up with authentic and accurate content creation.
With open source components, you can arrive at a transparently private AI solution. In doing so, it is even possible to trace from the output which original sources led to the generated piece of content. With much experimentation, we have arrived at private AI pilots that meet the following specifications and functionality:
Real-time training data
In addition to the original training data, the model is re-trained daily with the newly generated content (if desired). As a result, you always generate content based on up-to-date information, which is very relevant for news, for example. Per use case, it is a strategic choice which content you do or do not use for re-training the model. For example, you can choose: we don’t use the AI-generated articles for tutoring, but we do use the largely manually written content.
Targeted training
Instead of overwhelming the AI with all available information, it trains with specific, factual input.
Transparency for sources and SEO on steroids
Users can see exactly which part of which article led to a particular output. With the ability to generate traceable content, a well-designed private AI offers the chance to take your SEO to the next level. Namely, if you know which sources from your own site(s) led to the new content, then you can link from those original articles to the new articles. This makes your private AI an extremely powerful tool for achieving strong, relevant internal links.
Authentic AI is an appropriate solution when it comes to the fine balance between quality and quantity of content creation. It is a valuable tool for content creation, answering your questions based on your own carefully generated archive.
In practice: setting up private AI
Here is an in-depth look at how to set up private AI in practice and the building blocks involved.
Building Block 1. Training dates
- Training with your own archive: by training the AI with your own archive, the output can be tailored specifically to your corporate culture, terminology and style guide. This ensures seamless integration with existing content.
- Training with external information: this involves feeding the AI with data from reliable external sources, such as scientific articles, professional journals or even other sources that are your property. This expands the AI’s knowledge base and allows it to cover a wider range of topics.
- Realtime retraining: this is constantly updating the AI model with new information. While this ensures up-to-date knowledge, it is essential to be careful. Overtraining or training with inaccurate data can lead to “hallucinations” or inaccurate output.
Building Block 2. Content formulas – powerful tools for scalable output
- Customized content formulas: private AI allows you to use customized formulas that can answer specific questions or “prompts” at scale. By using formulas fed from a database, large-scale prompts can be answered efficiently.
- Prompt Query Language: a specially developed language designed to allow the AI to ask precise queries to both private and public AI, and combine multiple prompts to arrive at in-depth long-reads. This can be enormously useful for complex data requests or to generate specific content formats.
Building Block 3. Feeding Prompts
Once you’ve trained your model, and arrived at various content formulas, it’s time to put your model and formulas to work. This can be done manually, article by article. But with smart setup, you can also do this at scale, such as using already existing databases in your website(s). For inspiration:
- Example company website: use company-specific databases to provide relevant and contextual information to the AI.
- Comparing companies: using various data sources, the AI can compare companies on various parameters such as revenue, location, demographics, etc.
- Example recipe site: it can consist of all kinds of data sets, such as demographic information, customer feedback, sales figures, etc.
- Searches as a feed: by entering common or trending searches, the AI can better respond to current needs and queries.
With private AI, the power of advanced content creation and data processing is in your hands. Whether it’s personalized content strategies or in-depth data analysis, with the right building blocks and training, your AI model can work wonders.
Our experience with private AI: reflection on 100 days since GO LIVE
After 100 days of intensive work with private AI, we gained a range of exciting experiences and insights. Our collaborative projects ranged from partnerships with commercial organizations to demos with publishers. Here’s an overview of what we learned:
Pay attention to input quality
The “rubbish in, rubbish out” principle proved true time and again. The quality of the training data largely determines the quality of the outcome.
Avoid intertwining opinions and facts
When we trained a model with 5,000 articles from a newspaper, it generally produced well-researched, objective articles. But one particular paragraph was unexpectedly opinionated and activist in nature. This was because we had included opinion articles in our training. Now we either adjust our training process to exclude such articles or we make sure that specific training data is not used in specific questions.
The challenge of writing prompts
Writing good prompts is an art in itself. Regardless of whether you are working with public or private AI: without significant experience, you will not easily reach a desired outcome. As a guideline, it’s advisable to have more than 100 hours of experience with prompts or work with an expert who can help write effective prompts.
Volume of training data
We have found that you need between 1,000 and 2,500 articles per topic for optimal results. Although we have experimented with archives of up to 100,000 articles, we are confident that we can handle up to 1,000,000 articles with our current approach.
Pay attention to balance between new and old
For example, if you are a newspaper and you want to embrace modern writing styles but at the same time make use of your historical archives, it is essential to keep this balance in mind when designing your private AI.
Provide diversity in content
It is crucial to have multiple content formulas or formats. Repeatedly generating one type of content can lead to monotony.
Working with private AI is something you learn by doing. It has helped us understand how powerful, but also how nuanced AI-based content creation can be. We look forward to further experiments in this area with end users and content agencies.
Private AI as the next step
Private AI, in contrast to public AI, marks a significant leap into a future where content creation is more authentic, nuanced and efficient, while battling many of the risks of public AI. This technology is inherently customizable. Each application is unique, given the variation in training data, formulas and prompts. As a result, it requires a combination of technical expertise and strategic content marketing insights. This, coupled with the fact that there is a lot of trial-and-error involved, means that private AI is still a vast, unexplored area, brimming with opportunity.
The header image was generated with DALL-E 3 by Sebastiaan van der Lans.
In addition, here’s a podcast I recorded with Joost de Valk (founder Yoast SEO) on AI and Search, earlier this year: