Insights

How to Get Your Brand in ChatGPT’s Training Data

If you’re a new or lesser-known brand in 2025, how do you compete in AI search? 

Well, one way to do it is to rank in Bing. ~40% of the time, SearchGPT is activated to supplement ChatGPT’s innate knowledge, and SearchGPT is generally formulating your conversational query into a search query and pulling in relevant webpages from Bing’s index.

But what about the other ~60% of the time?

In the majority of ChatGPT conversations, when a user submits a query, that query is answered using ChatGPT’s innate knowledge. That innate knowledge is built from a combination of data sources in pre-training and model training in post-training. 

How does a new or lesser-known brand ensure that it’s part of that innate knowledge? That’s the question this post will answer.

Pre-Reading & Helpful Resources

The following resources will help you better understand the world of AI Search and Seer’s POV on how to capitalize on this rare opportunity for channel expansion:

Caveats and Disclaimers

  • We are essentially trying to peek inside the black box that is LLM training data. Some platforms, such as Meta’s Llama, have been very open about their training data inclusions. OpenAI was fairly transparent about its training data with ChatGPT 3, but much less so with ChatGPT 4. In fact, Stanford’s Foundational Model Transparency Index rated it a 0/10 in transparency. 
  • This post focuses on training data for OpenAI’s ChatGPT, but we believe the guidance can be applied directionally to all LLM pre-training data sources.
  • This post focuses on pre-training data sources that can be reasonably achieved by marketers. Data sources too far out of our reach (such as inclusion in published books) are omitted. 
  • Understanding the inclusions and weighting of training data is a critical piece of understanding AI Search. This effort is akin to trying to reverse engineer Google’s algorithm. While we generally believe there isn’t value in overindexing on hypotheticals, we can also agree that sharing insights, learnings, and beliefs can help the industry grow as a whole

In short: SEOs haven’t seen a worthy competitor to Google in decades. IF there is a reasonably achievable playbook for optimizing AI Search, and IF we can follow that playbook to drive increased visibility, THEN we potentially position the brands bold enough to invest in testing and learning with moats that will be high effort to attack and low effort to defend.

 


 

Brand Mentions are the New Links

The first point to understand is AI Search is built upon a different system than traditional search engines. There is no known weighting to hyperlinks and anchor text. Instead, our goal is to ensure our brand is mentioned in priority sources next to the words that describe our products, services, and audience needs.

Align on standard phrases, blurbs, and paragraphs that sum up your brand. It’s likely that consistency in phrasing within these data sources is as important as their presence.

Don’t Expect Overnight Success

We’ve been spoiled in recent years by the speed of modern search engines. New content can be crawled, indexed, and served within hours.

Optimizing for an LLM’s training data will be different.

Training data will continue to be a critical component of LLMs for years to come. But updates to training data will likely come as new frontier models are released.  In other words, brands who seek to be referenced within the innate training data of an LLM must prepare to wait months or even years to be included in this dataset. 

While household brands will continue to grow their digital footprint, new and lesser-known brands will need to operate with more focus and discipline to catch up.

 


 

 


 

What data sources are most important to optimize to be referenced in AI Search?

OpenAI hasn’t explicitly shared weighting or inclusions of training data to ChatGPT 4. The following sources are compiled through Seer’s testing and research. We will update this list as we continue to research this space.

Tier 1: Critical Data Sources

  • Wikipedia. Ensure your brand has a well-referenced Wikipedia page that follows Wikipedia’s notability guidelines. Use citations from reputable news sources to support your entry.
  • OpenAI Publisher Partners. OpenAI licenses content from specific news organizations, meaning articles from these sources will likely be directly included in future training datasets. Ensure your PR team is aware of the heightened importance of coverage on these sources.

 

Top OpenAI Publisher Partners (1)

  • Owned Websites. As we’ve stated since September 2023, we believe you should allow LLMs to scrape your website content. This website content should be accessible to bots, and include factual, descriptive, well-structured content. 
Tip: Up-to-date content is important. If your content isn’t dated or is over 1 year old, prioritize updates accordingly.

  • Press Releases. This is especially important for lesser-known brands who need to build awareness. Investing in a service that will widely distribute news about your brand and leadership is key. 
Tip: For brands with limited PR resources, this could be the most achievable way to influence #2 on our list.

 



Tier 2: Important Data Sources

  • Reddit. Reddit content with at least 3 upvotes has been rumored to be included in ChatGPT 4’s training data. Brands organically discussed on Reddit should impact ChatGPT’s innate knowledge of said brand. Further, brands mentioned in organic conversations about relevant products, services, and audience pain points should connect the dots for LLMs between brands and the topics they seek to be associated with.
Tip: Given Reddit’s growing importance in the search ecosystem, brands will do well to invest in community management and ensure there’s clear ownership of Reddit as a marketing channel.
  • Industry Specific Publications. Publications that are frequently cited and engaged with are highly likely to be weighted strongly as part of web-based training data. While many of these examples will fall within #2, there will be additional sources to pursue.
  • For example, a brand seeking to be associated with a financial solution should seek coverage in publications like Bloomberg, Financial Times, Forbes, and CNBC.
Tip: Unsure which publications are most relevant to your audience? Ask ChatGPT.
  • Substack, Medium, and Independent Publications. AI models train on high-quality, long-form content. Publishing in these spaces should build topical authority and relevance. Focusing on platforms with wide distribution is key, as future partnerships emerge it would be logical that AI platforms continue to try to cast a wide net with the content they are licensing.


 

Tier 3: Emerging Data Sources

  • YouTube. LLMs have primarily focused their training data on text-based resources. In order to be multi-modal, these data sources must expand. One way to do so is to create branded content on the 2nd largest search engine: YouTube. Ensure your content is well-structured with clear speech to facilitate automatic transcript indexing, and that you include descriptions, captions, and metadata.
Tip: While you’re getting your own branded video content abilities off the ground, partner with established channels and influencers to begin building a footprint for your brand in YouTube.

  • Podcasts. This is uncharted territory for LLMs to-date, but common sense leads us here. Brands discussed on popular podcasts will likely become more visible in LLMs and GEO. Consider distribution and how platforms like ChatGPT will ultimately partner with platforms like Spotify, SiriusXM, and iHeart to download entire inventories and collections. 

 


 

Important data sources to optimize to be referenced in AI Search

 

In Conclusion

Brands who have invested in the above coverage should be well positioned to be in the consideration set for questions related to their products, services, and audience’s pain points. As an added bonus, these sources all happen to be tried-and-true marketing strategies that should better connect your audience to your brand.

 

 

We love helping marketers like you.

Sign up for our newsletter for forward-thinking digital marketers.