The Data Moat Principle
Part III: What to Do
Current AI capability tells you almost nothing about who wins long-term. Benchmarks, demos, parameter counts, context windows - they are all nearly useless for predicting the decade-long competition. The company leading today might be trailing in two years. The company that seems behind might have assets that compound into dominance. What actually matters is data. Not data in the abstract sense of "AI models need lots of training data." Something more specific: proprietary data that a company generates or owns, that no competitor can access or replicate, that compounds over time into an insurmountable advantage. This is the data moat principle. You can hire researchers from a competitor. You can read their papers and reproduce their techniques. You can build similar infrastructure and train similar models. You cannot create a decade of YouTube videos. You cannot conjure a social network with 600 million users. You cannot retroactively collect driving data from millions of vehicles. Once you understand this, you see the AI landscape completely differently.
Why Data Beats Everything Else
An AI model is essentially a compression of its training data. The model
learns patterns from examples and then applies those patterns to new inputs. It can only produce outputs that are some combination or extrapolation of what it has seen during training. Better data in means better capabilities out. If you train a model on higher quality examples, more diverse examples, more current examples, the model will be more capable. Every AI researcher would agree with this statement. It’s no different than a human brain. Which human is better equipped to be a world-class scientist - one that has spent years studying science, or one that plays Fortnite 16 hours a day? Now think about what this means for competition. If two companies have identical architectures and identical compute resources, the one with better training data will produce a better model. That is the fundamental reality. There is no way around this. The data determines the capability ceiling. And data is much harder to replicate than anything else in AI. You can hire researchers from a competitor. You can read their papers and reproduce their techniques. Smart engineers can reverse-engineer architectures from published information. You can build similar infrastructure and train similar models on similar compute clusters. You cannot create a decade of YouTube videos. You cannot conjure a social network with 600 million users. You cannot retroactively collect the driving data from millions of vehicles. You cannot replicate search history from billions of queries. Compute can be bought. Talent can be recruited. Algorithms get published. Data that a company has uniquely generated or uniquely has access to - that cannot be acquired on the open market. This is why data moats are the determining factor for long-term AI competition. Everything else is either purchasable or replicable or both.
This will give you a very good understanding of where you should hitch your
ride. Be it as investments, places of employment, model selection - anything. In an AI world, the best AIs will be the ones with the best data, and the best data moats.
The Snapshot Versus Trajectory Distinction
When you look at a benchmark score, you are looking at a snapshot. You are
asking: what can this model do today? That is a reasonable question for deciding which tool to use for a task right now. It tells you nothing about what the model will be able to do in two years. When you look at a data moat, you are looking at a trajectory. You are asking: what assets does this company have that will make their future models better? This tells you almost nothing about what the model does today. It tells you everything about where the company is going. Most people are watching snapshots. They look at leaderboards and assume the company at the top of the leaderboard will stay at the top. They watch product announcements and assume the most impressive demo indicates the most promising company. This is exactly backwards. The AI field is moving so fast that any capability advantage erodes within months. If one company launches a feature today, competitors will have similar features within six months. If one model scores higher on a benchmark today, the scores will equalize as everyone applies similar techniques. But data moats compound. A company with unique data today will have more unique data tomorrow. Their models will improve faster because they have better signal to learn from. The gap widens over time rather than narrowing. So when I look at an AI company and try to decide whether to invest or what AI to use for my daily use, I barely look at what their current product does. I
spend almost all my time on one question: what data do they own that no one else can access?
Applying the Framework
Let me apply this to the major players.
xAI: The Brain of the Musk Ecosystem This is the intelligence layer for an integrated physical system that spans transportation, robotics, energy, and space infrastructure. This distinction matters enormously for evaluating its data moat. Consider the data sources xAI can access that no competitor can: X (Real-time human discourse): 600 million monthly active users generating real-time data about what humans think, argue about, care about. Not months-old web crawls - the actual firehose of human consciousness as it happens. Tesla Fleet (Physical world navigation): Billions of miles of driving data, edge cases, real-world scenarios. Useful for FSD, obviously, but also training data for understanding how AI systems should interact with the physical world more broadly. Optimus (Robotics manipulation): As Optimus deploys, every movement, every grasp, every interaction becomes training data for embodied AI. No other AI company has access to humanoid robotics data at scale. Starlink (Global connectivity patterns): Network data from millions of terminals worldwide, showing how information flows and where infrastructure bottlenecks exist. And these data sources don't just add - they multiply. AI trained on X data can be applied to Tesla vehicle interactions. Robotics learning from Optimus informs how AI should reason about physical manipulation. The ecosystem
creates compounding data advantages that standalone AI companies can't replicate. Google's YouTube corpus gives them capability advantages that Veo 3 proved - when you can train on essentially all the video humanity has created, you build better video models. OpenAI and Anthropic have strong products, but weaker data moats - their training data comes from the same sources everyone else can access. But I need to be honest about xAI and examine my own potential biases here. xAI is newer and less proven than the others. Google has been doing AI research for over a decade. OpenAI has shipped multiple generations of products to hundreds of millions of users. Anthropic has some of the most respected safety researchers in the field, and they have my favorite model in Opus 4.6. xAI launched Grok and has been iterating fast, but they do not have the same track record of sustained execution just yet. The X data asset is valuable, but it is also noisy. Most tweets are not high- signal training data. They are shitposts, arguments, bots, spam, and low- quality content - most of them from me. The signal-to-noise ratio is worse than YouTube, where people invest significant effort in creating content. Extracting the valuable patterns from X's firehose requires filtering and curation that is non-trivial. Elon's attention is also genuinely split. He is running Tesla, SpaceX, X, Neuralink, The Boring Company, and xAI simultaneously. Each of these would be a full-time job for anyone else. His ability to context-switch and drive progress across all of them is remarkable, but there are only so many hours in a day. xAI does not get the same focus that Tesla or SpaceX got in their critical early years. But once they merge… that equation can change drastically. That said, I still think the real-time data advantage is significant and underappreciated. The ability to train on current human discourse rather than
months-old data is a qualitative difference that will matter more over time. And Elon has a track record of making these kinds of bold bets pay off. So what I want to focus on here is how you apply this filter to your own analysis.
The Investment Filter
How do you apply the data moat principle when evaluating AI companies for
investment?
I ask a series of questions:
What data does this company generate or control that no competitor can
access? This is the foundational question. If the answer is "nothing" or "the same data everyone else has," the company does not have a data moat. A valid answer looks like: "They own a platform that generates real-time human behavior data that exclusively have access to." Or: "They have a fleet of devices collecting physical world data." Or: "They control a content library that is essentially impossible to replicate." An invalid answer looks like: "They have a lot of users." That is not a data moat unless those users are generating unique data that competitors cannot access elsewhere. This is OpenAI’s strongest argument to date - they have entire conversations with people that are sharing their deepest secrets for advice. But you know who else has this? Google on their Android and Gmail platforms. Apple on their entire ecosystem. You see what I’m talking about?
Does this data compound over time?
A good data moat gets stronger the longer a company has it. More data leads
to better models leads to more users leads to more data. The flywheel should be self-reinforcing. If the data is static - if it does not grow or improve as the company operates - it is less valuable than data that compounds.
Can this data be replicated with enough money?
Some apparent moats can be overcome by well-funded competitors. If a
company has lots of data because they paid to license it, a competitor could pay to license similar data. That is not a moat. True data moats come from assets that cannot be purchased - user-generated content on proprietary platforms, proprietary sensor data from hardware deployments, historical accumulations that took decades to build.
Is the company organized to use its data effectively?
Having a data moat is necessary but not sufficient. A company has to be able
to actually leverage its assets. Organizational dysfunction, internal competition, or strategic confusion can prevent a company from capitalizing on data advantages.
What happens to the data moat in different scenarios?
Think about how regulatory changes, partnership dissolutions, or competitive
dynamics might affect the data advantage. A moat that depends on a partnership is less durable than one based on owned assets.
The Veo 3 Case Study
Let me spend some time on Veo 3 because it is such a clean example of the
data moat principle in action. When Google launched Veo 3, it was noticeably better than competing video generation models. Not marginally better - visibly, obviously superior. The quality gap surprised people who had expected the major labs to be roughly at parity. Why was Veo 3 so much better? YouTube. That is the answer. Google did not invent a revolutionary new architecture. They did not have researchers that others lacked. They did not throw more compute at the problem than anyone else. Google trained Veo 3 on the YouTube corpus - billions of hours of video content with associated metadata, comments, descriptions, and engagement signals. No one else has access to this data. No one else can train on it. When you train a video model on YouTube, you are training on essentially all the video that humans have ever created and organized. You are seeing not just the visual content but the human reaction to it - what people clicked on, how long they watched, what they commented, what they shared. That signal is irreplaceable. You cannot replicate it by scraping the open web. You cannot purchase it from data vendors. You cannot synthesize it. The capability gap in Veo 3 is a direct manifestation of the data moat. Google had better training data, so they produced a better model. The advantage is not eroding over time - if anything, it is compounding as YouTube continues to accumulate more content. This is exactly what the data moat principle predicts. And it should inform how you think about every AI competition going forward.
What This Means for Your Investments
If you accept the data moat principle, it changes how you allocate capital in
AI. It means you should be skeptical of companies trading at high valuations based on current benchmark scores or current user counts. Those are trailing indicators that tell you about the past, not leading indicators that predict the future. It means you should look for companies that are generating proprietary data at scale, even if their current products are less polished than competitors. The trajectory matters more than the snapshot. It means you should be wary of companies whose advantages depend on partnerships, talent retention, or other assets that can be competed away. Only owned data moats provide durable competitive advantage. If you apply the data moat principle consistently, you will reach different conclusions than the market consensus about which AI companies are undervalued and which are overpriced.
Beyond AI Companies
The data moat principle applies beyond pure-play AI companies. Any company that generates unique data at scale has potential value in the AI era. The question is whether they can capitalize on it. Tesla is a car company that happens to generate the world's largest dataset of real-world driving video. That data is enormously valuable for training autonomous driving systems and robotics applications. The cars are just the mechanism for data collection. Amazon has transaction data from hundreds of millions of purchases. They know what people buy, when they buy it, what they considered but did not buy, how they react to recommendations. That is training signal for understanding consumer intent.
Healthcare companies have patient data that could train medical AI systems. Financial institutions have transaction patterns that could train fraud detection. Logistics companies have route optimization data. The question for any of these companies is: can they build the organizational capability to leverage their data advantage? Many traditional companies have data moats but lack the AI expertise to use them. Many AI companies have expertise but lack data moats. The winners in the AI era might be companies that combine both - either traditional companies that develop AI capabilities, or AI companies that acquire unique data sources. Legacy companies will have to fight through The Innovator’s Dilemma. New companies will have to fight through the Data Moat Principle.
The Durability Question
One fair challenge to the data moat principle is the question of durability. What if the importance of data diminishes over time? What if AI systems become capable of generating their own training data? What if synthetic data becomes good enough that proprietary real-world data matters less? These are legitimate scenarios to consider. I think they are unlikely to fully materialize for a few reasons. First, AI systems trained on synthetic data tend to exhibit subtle degradations that compound over generations. If you train a model on real data, then train another model on that first model's outputs, then train a third model on the second model's outputs, the quality decays. This phenomenon has been studied and appears fundamental. Real-world data has grounding in physical and social reality that synthetic data lacks. Training on the real thing produces better results than training on simulations of the real thing.
Second, even if synthetic data becomes viable for some capabilities, the
companies with real data will be able to validate and calibrate their synthetic data generation. They will have an advantage in knowing what good synthetic data looks like because they can compare it to real data. Third, for domains that involve understanding human behavior - which is most commercially valuable applications - human-generated data will always have primacy. You cannot synthesize authentic human expression. You can only observe it. I could be wrong about this. The AI field is moving fast and surprises happen. But I think the data moat principle will remain valid for at least the next decade, which is the investment horizon that matters.
Bringing It Together
The data moat principle is straightforward: in the AI era, whoever owns the
data wins. Current capability tells you about today. Data moats tell you about the trajectory. When you evaluate AI companies, run them through the five-question filter. What data do they own that no one else can access? Does that data compound over time? Can competitors replicate it with enough money? Is the company organized to use it effectively? How durable is the moat under different scenarios? Apply this filter consistently and you will see the AI landscape very differently than the consensus view. This transformation is fundamentally about which companies are accumulating the assets that will define the next era. Who builds the best products today is almost beside the point. Data moats are those assets. Understanding this principle positions you for the abundance side of the fork in the road.
If you understand this principle and apply it to your decisions, you will be
positioned very differently than most people watching this space. And I think you will be positioned correctly.