At least once a week I have a pitch meeting with a Founder who says “We’re going to collect a ton of data, it’s going to be really valuable. We’re going to build a data network effect.”
Data is not inherently valuable. Most data doesn’t produce a real data network effect, and most data network effects aren’t that powerful even once established. Consider the idea that your data strategy, in terms of value creation, is overrated. It can be powerful, but it’s typically not.
I know that’s a sobering thought. But in order to build a lasting business, we need to understand the realities and nuances of what makes data valuable.
Let’s lay out what’s exciting and real about data defensibilities.
Data network effects can be built and data can be used to create defensibility. So how do you actually do it? There are at least three broad approaches to using data to give your product an advantage.
A data network effect is when a product’s value grows as a result of more usage via the accretion of data.
This is the most valuable type of defensibility you can build with data, and it’s much rarer than most realize.
One of the best examples of a real data network effect in the consumer world is Waze. It has 6 elements which are instructive, and that you should look for in your own products:
Let’s break these 6 elements down.
To have data network effects, your product doesn’t need to capture data automatically, but it sure helps.
CoStar is an example of a company that has a data network effect without collecting it automatically. They aggregate and sell pricing data for commercial real estate by having hundreds of people call to collect the data from tens of thousands of commercial real estate agents in the US. CoStar has a market cap over $20B.
Abacus is another example that started with phone calls and faxes to aggregate near real-time data about sales in the catalog retail space. Eventually, they sold for $1.7B.
Note that both of these companies collect and sell data that is constantly changing and old data is not very valuable. They are on a “data treadmill”, and so is Waze. But that treadmill gives them defensibility from competitors.
In general, however, you want to try and design your product to collect data automatically to:
This is important because real network effects are flywheels that lead to runaway value creation. Best not to constrain that flywheel with manual effort in your operations. Waze, for example, has very little lag time between users “writing” to the Waze traffic database and other users “reading” from that database in creating their own routes.
It’s not enough to capture data if that data doesn’t then result in improved value for existing users, thereby completing the positive feedback loop that drives a data network effect.
It’s easy to collect data to make product improvements or find one or two key insights from a data set. These add value, but it’s not a network effect. The difference is that these product improvements are usually a manual and periodic process instead of being automatic and continuous. Product improvements are therefore an indirect and sporadic result of increased usage, and they get harder to find over time. With a true data network effect, you see direct, constant increases in product value with new usage and thus new data.
As a Waze user, for example, you continuously benefit from the data being uploaded by all the other users to get a more complete, real-time picture of the traffic within any given geography, impacting your route algorithm on a minute-to-minute basis.
How much data do you need to make your product valuable to users? How hard is it to get that dataset — or a dataset like it? If you need a lot of data that is hard to get, that creates a barrier to others competing with you in the future.
But note that most Founders overestimate how unique and valuable their dataset is. I did while running Tickle (story below). It’s often the case that the value of the data you have isn’t as high as you’re hoping, or that competitors can get a similar, but different, dataset to produce similar value in their product.
Similarly, most Founders fail to realize that a competitor can pitch having a similar value to users despite having less data in reality.
So double-check that you have a real threshold defensibility: dataset size, value, and uniqueness.
Unfortunately, even if you have a data network effect, in most cases the value of that data tends to asymptote quickly. A bigger corpus of data has diminishing returns for the value being provided to the customer.
A typical example we see of Founders need more help understanding asymptotic data value is in the medical area. Many companies get data on 2,500 cases of a disease type and can use it to create a better algorithm for diagnosis.
Accumulating data on 20,000 cases improves the accuracy or speed of the diagnosis only a few percentage points relative to only having 2,500, which leaves room for competitors to claim the same value with only 2,500 cases. This is not much of a defensibility and thus not a powerful network effect.
Another simple example of asymptotic data value is Yelp reviews. The 50th review of a restaurant adds less marginal value than the 1st or 10th review. What keeps Yelp valuable are other defensibilities like discoverability on Google, brand, etc.
The threshold at which this asymptotic occurs varies. Some applications of machine learning require enormous datasets. Search engines, for example. Catching up to Google’s lead in search data would be a daunting prospect for a competitor, so their data network effect imposes a significant barrier to entry — one reason why they’ve been dominant in search for so long.
Regardless of the rare examples, Founders overestimate the point of diminishing returns for data in their own market category. The asymptote is usually not that high — perhaps thousands or even hundreds of data points are often enough to get 90% of the product experience value.
The most reliable exceptions to this are in market categories where real-time data is valuable, like with Waze. Real-time data network effects require a constant feed of data, which means that bigger networks of users have a big advantage over smaller networks. The corpus of data ages so quickly that it doesn’t have time to hit the point of diminishing returns — new data is always valuable.
Netflix is a great example to highlight this point. They made a big deal of their data on your watching habits and their data-driven algorithm to match you with the right next movie to watch. They pitched it as a unique and defensible technology that should make Netflix your clear choice. But it turns out that the value of their discovery engine is not central to their value.
What’s central to their value is simply expanding their content library to encompass the movies and TV shows that most people want to watch. Netflix’s core product value, it turns out, is content licensing and content production. Recently, they have even dropped the long tail of titles they used to match you to. The data network effect they thought they had turned out to be peripheral to the real value of the product.
The lack of defensible network effect at Netflix is why the streaming wars have seen significant market share gain by new entrants like Disney+. Netflix’s market share of streaming services fell from 91% in 2007 to 19% today. And it’s why Netflix spent more than $15 billion on proprietary content development last year.
Peripheral data network effects like recommendation engines at Netflix are not powerful.
Can your user or customer perceive the value to them that comes from your use of data?
They can with a product like Waze, and that will keep people using Waze, demonstrating a real network effect.
But go back to the medical example above where this isn’t as powerful. A competitor can build a website claiming they have the same accuracy as you even with less data. They can put it on a PowerPoint. They can say it in a pitch to the hospital.
Thus, the customer may not be able to perceive the difference — or value the difference — between you and the competitor. Your data advantage is not an effective advantage in the market, even if it is an advantage. It’s not doing the work of making you more defensible.
What if you can’t get a real network effect in your business? The next best thing is probably a data scale advantage.
Yelp is actually a good example here. They have a scale advantage of covering the most restaurants and local businesses. Thus, you know when you use Yelp, you are most likely to find what you’re looking for.
Amazon is another example. They have a data scale advantage in the breadth of the product database.
Do I get more value because you looked at Yelp or bought it from Amazon? Not really. That’s why it’s not a network effect. But the scale of data in those companies certainly makes it hard for a new competitor to challenge them.
The reason why data scale is weaker than a true data network effect shouldn’t surprise anyone who’s familiar with what we’ve written about the value of scale vs. network effects in the past. Scale effects are linear and asymptote quickly, while network effects are nonlinear and create increasing returns.
Data scale can also be useful to help you use ML to find unique insights that create value for your customers.
But be careful with this.
Many Founders have a fantasy that data will be an ongoing insight engine — that as they collect more and more data, they will get more and more insights that will give their startup an ongoing advantage.
In practice, most of the valuable insights you will glean from a scaled-up dataset in a particular domain will come early on. More data just confirms those insights. Further, the number of insights that will matter enough for customers to pay you or choose you over other providers is typically small. So data scale can give you these early 1-3 key insights, but you typically need to develop other aspects of your defensibility to build an important company.
Data embedding is the idea that by holding more and more of your clients’/users’ data, the harder you are to remove, and that gives you some defensibility.
As an example, if you’re a SaaS company, let’s say the more data you have on your client’s activities, the better your product will customize itself to their needs. And the more clients you have, the better your algorithms will be for all clients.
If Founders squint at this, they often think this is a network effect.
It’s not really a network effect, because the value the clients get from your working with the other clients is typically a small part of the value you provide and is not the reason clients would stay with you. The greater part of your value is that the more data you hold for them embedded in their operations, the harder it would be to remove you. That gives you defensibility. It’s effective, but it’s not a network effect.
Why is it effective? Mostly because you’re capturing and owning the customers’ data, and they can’t quit you.
In the early 2000s, I was Founder/CEO of a website called Tickle where we built a mother lode of unique data. We did it by putting up 450 self-assessments tests written by our PhDs, which received responses from 150 million users who answered 24 billion questions about themselves.
We were convinced we were sitting on a goldmine of psychometric and demographic data. After all, it was a unique and large structured dataset unlike any before it.
We went to Hallmark, Ford, CapitalOne and other companies with huge advertising budgets, assuming they’d be interested in our treasure trove.
Unfortunately, even those sophisticated users of data didn’t see much value in it. We heard things like, “we don’t have any way of using that data,” or “we just don’t have budget for that.” What we discovered is that a large company might be spending $3 billion on annual advertising, but only $30 million for research data. In the end, our data wasn’t monetizable.
Of course, we then tried using our data to make our own products better. After two years of trying, clever use of our data made our core products maybe 20% better. Unfortunately, 20% better — and even 40% better — doesn’t typically move the needle in growth or in defensibility against competitors. Working on other attributes of the products’ value made much more sense.
Realizing our error in believing the data would make the difference, we shifted to building raw traffic volume, grew our media tech revenues to $40M, and sold the company to Monster for $110M. It was an okay outcome, but below the scale we, and you, should be aiming for.
With AI/ML technologies making the leap in recent years, the seductive illusion of data has only grown. More Founders are falling prey to the idea we had at Tickle: the phantom value and defensibility of data. Collecting data is easy now, but turning that exhaust into value is not trivial. Further, a world of abundant data, it’s getting harder to differentiate on the basis of data quantity alone.
Building a truly iconic company is about building true defensibility. Founders, don’t fall for the mirage of data value just because you’re able to collect a lot of it. Make sure you understand the methods for creating real value with your data.
Can you upgrade your defensibility with data? Can you get a data embedding, a data scale advantage, or — the holy grail — a true data network effect?
As Founders ourselves, we respect your time. That’s why we built BriefLink, a new software tool that minimizes the upfront time of getting the VC meeting. Simply tell us about your company in 9 easy questions, and you’ll hear from us if it’s a fit.