Publisher Data Leakage — is there an acceptable amount? | by Paul Bannister

5 min readSep 10, 2020

Publisher Data Leakage — is there an acceptable amount?

Imagine there were only two content sites on the open web. One was a high-quality auto review site where customers could figure out what car they wanted to buy (“Car Site”). The other site was mass-produced junky entertainment content, with all of its traffic coming from paid Facebook campaigns (“Junk Site”).

And let’s say there was one advertiser, Ford, who has a budget of $1,000 a month and they can’t spend over a $3 CPM because over that price they don’t meet their ROAS (return on ad spend) goals.

Let’s say Car Site gets 100,000 ad impressions per month and Junk Site gets 500,000. In a world with adtech and unlimited data leakage, assuming that the two sites had a complete overlap of users, Ford could buy the same users on Junk Site and Car Site, and effectively reduce the CPMs that Car Site earns.

Let’s do the math. If Ford’s budget is $1,000, they can buy 600K ad impressions at $1.67 each — amounting to $167 on Car Site and $833 on Junk Site — that’s a bad deal, clearly. If Junk Site disappeared, Ford could spend the full $3 CPM on Car Site, but that means they would only spend $300 total, leaving $700 unspent. The $3 CPM for Car Site is clearly better than $1.67, but we’re also leaving money on the table.

Let’s change our example and replace Junk Site with another high-quality site (Computer Site, with great reviews of computers) and a second advertiser, Dell. Computer Site is bigger than Car Site and has 400,000 ad impressions. Dell has the same spending parameters ($3 max CPM, $1,000 budget) as Ford.

If the advertisers could only activate on the publisher data of Computer Site and Car Site on their respective sites, Ford would spend $300 on Car Site and Dell would spend $1,000 on Computer Site — buying 100,000 impressions on Car Site and 333,000 impressions on Computer Site.

If the advertisers could bid across both sites with data from both of them, Computer Site could actually make some more money (Car Site is sold out). Ford would be able to buy impressions on Computer Site and send an incremental $200 that way.

Obviously, the reality is more complicated, but the truth stands — if advertisers can bid across sites using some level of user data, publishers can make more money. But, just as obviously, this has been massively abused in today’s digital advertising world — and heavily to the disadvantage of good publishers.

Getting to reality

Some publishers think that by getting rid of adtech, so buyers can only reach audiences directly on their sites, their CPMs will increase, and that is true. For that reason, the death of 3rd party cookies and other privacy-related initiatives are viewed with glee — get rid of adtech and get rid of junk publishers and they come out winners.

But while it might be true that some publishers will do better, the open web at large will do worse. Buyers in our example won’t pay CPMs above their cap, leaving lots of money on the table. In the real world, the situation is more complex, for several reasons:

The digital advertising market is huge and dominated by walled gardens.
Buyers want to reach users at multiple touchpoints and over time — not just when they are on one publisher’s site.
Buyers need their advertising to be ROI-positive, so they’re not going to pay more than what they can make on an ad — fewer impressions will drive CPMs up to some level, but overall means they will spend less money.

Buyers will cut budgets and move money to the walled gardens — the money won’t flow to other good publishers and support great content creation.

For that reason, I believe some amount of publisher data leakage is actually better for the health of the open web. Not allowing buyers to reuse data in other contexts will force them to cut budgets and move money to the walled gardens.

Clearly, today’s model of unlimited data leakage is bad for premium publishers. But an open web that keeps data inside every large publisher, creating mini-walled gardens, is likely bad for the open web at large. It will help a small number of publishers (that includes CafeMedia and our publishers!) but reduce the total budgets for the open web. And the best publishers will still lose out on some money because they’ll have fewer available buyers for their ads (i.e., in our example above, Dell will never buy on Car Site).

Is there a happy medium?

It’s possible that a happy medium actually could be better for all premium content creators by making the web a more viable place for advertisers to spend their budgets.

The big question is, what is that happy medium? Some options include:

Data from an originating domain can be tracked and reused for some short period of time (24 hours, let’s say)
Data from an originating domain can be tracked and reused as long as the originator is compensated for the data usage
Data from an originating domain can be reused across a group of authorized domains (e.g., publishers could agree to share data with other publishers they trusted)

What are some other options here?

But wait, doesn’t the death of cookies mean this doesn’t matter anyway?

You might think that with cookies going away in Chrome, that this would spell the end of all of this data leakage anyway. But it doesn’t, for at least three reasons, and possibly more:

Google’s FLoC proposal allows the browser to dynamically assemble groups of cohort browsers that are similar to each other and assign them a FLoC ID. The signals that the FLoC algorithm will take into account will almost entirely be based on browsing activity — effectively publisher data. For example, if a given user visits a lot of automotive review webpages in a short period of time, and this is found to be similar to the activity of a cohort of other browsers, they can all be assigned a FLoC ID. To an outsider, it will be hard/impossible to discern which FLoCs mean what. But the companies best placed to gather enough data to understand what FLoCs are useful are adtech firms, with Google at the top of that list.
Google’s TURTLEDOVE proposal gives advertisers the ability to assign users to cohorts on their own properties, but they can also ask publishers to assign cohorts for the advertiser to target against (Read the “Browsers Joining Interest Groups” part of the TURTLEDOVE proposal). I think it’s a very safe assumption that many advertisers will take advantage of this to strong-arm publishers into dropping cohort assignment tags everywhere, in return for some amount of media spend. And that will not compensate for the amount of lost potential revenue from that data spreading all over the web.
Identity solutions (like LiveRamp ATS/IDL, Trade Desk UID 2.0) will allow advertisers and adtech to thread authenticated users together across sites. In theory, this means that a given publisher’s most valuable users are actually trackable across sites, and data about them can leak across all authenticated sites.

There may be other ways within the Privacy Sandbox for data leakage to continue. The priority of the Sandbox proposals is, in order: 1) user privacy; and 2) advertiser use-cases. Publisher data leakage isn’t even mentioned and has not been thought about, but will continue to be an issue given the current state of the proposals.

Data leakage will live on — but can we reach a happy medium?

I’m curious about others’ opinions on all of this, share your comments below.