Meet Potential Future of Third Party Cookies – Google Federated Learning of Cohorts API
Web advertising is a quintessential part of the internet ecosystem as it helps fund new content ideas and bring product/services closer to the target audience. But the ecosystem is threatened by misusing customer privacy. Users want assurances that their identity would remain anonymous & safe as they browse the internet.
Google introduced Privacy Sandbox to remove third party cookies with viable privacy-first alternatives that deliver relevant ads without risking user privacy. In this ecosystem Google has proposed various approaches based on the end-to-end ecosystem of digital advertising such as Trust Token API
To deliver more relevant ads in the new ecosystem Google proposes an interest-based advertising approach code named Federated Learning of Cohorts or FLoC API. It was started with the idea that groups of people can replace individual identifiers aka third-party cookies.
What is Federated Learning of Cohorts ?
Federated Learning of Cohorts or FLoC proposes a new way for businesses to identify prospects or customers based on their particular interests and/or browsing behavior. Users with the same interests based on the browsing history are clustered into a large group with similar interests and assigned a cohort id. As a result, individual user’s identity remains hidden amongst other individuals and they are rather targeted by the assigned group or cohort rather than an individual user.
To ensure that all advertising providers can ensure user privacy while delivering the most relevant ads, the following principles are defined :
- Cohort ID should prevent individual cross-site tracking
- Cohort should only consist of users with similar browsing behavior
- Since each provider has their own optimization function, cohort assignments should be unsupervised algorithms
- Parameter choice for algorithm assignment should be clearly and easily explained
- Requires low system requirements to compute an individual’s cohort
What is a Clustering Algorithm?
Clustering algorithms are functions used by advertising providers, preferably on the browser to assign ‘Cohort IDs’ to a browser based on the grouping mechanism using browsing history. Performance of each plausible clustering algorithm would be evaluated by the following factors :
Privacy : To ensure privacy, there should be a large number of users that share the same Cohort ID.The higher the number of users, the lower the ability to unique identify a user or higher ability to ‘hide in the crowd’ and vice versa
Utility : A large set of diverse users within a cohort, would reduce the impact of the ads served to the particular Cohort ID. The challenge is to maintain a healthy number of users of within a cohirt that have the same segment
Centralization : Does it require a centralized server to calculate the cohort ID?
How are Interest Profiles created ?
The input for each clustering algorithm is a user’s browsing history denoted as a single vector in a d-dimensional space. Followed by clustering operation that groups similar users together to define cohort id.
For ease of this discussion, I am defining Interest Profiles as group of users who have the same cohort ID assigned by the clustering algorithm.
To showcase the working of various clustering algorithms, here’s an implementation on publicly available datasets – Million Song Dataset and MovieLens 25M. Each wordcloud represents frequency of users that have the same interests.
Million Song Dataset
The MSD or Million Song Dataset is a collection of 1 million songs tagged by categories and user IDs. Each of the 650 thousand users are tagged by the song , how many times it was listened as well as the categories it belongs to. Along with more than 200 thousand categories, the data set contains additional subjective tags such as “good” or “awesome”.
Lets start by randomly grouping together users which as you can see in the screenshot below does not define any specific interest between the cohorts.
Now applying SimHash clustering algorithm to the same dataset, we should experience even more definition between the cohort. As in the screenshot below, second cohort defines group of users interested in ‘metal’ music and third cohort represents music everyone can dance to at a party.
Finally to the same dataset we apply affinity clustering to determine enhanced cohorts. Unlike SimHash cohorts, these new cohorts are very well defined and smaller in size. Even though the first and third cohorts generated seemed to have a single genre, the second cohort seems to have multiple categories associated.
MovieLens 25M
This is the most commonly used dataset to evaluate recommendation systems, consisting of 25M ratings keyed by user if. Each movie listed is associated with one of 20 movie genres.
Following the same method as previous dataset, the random grouping of users does not output any definitive cohorts that define Interest Profiles.
Applying SimHash clustering algorithm to MovieLens 25M dataset produces new set of cohorts as detaild below. The first cluster is extremely definitive to ‘Children’ based movie interest profiles such as ‘Toy Story’. The remaining 2 clusters both contain the close to same Interest profiles – ‘Thriller’,’Drama’,’Crime’.
Finally using Affinity clustering, the cohorts appear better than SimHash
In conclusion, the future of web advertising rests upon users browsing history and algorithms that assign users a cohort ID that keeps their identity hidden in the cloud
Photo by The Creative Exchange on Unsplash