Brave Search's index

Hi,

Your videos and Henry’s twitter seem to take Brave Search’s claims that they have their own index at face value, but this claims seems pretty questionable to me.

This blog post from Cliqz (the company that built Tailcat and was later bought by Brave and rebranded Brave Search), shows very clearly the issue. Brave’s “Index”, as of 2019, was nothing but a glorified cache with a basic added layer of custom ranking. As far as I know, in 2022, brave still doesn’t have a crawler. A question regarding Brave’s crawler on their forums went unanswered, no brave related crawler appears on any web crawler list for system administrators, and no mention of a crawler is ever made in any Brave Search related announcement. Brave has the Web Discovery Project which is just the continuation of what is described in Cliqz’s blog, basing search results on Bing’s, and “improving” its ranking based on observed user behaviour (I put improving in quotes because Bing already does the same, and such a strategy IMHO risks creating a bias in Brave search given that Brave’s userbase is far from being a statistically accurate representation of the average person). In the end, when Bing stops providing search results to Brave, their “index” will become obsolete in a few days because they have no way of making it discover new content on its own.

I consider Brave’s marketing to be pretty misleading, if not straight up a lie. They are not independent, and rely on Bing to get results for any content that they haven’t yet queried from Bing. I personally put them on the same level as Telegram claims about privacy/security.

Maybe I’m wrong and Brave has innovated a lot on Cliqz’s initial approach without talking about it (it would be actually amazing!) but I find it very unlikely. Brave Search’s budget is a rounding error compared to that of Bing/Google, yet they results are have comparable accuracy. Mojeek has been in the game for more than a decade and their results are extremely bad, but they are really independent, which we can be fairly confident is true because they regularly post updates about their datacenter, the size of their index and they have a crawler that can actually be observed by web admins.

If they have actually innovated and made search cheap, why aren’t they bragging about it in their blog like every other tech company does? They act like they invented a new color yet never show any proof (source code, high level description of the search engine, patents etc…).

If anyone has some more info on the validity of Brave Search’s claims I’d be happy to listen to you.

2 Likes

Firstly, there is a significant difference between indexing the web and crawling the web. Having search result information locally and ranking it is an index regardless of the source of that information. It is built on the backs of Google and Bing’s web crawling, which you could definitely argue is a flawed approach to building an independent search engine. I’ll explain why I don’t think this is a huge deal.

I believe this is a gross oversimplification of what is happening with the Web Discovery Project, because it is not gathering results based merely on user actions on Brave Search alone, as Bing does. The Web Discovery project has the advantage of running client-side in Brave Browser, and it is able to learn from Google or Bing’s results without any sort of partnership or agreement with them. This is the behavior described in Cliqz’s blog post as well, which is fundamentally different than entering into a search agreement with Bing or Google as DuckDuckGo or Startpage has.

Now, Brave has entered into a search agreement with Bing and Google. This was the quickest and easiest way to jumpstart their own index, but the Web Discovery Project is what is going to — in theory — allow them to be independent in a way DuckDuckGo or Startpage never could:

The Web Discovery Project is the “missing” crawler you’re looking for. It won’t appear in a sysadmin’s logs because it runs in Brave Browser itself, but the data it collects, search queries, URLs, and browsing information is not different from any other web crawler, it just runs in a decentralized manner rather than out of Brave’s datacenters. Brave’s user-base allows them to accomplish what projects like Mojeek or even YaCy could only dream of, by distributing that cost among all Brave users. This is how they’ve made search cheap.

I would argue that this is a non-issue. All search engines are biased, that is what “relevance” is and what sets them apart from each other. Brave Search returning results that are more relevant to Brave users is probably considered an advantage to its current user-base, and as its market-share grows it will learn to become more average anyways. Bing results are fairly bad, so clearly improvements have to be made for a competing search engine like Brave Search to become relevant.

I disagree with this point as well, Brave’s marketing doesn’t paint an idea of Brave Search being completely independent, in my opinion. They make it clear not only in their documentation but on their results page themselves where your results are sourced from.

Finally, I completely disagree with this classification, which is what inspired me to write this reply in the first place. Telegram is built on fundamentally flawed cryptography and unsafe defaults that lull users into a false sense of security. It is one of the most harmful messaging platforms to privacy out there, in my opinion.

Even if you ignore the Web Discovery Project, and you consider Brave Search’s index to be a mere caching layer over Google and Bing, Brave Search provides at worst the same level of privacy, security, and independence as DuckDuckGo or Startpage. Using any of these three search engines is going to provide a net improvement to privacy for any of their users and should be encouraged. When companies like Mojeek build completely unusable search engines in the name of total independence, all it does is drive users back to Google, which is more harm than Brave Search is ever going to cause.

Anyone who knows me knows I am a pretty anti-Brave guy, but I really believe we can’t let perfect be the enemy of the good, or perhaps even the great in this case, as Brave Search has managed to provide a pretty stellar experience for its users right out of the gate.

6 Likes

Amazing breakdown!

Can 100% vouch for this :sweat_smile:

1 Like

Thanks for the detailed reply!

You’re right that I misunderstood what the Web discovery project was. I thought it was a way to opt-in to that kind of tracking only for what page you go to from brave search. I don’t know how I got it this wrong, thanks for correcting me.

I think it’s more complex than that. Crawling isn’t the only part of the pipeline, even if it’s possible put the costs of crawling on Brave’s users, indexing and ranking still needs to be done on Brave’s servers and is going to be expensive (in both infrastructure and R&D to make it fast and relevant). I wasn’t able to find numbers on the cost of crawling vs Indexing/Ranking for other engines but my guess would be that indexing and ranking are much more expensive than crawling.

I agree, this is a nitpick more than a serious problem. It depends on opinions which will vary between people.

idk, I’ve been using DDG for a while and find the results to be often more relevant than Google’s (though YMMV most of the search I do is related to software development which is where DDG shines for some reason).

I perceive their marketing (and I think the average non-tech person perceives similarly) as them claiming that if Google/Bing were to not renew their agreement they’d be able to respond to 90% of their queries, which would not be true.

I wasn’t clear about what I meant. I’m talking about their marketing, not about the technical aspects about Brave (Search) as a whole, or about the privacy of Brave search. Brave Search is obviously much better than Google and Bing privacy-wise (and IMO on par with DDG), while Telegram is questionable even when compared to Messenger. What annoys me is the same pattern of “technically not wrong” facts that are conveniently cherry-picked to paint them as better than what they really achieve by using terms to refer to things they are not usually used to refer to (“encryption” in the case of Telegram is used to talk about “client-server” encryption instead of E2EE and “index” in the case of Brave is used to talk about their own really weird definition instead of a complete crawling + indexing + ranking pipeline).

100% True. I did not criticize the product itself or its privacy/security usefulness. My criticism goes 100% to their marketing. The ideas behind Brave Search are worth exploring, but putting a “90% independent metric” in the face of their users, most of whom will have a completely wrong idea about what it means seems unethical to me.

I’m not advocating for ditching Brave search or that you stop recommending it, but I do think that when talking about it it’s worth mentioning that Brave’s definition of an “index” refers to something that is very different from the definition of “index” that people naturally think about. Other “private” search engines deserve their fair share of criticism, but they tend to get it more regularly than Brave Search does.

3 Likes

I agree 100% and actually said much the same thing on the Techlore Matrix/Discord.

Excerpt of conversation

My replies formatted like this. Some messages omitted for brevity.

Note that Brave Search is based on an independent index. However, for some queries, Brave can anonymously check our search results against third-party results, and mix them on the results page.
I have no idea whether they check Google or Bing, but I guarantee Brave Search will become unusable if they ever got cut off from Google/Bing

In reply:
they use other search engines to improve their own index, plus they show what percentage of search index independence they have

In reply:
There is literally no meaningful way to quantify “search index independence”. That’s just marketing BS.

In reply:
Why can’t you quanitfy it?
You can say how many results are ranked via your on system vs using googles

In reply to “Why can’t you quantify it”:
Does Brave have access to a complete copy of Google’s index so Brave can compute the number of shared results from both indeces over the total number of results in Google’s index?

In reply to “There is literally no meaningful way…”:
Wouldn’t that just be how many shown results come from their index, and how many come from google?

In reply to “Does Brave have access…”:
No, but they can say how many results were indexed by their own system, vs depending on google

In reply to “You can say how many results…”:
That’s not meaningful. That metric only holds for a single query, not overall. It’s not a measure of “search index independence”

In reply:
Any % of independance is better than 100% dependance

In reply to “There is literally no meaningful way…”:
If it’s just marketing then why do they admit that they sometimes use Google and Bing’s indexes for search results to complete the result better?

In reply to “Any % of independance…”:
That doesn’t make it a useful metric. That just lets them market a meaningless number

You could average it, or take the medium or smth
Over the last 1000 queries, we showed google results x% of the time, with a minimum of y and a maximum of z

In reply:
This is accurate, WJ just has a grudge against brave

In reply to “If it’s just marketing then why do they admit…”:
Your question makes no sense. The whole point of a “search index independence” number that’s not always 100% is to make it clear that some of your results didn’t come from Brave’s index but also that (hopefully) some of them did.

In reply to “This is accurate, WJ just has a grudge against brave”:
I have a grudge against Brave browser, not against their search engine. And I separate marketing BS from real data even for products I actively like.

In reply:
Its not BS. They’re showing how reliant they are on google vs themselves. Much better than a google or bing search front end

I want Brave Search to grow into as thorough an index as possible, because that is good for everybody. That doesn’t mean I will hesitate to point out an obvious bit of marketing BS.

In reply:
How is it BS??

Ppl hate brave for no reason

In reply to “Its not BS. They’re showing…”:
No, they’re not showing how reliant they are/aren’t on Google. They have no way to measure that without a carbon copy of Google’s entire index.

In reply:
They can know how often they pull results from google
Thats very measurable

It may be a statistic of the past x searches and a comparison of their results to Google, but it is not a measure of overall index independence.
Correlated metric perhaps. But it is not the metric it is named to be.

In reply:
You’re not making sense

In reply to “No, they’re not showing how reliant they are/aren’t…”:
It doesn’t matter how big Google’s index is. It only matters how much they use it. Bing isn’t reliant on Google

In reply:
Upvote
They can say they only use it 10% of the time

…And this is why so many statistics get misinterpreted to mean something they don’t.

In reply to “They can say they only use it 10% of the time”:
Saying they only use Google 10% of the time is, once again, a different metric from “search index independence”

In reply:
How so?

In reply to “Saying they only use Google 10% of the time…”:
If I get 10% of my income from Facebook, I’m 10% dependent on Facebook

In reply to “How so?”:
“Search index independence” can only be measured with a full copy of Google’s index. “Percentage of results taken from Google over the past 1000 searches” may be correlated but is not the same.

In reply:
How dependent is bing on google then? We’ll never know, we don’t have both of their indexes :pensive:

In reply to ’ “Search index independence” can only be measured…’:
They sound the same lol

In reply:
To describe the difference in statistician language, the latter (which Brave can measure) would enable Brave to compute a confidence interval for the former. It’s just like using random sampling in real life; you measure a sufficient sample size to estimate some value across the entire population that is unreasonable to measure directly.

If brave wanted they could drop google entirely, but they don’t because they think the comprise for better results is worth it at the moment
They could show as much or as little of google results as they want

In reply to “To describe the difference in statistician language…”:
Your the only one who can tell the difference. I don’t consider it “marketing bs” if the wording is slightly off but everyone understands what they mean

wait so what marketing bs does brave say

In reply:
“90% index independance” or smthn along those lines. It’s true, wj just doesnt like the wording

In reply to “Your [sic] the only one who can tell the difference. …”:
It bothers me disproportionately as someone passionate about accurate reporting of statistics, certainly. But technically it is wrong. If Brave stopped relying on other search engines altogether and claimed 100% index independence, that would be valid. But there is no way to measure index independence as long as it’s anything less than 100%.

3 Likes

I can attest to that very well​:sweat_smile: I think there were countless times where you were criticizing me or people about it :slight_smile: god I hate these auto emotes is there a way to turn them off?

  • conquodS

You could have set your forum name to conquodS if you were going to identify yourself anyway, you know :)

I know but gotta snag that rare username :crazy_face::crazy_face::crazy_face:

1 Like

I like seeing that this forum is using well, thank you both!