Back

Mwmbl: Free, open-source and non-profit search engine

162 points17 hoursmwmbl.org
illegally30 minutes ago

For fun and learning is good but don't think it's practical... not even close to functionality from search engines in the 90s

xenodium6 hours ago

If keen on some minor feedback (specially for mobile), you can likely cut down on landing page text:

From:

    MWMBL

    [Search on mwmbl...]

    Welcome to mwmbl, the free, open-source and non-profit search engine.

    You can start searching by using the search bar above!

    Find more on

    [Github] [Wiki]
To:

    MWMBL

    [Search on mwmbl...]

    A free, open-source and non-profit search engine.

    [Github] [Wiki]
mdaniel9 hours ago

I wondered if this approach would be feasible for a distributed crawler: https://github.com/mwmbl/mwmbl#crawling

Also, your own posting appears to be missing from the index: https://mwmbl.org/?q=mwmbl+ycombinator

(and, yes, another vote for changing the domain name; you can have a quirky project name, but if I can't remember the cat-walking-on-keyboard domain, I'm not going to use it)

marc_abonce9 hours ago

> We now have a distributed crawler that runs on our volunteers' machines! If you have Firefox you can help out by installing our extension.

This is a very interesting idea that other search engines have tried before. Actually, the Brave search engine is built over Cliqz[6] that implemented this same idea but *without* the user's consent.

Copy pasting from an old comment I made about this "human web" crawler idea:

Both PeARS[1] and Cliqz[2] tried to do that. Both got direct support from Mozilla[3][4] but it looks like neither really kicked off.

PeARS was meant to be installed voluntarily by users who would then choose to share their indexes only to those they personally trusted, so the idea is very privacy conscious but also very hard to scale.

Cliqz, on the other hand, apparently tried to work around that issue by having their add-on bundled by default in some Firefox installations[5] which was obviously very controversial because of its privacy and user consent implications.

I still think the idea has potential, though, even if it's in a more limited scope.

[1] https://github.com/PeARSearch/PeARS-orchard

[2] https://cliqz.com/en/whycliqz/human-web

[3] https://blog.mozilla.org/press-uk/2016/06/22/mozilla-gives-3...

[4] https://blog.mozilla.org/press-uk/2016/08/23/mozilla-makes-s...

[5] https://www.zdnet.com/article/firefox-tests-cliqz-engine-whi...

[6] https://www.theregister.com/2021/03/03/brave_buys_a_search_e...

Proven1 hour ago

[dead]

robinduckett5 hours ago

I’m from Wales and it almost seems like a transliteration of the word “Mumble” - actual translation is “mwmial”

remram17 minutes ago

So not only is it based on an obscure word in Welsh, but it's not even spelled correctly?

dmurray4 hours ago

Welsh has the unfortunate combination of being unfamiliar to most English speakers, and not exotic enough to score diversity points.

melx4 hours ago

On that topic I love the Welsh-English encounter of civil servants thinking they understood each other[0]

[0] http://news.bbc.co.uk/1/hi/7702913.stm

zx80801 hour ago

It surprisingly resembles Swedish language a bit.

toastal4 hours ago

But Mumble brings back fond memories https://www.mumble.info/

eviks6 hours ago

You don't need to remember it, just bookmark and tag however you like (it's anyway a waste of keystrokes to manual type the full domain for such a frequently used site like a search engine)

DandyDev5 hours ago

That is not how a large part of the citizens on the internet works. Hell, a not insignificant number of people will still "search" for Google in their address bar before they get to the actual googling

eviks4 hours ago

Except I'm not talking to a large part of the citizens, but to a single one. Do you type 'google.com' in your address bar to search?

tux19683 hours ago

You're being argumentative for no good reason. He was suggesting a name change to improve the likelihood of a large userbase, not a change for his own convenience.

krishadi5 hours ago

This and the other engines seem to implement all the components of crawling, indexing, and searching strung together. Is there a reason for this? Wouldn't an option of, let's say, crawling + indexing made available separately, where others could built a search algorithm on top of, or just the crawling as a service made available. Are there stuff like these already available? Or is it just not a viable option?

marginalia_nu3 hours ago

Crawling can be done collaboratively, but there's not a lot of point to doing this. Crawling is the cheap and easy part.

As for the rest, in order to perform well, the indexer needs to be built specifically tailored to the what the search engine is doing. Often you're scrounging for places to cram in individual bits to encode some additional piece of information about the term.

If a DBMS tries to support every use case, a search engine index does the opposite, it supports a singular use case and cuts every corner imaginable and then some to make that happen with as much resource frugality as possible.

ddorian434 hours ago

There is common crawl: https://github.com/commoncrawl

marginalia_nu3 hours ago

Kinda sucks that it's stuck in AWS with no easy way of exfiltrating the data from the Amazon ecosystem. Last I tried I got like 100 Kb/s on their HTTP mirror. At that rate, the download would take 12 years.

ddorian432 hours ago

I just tried 2 http examples from https://commoncrawl.org/get-started for an old dataset and the most recent one and got 110Mb/s (my full download bandwidth):

  wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz

  wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/CC-MAIN-20230527223515-20230528013515-00483.warc.gz
evolve2k10 hours ago

Love that you folks are working on this. We desperately need more diversity in search options.

Much is at stake in this arena.

marginalia_nu6 hours ago

I'd love to see more competition in search. Feels like everyone right now gets tripped up on trying to emulate Google, which is a trap even if you succeed. Nobody is going to out-Google Google.

ChatGPT's recent huge success in performing a specific tasks previously within the domain of Google by doing something other than they are is a good example of this.

kristopolous6 hours ago

Chatgpt is so much more useful for things that are specific and complex than Google is.

Google used to be good at it but it's now utterly befuddled by specificity and returns such garbage that I had given up.

But the form of "I'm doing this, I'm seeing this and I'm wondering if X is possible" chatgpt is solid on that - basically a personal stack overflow

marginalia_nu6 hours ago

Question-answering is something Google pivoted toward with great enthusiasm but never quite nailed down. They'd sometimes get some questions right, but it was more of a broken clock sort of a deal.

+1
kristopolous5 hours ago
Fnoord2 hours ago

Google got completely and utterly raped by world-wide SEO. We'll have to see how ChatGPT ages. Since the dataset is more controlled, I give it a fair chance.

Proven8 hours ago

[dead]

dang12 hours ago

Related:

Show HN: I'm building a non-profit search engine - https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199 comments)

tamimio54 minutes ago

I searched first test “best business banks in Canada” and it showed no results saying it couldn’t find any “We could not find anything for your search..”, I can also see two redundant lenses icons.

carlsborg4 hours ago

Sub-100 ms search results, nicely typed python codebase, good project. How many 4096 byte pages do you currently store?

worksonmine2 hours ago

Why 4096 bytes?

carlsborg1 hour ago

From the github: "Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items."

TheExplorer1 hour ago

Typing 'Debian' and getting some results, adding 'gdm' results 0. lol

nonrandomstring5 hours ago

> This website requires you to support/enable scripts.

Bye bye.

You do not need "scripts" to turn the text string I'll supply into a list of candidate links. How can you not understand this basic accessibility foundation?

crtasm39 minutes ago

It's much easier to read after changing --bold-font-weight to 500 in the CSS.

Fnoord2 hours ago

When I enter Kamil Galeev I get directed to a Nitter post by him (and only that), but when I enter Kamil Kazani (which was the mentioned nickname of said Nitter post) I get returned nothing at all.

1vuio0pswjnm73 hours ago

"Welcome to mwmbl, the free, open-source and non-profit search engine.

This website requires you to support/enable scripts."

JSON results, no Javascript

https://api.mwmbl.org/?search=search+the+web+without+javascr...

kwhitefoot3 hours ago

A lot of the terms I searched for returned no hits. The Firefox add-on crawls pages linked from Hacker News which is amusing perhaps but seems unlikely to crawl a representative selection of the web. Perhaps the user should be able to suggest pages to be crawled.

But when it does find something it is very quick! So I'll give it a go.

hk__23 hours ago

Same experience: it’s quick at finding irrelevant links. For some reason, it seems to have indexed a lot of spammy websites: search for "Trastevere" on Google, and you get Wikipedia and pages about the district in Rome. Search it on Mwmbl and you only get links from a random *.it-romehotels.com website.

Other random examples: search for "2023" and the very first link is "2023 Pomeroy College Basketball Ratings". Search for "iphone", and the 5th link is a page about iPhone 6s that was last updated in 2021. Typos don't work: "haker news" has only one result, a hungarian press article.

joshxyz26 minutes ago

man these names are making me dyslexic. love it though.

BlackLotus894 hours ago

Gigablast (linked in the faq) is dead for some time now. Had some sort of collab with freenode and then suddenly disappeared (not implying causality)

Black616Angel7 hours ago

Okay, name aside, because I instantly got that and englisch isn't my first language.

But the crawler seems to be lacking quite a bit. For my first search (current work problem) "rust json diff" it only found 6 links, only one of which was a rust crate. Unfortunate.

Second Search: "black sabbath sleeping village lyrics" only gave 2 results, only one of which was correct.

Also the repo is missing the SearXNG[1] search engine.

[1] https://github.com/searxng/searxng

marginalia_nu6 hours ago

SearXNG isn't really a search engine. It's just a unified front-end for other search engines, doesn't do any actual crawling or indexing as far as I'm aware.

davidebaldini2 hours ago

If I understand, having only 4096 bytes of data per term causes multiple terms in the same query to intersect to little or no results. The purpose seems to cut cost in compromise of completeness.

marginalia_nu2 hours ago

Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

Reticularas3 hours ago

Don't know if the index isn't complete, but the results with this are quite poor

marginalia_nu7 hours ago

I was happy to notice these guys a while back, but the git repo seems very dormant. I wonder if they backpedaled on the open source side of things, or if the project is asleep.

Either would be sad, because the world needs more open source search engines.

romwell12 hours ago

OK, the obvious question:

Why go with an unpronounceable name?

I mean, great that it was made, but I can't even tell people I'm using... mwumble? But it's spelled em-doubleyou-em-bee-el dot org.

BLKNSLVR10 hours ago

It's pronounced mumble. An explanation is at the very bottom of the github Readme, quoting:

> How do you pronounce "mwmbl"?

> Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

9dev5 hours ago

Marketing 101: don’t try to be clever with your brand name :)

xpe10 hours ago

Ok, I'll think of it like this. Since the name of "w" is "double u" that means "mwmbl" is also "muumbl"!

UUho knows, maybe the name can uuork after all!

sdf4j10 hours ago

really? like mumble? [0]

[0] https://mumble.info

BLKNSLVR7 hours ago

According to government records, the only names not yet trademarked are "Popplers" and "Zittzers"

lionkor4 hours ago

Not for long! Someone ought to make a Popplers fastfood chain.

dabluecaboose8 hours ago

Its a tech startup, vowels are not chic

evolve2k10 hours ago

+1000 change the name

Make it so you can use it in a sentence to replace “Just google it”

thelastparadise11 hours ago

I think it's mimblewimble.

kylecazar10 hours ago

That's worse than I thought, suspected it was a play on mumble.

evolve2k10 hours ago

UI feedback: On my iPhone the search box shows two magnifine glasses, make it just one.

retrofuturism6 hours ago

Google only has 1. You gotta 1-up the competition to win.

imachine1980_9 hours ago

still don't understand ""

389 hours ago

slashes get eaten by the page. not cool.

dotcoma7 hours ago

So do vowels ;)

Andrew0182 hours ago

[dead]

based-nerd10 hours ago

[flagged]

bobse8 hours ago

What a terrible name! Into the trash.

jw_cook8 hours ago

While it doesn't refute your point, the Frequently Asked Question section does give an explanation for the consonant soup: it's Welsh.

> How do you pronounce "mwmbl"? Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"

seanthemon7 hours ago

I highly recommend grabbing something simpler to say and remember to redirect to your site. You're going to need a large amoung of inertia to get people to comfortably use an odd domain name.

AlphaCerium5 hours ago

Arguably, Google was probably a odd name for a search engine to people in the 90s that weren't maths-savvy.

+2
brettermeier4 hours ago
mdtrooper6 hours ago

I remember https://yacy.net/ but the big problem of this project was java and had not implementations in others languages. I mean it as imagine torrent was only in perl.

marginalia_nu6 hours ago

YaCy's big problem is that distributed search is a bad idea that will never perform well. Search is as fast as the data is local.

kristopolous6 hours ago

There was an effort in the early 90s to have search as a protocol so you could have a query and then select the domains you want to run it on and return an aggregate result.

It was 100% abandoned and I think that's a mistake. It'd be nice to explore some of those ideas again

marginalia_nu6 hours ago

I think a big part of the problem is that domains in isolation don't provide the best search results. Out-of-band information like (global) anchor texts or click data makes search perform so much better.

If I want to learn how to do an INNER JOIN in MariaDB, this is the authoritative source: https://mariadb.com/kb/en/join-syntax/

The problem being that INNER JOIN isn't particularly important to that page using most IR measures of importance, it's also primarily in a <code>-block which is typically further de-prioritized. To learn that this is an important link, you need to look outside of mariadb.com.

kristopolous5 hours ago

There's more to it than that.

What if instead of crawling the php generation of database rows with a bunch of cruft, the administrator published some kind of schema with scraping and querying rules and you could alternatively make a single call to capture all of the data in a sematic schema.

You can still do all the stuff you're talking about but it could make search more coherent.

An entry for that humans and an entry for the computers.

You can't trust everybody like this sure, but say imdb, discogs, wikipedia, all of which provide database dumps anyways (eg: https://datasets.imdbws.com/). That's what I'm advocating for revisiting. Lots of legit sites such as universities, newspapers, public records offices...

You could even have a search toggle "screened sources" or whatever for the ones that make the cut