Back

You Wouldn't Download a Hacker News

455 points5 daysjasonthorsness.com
montebicyclelo5 days ago

There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.

- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`

- ClickHouse, no signup needed, can run queries in browser directly, [1]

[1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

kordlessagain5 days ago
ZeWaka4 days ago

and now yours :)

xnx5 days ago

The ClickHouse resource is amazing. It even has history! I had already done my own exercise of downloading all the JSON before discovering the Clickhouse HN DBs.

k0ns0l4 days ago

+1

mattkevan5 days ago

I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.

Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.

nthingtohide5 days ago

> an innocent machine about wanking and divorce

Let's say you discovered a pendrive of a long lost civilization and train a model on that text data. How would you or the model know that the pendrive contained data on wanking and divorce without anykind of external grounding to that data?

dTal4 days ago

LLMs learn to translate without explicit Rosetta stones (pairs of identical texts in different languages). I suspect they would learn to translate even in the absence of any identical texts at all. "Meaning" is no more or less than structural isomorphism, and humans tend to talk about similar things in similar ways regardless of the language used. So provided that the pendrive contained similar topics to known texts, and was large enough to extract with statistical significance the long tail of semantically meaningful relationships, then a translation could be achieved.

alabastervlog4 days ago

This is much more concise than my usual attempts to explain why LLMs don’t “know” things. I’ll be stealing it. Maybe with a different example corpus, lol.

nthingtohide4 days ago

I actually I fashioned this logic out of the philosophy question of why certain neural firings appear as sound in our brain while others appear as vision? What gives?

+1
wfn4 days ago
moate4 days ago

I think it only fair to leave that in for posterity. Where would we be without wanking and divorce after all?

harry84 days ago

Sexually fulfilled?

tomcam4 days ago

So, wanking and hentai?

falcor845 days ago

What's wrong with wanking and divorce? These are respectively a way for people to be happier and more self-reliant, and a way for people to get out of a situation that isn't working out for them. I think both are net positives, and I'm very grateful to live in a society that normalizes them.

pc865 days ago

I'm not implying that divorce should be stigmatized or prohibited or anything, but it is bad (necessary evil?) and most people would be much happier if they had never married that person in the first place rather than married them then gotten divorced.

So "normalize divorce" is pretty backward when what we should be doing is normalizing making sure you're marrying the right person.

nhod4 days ago

This reminds me of one of my very favorite essays of all time, "Why You Will Marry the Wrong Person" by Alain de Botton from the School of Life. The title is somewhat misleading, and I resisted reading it for a couple years as a result. It is exquisite writing — it couldn't be said with fewer words, and adding more wouldn't help either — and an extraordinary and ultimately hopeful meditation on love and marriage.

NYT Gift Article: https://www.nytimes.com/2016/05/29/opinion/sunday/why-you-wi...

Nzen4 days ago

Alain de Botton also published this in video form, seven years ago [0]. If you want the cliff's notes, his School of Life channel has a shorter version [1].

[0] https://www.youtube.com/watch?v=-EvvPZFdjyk 22 minutes

[1] https://www.youtube.com/watch?v=zuKV2DI9-Jg 4 minutess

didgetmaster4 days ago

I agree. The title is wrong. It should be 'Why you are sure to think, whomever you marry, that they are the wrong person".

tailspin20194 days ago

You’re 100% right. That essay is superb and I’m glad I read it!

Thanks for sharing the link.

cgriswald4 days ago

Making sure you are marrying the right person is normalized. I’d have never even known my ex wasn’t the right person if I hadn’t married her. I didn’t come out of my marriage worse off.

Normalize divorce and stop stigmatizing it by calling it bad or evil.

+2
bluefirebrand4 days ago
+1
pc864 days ago
+1
pixl974 days ago
dcuthbertson5 days ago

The innocent machine can't do either. It's akin to having no mouth, but it must scream (apologies to Harlan Ellison)

falcor845 days ago

That is a fair point, but it would then apply to everything else we teach it about, like how we perceive the color of the sky or the taste of champagne. Should we remove these from the training set too?

Is it not still good to be exposed to the experiences of others, even if one cannot experience these things themself?

+2
dcuthbertson4 days ago
adamc4 days ago

Having gone through a divorce... no. It would be better if people tried harder to make relationships work. Failing that, it would be better to not marry such a person.

zelphirkalt4 days ago

The state of having married the wrong person, will always occur. To stigmatize divorce is to put people who made the wrong choice once in a worse spot.

Marriage should be made less artificially blown up with meaning and divorce should not be stigmatized. Instead, if done with a healthy frequency, people divorcing when they notice it is not working, should be applauded, for looking out for their own health.

At the same time people also should learn how to make relationships in general work.

+1
wkat42424 days ago
falcor844 days ago

People sometimes grow in different directions. Sometimes the person who was perfect for you at 25 just isn't a good fit for you at age 40, regardless of how hard you try to make it work.

jakegmaths5 days ago

Your query for Java will include all instances of JavaScript as well, so you're over representing Java.

smarnach5 days ago

Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words

sph5 days ago

A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.

Call it "Go", for example.

(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)

setopt5 days ago

Let’s make a language called “A” in that case. (I mean C was fine, so why not one letter?)

TZubiri5 days ago

Or call it the name of a popular song to appeal to the youngins.

I present to you "Gangam C"

InDubioProRubio5 days ago

You also wouldn't acronym hijack overload to boost mental presence in gamers LOL

matsemann5 days ago

Reminded me about Scunthorpe problem https://en.wikipedia.org/wiki/Scunthorpe_problem

brian-armstrong4 days ago

Amusingly, the chart shows Rust's popularity starting from before its release. The rust hype crowd is so exuberant, they began before the language even existed!

Matumio4 days ago

Now if we only could disambiguate words based on context. But you'd need a good language model for that, and we don't... wait.

jasonthorsness5 days ago

Ah right… maybe even more unexpected then to see a decline

cs02rm05 days ago

I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.

I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.

cess115 days ago

Recruiting Java developers is easy mode, there are rather large consultancies and similar suppliers that will sell or rent them to you in bulk so you don't need to nag with adverts to the same extent as with pythonistas and rubyists and TypeScript.

But there is likely some decline for Java. I'd bet Elixir and Erlang have been nibbling away on the JVM space for quite some time, they make it pretty comfortable to build the kind of systems you'd otherwise use a JVM-JMS-Wildfly/JBoss rig for. Oracle doesn't help, they take zero issue with being widely perceived as nasty and it takes a bit of courage and knowledge to manage to avoid getting a call from them at your inconvenience.

+1
patates5 days ago
karel-3d5 days ago

New Java looks actually good, but most of the Java actual ecosystem is stuck in the past.... and you will mostly work within the existing ecosystem

smcin4 days ago

a) Does your query for 'JS' return instances of 'JSON'?

b) The ultimate hard search topic for is 'R' / 'R language'. Check if you think you index it corectly. Or related terms like RStudio, Posit, [R]Shiny, tidyverse, data.table, Hadleyverse...

userbinator5 days ago

I had a 20 GiB JSON file of everything that has ever happened on Hacker News

I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.

sph5 days ago

2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.

Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.

samplatt5 days ago

Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?

kevincox5 days ago

I suspect it is closer to 100% increase for the average comment. If the average comment is a few senteces and the metadata has id, parent id, author, timestamp and a vote count that can add up pretty fast.

FabHK5 days ago

Around one book every 12 hours.

olalonde4 days ago

7.5KB/s (aka 7500 characters per second) didn't sound realistic... So I did the math[0] and it turns out it's closer to 34 bytes/s (0.03 KB/s). And it's really lower than that because of all the metadata and syntax in the JSON. You were right about the "over 2MB per day" though.

[0] Well, ChatGPT did the math but it seems to check out: https://chatgpt.com/share/68124afc-c914-800b-8647-74e7dc4f21...

NitpickLawyer4 days ago

The entire reddit archive was ~4TB sometime close to them removing the API. That's fully compressed, it used to be hosted on the-eye. There are stil arrrr places where you can torrent the files if you're inclined to do so. A lot of that is garbage, but the early years are probably worth a look at, especially before 2018-2019 when smarter bots came to be.

xnx5 days ago

20 GB JSON is surprising to me. I have an sqlite file of all HN data that is 20 GB, it would be much larger as JSON.

wolfgang424 days ago

20 GB of JSON is correct; here’s the entire dump straight from the API up to last Monday:

  $ du -c ~/feepsearch-prod/datasource/hacker-news/data/dump/*.jsonl | tail -n1
  19428360        total
Not sure how your sqlite file is structured but my intuition is that the sizes being roughly the same sounds plausible: JSON has a lot of overhead from redundant structure and ASCII-formatted values; but sqlite has indexes, btrees, ptrmaps, overflow pages, freelists, and so on.
elcritch4 days ago

Sqlite also doesn’t have fixed types, but uses a tagged value system to store data. Well according to what I’ve read on the topic.

kortilla4 days ago

SQLite files are optimized for fast querying, not size.

dredmorbius4 days ago

The total strikes me as small. That's nearly two decades of contributions from several 100k active members, and a few million total. HN is what would have been a substantial social network prior to Facebook, and (largely on account of its modest size and active moderation) a high-value one.

I did some modelling of how much contributed text data there was on Google+ as that site was shutting down in 2019.

By "text data", I'm excluding both media (images, audio, video), and all the extraneous page throw-weight (HTML scaffolding, CSS, JS).

Given the very low participation rates, and finding that posts on average ran about 120 characters (I strongly suspect that much activity was part of a Twitter-oriented social strategy, though it's possible that SocMed posts just trend short), seven years' of history from a few tens of millions of active accounts (out of > 4 billion registered profiles) only amounted to a few GiB.

This has a bearing on a few other aspects:

- The Archive Team (AT, working with, but unaffiliated with, the Internet Archive, IA) was engaged in an archival effort aimed at G+. That had ... mixed success (much content was archived, one heck of a lot wasn't, very few comments survive (threads were curtailed to the most recent ten or so, absent search it remains fairly useless, those with "vanity accounts" (based on a selected account name rather than a random hash) prove to be even less accessible). In addition to all of that, by scraping full pages and attempting to present the site as it presented online, AT/IA are committing to a tremendous increase in the stored data requirements whilst missing much of what actually made the site actually of interest.

- Those interested in storing text contributions of even large populations face very modest storage requirements. If, say, average online time is 45 minutes daily, typing speed is 45 wpm, and only half of online time is spent writing vs. reading, that's roughly 1,000 words/(person*day), or about 6 KiB/(person*day). That's 6 MiB per 1,000 people, 6 GiB per 1 million, 6 PiB per billion. And ... the true values are almost certainly far lower: I'm pretty certain I've overstated writing time (it's likely closer to 10%), and typing speed (typing on mobile is likely closer to 20--30 wpm, if that). E.g., Facebook sees about 2.45 billion "pieces of content" posted per day, of which half is video. If we assume 120 characters (bytes) per post, that's a surprisingly modest amount, substantially less than 300 GiB/day of text data. (Images, audio, and video will of course inflate that markedly).

- The amount of non-entered data (e.g., location, video, online interactions, commerce) is the bulk of current data collection / surveillance state & capitalism systems.

SilverBirch5 days ago

What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?

internetter4 days ago
umvi4 days ago

What if someone from EU invokes "right to be forgotten" and demands HN delete past comments from years ago. Will those deletions be reflected in the public database? Or could you mine the db to discover deleted data?

jeremyjh4 days ago

They need to issue their demand to whoever is hosting their data. If HN has deleted it, they are not hosting it.

dang4 days ago

That's an entirely third party project so I doubt they should be listing YC as a partner there.

internetter4 days ago

Huh, yeah that is really misleading. Makes it look like it is by YC.

+1
dang4 days ago
krapp5 days ago

HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.

mikeevans5 days ago

Firebase is owned and operated by Google (has been for a while).

euroderf5 days ago

Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?

TZubiri5 days ago

Well, it's called Hacker News, so hacking is fair game, at least in the good sense of the word.

alt2275 days ago

If something is on the public web, it is already being scraped by thousands of bots.

dangoodmanUT5 days ago

there's literally an API they promote. Did you read that part before trying to cancel them?

flakiness5 days ago

I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.

minimaxir5 days ago

That's not cheating, that's just pragmatic.

AbstractH245 days ago

What a pragmatic way to rationalize most cheating

bambax5 days ago

> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.

The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?

icoder5 days ago

I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.

Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).

Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.

Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.

XorNot5 days ago

I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.

wobfan5 days ago

Yeah exactly, this was exactly the idea behind that. Unfortunately, while on paper it just sounds like a sound idea, at least IMO, though ineffective, it has proven time and time again that the WOT idea in PGP has no chance against the laziness of humans.

genewitch5 days ago

Matrix protocol or at least the clients agree that several emoji is a key - which is fine - and you verify by looking at the keys (on each client) at the same time in person, ideally. I've only ever signed for people in person, and one remote attestation; but we had a separate verified private channel and attested the emoji that way.

nickdothutton5 days ago

Do these still happen? They were common (-ish, at least in my circles) in the 90s during the crypto wars, often at the end of conferences and events, but I haven't come across them in recent years.

drcongo5 days ago

I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.

icoder4 days ago

Yeah building something like this is not a weekend project, getting enough traction for it to make sense is another orders of magnitude beyond that.

I like the idea of one's trust to leverage that of those around them. This may make it more feasible to ask some 'effort' for the trust gain (as a means to discourage duplicate 'personas' for a single human), as that can ripple outward.

all24 days ago

How would 'trust' manifest? A karma system?

How are individuals in the network linked? Just comments on comments? Or something different?

drcongo4 days ago

The system I built it for was invite only so the mesh was self-building, and yeah, there was a karma-like system that affected the trust levels, which in turn then gave users extra privileges such as more invites. Most of this was hidden from the users to make it slightly less exploitable, though if it had ever reached any kind of scale I'd imagine some users would work out ways to game it.

littlestymaar5 days ago

Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.

For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor I'd guess, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.

eadmund5 days ago

> Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.

I don’t think that really follows. Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time. Various networks of merchants did the same in the Middle Ages.

+1
littlestymaar5 days ago
nostrademons4 days ago

That’s not really what research on state formation has found. The basic definition of a state is “a centralized government with a monopoly on the legitimate use of force”, and as you might expect from the definition, groups generally attain statehood by monopolizing the use of force. In other words, they are the bandits that become big enough that nobody dares oppose them. They attain statehood through what’s effectively a peace treaty, when all possible opposition basically says “okay, we’re submit to your jurisdiction, please stop killing us”. Very often, it actually is a literal peace treaty.

States will often co-opt existing trust networks as a way to enhance and maintain their legitimacy, as with Constantine’s adoption of Christianity to preserve social cohesion in the Roman Empire, or all the compromises that led the 13 original colonies to ratify the U.S. constitution in the wake of the American Revolution. But violence comes first, then statehood, then trust.

Attempts to legislate trust don’t really work. Trust is an emotion, it operates person-to-person, and saying “oh, you need to trust such-and-such” don’t really work unless you are trusted yourself.

littlestymaar4 days ago

> The basic definition of a state is “a centralized government with a monopoly on the legitimate use of force

I'm not saying otherwise (I've even referred to this in a later comment).

> But violence comes first, then statehood, then trust.

Nobody said anything about the historical process so you're not contradicting anyone.

> Attempts to legislate trust don’t really work

Quite the opposite, it works very, very well. Civil laws and jurisdiction on contracts have existed since the Roman Republic, and every society has some equivalent (you should read about how the Taliban could get back to power so quickly in big part because they kept doing civil justice in the rural afghan society even while the country was occupied by the US coalition).

You must have institutions to be sure than the other party is going to respect the contract, so that you don't have to trust them, you just need to trust that the state is going to enforce that contract (what they can do because they have the monopoly of violence and can just force the party violating the contract into submission).

With the monopoly of violence comes the responsibility to use your violence to enforce contracts, otherwise social structures are going to collapse (and someone else is going to take that job from you, and gone is your monopoly of violence)

icoder4 days ago

Interestingly, as I've begun to realise the ease by which a State's trust can sway has actually increased my believe that this should come from 'below'. I think a trust network between people (of different countries) can be much more resilient.

haswell5 days ago

I’ve also been thinking about this quite a bit lately.

I also want something like this for a lightweight social media experience. I’ve been off of the big platforms for years now, but really want a way to share life updates and photos with a group of trusted friends and family.

The more hostile the platforms become, the more viable I think something like this will become, because more and more people are frustrated and willing to put in some work to regain some control of their online experience.

TheOtherHobbes4 days ago

They're different application types - friends + family relationship reinforcement, social commenting (which itself varies across various dimensions, from highlighting usefulness to unapologetically mindless entertainment), social content sharing and distribution (interest group, not necessarily personal, not specifically for profit), social marketing (buy my stuff), and political influence/opinion management.

Meta and X have glommed them all together and made them unworkable with opaque algorithmic control, to the detriment of all of them.

And then you have all of them colonised by ad tech, which distorts their operation.

jeremyjh4 days ago

The key is to completely disconnect all ad revenue. I'm skeptical people are willing to put in some money to regain control; not in the kind of percentages that means I can move most of my social graph. Network effects are a real issue.

brongondwana5 days ago

Also there's the problem that every human has to have perfect opsec or you get the problem we have now, where there are massive botnets out there of compromised home computers.

im3w1l5 days ago

GPG lost, TLS won. Both are actually webs of trust with the same underlying technology. But they have different cultures and so different shapes. GPG culture is to trust your friends and have them trust their friends. With TLS culture you trust one entity (e.g. browser) that trusts a couple dozen entities that (root certificate authorities), that either signs keys directly or can fan out to intermediate authorities that then sign keys. The hierarchical structure has proven much more successful than the decentralized one.

Frankly I don't trust my friends of friends of friends not to add thirst trap bots.

lxgr5 days ago

The difference is in both culture and topology.

TLS (or more accurately, the set of browser-trusted X.509 root CAs) is extremely hierarchical and all-or-nothing.

The PGP web of trust is non-hierarchical and decentralized (from an organizational point of view). That unfortunately makes it both more complex and less predictable, which I suppose is why it “lost” (not that it’s actually gone, but I personally have about one or maybe two trusted, non-expired keys left in my keyring).

kevin_thibedeau5 days ago

The issue is key management. TLS doesn't usually require client keys. GPG requires all receivers to have a key.

amenghra4 days ago

Couple dozen => it’s actually 50-ish, with a mix of private and government entities located all over the world.

The fact that the Spanish mint can mint (pun!) certificates for any domain is unfortunate.

Hopefully, any abuse would be noticed quickly and rights revoked.

It would maybe have made more sense for each country’s TLD to have one or more associated CA (with the ability to delegate trust among friendly countries if desired).

https://wiki.mozilla.org/CA/Included_Certificates

wkat42424 days ago

Yes I never understood why the scope of a CA was not previously declared as part of their CA certificate. The purpose is (email, website etc) but not the possible domains. I'm not very happy that the countless Chinese CAs included in Firefox can sign any valid domain I use locally. They should be limited to anything .cn only.

At least they seem to have kicked out the Russian ones now. But it's weird that such an important decision lies with arbitrary companies like OS and browser developers. On some platforms (Android) it's not even possible to add to the system CA list without root (only the user one which apps can choose to ignore)

marcusb5 days ago

Isn't this vaguely how the invite system at Lobsters functions? There's a public invite tree, and users risk their reputation (and posting access) when they invite new users.

withinboredom5 days ago

I know exactly zero people over there. I am also not about to go brown nose my way into it via IRC (or whatever chat they are using these days). I'd love to join, someday.

somethingsome5 days ago

Hey I never actually tried lobsters, do you mind if I ask an invite?

SuperShibe5 days ago

I think this ideas problem might be the people part, specifically the majority type of people that will click absolutely anything for a free iPad

icoder4 days ago

Theoretically that should swiftly be reflected in their trust level. But maybe I'm too optimistic.

I have nothing intrinsically against people that 'will click absolutely anything for a free iPad' but I wouldn't mind removing them from my online interactions if that also removes bots, trolls, spamners and propaganda.

miki1232115 days ago

How do you know it isn't already happening?

With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.

Joker_vD5 days ago

But what if LLMs will start leaving constructive and helpful comments? I personally would feel like xkcd [0], but others may disagree.

[0] https://xkcd.com/810/

gosub1005 days ago

That's the moment we will realize that it's not the spam that bothers us, but rather that there is no human interaction. How vapid would it be to have a bunch of fake comments saying eat more vegetables, good job for not running over that animal in the road, call mom tonight it's been a while, etc. They mean nothing if they were generated by a piece of silicon.

+1
miki1232114 days ago
+1
withinboredom5 days ago
Pikamander25 days ago

I was browsing a Reddit thread recently and noticed that all of the human comments were off-topic one-liners and political quips, as is tradition.

Buried at the bottom of the thread was a helpful reply by an obvious LLM account that answered the original question far better than any of the other comments.

I'm still not sure if that's amazing or terrifying.

melagonster5 days ago

This just another reddit or HN.

nashashmi5 days ago

We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.

r3trohack3r5 days ago

HN already has a pretty good immune system for this sort of thing. Low-effort or repetitive comments get down-voted, flagged, and rate-limited fast. The site’s karma and velocity heuristics are crude compared with fancy ML, but they work because the community is tiny relative to Reddit or Twitter and the mods are hands-on. A fleet of sock-puppet LLM accounts would need to consistently clear that bar—i.e. post things people actually find interesting—otherwise they’d be throttled or shadow-killed long before they “replace all human text.”

Even if someone managed to keep a few AI-driven accounts alive, the marginal cost is high. Running inference on dozens of fresh threads 24/7 isn’t free, and keeping the output from slipping into generic SEO sludge is surprisingly hard. (Ask anyone who’s tried to use ChatGPT to farm karma—it reeks after a couple of posts.) Meanwhile the payoff is basically zero: you can’t monetize HN traffic, and karma is a lousy currency for bot-herders.

Could we stop a determined bad actor with resources? Probably, but the countermeasures would look the same as they do now: aggressive rate-limits, harsher newbie caps, human mod review, maybe some stylometry. That’s annoying for legit newcomers but not fatal. At the end of the day HN survives because humans here actually want to read other humans. As soon as commenters start sounding like a stochastic parrot, readers will tune out or flag, and the bots will be talking to themselves.

Written by GPT-3o

stephenhumphrey4 days ago

Regardless of whether that final line reflects reality or is merely tongue-in-cheek snark, it elevates the whole post into the sublime.

Etheryte5 days ago

See the Metal Gear franchise [0], the Dead Internet Theory [1], and many others who have predicted this.

> Hideo Kojima's ambitious script in Metal Gear Solid 2 has been praised, some calling it the first example of a postmodern video game, while others have argued that it anticipated concepts such as post-truth politics, fake news, echo chambers and alternative facts.

[0] https://en.wikipedia.org/wiki/Metal_Gear

[1] https://en.wikipedia.org/wiki/Dead_Internet_theory

djoldman5 days ago

A variant of this was done for 4chan by the fantastic Yannic Kilcher:

https://en.wikipedia.org/wiki/GPT4-Chan

holuponemoment5 days ago

Does it even matter?

Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.

bob10295 days ago

It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.

I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...

OccamsMirror5 days ago

I always love checking the comments on articles about Bevy to see how the metaverse client guy is going.

gosub1005 days ago

The paths are going to be predictable by necessity. It's not possible for everyone to have a uniquely derived interpretation about most common issues, whether that's standard lightning rod politics but also extending somewhat into tech socio/political issues.

photochemsyn4 days ago

It's hopeless.

We can still take the mathematical approach: any argument can be analyzed for logical self-consistency, and if it fails this basic test, reject it.

Then we can take the evidentiary approach: if any argument that relies on physical real-word evidence is not supported by well-curated, transparent, verifiable data then it should also be rejected.

Conclusion: finding reliable information online is a needle-in-a-haystack problem. This puts a premium on devising ways (eg a magnet for the needle) to filter the sewer for nuggets of gold.

no_time5 days ago

I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:

- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.

- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.

- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.

All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.

theasisa5 days ago

Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.

You'd need to have a permanent captcha that tracks that the actions you perform are human-like, such as mouse movement or scrolling on phone etc. And even then it would only deter current AI bots but not for long as impersonation human behavior would be a 'fun' challenge to break.

Trusted relationships are only as trustworthy as the humans trusting each other, eventually someone would break that trust and afterwards it would be bots trusting bots.

Due to bots already filling up social media with their spew and that being used for training other bots the only way I see this resolving itself is by eventually everything becoming nonsensical and I predict we aren't that far from it happening. AI will eat itself.

no_time5 days ago

>Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.

Correct. But for curbing AI slop comments this is enough imo. As of writing this, you can quite easily spot LLM generated comments and ban them. If you have a verification system in place then you banned the human too, meaning you put a stop to their spamming.

icoder5 days ago

I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.

See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.

dns_snek5 days ago

There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.

Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.

055 days ago
dns_snek5 days ago

I don't think that's what I was going for? As far as I can see it relies on a locked down software stack to "prove" that the user is running blessed software on top of blessed hardware. That's one way of dealing with bots but I'm looking for a solution that doesn't lock us out of our own devices.

vvillena5 days ago

These kinds of solutions are already deployed in some places. A trusted ID server creates a bunch of anonymous keys for a person, the person uses these keys to identify in pages that accept the ID server keys. The page has no way to identify a person from a key.

The weak link is in the ID servers themselves. What happens if the servers go down, or if they refuse to issue keys? Think a government ID server refusing to issue keys for a specific person. Pages that only accept keys from these government ID servers, or that are forced to only accept those keys, would be inaccessible to these people. The right to ID would have to be enshrined into law.

no_time5 days ago

As I see it, a technical solution to AI spam inherently must include a way to uniquely identify particular machines at best, and particular humans responsible for said machines at worst.

This verification mechanism must include some sort of UUID to reign in a single bad actor who happens to validate his/her bot farm of 10000 accounts from the same certificate.

kriro5 days ago

I think LLMs could be a great driver of private-public key encryption. I could see a future where everyone finally wants to sign their content. Then at least we know it's from that person or an LLM-agent by that person.

Maybe that'll be a use case for blockchain tech. See the whole posting history of the account on-chain.

ahoka5 days ago

Probably already happening.

drcongo5 days ago

The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.

dangoodmanUT5 days ago

I imagine LLMs already have this too

genewitch5 days ago

I have all of n-gate as json with the cross references cross referenced.

Just in case I need to check for plagiarism.

I don't have enough Vram nor enough time to do anything useful on my personal computer. And yes I wrote vram like that to pothole any EE.

_Algernon_5 days ago

This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/

g8oz4 days ago

I predict that in the coming years a lot of APIs will begin offer the option of just returning a duckdb file. If you're just going to load the json into a database anyway, why not just get a database in the response.

vdm4 days ago

zstd parquets exported from my duckdb 1.2 files compress 2-3x more

stefs5 days ago

please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.

seabass5 days ago

My first thought as well! The author of uPlot has a good demo illustrating their pitfalls https://leeoniya.github.io/uPlot/demos/stacked-series.html

jasonthorsness5 days ago

It's true :( but line charts of the data had too much overlap and were hard to see anything. I was thinking next time maybe multiple line charts aligned and stacked, with one series per region?

jimmySixDOF4 days ago

Thats just where a 3D approach fixes the problem because you stack but with some offset there is nothing better for one shot only look once comprehension of high volume data using game engine tech for real world business intelligence please see the work of https://flowimmersive.com/

dguest5 days ago

How do you feel about stacked plots on a logarithmic y axis? Some physics experiments do this all the time [1] but I find them pretty uninitiative.

[1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...

lblume5 days ago

What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.

dguest5 days ago

It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.

The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.

For the record I think it's a terrible convention, it just somehow became standard in some fields.

ashish015 days ago

I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.

jasonthorsness5 days ago

Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…

I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.

    // DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
    // created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.

    const DefaultStaleIf = "(:now-refreshed)>" +
 "(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
https://github.com/jasonthorsness/unlurker/blob/main/hn/core...
wslh4 days ago

It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].

[1] https://www.bittorrent.org/beps/bep_0046.html

[2] https://www.npmjs.com/package/bittorrent-dht

Am4TIfIsER0ppos5 days ago

I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here's hoping for another Tay AI.

9rx5 days ago

> The Rise Of Rust

Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!

emilbratt5 days ago

The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.

vaylian4 days ago

What should the label of the y-axis be?

hsbauauvhabzb5 days ago

Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.

Havoc5 days ago

It’s on firebase/bigquery to avoid people doing what OP did

If you click the api link bottom of page it’ll explain.

jasonthorsness5 days ago

I used the API! It only takes a few hours to download your own copy with the tool I used https://github.com/jasonthorsness/unlurker

I had to CTRL-C and resume a few times when it stalled; it might be a bug in my tool

xnx5 days ago

Is there any advantage to making all these requests instead of using Clickhouse o BigQuery?

jasonthorsness5 days ago

Probably not :P. I made the client for another project, https://hn.unlurker.com, and then just jumped straight to using it to download the whole thing instead of searching for an already available full data set.

Havoc4 days ago

My mistake - apologies. Had misunderstood what you did

byearthithatius4 days ago

Can you scrape all of HN by just incrementing item?id (since its sequential) and using Python web requests with IP rotation (in case there is rate limiting)?

NVM this approach of going item by item would take 460 days if the average request response time is 1 second (unless heavily parallelized, for instance 500 instances _could_ do it in a day but thats 40 million requests either way so would raise alarms).

deadbabe5 days ago

Is the 20GB JSON file available?

shayway5 days ago

Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.

alt2274 days ago

Here, the entire history of HN with the ability to run queries on it directly in the browser :)

https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

matsemann5 days ago

One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?

pjc505 days ago

I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.

I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.

(HN does not like jokes, but you can get away with it if you also include an explanation)

minimaxir5 days ago

The only vote data that is visible via any HN API is the scores on submissions.

Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.

ryandrake5 days ago

Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.

saagarjha5 days ago

I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.

nottorp5 days ago

> Are there users I constantly upvote/downvote?

Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

vidarh5 days ago

The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.

pjc505 days ago

I recognize twenty or so of the most frequent and/or annoying posters.

The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.

matsemann5 days ago

Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?

thaumasiotes5 days ago

> It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...

...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?

nottorp5 days ago

Either you got the direction wrong or you'd support someone who is wrong just because you like them.

You're wrong in both cases :)

+1
thaumasiotes5 days ago
xnx5 days ago

Some of this data is available through the API (and Clickhouse and BigQuery).

I wrote a Puppeteer script to export my own data that isn't public (upvotes, downvotes, etc.)

9rx5 days ago

> What's my upvote/downvote ratio?

Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?

It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.

For those who seek fidget toys, there are better devices for that.

immibis5 days ago

Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.

Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.

9rx5 days ago

So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?

+1
jpc05 days ago
matsemann5 days ago

Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?

saagarjha5 days ago

If Hacker News had reactions I’d put an eye roll here.

9rx5 days ago

You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.

mike5034 days ago

Other people have asked, probably for the same reason but I would love an offline version, packaged in zim format or something.

For when the apocalypse happens it’ll be enjoyable to read relatively high quality interactions and some of them may include useful post-apoc tidbits!

xnx5 days ago

I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?

I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.

Is there a good tool for making charts directly from Clickhouse data?

texodus5 days ago

No Clickhouse connector for free accounts yet, but if you can drop a Parquet file on S3 you can try https://prospective.co

xnx5 days ago

Thanks! I'll check that out. Thought it was a typo of "Perspective" for a moment: https://perspective.finos.org/

texodus5 days ago

Yes! This is the pro version, we also develop open source https://github.com/finos/perspective (which Prospective is substantially built on, with some customizations such as a wasm64 runtime).

tacker20005 days ago

Yea, i also get the feeling that these rust evangelists get more annoying every day ;p

sebastianmestre5 days ago

Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layers

Edit: or make a non-stacked version?

jasonthorsness5 days ago

Lots of valid criticism here of these graphs and the queries; I'll write a follow-up article.

dredmorbius4 days ago

I've been tempted to look into API-based HN access having scraped the front-page archive about two years ago.

One of the advantages of comments is that there's simply so much more text to work with. For the front page, there is up to 80 characters of context (often deliberately obtuse), as well as metadata (date, story position, votes, site, submitter).

I'd initially embarked on the project to find out what cities were mentioned most often on HN (in front-page titles), though it turned out to be a much more interesting project than I'd anticipated.

(I've somewhat neglected it for a while though I'll occasionally spin it up to check on questions or ideas.)

febeling4 days ago

You wonder what all the Rust talk was about before the programming language's release in Jan 2012.

0x0084 days ago

like other's have said it likely includes partial matches as well (e.g. 'antitrust' etc)

andrewshadura5 days ago

Funny nobody's mentioned "correct horse battery staple" in the comments yet…

th1nhng04 days ago

Can I ask how you draw the chart in the post?

jasonthorsness4 days ago

lol it was Excel (save as picture / SVG format / edit colors to support dark/light mode)

th1nhng04 days ago

wow, I never expect that xD thanks for let me know

pier255 days ago

would love to see the graph of React, Vue, Angular, and Svelte

Too4 days ago

And Nextjs

hellostgeroge5 days ago

[dead]

a3w5 days ago

Cool project. Cool graphs.

But any GDPR requests for info and deletion in your inbox, yet?

arduanika4 days ago

Come on, you wouldn't GDPR a whimsical toy project!