Reflections as the Internet Archive turns 25

tkgally • 4 years ago

In his reflections, Brewster Kahle mentions his goal of creating “a library available to anybody, anywhere in the world.” He doesn’t mention, though, the costs of making that library available to the world for free or the fact that the Internet Archive accepts donations. So I will:

https://archive.org/donate/

raybb • 4 years ago

Donations are fantastic but if you have engineering (or project management, design, etc) skills spending just 1 hour a week contributing to their open source goes a very long way!

Open Library in particular has a very active repo with lots of volunteers, a weekly community call, and a rather accessible codebase. https://github.com/internetarchive/openlibrary

If anyone knows webpack well would LOVE to have this dev-facing issue resolve to auto reload CSS https://github.com/internetarchive/openlibrary/issues/4955

dailyanchovy • 4 years ago

Alright, I'm new to pull requests but I had a go at it! https://github.com/internetarchive/openlibrary/pull/5451

state_less • 4 years ago

I sent my bones or clams or whatever you call them.

Will send moore when I have more and when I've learned to be more generous. It's good to know that you're near Internet Archive.

But, oh, what a wonderful feeling

Just to know that you are near

Sets my a heart a-reeling

From my toes up to my ears

-Bob Dylan, The man in me

mkaufman • 4 years ago

state_less sounds like a Little Lebowski Urban Achiever.

state_less • 4 years ago

Little Lebowski Urban Achievers - inner city children of promise but without the necessary means for a - necessary means for a higher education. So Mr Lebowski is committed to sending all of them to college.

walterbell • 4 years ago

Anyone know the relative budgets/donations/staff of Wikipedia vs. Archive.org?

tkgally • 4 years ago

ProPublica has a database of tax filing information for nonprofits.

Internet Archive: https://projects.propublica.org/nonprofits/organizations/943...

Wikimedia Foundation: https://projects.propublica.org/nonprofits/organizations/200...

Mozilla Foundation: https://projects.propublica.org/nonprofits/organizations/200...

Electronic Frontier Foundation: https://projects.propublica.org/nonprofits/organizations/430...

jdc • 4 years ago

Wow, never thought I'd see Wikimedia at more than 4x the budget of Mozilla!

feudalism • 4 years ago

It's how Katherine Maher funds her trips to exotic locations.

dkdk8283 • 4 years ago

That’s Mozilla Foundation, not Mozilla Corporation. Corp has way more revenue.

joe_the_user • 4 years ago

Nice they're there. At the same time, it's amazingly easy for content to be removed from there - if someone objects or even if things are murky.

For example, all content from the old ezboard site was been removed based on the configuration of the current URL owners' robots.txt, and current URL owner is just a domain parker. Ezboard hosted a lot of content back in the day.

https://archive.org/post/560730/ezboard-is-there-any-hope

1vuio0pswjnm7 • 4 years ago

This is an old problem I could have sworn there were promises they were going to change their procedures.

The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.

If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.

joe_the_user • 4 years ago

Well, "complain on hn" has been a way to get stuff from Google. Maybe someone at archive will notice this thread.

hidden-spyder • 4 years ago

I'm curious. What changes has Google made due to complains on HN?

SilverRed • 4 years ago

Hopefully it is just hidden and not deleted. But this is the main reason why alternative archive sites exists which ignore the original posters requests. Frequently used to archive posts from public figures which are suspected to be attempted to be scrubbed later.

techrat • 4 years ago

> Hopefully it is just hidden and not deleted.

Hidden. Even when you request for them to remove stuff.

Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.

throwslackforce • 4 years ago

That's insidious. Does it even make sense to revive data that has been removed based on current configuration?

Even if the owner is the same, allowing the site to be archived going forward isn't the same thing as permitting it retroactively.

mavhc • 4 years ago

Wonder how you'd overcome that flaw, is there a history of domain name ownership?

account42 • 4 years ago

Maybe we need a whois.archive.org.

newswasboring • 4 years ago

fwn • 4 years ago

As far as I was able to experience it's just hidden and not deleted.

I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.

But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.

mercora • 4 years ago

it sounds a lot like this would need some kind of delegation mechanism where you could point to a different URL in-time before abandoning the place. or maybe some kind of sealing using a cryptographic function that lets you proof your are the owner of the current/previous content and also would proof you are not the owner of the newer content while this proof could be used to release the ban if ever needed.

joe_the_user • 4 years ago

Got any examples of these alternative archives?

SilverRed • 4 years ago

This is the main one https://archive.is/

From the FAQ, they do not respect robots.txt since they only archive on request by a user and they do not remove archives unless they contain illegal content.

1vuio0pswjnm7 • 4 years ago

HN commenters like to use archive.is but I always wonder if people are aware that archive.is is (a) blocked in some countries^1 and (b) may block access to itself in a country if it feels threatened.^2

There is also the issue of EDNS subnet.^3 archive.is tries to require it; it wants to know what location a request is coming from. In addition to EDNS, archive.is inserts the IP address and geolocation of the incoming request into the HTML of the returned page as a tracking pixel.^4

Thus archive.is does some things archive.org does not do besides just ignoring robots.txt

One of the things archive.org does that archive.is does not do is that archive.org inserts an HTTP response header intended to disable Chrome FLoC.^5 I add this header for all sites in a local proxy; however I do not see many sites adding it as a courtesy. Thanks archive.org for doing that.

1. https://en.wikipedia.org/wiki/Archive.is

2. https://www.reddit.com/r/KotakuInAction/comments/3e29vm/arch...

3. https://webapps.stackexchange.com/questions/135222/why-does-...

4. https://news.ycombinator.com/item?id=27498902

5. permissions-policy: interest-cohort=()

mellosouls • 4 years ago

A couple of pointers to the wider world of web archiving:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

https://github.com/iipc/awesome-web-archiving

v0x • 4 years ago

There’s archive.is but I get the sense that the major use case for that is getting around paywalls as opposed to permanently archiving a page - indeed, since they host content that the site owner probably doesn’t want them to, it would stand to reason the service would not be likely to stand the yet of time. But I could be wrong.

Jiro • 4 years ago

They actually posted about this in 2017: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... . At the time it sounded like they might change their robots.txt policy. I guess they never followed up on it.

(I checked and ezboard is still excluded.)

DonHopkins • 4 years ago

A search of youtube for "wayback machine" produces pages of stuff about the Internet Archive, and only the 24th result has anything to do with the origin of the term.

People who didn't spend their Saturday mornings glued in front of the TV screen as a child of the 1970's might not remember how American kids learned about history back then:

Peabody's Improbable History - Surrender of Cornwallis

Peabody and Sherman travel back to October 19, 1781 to witness when Cornwallis surrendered for Washington. However, when they got there, then he didn't show up.

https://www.youtube.com/watch?v=3E8zmaOiCVw&ab_channel=bullw...

ConceptJunkie • 4 years ago

Actually, "The Bullwinkle Show" premiered in 1959, but I discovered it in the early 70s, as I suspect you did.

pwdisswordfish0 • 4 years ago

FYI, it's not nowadays obscure. There's a current series The Mr. Peabody & Sherman Show on Netflix, and Hollywood made a Mr. Peabody and Sherman movie (by Dreamworks) in 2014.

elric • 4 years ago

Without trying to be contrarian, I don't think that everything should be archived. Random tweets, random blog posts, random personal web sites. Let them wither and die and be forgotten. Notable content by notable people? Sure.

Everyone else ought to have the right to be forgotten, including some drunk tweet they wrote 10 years ago and regret, or an old personal page which contained too much PII.

Archive no longer has a way to opt-out, which is bad enough, but I still think they should be opt-in.

jl6 • 4 years ago

Perhaps the Internet Archive could do more to help people who find their personal/sensitive/embarrassing content made available in perpetuity (I’m not sure exactly what they could do), but it’s incredibly valuable to have archiving on by default. The voice of un-notable people is underrepresented in every field of study, and the voice of notable people tends to get preserved in other ways anyway.

jjkaczor • 4 years ago

It is helpful to get the perspectives of "un-notable" people from a historical perspective.

For example - the graffiti at Pompeii is interesting (and is pretty much at the same "quality bar" as Twitter):

https://www.theatlantic.com/technology/archive/2016/03/adrie...

https://kashgar.com.au/blogs/history/the-bawdy-graffiti-of-p...

jacquesm • 4 years ago

You never know in advance what will and what will not be worthwhile archiving, you only know that at some unspecified point in the future.

elric • 4 years ago

That's lovely from the perspective of a historian 200 years in the future. But it does not nothing to alleviate the pain of people living in the present. People whose prospective employers comb over their embarrassing past. Or bullies. Or any other number of evildoers whose life is made easier by unfettered access to indelible information.

pwdisswordfish8 • 4 years ago

I don't like your proposed solution that uses right to be forgotten as a blunt instrument to paper over serious problems, to used a mixed metaphor. When you appeal to the right to be forgotten to fix problems like people and "prospective employers comb over their embarrassing past", it's a way of throwing up your hands and neglecting deeper issues.

In a right-to-be-forgotten world, the way it would end up going is:

1. problematic potentates punish pitiable proles

2. someone invokes right to be forgotten

3. this is considered "good enough"

4. the problem conditions that allowed #1 to fester remain uncorrected

I feel this way about a lot of stuff these days (especially the where the erosion of the tenets of a liberal society is involved), where people argue vociferously for a "solution" that can at best be considered an indirect way of handling the problem. You see this with a lot of contemporary calls for the dismantling of tradition of free speech/free inquire/freedom of association, for example. People end up chafing in the direction of proposals that have dual-use effects in the first instance and perniciously "null" effects in the second instance.

stared • 4 years ago

Most of history we know is from the perspective of the wealthies 0.1%. Even though the Internet is still biased towards the wealthier and more educated, having a history of the wealthies 10% would be enormous progress.

X6S1x6Okd1st • 4 years ago

Historians spend a lot of time pouring over minutia from non-notable people. There is plenty about the world that doesn't seem worth writing down, but can be intuited from tangential texts.

The costs seems low enough to just keep it.

dogorman • 4 years ago

From what I understand, present day historians are sitting on a huge pile of cuneiform tablets that have yet to be transcribed or translated because there is much more material than there is interest/manpower.

Of course, to your point, they do keep it around. They don't just throw it in the trash.

jacobolus • 4 years ago

Only a tiny number of people in the world can read Sumerian or Akkadian, even for them the process is slow and error prone because we are missing a lot of the original context, and they have better things to do than skim through piles of delivery receipts.

* * *

It would be pretty neat if someone could figure out how to OCR all the cuneiform tablets and turn them into something searchable.

ignoramous • 4 years ago

> Random tweets, random blog posts, random personal web sites. Let them wither and die and be forgotten. Notable content by notable people? Sure.

Well, it isn't named the Internet Encyclopedia, for a reason.

> Without trying to be contrarian, I don't think that everything should be archived

It isn't contrarian. The deletionists are seemingly the majority. It is contrarian to in fact archive all. the. things.

generationP • 4 years ago

Who decides what "notable" is? I frequently use the Archive to find old academic grey lit (preprints, lecture notes, newsgroup posts, etc.). Much of it is on "random" blog posts and personal websites. Even the authors aren't usually notable by Wikipedia standards. Yeah, there is some PII on those pages, but also treasures of useful information.

spiritplumber • 4 years ago

The Internet Archive location is beautiful. it's a church that has been partially turned into a server farm. Big Neuromancer energy when you go inside and look.

DoingIsLearning • 4 years ago

I was too curious not to look this up:

https://www.atlasobscura.com/places/internet-archive-headqua...

N3cr0ph4g1st • 4 years ago

Did a hackathon there in 2016, it is so cool!

ignoramous • 4 years ago

A little known trivia: Apache Hadoop (and the multi-billion dollar open source big-data ecosystem it spawned) was worked upon at first at Internet Archive [0].

Speaking of billions: According to Kahle, Alexa Internet's compute infrastructure informed Amazon's take on IaaS (AWS) [1].

Another perhaps lost nugget is Amazon once funded (either in part or in full) the development of the Wayback Machine, Internet Archive's most impactful product. In addition, till date (if I'm not mistaken) Amazon continues to donate data it fetches from Alexa Toolbar installations to the Wayback Machine.

[0] https://archive.is/Le3id

[1] https://archive.is/EnzHq

dleslie • 4 years ago

I love the Internet Archive; I worry that its utility will wane as content becomes more dynamic than static. What does it mean to archive the experience of scrolling through a social feed?

petertodd • 4 years ago

The paid, legal-oriented, archiving service Perma.cc that Harvard Law runs actually lets you upload your own PDFs and screenshots in addition to allowing Perma.cc's bots capture webpages. Of course, since you could upload anything the difference is made clear in the UI.

In a legal context, simply attesting to the validity of a screenshot is really common. So when that functionality is used Perma.cc is operating more as a permanent file storage service than a trusted archive.

Regardless, this does go a long way to solving the problem of dynamic sites.

cxr • 4 years ago

> actually lets you upload your own PDFs and screenshots in addition to allowing Perma.cc's bots capture webpages

FWIW: the Wayback Machine is just one part of the Internet Archive. The quoted bit accurately describes things you can do with an archive.org account, too. Readers here may be familiar with the archive.org-affiliated effort by a team specifically working to recreate the playability of old PC (and otherwise) video games with JSMESS.

> this does go a long way to solving the problem of dynamic sites

Maybe, but the "dynamic" aspect that I'm sure the other person had in mind doesn't have much to do with the D in DHTML so much as it has to do with the dynamism that arises when you have a smart server responding to requests from a fat, JS-powered frontend. It would be possible to accurately model this in and execute it from a series of static assets, in some cases, but it's rarely done.

Even many sites built with static site generators today are not going to be usable in the future. There's too much tight coupling to the environment/deployment configuration and not enough semantic richness to properly hint to the crawler what resources are necessary to archive. In the heydey of XML, it used to be a big deal to strive for machine readable documents. Today's resume-driven development-obsessed webdevs effectively cast a vote of no confidence even in HTML, doing an end-run around it daily, and figuratively holding up a middle finger to the Principle of Least Power.

https://www.w3.org/DesignIssues/Principles.html#PLP

To some extent, even a bunch of the projects associated with TBL's Solid initiative are guilty of doing the same.

musicale • 4 years ago

> This library would have all the published works of humankind. This library would be available not only to those who could pay the $1 per minute that LexusNexus charged, or only at the most elite universities. This would be a library available to anybody, anywhere in the world. Could we take the role of a library a step further, so that everyone’s writings could be included–not only those with a New York book contract? Could we build a multimedia archive that contains not only writings, but also songs, recipes, games, and videos?

For every Sci-Hub trying to create the library of Alexandria, there's an Elsevier trying to burn it down.

Current copyright law is largely on the side of the arsonists rather than the archivists.

(note: recipes are not copyrighted, though cookbooks are)

akkartik • 4 years ago

I wish IA hosted Usenet archives. Even if they stopped at year 2000 and didn't update further.

cxr • 4 years ago

Anything particular about Usenet that makes it attractive? Modern mailing lists (and even ones that are now dead but accessible) are a treasure trove of information, and I've often privately mulled over my concern that there doesn't seem to be a concerted effort to save them. Having tried to track down old copies of the Linux mailing lists, and having run into indicators of others' quests (and dead ends), I know how fragile the situation is.

As I've recently come to understand, the Internet Archive itself used to have its own mailing lists for handling discussion, which interestingly enough seem to no longer be accessible—perhaps even lost.

generationP • 4 years ago

They host some: https://archive.org/details/usenet

As far as my own old posts are concerned, it looks complete :) But it isn't easily findable or searchable; the intended way of interaction is apparently to download an entire hierarchy and grep.

akkartik • 4 years ago

Oh good to know! That's adequate.

ZeroGravitas • 4 years ago

This is a bit of a tangent, but there was mention of non-advertising based funding models.

Is anyone working on an advertising model that achieves the basic goal of advertising, but without the centralising aspect which seems to be the root cause of many of the issues? Giant monopolies are always going to subvert regulation but the same industry as disconnected units might be easier to police. Obviously you can just try to split them up or limit their size with regulation after the fact but a good technical basis might help out.

Does web advertising just not make sense unless you can amass lots of private user data and track people across the web? If so can we subcontract that data to smaller companies we can trust with our data and effectively punish if they break the rules?

coldpie • 4 years ago

> Does web advertising just not make sense unless you can amass lots of private user data and track people across the web?

It does make sense, but it has to compete with the invasive-style advertising, and it will always lose. If you want the "good advertising" you have to kill the "bad advertising".

soheil • 4 years ago

For anyone interested I made a quick and dirty way to pull up the archive of any URL by prefixing it with arxiv.link

e.g.,

http://arxiv.link/https://news.ycombinator.com/

fouc • 4 years ago

Nice, thanks! Ever since I started using archive.is last year, I've always wanted something like this! Could be paired with a bookmarklet too.

javascript:(function(){window.open('http://arxiv.link/'+location.href)})();

alexislours • 4 years ago

You have always been able to do this with archive.is, I’ve been using this bookmarklet for a quite some time.

javascript: (() => { window.open("https://archive.is/" + window.location.href, '_blank')})();

fouc • 4 years ago

I meant I wanted to go directly to the most recent result in the wayback engine for the given page.

alexislours • 4 years ago

In this case you can just use the following and it will redirect on the latest archive without being dependant on a 3rd party website:

javascript: (() => { window.open("https://web.archive.org/web/" + window.location.href, '_blank')})();

codethief • 4 years ago

archive.is is not related to the Internet Archive's Wayback Machine, though, is it?

fouc • 4 years ago

It's not related, but it will give you a link there if it doesn't have the page archived.

johtso • 4 years ago

Is there an advantage compared to just putting `web.archive.org/web/` before the url?

fouc • 4 years ago

Oh interesting, http://web.archive.org/web/https://news.ycombinator.com also works, didn't know that.

soheil • 4 years ago

Didn't know that, thanks. I guess just shorter.

ipsum2 • 4 years ago

a little confusing since arxiv refers to a popular research paper archive. Useful project though!

causi • 4 years ago

As a question related to archival, what's the best tool for local archiving? HTTRACK is getting long in the tooth and it just work for all the dynamic content on modern web pages.

ovebepari • 4 years ago

Brewster Kahle mentions his goal of creating “a library available to anybody, anywhere in the world.”

Fun fact: Archive.org is blocked in Bangladesh for god knows what reasons.

nonbirithm • 4 years ago

Isn't the publisher's lawsuit still going forward in November of this year? It would be terrible if the Archive had to shut down because they decided to do an unnecessary experiment with copyright while simultaneously being tied to an irreplaceable, yet centralized, historical resource.

I donate to them with the hopes that they won't try to do anything that carries that kind of risk to their continued existence again.

TekMol • 4 years ago

How can IA just go out there and copy+republish the content of others and get away with it?

Aren't they breaching copyright on a massive massive scale?

ghaff • 4 years ago

Because the vast majority of people don't care if someone archives something they've made public. And the Internet Archive bends over (to some too far) backwards with robots.txt to exclude anything someone wants removed from public view.

ArtDev • 4 years ago

I think we need a new src attribute that can have a fallback. There are so many parts of the internet where most of the links are dead. It's pretty sad actually.

ezequiel-garzon • 4 years ago

This library would be available not only to those who could pay the $1 per minute that LexusNexus charged

Any idea on what LexusNexus is, or was back then? Thanks!

fumeux_fume • 4 years ago

I think it's a reference to LexisNexis which still exists today in a similar capacity. Back in the day (1990s), it was an online service to search for and read virtually any legal or news article among many other things. It was very expense, but was usually access thru a corporate or educational institution's license.

Mountain_Skies • 4 years ago

If you live in the US, they have a file on you and it's probably very thick. A credit report will run a couple of pages long, the LexusNexus report on me (a complete nobody) cost them about $8 to mail. It contained many errors, mostly about property I don't own but also about an insurance claim I never made.

Like your credit report, you can get a free copy by writing them and requesting a copy. IIRC when I did it a few years ago, I had to make the request in writing, I wasn't able to order it online, at least not for free.

jacquesm • 4 years ago

They ought to be liable for those errors.

lifekaizen • 4 years ago

Very expensive research tool places like large law firms would use.

Teever • 4 years ago

https://en.wikipedia.org/wiki/LexisNexis

endisneigh • 4 years ago

I'm curious - how does Archive.org get around DMCA? I assume it's because Fair Use and the fact that they're a not-for-profit, but more details would be great.

Other sites like outline.com (which I guess is a for-profit) entity don't really allow you to get around paywalls the way the Wayback Machine does.

As someone interested in building a site that gets around paywalls for semi-educational purposes I'm curious if anyone has details!

pabs3 • 4 years ago

IA are a designated library, which confers a set of privileges under US copyright law.

DoingIsLearning • 4 years ago

Yes as a non-US based user this is one of the most amazing things for me with IA.

in my head IA was just the wayback machine with cached pages. But during the pandemic I realized that there is a plethora of actual books that one can checkout with far less friction than in a standard library.

It is such a cool concept, also there are IA satellite projects (not sure if they are owned by or in partnership), for different non-english languages e.g. arquivo.pt so you can have the same plethora of content in other non-english languages as well.

ghaff • 4 years ago

A very limited set of privileges. And AFAIK it's only a "designated library" in California. I'm not sure there is such a thing at the federal level other than the Library of Congress.

Basically, most people are fine with what the Wayback Machine does and they'll take down any mirror that the domain owner asks them to.

toomuchtodo • 4 years ago

Takedowns cause the content to be “darked.” It’s no longer publicly available, but still archived on disk until a future date.

joe_the_user • 4 years ago

My impression archive.org follows whatever directions are in robots.txt. If those directions allow a site to be archived, they do it, otherwise they don't. That's from looking at not-archived sites, ezboard, which mention in another post.

Santosh83 • 4 years ago

And what about sites that do not have robots.txt? Does IA snapshot those?

nlitened • 4 years ago

Yes. By default, if you put information on the web, it is assumed that you are okay with people seeing it.

Thoreandan • 4 years ago

It's a shame that the mirror in Alexandria appears to be long abandoned.

shrubby • 4 years ago

Pictures and touchscreens have ruined the internet ;-)

IfOnlyYouKnew • 4 years ago

Big fan of the archive, especially openlibrary. But the lofty talk of “democratizing” this-or-that is a bit overblown: in authoritarian countries, the archive does zero democratizing, because it’s either just not available, or all the relevant individual documents are blocked.

For democratic countries that have democratically elected to restrict the distribution of some materials the society considers harmful, such as Germany with Nazi propaganda, the archive happily decides to undermine those clear, longstanding, and — disproving the single argument for free speech absolutism — not slippery-sloping anywhere over decades. Why? Because laws you disagree with are, apparently, illegitimate.

It probably helps their bold defense of all that is holy to intimately know that these really are democratic countries, which aren’t going to just send a wet team to dismember them, Saudis-style, or to spend millions on an elaborate plan whose only purpose is to let you live for another month, with a clear mind that has complete certainty that you will die, and who did it.

The anti-semitism and racism on archive.org, plus some copyright violations isn’t a byproduct of their “freedom”. It’s all of it. There are plenty of free hosts for video or documents, and an hour at minimum wage would pay for hosting quite a lot for quite a while for most of the archive’s audience. But the killer feature is immunity, through anonymity and DMCA’s Safe Harbour.

Sure, everyone here is only defending “free speech” and would never agree with the swastika-fetishists. Only, somehow, they never complain about ISIS having a hard time on Twitter, or porn being censored on Facebook. It’s the scans of Der Stürmer, especially of the 38 to 44 vintage, are the chosen symbols of “democratization”.

[0]: https://www.sfgate.com/california-politics/article/Internet-...

textfiles • 4 years ago

"Big Fan" is doing a lot of heavy lifting there.

ZeroGravitas • 4 years ago

I was intrigued by this comment, but couldn't really figure out what was being claimed so read the linked article.

It's still not 100% clear but elements of it include:

- Historical racist texts (Mein Kampf, 1930s newspapers) which I feel on balance is a good thing to preserve in a library. I would assume having those papers available does more to combat fascism than to encourage it (though who knows really)

- archives of websites that are dodgy (ISIS and Neo-Nazi are mentioned but there must be all sorts of crazy and or bad stuff on the web that gets archived)

- following US law rather than local law (a general conundrum for internet sites like this, especially if you do it for countries whose laws you like but not for other countries)

- providing a place for people to upload and share files anonymously

So yeah, I'm not really a Free Speech absolutist myself (doesn't seem like many that claim they are actually believe it when it comes to things they disagree with) but doesn't feel like they're in the same boat as social media platforms who actively spread bad things if it increases ad impressions. Some similar issues around policing large amounts of content and dealing with different legal, political and moral frameworks at scale though.

TedDoesntTalk • 4 years ago

How many books have you read that were published in the 1800s?

I’m betting close to zero.

Unfortunately, most people 200 years from now won’t care about the 70 petabytes the Internet Archive has saved.

Don’t misunderstand: I am glad they do this and love their work. I just think we overestimate the long-term value of this info beyond a very small set of future historians or social historians.

Most people have their lives to live in this moment, and if they have a chance to look backwards before they were born, it’s not a big piece of their time.

wolverine876 • 4 years ago

Jane Austen, Dickens, Melville, Oscar Wilde, Mark Twain, Tolstoy, Emily Dickenson, Dostoevsky, Emerson, Charlotte Bronte, Lewis Carroll, Victor Hugo, Grimm bros., Goethe, Darwin, Ibsen, Nietzsche, Pushkin ...

It's a pretty good century for literature and other books.

Also, lots of people are interested in history. Those can be best sellers.

Andrew_nenakhov • 4 years ago

You must be biased against the French, mentioning only Hugo in your list! What about Dumas, Stendhal, Balzac, Zolya, Flaubert, Maupassant, George Sand, Verlaine, Rimbaud, Valery???!

(That's only from a Russian, pretty sure actual French will add dozens more to this list)

wolverine876 • 4 years ago

And that's just France. Imagine how long this list could be ...!

someguy101010 • 4 years ago

I was reading every newspaper that mentioned spanish flu from the 1910's when covid was starting. I never would have thought that I would have been wishing for easier access to full text search newspaper archives. Here we are :p

SilverRed • 4 years ago

How many times have you clicked a link and found the page 404 or redirect you somewhere else? That is the real value of the web archive to me. Wikipedia uses it a lot as well, they automatically snapshot any link used as a reference and if the page goes away they automatically swap the link for a web archive version.

Reading a reference 50 or even 200 years old is not absurd. A post detailing some research findings which is referenced in wikipedia is still greatly valuable. Youtube historians often reference ancient patents to uncover the history of old items.

techbio • 4 years ago

Interesting point about managing against link rot.

But what got me is Youtube historians and their "ancient patents..." I want to see the ones for reinventing the wheel.

instagraham • 4 years ago

Quite a spectacularly wrong take on multiple levels.

So much great literature comes from centuries before our own. And considering that the internet is likely to be around forever or as long as humans persist, a snapshot of its initial decades will one day be one of the greatest "archaeological" treasures.

Perhaps your main point is that you do not care for their work nor for the work of literature written before your time. You need not apply your yardstick to anything else in a bid to gauge its value.

TedDoesntTalk • 4 years ago

But 70 petabytes worth? Most people in this thread mention a few dozens authors. Maybe this amounts to a few hundred books. Not 70 petabytes (and counting).

allturtles • 4 years ago

Sure, no human has ever read 70 petabytes worth of books.

An archive is not the same as a local public library. The latter holds a small collection of mostly frequently accessed items (e.g. the published works of Dickens, Austen, etc.). The former holds a much larger collection of rarely accessed items (e.g. every letter written by/to Dickens that survived, every political pamphlet published in Philadelphia in the nineteenth century , etc.).

If your point is that most items in the archive will be rarely accessed, I don't think anyone will disagree with you, but suggesting that the literature of the nineteenth century is no longer of any interest was perhaps not the best way of making that point.

kart23 • 4 years ago

I hope you're wrong. Video and pictures coupled with text are a goldmine for me. I would've loved to see a 'vlog' from the 1800s honestly. What was it like for a perfectly normal person, not a career writer? I don't think we have a lot of that, or if we do, it's from a singular viewpoint, and we're subject to their view and biases.

I found a archive of videos in my city from 1970, a street-view like recording of select roads. I pored over it for a couple hours, noting the buildings that were still there, the completely empty hills now filled with houses, etc. That kind of stuff is really cool.

dannyobrien • 4 years ago

How many books have you read that were written by people who read books published in the 1800s?

TheCowboy • 4 years ago

> How many books have you read that were published in the 1800s? I’m betting close to zero.

Quite a few actually, and I'm not an outlier. Plus there are many adaptations and derivative works that exist.

mdp2021 • 4 years ago

>I’m betting

You are betting very wrongly. People use large amounts of older literature. Maybe not in your territory - well, be aware then that many cultures do.

>Unfortunately, most people

That the median individual should be considered a parameter is very controversial. (Contextually: services are very easily for interested minorities.)

>overestimate the long-term value

As if Project Gutenberg had not arguably been one of the most important endeavours in history.

>if they have a chance to look backwards

It is a fundamental part of education...

loughnane • 4 years ago

Just add a few more that are popular

- Ralph Waldo Emerson

- Henry David Thoreau

- Rudyard Kipling (jungle book)

- Anna Sewell (black beauty) - Walt Whitman (leaves of grass)

- Edgar Allan Poe

- Alexander Dumas (Count of Monte Cristo, musketeers)

- Tocqueville (democracy in America)

5etho • 4 years ago

when I read for the first time Balzac's novel - nucingen Bank (1839) [1] and about business 'mindset' at the time I was mesmerized [1] https://pl.wikipedia.org/wiki/Bank_Nucingena (no wiki for english lang at least no hyperink)

dleslie • 4 years ago

19th century literature is full of treasures.

jfoutz • 4 years ago

Eh, Shelly, Dickens, Twain and Wells are all I can think of. Not zero, but a vanishingly small percentage.

kilroy123 • 4 years ago

I've read several from that time period. There are loads of good books from the 1800s.

TedDoesntTalk • 4 years ago

Yes. But 70 petabytes worth?

StrictDabbler • 4 years ago

70 petabytes of anything didn't exist in stored form in the 1800s so that's a ridiculous standard, but...

The value here is genuinely historical. In a hundred years, how will we track the etymology of common terms that originated in this age?

Memes make preserving the internet extremely important. Terms and ideas evolve so quickly that the history of language and thought will become obscure almost instantly. Even now it can be almost impossible to understand some internet terms if you weren't part of the subculture that spawned them at the exact time they were spawned.

Do you know what hunter2 means? Do you know it because of bash.org?

A person doesn't have to read all this material. The material has to be stored because our future society will have descended from this material, and if they don't have it they won't know how they got there.

You know, except for the bit where civilisation collapses over the next hundred years as the planet warms and hot countries invade cooler, developed countries looking for living space.

Mortiffer • 4 years ago

Via librivox for me and many others i consume 1800s content almost daily