Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

752 points3 yearsgithub.com

zomglings • 3 years ago

If anyone from the dolt team is reading this, I'd like to make an enquiry:

At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.

The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.

I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.

zachmu • 3 years ago

We'll be in touch :)

st1ck • 3 years ago

You can just try converting it into Parquet with ZSTD compression (and set high enough level). You can even fiddle with dictionary settings: https://arrow.apache.org/docs/python/generated/pyarrow.parqu...

st1ck • 3 years ago

Dataset similar to yours (github data) which you can query using Clickhouse: https://gh.clickhouse.tech/explorer/

zomglings • 3 years ago

This is the dataset on Kaggle - https://www.kaggle.com/simiotic/github-code-snippets

justinclift • 3 years ago

Yeah, that sized database is likely to be a challenge unless the computer system it's running on has scads of memory.

One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.

Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)

zachmu • 3 years ago

We have 200 GB databases in dolt format that are totally queryable. They don't work well querying on the web though - you need a local copy to query it effectively. Making web query as fast as local is an ongoing project.

justinclift • 3 years ago

Yeah the "on the web" piece is the thing we're talking about. :)

200GB databases for PostgreSQL (etc) isn't any kind of amazing.

tomcam • 3 years ago

Uh, personal question here. Where does your ~10G number come from? I pretty much run my life on the Apple Notes app. My Notes database is about 12G and now I’m scared.

justinclift • 3 years ago

zomglings • 3 years ago

Can't beat SQLite for distribution as a public dataset, though.

zachmu • 3 years ago

StreamBright • 3 years ago

You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).

zomglings • 3 years ago

That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?

StreamBright • 3 years ago

That would be amazing. I get back my internet tomorrow and i can play with the dataset see how much we could optimize.

zomglings • 3 years ago

laurent92 • 3 years ago

Wordpress would have benefited from this.

What a lot of webmasters want is, test the site locally, then merge it back. A lot of people turned to Jekyll or Hugo for the very reason that it can be checked into git, and git is reliable. A static website can’t get hacked, whereas anyone who has been burnt with Wordpress security fail knows they’d prefer a static site.

And even more: People would like to pass the new website from the designer to the customer to managers — Wordpress might have not needed to develop their approval workflows (permission schemes, draft/preview/publish) if they had had a forkable database.

berkes • 3 years ago

At the root of this lies the problem that content, configuration and code is stored in one blurp (a single db). Never clearly bounded nor contained. Config is spread over tables. Usergenerated content (from comments to orders) mixed up with redactional content: often even in tables, sometimes even in the same column. Drupal suffers the same.

What is needed, is a clear line between configuration (and logic), redactional content, and user generated content. Those three have very distinct lifecycles. As long as they are treated as one, the distinct lifecycles will conflict. No matter how good your database merging is: the root is a missing piece of architecture.

Hallucinaut • 3 years ago

Very well summarised. I can see clearly you too have tried to apply normal software lifecycle principles and come out the other end with a rather dismal impression of WordPress.

TimTheTinker • 3 years ago

Sure, but Wordpress is still running PHP files for every page load and back-end query. Dolt would help offload some of the code’s complexity, but that would still leave a significant attack surface area.

In other words, by itself Dolt couldn’t solve the problem of Wordpress not being run mostly from static assets on a server (plus an API).

kenniskrag • 3 years ago

they have a theme preview now. :)

jimsmart • 3 years ago

The parent post here is speaking to the content stored in the database, not the templates on the filesystem. Dolt enables one to merge, push and pull data, just as one would with files and git.

2malaq • 3 years ago

I read it as sarcasm.

dang • 3 years ago

Some related past threads:

Dolt is Git for data - https://news.ycombinator.com/item?id=22731928 - March 2020 (191 comments)

Git for Data – A TerminusDB Technical Paper [pdf] - https://news.ycombinator.com/item?id=22045801 - Jan 2020 (5 comments)

Ask HN: Would you use a “git for data”? - https://news.ycombinator.com/item?id=11537934 - April 2016 (10 comments)

LukeEF • 3 years ago

thanks for the mention of TerminusDB (https://github.com/terminusdb/terminusdb) -> we are the graph cousins of Dolt. Great to see so much energy in the version control database world!

We are currently focused on data mesh use cases. Rather than trying to be GitHub for Data, we're trying to be YOUR GitHub for Data. Get all that good git lineage, pull, push, clone etc. and have data producers in your org own their data.

We see lots of organizations with big 'shadow data' problems and data being centrally managed rather than curated by domain experts.

scottmcdot • 3 years ago

Dolt might be good but never underestimate the power of Type 2 Slowly Changing Dimension tables [1]. For example, if you had an SSIS package that took CSV and imported them into a database, and one day you noticed it accidently rounded the value incorrectly, you could fix the data and retain traceability of the data which was there originally.

E.g., SSIS package writes row of data: https://imgur.com/DClXAi5

Then a few months later (on 2020-08-15) we identify that trans_value was imported incorrectly so we update it: https://imgur.com/wdQJWm4

Then whenever we SELECT from the table we always ensure we are extracting "today's" version of the data:

    select * from table where TODAY between effective_from and effective_to

[1] https://en.wikipedia.org/wiki/Slowly_changing_dimension

sixdimensional • 3 years ago

I definitely agree, just tossing in the superset concept that Dolt and Type 2 SCD involve - temporal databases [1].

I think the idea of a "diff" applied to datasets is quite awesome, but even then, we kind of do that with databases today with data comparison tools - it's just most of them are not time aware, rather they are used to compare data between two instances of the data in different databases, not at two points in time in the same database.

[1] https://en.wikipedia.org/wiki/Temporal_database

skybrian • 3 years ago

The commit log in Dolt is edit history. (When did someone import or edit the data? Who made the change?) It's not about when things happened.

To keep track of when things happened, you would still need date columns to handle it. But at least you don't need to handle two-dimensional history for auditing purposes. So, in your example, I think the "effective" date columns wouldn't be needed.

They have ways to query how the dataset appeared at some time in the past. However, with both data changes and schema changes being mixed together in a single commit log, I could see this being troublesome.

I suppose writing code to convert old edit history to the new schema would still be possible, similar to how git allows you to create a new branch by rewriting an existing one.

zachmu • 3 years ago

If all you want is AS OF semantics, then SCD2 is a great match. Used it a ton in application development myself.

Dolt actually makes branch and merge possible, totally different beast.

antman • 3 years ago

Can't something like that work as SQL:2011 temporal tables?

Postgres equivalent solution in this [0] rather complex but great tutorial

[0]: https://clarkdave.net/2015/02/historical-records-with-postgr...

crazygringo • 3 years ago

This is absolutely fascinating, conceptually.

However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.

I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.

So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.

Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.

So I feel like I'm missing something. What is this actually intended for?

I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?

zachmu • 3 years ago

The application backing use case is best suited for when you have parts of your database that get updated periodically and need human review. So you have a production database that you serve to your customers. Then you have a branch / fork of that (dev) that your development team adds batches of products to. Once a week you do a data release: submit a PR from dev -> prod, have somebody review all the new copy, and merge it once you're happy. If there's a big mistake, just back it out again. We have several paying customers building products around this workflow.

As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:

https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...

https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/

What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.

You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.

https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/

fiedzia • 3 years ago

This won't work for usual database usecases. This is meant for interactive work with data same way you work with code. Who needs that?

Data scientists working with large datasets. You want to be able to update data without redownloading everything. Also make your local changes (some data cleaning) and propose your updates upstream same way you would with git. Having many people working interactively with data is common here.

One of the companies I work with provided set of data distributed to their partners on a daily basis. Once it grew larger, downloading everything daily became an issue. So that would be desirable,

I have large data model that I need to deploy to production and update once in a while. For code, network usage is kept to minimum because we have git. For data, options are limited.

As with git, it is something that once you have, you will find a lot of usecases that make life easier and open many new doors.

saulrh • 3 years ago

Here's one thing I'd have used it for: Video game assets.

Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.

The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.

curryst • 3 years ago

I might look at it for work. Compliance requires us to keep track of who made what change when, and who it was approved by in case regulators need it.

Right now, this often means making an MR on Git with your snippet of SQL, getting it approved, and then manually executing it. This would let us bring the apply stage in, as well as avoid "that SQL query didn't do exactly what I expected" issues.

It's possible to do it within the SQL engine, but then I have to maintain that, which I would prefer not to do. As well as dealing with performance implications from that.

martincolorado • 3 years ago

I see a use-case for public resource data sets. For example, right of way, land use variance, permits, and land records. These are fundamentally public data sets that are often maintained by entities with limit budgets and the potential (even incentive) for fraud is substantial. Also, there is significant value in being able to access the data set for analytical purposes such as real estate analysis.

zachmu • 3 years ago

That's a great use case, let us know how we can help. We have a discord if you want to chat about it:

https://discord.com/invite/RFwfYpu

sixdimensional • 3 years ago

I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.

One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.

For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.

This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.

I have often realized that the more ways we can make things collaborative, the better off we will be.

Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.

One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.

Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.

For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.

I think it's an idea that is very timely.

P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)

LukeEF • 3 years ago

Your final example is very like the data mesh ideas coming out of thought works and elsewhere (https://martinfowler.com/articles/data-mesh-principles.html). Data products being owned by data producers and common data models ONLY when required. It is as much organizational model as technology, but I don't really think this maps to SQL tables. You probably want to look at a versioned knowledge graph of some kind. Downside is no SQL, upside is flexibility and speed. (disclaimer - I work over at TerminusDB, the graph version of Dolt)

sixdimensional • 3 years ago

You're absolutely right that what I'm saying is like data mesh.

The data mesh ideas aren't necessarily new - as this idea that data has been managed/owned by what some would call "federated teams" (optimistic view) and others might call "silos" (pessimistic view) isn't really new - especially in certain industries.

Certainly, the current energy and focus on building collaborative organizational service models, process, standards, etc. and standardizing this around the concept of data mesh is "fresh" and very much on the tip of many people's minds.

I worked at a company that had built a tool which was a data federation engine, and we dabbled with the concepts of "data fabric", "mesh", "network" etc. but we were more talking about tech/implementation vs. architecture and organizational structure. Still, we thought the right tech could enable this architecture/organizational structure and that it wasn't something totally figured out.

Thanks for sharing and mentioning it, and also Terminus. Certainly, I agree that knowledge graph is another approach and one that can also fit very well with the idea of linking data that is both common and specific to domains.

However, I've rarely seen graph databases that had the kind of versioning/time travel that a temporal database (such as relational ones) had built in natively (or anything like "diff").

I don't agree this can't map to SQL tables.. tables and relations (tuples) are just a form of data. People do collaborative data management using relational-based master data management systems already today and have been for many, many years.

That is not to say that graphs aren't a good match for storing common/reference datasets or linking data together. Given that graphs can be considered a superset of relational model, I don't see any reason that you can't do the same thing in graph databases, just that doing diffs and enabling the same kind of workflow might be more difficult in a graph data model - but definitely not impossible.

Lest we forget things like "Linked Data" [1], triple stores, etc. also that have been trying to get at linked data across the entire Internet. However, I never saw collaborative data management as part of that vision, it was mostly for querying.

[1] https://en.wikipedia.org/wiki/Linked_data

zachmu • 3 years ago

We totally agree on all points! Come chat with us on our discord about how we can help your org solve its problems:

https://discord.com/invite/RFwfYpu

a-dub • 3 years ago

agree 100% this looks awesome for reference datasets.

not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?

jorgemf • 3 years ago

Machine Learning. I don't think it has many more use cases

sixdimensional • 3 years ago

Or more simply put, how about table-driven logic in general? It doesn't have to be as complex as machine learning. There are more use cases than just machine learning, IMHO.

jedberg • 3 years ago

Such as? I'm having difficulty coming up with any myself.

sixdimensional • 3 years ago

See my post earlier in this thread [1].

Yes you need reference data for machine learning, but the world isn't only about machine learning. You might want reference data for human-interpreted analytics, table-driven logic (business rule engines, for example), etc.

[1] https://news.ycombinator.com/item?id=26371748.

zachmu • 3 years ago

Say, network configuration.

justincormack • 3 years ago

I collected all the git for data open source projects I could find a few months back, there have been a bunch of interesting approaches https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm...

glogla • 3 years ago

This one seems to be missing: https://projectnessie.org/

justinclift • 3 years ago

It's also missing DBHub.io. ;)

chub500 • 3 years ago

I've had a fairly long-term side project working on git for chronological data (data is a cause and effect DAG), know of anybody doing that?

michaelmure • 3 years ago

It might not be exactly what you are looking for, but git-bug[1] is encoding data into regular git objects, with merges and conflict resolution. I'm mentioning this because the hard part is providing an ordering of events. Once you have that you can store and recreate whatever state you want.

This branch[2] I'm almost done with remove the purely linear branch constraint and allow to use full DAGs (that is, concurrent edition) and still provide a good ordering.

[1]: https://github.com/MichaelMure/git-bug [2]: https://github.com/MichaelMure/git-bug/pull/532

LukeEF • 3 years ago

great list - i talked through it and what each tool does in a MLOps Community meetup: https://www.youtube.com/watch?v=r5uxntl_hWg

strogonoff • 3 years ago

You can also use Git for data!

It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).

I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.

Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).

teej • 3 years ago

The fact that I can use git for data if I carefully avoid all the footguns is exactly why I don’t use git for data.

vvanders • 3 years ago

Nope, been there done that, no thanks.

Lack of locking for binary files, overhead > 1gb and all the shenanigans you need to do for proxy servers. There's better solutions out there but they aren't free.

strogonoff • 3 years ago

Would be very curious to hear more about issues with proxy servers (where were they required?), overheads (do you mean RAM usage?) and locking.

vvanders • 3 years ago

Sure keep in mind that my data is a little old but last time I peeked into the git LFS space it seemed like there were still a few gaps.

First, most of my background in this area comes from gamedev so YMMV if the same applies in your use cases.

For our usage we'd usually have a repo history size that crossed the 1TB mark and even upwards of 2-3TB in some cases. The developer sync was 150-200GB, the art sync was closer to 500-600GB and the teams were regularly churning through 50-100GB/week depending on where we were in production.

You need discipline specific views into the repo. It just speeds everything up and means that only the teams that need to take the pain have to. From a performance perspective Perforce blows the pants off anything else I've seen, SVN tries, but P4 was easily an order of magnitude faster to sync or do a clean fetch.

I've seen proxy servers done with git but it's usually some really hack thing scripted together with a ton of ductape and client-specific host overrides. When you have a team split across East Coast/West Coast(or other country) you need that proxy so that history is cached in a way that it only gets pulled in locally once. Having a split push/pull model is asking for trouble and last I checked it wasn't clear to me if stuff like git LFS actually handles locking cleanly across it.

From an overhead perspective git just falls over at ~1gb(hence git LFS, which I've seen teams use to varying degrees of success based on project size). The need to do shallow history and sidestep resolving deltas is a ton of complexity that isn't adding anything.

With a lot of assets, merging just doesn't exist and a DVCS totally falls over here. I've seen fights nearly break out in the hallway multiple times when two artist/animators both forgot to checkout a file(usually because someone missed the metadata to say it's an exclusive access file). With unmergeable binary files that don't get locked your choice is who gets to drop 1-3 days of work on the floor when the other person blows away their changes to commit. If those changes span multiple interconnected packages/formats/etc you have a hard fork that you can never bring back together.

There's a couple other details but that's the large ones, Perforce worked incredibly well in this space but it is not cheap and so I've seen teams try to go their own way to mixed success. I'll admit that you can't do a monorepo in P4(and even tools like repo in the Android world have their problems too) but if you segregate your large business/product lines across P4 repos it scales surprisingly well.

Anyway, you may or may not hit any or all of this but I've yet to see git tackle a 1TB+ repo history well(and things like repo that uses many mini-repos doesn't count in my book due to the lack of atomicity across submissions that span multiple repos).

strogonoff • 3 years ago

graderjs • 3 years ago

I find a balance between this using git on JSON files. And I build the JSON files into a database (1 file per record, 1 directory per table, subdirectories for indexes). The whole thing is pretty beautiful, and it's functioning well for a user-account, access management database I'm running in production. I like that I can go back and do:

`git diff -p` to see the users who have signed up recently, for example.

You can get the code, over at: https://github.com/i5ik/sirdb

The advantages of this approach are using existing unix tooling for text files, solid versioning, easy inspect-ability, and leveraging the filesystem B-Tree indexing as a fast index structure (rather than having to write my b-trees). Another advantage is hardware-linked scaling. For example, if I use regular hard disks, it's slower. But if I use SSDs it's faster. And i should also be possible to mount the DB as a RAM disk and make it super fast.

The disadvantages are that the database side still only supports a couple of operations (like exact, multikey searches, lookup by ID, and so on) rather than a rich query language. I'm OK with that for now, and I'm also thinking of using skiplists in future to get nice ordering property for the keys in an index so I can easily iterate and page over those.

pradn • 3 years ago

Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one. Complexity is a liability. Mercurial has a nicer interface. And now I see the real simplicity of non-distributed source control systems. I have never actually needed to work in a distributed manner, just client-server. I have never sent a patch to another dev to patch into their local repo or whatnot. All this complexity seems like a solution chasing after a problem - at least for most developers. What works for Linux isn't necessary for most teams.

Hendrikto • 3 years ago

> Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one.

Cars are too complicated. They are barely usable for daily tasks. Look at how many people have to Google for basic things like changing a fan belt, or fixing cylinder head gasket.

You can fill in almost anything here. Most tools are complicated. Yet yo don’t need to know their ins and outs for them to be useful to you.

yoavm • 3 years ago

To me it sounds like you're proving the exact opposite. I'd assume most car owners never need to change a fan belt themselves, while everyone who uses git daily needed at some point to revert a commit. "How to turn right" isn't huge on stackoverflow last time I checked...

ttz • 3 years ago

Git is used prolifically in the tech industry. What on earth are you talking about?

detaro • 3 years ago

Being needlessly complicated seldomly stops the tech industry from using something as long as the complexity is slightly out of the way.

ttz • 3 years ago

"Barely usable for daily tasks"

Is a pretty strong statement, especially given many tech companies use it exactly for this purpose.

Git might have a learning curve, and sure, it's not the simplest. But "barely usable" is hyperbole in the face of actual evidence.

I'm not defending Git specifically ; other SVNs are just as viable. The quoted statement seems a bit ridiculous.

strogonoff • 3 years ago

To me there’s some irony in that all insta-criticism of Git in responses to my comment presumably applies to a project that describes itself as “Git for data” and promises exact reproduction of all Git command behaviour—therefore suffering from the same shortcomings.

strogonoff • 3 years ago

Doit boasts its likeness to Git as a feature. Does this mean it’ll also be barely usable for daily tasks? Is it possible for a project to faithfully reproduce the entirety of Git command interface and be less complicated than Git / not suffer from the same shortcomings?

I personally think Git isn’t that bad, once it’s understood. It could be counter-intuitive sometimes though (as an example, for the longest time I used Git without realizing it stores a snapshot of each file and diffs/deltas are only computed when required). Just trying to be pragmatic and not expecting a tool like Git to be entirely free of leaky abstractions.

rapjr9 • 3 years ago

We used git to store and manage data sets for a machine learning project involving chewing detection with audio data used in training. It was cumbersome and the huge datasets caused some problems with git (e.g., searches of our code base got really slow because the data was being searched also until we moved the data to a different repo). Something easier to use that could manage large datasets would be useful.

I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.

Ericson2314 • 3 years ago

What people usually miss about these things is normal version control benefits hugely from content addressing and normal forms.

The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.

The first part is easier enough, store rows in some order.

The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.

Example:

  CREATE TABLE Foo (
    a uuid PRIMARY KEY,
    b text,
    best_friend uuid REFERENCES Foo(b)
  );

DB 0:

  0 Alice 0

1 reclusive Alice, best friends with herself. Just fine.

  0 Alice 1
  1 Alice 1

2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort.

  0 Alice 1
  1 Alice 0

2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.

All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.

I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.

TOGoS • 3 years ago

A lot of your comment went over my head, but I have modeled relational data in a way conducive to being stored in a Merkle tree. The trick being that every entity in the system ended up having two IDs. A hash ID, identifying this specific version of the object, and an entity ID (probably a UUID or OID), which remained constant as new versions were added. In a situation where people can have friends that are also people, they are friends with the person long-term, not just a specific version of the friend, so in that case you'd use the entity IDs. Though you might also include a reference to the specific version of the person at the point in time at which they became friends, in which case they would necessarily reference an older version of that person. If you friend yourself, you're actually friending that person a moment ago.

A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.

On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?

Ericson2314 • 3 years ago

> ...The trick being that every entity in the system ended up having two IDs...

I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.

> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?

I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.

I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.

ghusbands • 3 years ago

> The salient aspect of relational data is that it's cyclic

This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.

Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.

chatmasta • 3 years ago

We have content addressable objects at Splitgraph. [0] And here’s an example of a “point in time” query across two versions of an image on Splitgraph. [1]

I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.

(Also, we’re hiring for backend and frontend roles. See my comment history for more.)

[0] https://www.splitgraph.com/docs/concepts/objects

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...

barrkel • 3 years ago

Consider the whole database - the whole set of facts across all relations - as the state in the tree. Each transaction a serialized delta that produces a new node in the tree, a new HEAD. That's closer to what's being gotten at, as I see it.

Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.

Ericson2314 • 3 years ago

> Consider the whole database - the whole set of facts across all relations - as the state in the tree.

I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.

Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)

Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)

jrumbut • 3 years ago

It's amazing this isn't a standard feature. The database world seems to have focused on large, high volume, globally distributed databases. Presumably you would't version clickstream or IoT sensor data.

Features like this that are only feasible below a certain scale are underdeveloped and I think there's opportunity there.

qbasic_forever • 3 years ago

Every DB engine used at scale has a concept of snapshots and backups. This just looks like someone making a git-like porcelain for the same kind of DB management constructs.

zachmu • 3 years ago

It's not just snapshots though.

Dolt actually stores the tables, rows, and commits in a Merkle DAG, like Git. So you get branch and merge. You can't do branch and merge with snapshots.

(You also get the full git / github toolbox: push and pull, fork and clone, rebase, and most other git commands)

qbasic_forever • 3 years ago

Yeah it's a neat idea but I struggle to think of good use-cases for merge, other than toy datasets. If I'm working on a service that's sharding millions of users across dozens of DB instances a merge is going to be incomprehensible to understand and reason about conflicts.

MyNameIsFred • 3 years ago

> Yeah it's a neat idea but I struggle to think of good use-cases [...] If I'm working on a service [...]

I suspect that's simply not the use-case they're targeting. You're thinking of a database as simply the persistence component for your service, a means to an end. For you, the service/app/software is the thing you're trying to deliver. Where this looks useful is the cases where the data itself is the thing you're trying to deliver.

fiddlerwoaroof • 3 years ago

Datomic has some sort of zero-cost forming of the database: it’s “add-only” design makes this cheap.

101008 • 3 years ago

Isnt the mysql log journal* what you are looking for?

* I dont remember the exact name but I refer the feature that is used to replicate actions if there was an error.

fiedzia • 3 years ago

No. Relational DB logs are kept for short time, and do not allow for branching/merging. And even if you would store full log since day 1, the only way to check state of database at some point would be to start from scratch and reply everything to desired point. For any sizeable db, that's not practical.

smarx007 • 3 years ago

If you are interested to one step beyond, you can check out https://terminusdb.com/docs/terminusdb/#/ which uses RDF to represent (knowledge) graphs (and has revision control).

px43 • 3 years ago

Is there a nice way to distribute a database across many systems?

Would this be useful for a data-set like Wikipedia?

It would be really nice if anyone could easily follow Wikipedia in such a way that anyone could fork it, and merge some changes, but not others, etc. This is something that really needs to happen for large, distributed data-sets.

I really want to see large corpuses of data that can have a ton of contributors sending pull requests to build out some great collaborative bastion of knowledge.

metaodi • 3 years ago

This is basically Wikidata [1], where the distribution is linked data/RDF with a triple store and federation to query different triple stores at once using SPARQL.

[1] https://wikidata.org

Klwohu • 3 years ago

Problematic name, could become a millstone on the neck of the developer far into the future.

maest • 3 years ago

It was most likely picked as an analogy to "git".

zachmu • 3 years ago

This is correct. Specifically, to pay homage to git and how Linus named it.

gerdesj • 3 years ago

Dolt and git are closer to synonymous rather than analogous.

ademarre • 3 years ago

Agreed. I couldn't immediately see if it was "DOLT" or "do it", as in "just do it". It's the former.

ssprang • 3 years ago

Reminds me of this user testing story from the early Macintosh days: https://www.folklore.org/StoryView.py?story=Do_It.txt

rapnie • 3 years ago

I was going back and forth between the two until seeing doLt in terminal font.

zachmu • 3 years ago

This ambiguity in sans serif fonts has actually been pretty annoying. Especially since GitHub doesn't let you choose your font on readmes and stuff.

dheera • 3 years ago

You can use <pre>...</pre> for monospace font.

mzs • 3 years ago

just lowercase it

TedDoesntTalk • 3 years ago

Already I would not use this project because of its name. I’m not offended by it, but I know others will be, and it will only be a matter of time before we have to replace it with something else. So why bother in the first place?

I know the name is not DOLT but it is close enough to cause offense. Imagine the N-word with one typo. Would it still be offensive? Probably to some.

mixedCase • 3 years ago

> but I know others will be, and it will only be a matter of time before we have to replace it with something else

Or we can just not give in to such insanity. That's always an option, and would help prevent things from getting increasingly worse as we cede ground to claims that increasingly get further and further away from the realm of what's reasonable.

TedDoesntTalk • 3 years ago

It’s not your choice when you’re an employee of a woke company (unless you want to quit) Don’t you know that by now?

simonw • 3 years ago

The name is DOLT with an L

_Wintermute • 3 years ago

Do you know what "git" means?

Klwohu • 3 years ago

This is the issue with names. Even though the project is called doit the DoIt stylizing makes it look problematic. It's a non-starter, hopefully the author makes a big change. Just choosing lower case for the project would be enough.

edgyquant • 3 years ago

It’s not though it is dolt, meaning a stupid person, like git (stupid simple version control.)

Klwohu • 3 years ago

Oh wow, that's awful! I thought this was just an innocent mistake.

pgt • 3 years ago

Datomic: https://www.youtube.com/watch?v=Cym4TZwTCNU

0xbadcafebee • 3 years ago

If this could somehow work with existing production MariaDB/PostgreSQL databases, this would be the next Docker. Sad that it requires its own proprietary DB (SQL compatible though it may be, it's not my existing database).

I wish something completely free like this existed just to manage database migrations. I don't think anything is quite this powerful/user-friendly (if you can call Git that) and free.

jka • 3 years ago

Are you familiar with SQLAlchemy Alembic[1]?

For best results it requires modeling your project in Python using SQLAlchemy, and even then, some migrations may require manual tweaks. But it is free and supports an extensive range of database systems.

[1] - https://alembic.sqlalchemy.org/en/latest/

rymurr • 3 years ago

I've been working on https://github.com/projectnessie/nessie for about a year now. Its similar to Dolt in spirit but aimed at big data/data lakes. Would welcome feedback from the community.

Its very exciting to see this field picking up speed. Tons of interesting problems to be solved :-)

weeboid • 3 years ago

Scanned for this comment before making it. This optically reads as "Dolt", not "DoIt"

coldtea • 3 years ago

Just optically? It's meant to be "dolt", not "do it", optically and semantically.

It's a pun on git, which also means "stupid/unpleasant/etc person".

weeboid • 3 years ago

ahh, got whooshed by the cool kids. dammit

XCSme • 3 years ago

But it is "DOLT", right?

zachmu • 3 years ago

Yup.

Nican • 3 years ago

I love the idea of this project so much. Being able to more easily share comprehensible data is an interest of mine.

It is not the first time I have seen immutable B-trees being used as a method for being able to query a dataset on a different point in time.

Spanner (and its derivatives) uses a similar technique to ensure backup consistency. Solutions such as CockroachDB also allows you to query data in the past [1], and then uses a garbage collector to delete older unused data. The Time-to-live of history data is configurable.

[1] https://www.cockroachlabs.com/docs/stable/as-of-system-time....

Nican • 3 years ago

Albeit, now it makes me wonder how much of the functionality of dolt is possible to be replicated with CockroachDB. The internal data structures of both databases are mostly similar.

You can have infinite point-time-querying by setting the TTL of data to forever.

You have the ability to do distributed BACKUP/IMPORT of data (Which is mostly a file copy, and also includes historical data)

A transaction would be the equivalent of a commit, but I do not think there is a way to list out all transactions on CRDB, so that would have to be done separately.

And gain other benefits, such as distributed querying, and high availability.

I just find interesting that both databases (CockroachDB and Dolt) share the same principal of immutable B-Trees.

devwastaken • 3 years ago

Something like this but for sqlite would be great for building small git enabled applications/servers that cana benefit from the features git provides but only need a database and a library to do it.

lifty • 3 years ago

Noms might be what you’re looking for (https://github.com/attic-labs/noms). Dolt is actually a fork of Noms.

joshspankit • 3 years ago

I never understood why we don’t have SQL databases that track all changes in a “third dimension” (column being one dimension, row being the second dimension).

It might be a bit slower to write, but hook the logic in to write/delete, and suddenly you can see exactly when a field was changed to break everything. The right middleware and you could see the user, IP, and query that changed it (along with any other queries before or after).

predakanga • 3 years ago

This does exist, though support for it is pretty sparse; it's called "Temporal Tables" in the SQL:2011 standard - https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs...

Last time I checked, it was supported in SQL server and MariaDB, and Postgres via an extension.

mcrutcher • 3 years ago

This has existed for a very long time as a data modeling strategy (most commonly, a "type 2 dimension") and is the way that all MVCC databases work under the covers. You don't need a special database to do this, just add another column to your database and populate it with a trigger or on update.

tthun • 3 years ago

MS SQL server 2016 onwards has temporal tables that support this (point in time data)

joshspankit • 3 years ago

Huh. I just read the spec. Not quite three 'dimension', but looks like exactly what I was asking for: a (reasonably) automatic and transparent record of previous values, as well as timestamps for when they changed.

I'll call this a "you learn something every day" and a "hey thanks @tthun (and @predakanga)"

kenniskrag • 3 years ago

Because you can do that with after update triggers or server-side in software.

herpderperator • 3 years ago

This is pretty cool. I wish `dolt diff` would use + and - though (isn't that standard?) rather than > and < which is harder to distinguish.

guerrilla • 3 years ago

Both are "standard" from the UNIX/GNU diff(1) tool. The default behavior gives you the '>' and '<' format and using -u gives you the "unified" '+' and '-' format.

Aeolun • 3 years ago

What does this look this look like when merging? Does it need a specialized tool?

I’m a bit sad it doesn’t seem to have git style syncing though.

zachmu • 3 years ago

Merge is built in, same as with git. Same syntax too:

   dolt checkout -b <branch>

   dolt merge <branch>

What do you mean git-style syncing? It has `push`, `pull`, and `fetch`.

Aeolun • 3 years ago

In regards to merging, I understand that part, but say I have a conflict, how does it get resolved. Does it open some sort of text editor where I delete the unwanted lines?

I mean, if I have a repository on one PC, I can clone from there on a different PC.

As far as I can see this only allows S3/GC or Dolthub remotes.

zachmu • 3 years ago

Merge conflicts get put into a special conflicts table. You have to resolve them one by one just like text conflicts from a git merge. You can do this with the command line or with SQL.

It's true we don't support SSH remotes yet. Wouldn't be too hard to add though. File an issue if this is preventing you from using the product. Or use sshfs, which is a pretty simple workaround.

valtism • 3 years ago

I've looked into Dolt perviously for powering our product that is Git for financials.

The trouble is that most of our complexity lies in our data's relationships. Merging is very tricky when some data that has been added in one branch has not had a specific change in properties when other data in a master branch has been modified.

LukeEF • 3 years ago

Good point - and why a graph based approach has advantages in some use cases (disclaimer TerminusDB co-founder here). Allows more flexibility in versioning relationships and more scope to merge in the scenario you outline.

anasbarg • 3 years ago

I like this. A while ago, I was asking my co-founder the question "Why isn't there Git for data like there is for code?" while working on a database migrations engine that aims to provide automatic database migrations (data & schema migrations). After all, code is data.

yarg • 3 years ago

Merging is hard, but the rest can be done with copy-on-write cloning (or am I missing something?).

xkvhs • 3 years ago

Well, maybe this is the ONE. But I've heard "git for data" way too many times to jump on board of the latest tool. I'll wait till "dust settles" and there's a clear winner. Till then, it's parquet, or even pickle.

mayureshkathe • 3 years ago

Since you've hosted it on GitHub, would you also consider providing a JavaScript client library for read access?

Could pave the way for single page web applications (hosted via GitHub Pages) working with Dolt as their database of choice.

kyrieeschaton • 3 years ago

People interested in this approach should compare Rich Hickey's Datomic.

knbknb • 3 years ago

Is something like "dolt diff master..somebranch -- mytable" possible?

Same question: Are in-between branch comparisons possible in dolt? Which "diff" subcommands to you plan to support in the future?

twobitshifter • 3 years ago

This is cool, but the parent dolthub project is even cooler! Dolthub.com

laurent92 • 3 years ago

Free for public repos!

zachmu • 3 years ago

Also private repos under a gig, but you have to give us a credit card to be private.

jonnycomputer • 3 years ago

Naive question here. Aside from it being mysql, what is different here than just using git + sqlite.

Update: When I posted, I'd forgotten that SQLite db file is a binary. Not sure what I was thinking.

bargle0 • 3 years ago

Merging a SQLite database is challenging.

justinclift • 3 years ago

We do it on DBHub.io. :)

If you're into Go, these are the commits where the pieces were hooked together.

* https://github.com/sqlitebrowser/dbhub.io/commit/6c40e051ff7...

* https://github.com/sqlitebrowser/dbhub.io/commit/989ee0d08e6...

unnouinceput • 3 years ago

I'd venture more and say merging any DB is challenging, SQLite or not.

bitslayer • 3 years ago

Is it for versions of the database design or versions of the data?

skybrian • 3 years ago

Both. Schema changes are versioned like everything else. But depending on what the change is, it might make merges difficult.

(I haven’t used it; I just read the blog.)

fiedzia • 3 years ago

BTW. I wish all databases versioned their schema and kept full history. This should be a standard feature.

codesnik • 3 years ago

keep history just as a changelog to see when and what was changed? or something with versions you actively can revert to? I suppose one of the problems is that changing schema is usually interleaved with some sorts of data migrations and conversion and those I have no idea how to track without using some general scripting language, like migrations in many frameworks which are already there.

fiedzia • 3 years ago

Just storing a changelog would be very useful. I could for example compare db version some app was developed for with current state. Most companies will store schema migrations in git, but doing anything with this information is difficult. Having this in db would make automated checks easier.

1f60c • 3 years ago

I'm not sure if this is supposed to be Dolt or DoIt, but using a swear word for a name (even a relatively mild one) is pretty distracting, IMHO.

drewwwwww • 3 years ago

presumably a riff on git, the well known famously unsuccessful version control system

1f60c • 3 years ago

Huh, I had no idea. (I'm not a native speaker.)

stjohnswarts • 3 years ago

DoIt isn't really a swear word. It's a euphemism for coitus. It's also a very common phrase in general so it really isn't strongly associated with "doing it" to cause any issues with English speakers. Aka it's fine as a project name. See Nike's "Just do it." advertisement campaign, they would have never gone with the phrase if it had strong negative connotations.

zachmu • 3 years ago

It's DOLT. San serif fonts.

einpoklum • 3 years ago

But do you really need this functionality, if you already have an SQL database?

That is, you can:

1. Create a table with an extra changeset id column and a branch id column, so that you can keep historical values.

2. Have a view on that table with the latest version of each record on the master branch.

3. Express branching-related actions as actions on the main table with different record versions and branch names

4. For the chocolate sprinkles, have tables with changeset info and branch info

and that gives you a poor man's git already - doesn't it?

resonantjacket5 • 3 years ago

That doesn't let you merge in someone else's changes easily. Aka two team members make changes to the (local copy of) database and now you want to merge it.

I mean sure you can have another database tracking their changes and a merging algorithm but that's what dolt is doing for you

mleonhard • 3 years ago

This could be useful for integration tests, especially for testing schema changes.

andrewmcwatters • 3 years ago

Reminds me a bit of datahub.io, but potentially more useful.

qwerty456127 • 3 years ago

I would rather have a "GitHub for data" - an SQL database I could have hosted for free, give everybody read-only access to and give some people I choose R/W access to. That's a thing I miss really.

rsstack • 3 years ago

Dolt also have a paid product called DoltHub, check it out.

zachmu • 3 years ago

Free for public repositories, and for private repos under a gig.

controlledchaos • 3 years ago

This is exactly what I was looking for!

tjkrusinski • 3 years ago

Any info on how this differs from Dat?

iamwil • 3 years ago

which db does it use?

zachmu • 3 years ago

It is a database. It implements the MySQL dialect and binary protocol, but it isn't MySQL. Totally separate storage engine and implementation.

pizzabearman • 3 years ago

Is this mysql only?

zachmu • 3 years ago

It uses the mysql SQL dialect for queries. But it's its own database.

efikoman • 3 years ago