WRGL repository took heavy inspiration from Git as will be evident throughout this article. The idea behind our product is simple, what if we had Git but for data? Sure but what's wrong with Git for data? Git was designed to keep track of code changes and code files are typically much much smaller in size compared to data files. As a result Git will always have terrible performance issue when it comes to big files. And sure you can use Git to diff data files as long as they are text files. Still it is oblivious to the format of your data file and so its usefulness will be very limited.
WRGL repository copy many ideas from Git such as the concept of commits and the linked-list nature of it. But it was designed from the ground up to store data and provide very fast diff for data.
Also we didn't want to develop something that is tied to a specific stack or database. Therefore the unit of data we store is CSV - the most universal and practical format we can think of.
A WRGL commit is very similar to Git commit but it only refer to a single CSV file rather than an arbitrary number of files. It also keep track of changes down to individual rows. Other than that it shares metadata that you also find in Git:
- Author: the commit author.
- Message: the purpose of commit in plain English.
- Date: when the commit was made.
- Previous commit hash: as mentioned above commits are arranged into linked-lists. This keep the history of changes in chronological order.
When committing a CSV file, you can optionally choose to specify a set of columns as its primary key. Just like in database primary key help us identify each row, therefore allowing us to tell which rows have changed, which have been deleted and which have been newly added.
If you don't specify primary key then it is assumed that all columns belong to the primary key. That is we will only be able to detect added and removed rows.
A linked-list of commits is called a repository. Despite the name repositories are more similar to Git branches. You can also think of WRGL repository as having only one branch.
When deciding how to name a repository and which one to store a piece of data, you can think of each repository as a SQL table. Each should store data that are not prone to change too abruptly, otherwise you won't really gain much from its history of changes.
How data diff is performed
Depending on how the data has changed diff can be performed in 2 ways:
- If primary key stays the same then added rows, removed rows and modified rows are shown. Further more for each modified rows, we show changes in cell values as well as changes in columns.
- If the primary key has changed then we can only show how the primary key has changed and changes in columns if any.