Concepts

Wrgl's concepts are very similar to Git's with the basic exception being that a commit represents a CSV table instead of an entire file tree. It also introduces new concepts such as table, block, and row which are unique to Wrgl.

Content

    Repository

    A repository contains versioned data of a single CSV table. Each version (known as Commit) is associated with a unique checksum which is guaranteed to change if the underlying data change in any way. It is also possible to compute detailed diff between any two commits in the same repository.

    Commit

    A commit is like a snapshot of a CSV table at a single point in time. Each is uniquely identified by a single 16-bit Meow hash checksum. Each commit is also linked to its immediate parent commits via their checksums. In this way, a commit captures its entire history. It has the following attributes:

    • Author name: the commit author's name
    • Author email: the commit author's email
    • Message: the commit message
    • Time: the commit time
    • Parents: the immediate parent commit checksums
    • Table: checksum of the Table that contains the actual data

    A side effect of the linked nature of commits is that it is possible to have diverged branches of data. You can manage them with the help of references which is introduced in the next section.

    Reference

    Reference is almost identical to Git's reference, feel free to skip this section if you are already familiar with the concept.

    While any commit can be accessed by their checksum, it is usually more practical to use a special name that references the checksum instead. References (sometimes shorten to just "refs") serve that function. There are two kinds of reference supported:

    • branch: It has the form heads/<branch-name> and references the latest commit of a history branch. It is customary for a repository to have a branch named main which tracks the main history.
    • remote: It has the form remotes/<remote-name>/<remote-reference> and references the commit kept at a remote reference. It is usually the product of running wrgl fetch.

    Remote

    A remote is just a hosted repository (aka not local), usually exposed to the internet via an HTTP API started with wrgld.

    Table

    A table houses the actual data of a CSV table and any other info necessary to index and compute changes. A table is also identified by a unique checksum. It has the following attributes:

    • Columns: the column names
    • PK: zero-based index of primary key columns. Just like in a SQL database, primary key columns serve as row identity. It is how Wrgl can tell which rows can be compared between two commits and which are simply marked as new additions or removals.
    • Blocks: data are contained in blocks of fixed length which is discussed in the next section.
    • Row count: the number of rows in this table.

    Block

    The data of a table is not stored as a single list of rows but divided into blocks of 255 rows. Each block is identified by a table checksum and a block offset. You can fetch contiguous rows by fetching blocks with a start offset and an end offset. Laying out data in this way helps greatly with data storage and diff computation performance.

    Row

    A row in a table is identified by a table checksum and a row offset. Each row is contained within a block, therefore a row offset is equal <block offset> * 255 + <row offset within block>. You can fetch rows with their offsets via the rows endpoint. Fetching rows in this way is less performant, therefore the only time when it is recommended is when fetching changed rows (which are referred to in diff result as row offsets).