It’s essential to take a step back and look at the evolution of computers and databases to understand the constraints and design choices that led to the current versions of the databases we use every day.
As we’ve learned elsewhere in the book, the term “relational” in relational databases doesn’t refer to relationships between data.
In 1970, Ted Codd invented Relational Algebra, the basis for all relational databases, which defined a relation as a database table and how to get data out of these relations and join them together. Therefore, Relational Algebra is how the DB responds to SQL; however, joining data together using relational math is computationally expensive. The following is the formula to join tables.
In 1986, when PostgreSQL was first released, a typical hard drive was around 10 MB in size and performed about 40 IOPS 1, and Apple released the Mac Plus with 1 MB of RAM. However, datasets were much smaller than today, and there wasn’t the expectation of running complex queries in milliseconds.
All early relational databases grew up with the mindset that the hard drive was the slowest part of the computer. Therefore, these databases relied heavily on RAM and avoided the hard drive as much as possible.
But, of course, hard drives were slow when they consisted of spinning platters, but solid-state drives don’t have any moving parts and are much faster. In fact, in 2022, PCIe NVMe SSDs can perform 1.7 million IOPS, roughly 42,500 times faster than the drives when Postgres was first released.
Modern databases such as Dgraph take advantage of fast SSDs, reducing the need for tons of memory.
In the ‘80s and ‘90s, CPU companies focused on making single-core processors faster and more efficient though it increasingly became harder to make improvements. The first dual-core processor was released in 2001 by IBM, but it wasn’t until 2005 and 2006 that AMD and Intel released their first dual-core processors, respectively.
Having multiple cores available to a CPU enabled a massive shift in thinking.
Early CPUs had a single core optimized to switch between multiple processes, known as concurrency. With the advent of multi-core processors, parallelism became possible. Parallelism is the ability to run multiple tasks at the same time. Newer languages, such as Golang, are optimized for parallelism in handling threads and garbage collection. Unfortunately, C’s language standard doesn’t support multithreading. It wasn’t until 1995 that POSIX threads or Pthreads became available to the GCC compiler, which allowed C programs to run multiple threads.
Postgres had already been around for nine years, so they based its architecture on processes instead of threads. Unfortunately, Postgres launches a process on startup that spawns new processes for each connection to the database, which is memory and resource-intensive. In 2017, Postgres released version 10, which contained the first code to start enabling parallel processing, and they’ve been working on getting more and more of the database to be thread-aware since then. 2
Scaling out instead of scaling up wasn’t limited to CPUs. The cost for servers continued to drop, making it affordable to own or rent many smaller computers for a fraction of the price of a single large computer. Networking these together created the benefit of additional parallelism, and if one of the nodes in the network died, the entire database wasn’t down.
The network architecture for early databases such as MySQL and Postgres contains a read/write or primary node and one or more read-only secondary or standby nodes. All writes are written on the master, replicated to read-only followers. When the primary node dies, all database writes halt until the promotion of a secondary node occurs, which is often a manual process involving down-time.
Early attempts at graph databases such as Neo4j maintained this model; however, Neo4j Enterprise can now run in cluster-mode with multiple read-write primary servers using the Raft protocol. Newer graph databases, such as Dgraph, were built from the ground up with multi-node clusters and included this functionality in their open-source version.
In my opinion, being able to scale a database cluster horizontally is the reason people invented NewSQL and NoSQL databases in the first place.
MySQL and Postgres struggle with scaling and high availability since their ultimate bottleneck is the number of writes that a single master can handle. Sharding, or the ability to move writes to multiple masters, is usually done in the application, adding complexity to the data layer and causing additional cognitive overload for the software engineer. “Failover” is often a manual process, which gets even more complicated when restoring the original master because new data needs to sync to the original master. Depending on the amount of data, a restore can take hours or days to return everything to a healthy state.
MySQL and Postgres leave the job of load balancing queries across an array of database followers to external services such as a proxy or the application, leaving it up to the database or software engineer, similar to sharding. Queries execute on a single machine instead of being broken up across several.
Dgraph built its system with these things in mind. Data for each attribute is spread across the cluster, allowing queries to achieve parallelism across machines. In addition, each node may be written to, sharing the write load across them all. Because each attribute, or predicate, is stored in separate files, writes are slower with Dgraph, and reads are much faster.
In a graph database, the relationship is a first-class citizen. To speed up reads in Postgres, you often need to preprocess data into a denormalized table or materialized view. Conversely, a graph database doesn’t need to denormalize its data because it’s fast when selecting complex relationships. Instead of a complicated formula to join tables, a graph database can follow the edges between nodes using pointers leading to much faster query times.
Relational DBs organize their data into tables, storing attributes as columns and individual tuples as rows. All columns are present in a table when saving a row, often leading to bloated files because of excessive nulls. In addition, adding and removing columns can become an expensive operation depending on its settings, such as unique or foreign constraints or default values.
Graph schemas are more flexible; there aren’t tables, columns, or rows. Graph DBs store data as nodes or vertices. Attributes are called predicates, stored in individual files. Edges denote relationships between nodes, which may contain details called facets.
For example, if the database stores car sales, it would have a node type for “People” and one for “Cars.” The DB stores an edge between the two nodes when a person buys a car. Since vehicles change owners many times, the edge maintains a facet called “Purchased On” to keep track of each purchase event. It might also preserve the salesperson’s name as another facet, or it could link to another Person node to show who sold the car.
Because edges are first-class citizens in a graph database, there’s no need to maintain “join” tables as relational databases require for complex relationships.
Dgraph supports JSON natively. Because nearly every programming language inherently supports JSON, you don’t need a complicated ORM to translate to the DBs query language. Instead, you can use data in built-in objects or structs to marshal to JSON. In addition, ORMs maintain foreign key consistency when inserting chained data. Dgraph handles all of this for you in a single mutation.
Dgraph uses a form of GraphQL that includes functions and filters for queries. GraphQL doesn’t require tokens as SQL does; instead, it lists desired predicates making string interpolation easier. Variables are used for user-supplied data to avoid query injection. Dgraph returns data in JSON, which is easy to unmarshal into structs.