Monday, July 12, 2010

NoSQL vs. RDBMS: Let the flames begin


The original name of this technology before people started calling it "NoSQL" was a distributed key/value store. This is a far more descriptive name, and I originally remember looking at it and going "hey, cool, I'll bet that will end up being very useful to a lot of people." The term has since expanded to essentially include "anything that isn't a relational database", but usually, when most people talk about NoSQL, they are talking about key/value stores.
Ever since the term NoSQL was coined, it's been getting touted as a silver bullet. I'm interested in products like Cassandra and follow their progress, but they are still immature technologies, and to claim that they are "replacing" SQL or RDBMSes in general (or that they will in the near future) is specious reasoning at best, if not an out-and-out lie.
Products and technologies fitting under the NoSQL umbrella are geared toward the following problem domain:-
  • You plan to deploy a large-scale, high-concurrency database (hundreds of GB, thousands of users);
  • Which doesn't need ACID guarantees;
  • Or relationships or constraints;
  • Stores a fairly narrow set of data (the equivalent of 5-10 tables in SQL);
  • Will be running on commodity hardware (i.e. Amazon EC2);
  • Needs to be implemented on a very low budget and "scaled out."
This actually describes a lot of web sites today. Google and Twitter fit very neatly into these requirements. Does it really matter if a few tweets are lost or delayed? On the other hand, these specs apply to nearly 0% of business systems, which is what a very high number of us work on developing. Most businesses have very different requirements:
  • Medium-to-large-scale databases (10-100 GB) with fairly low concurrency (hundreds of users at most);
  • ACID (especially the A and C - Atomicity and Consistency) is a hard requirement;
  • Data is highly correlated (hierarchies, master-detail, histories);
  • Has to store a wide assortment of data - hundreds or thousands of tables are not uncommon in anormalized schema (more for denormalization tables, data warehouses, etc.);
  • Run on high-end hardware;
  • Lots of capital available (if your business has millions of customers then you can probably find $25k or so lying behind the couch).
High-end SQL databases (SQL Server, Oracle, Teradata, Vertica, etc.) are designed for vertical scaling, they like being on machines with lots and lots of memory, fast I/O through SANs and SSDs, and the occasional horizontal scaling through clustering (HA) and partitioning (HC).
"NoSQL" is often compared favourably to "SQL" in performance terms. But fully maxed-out, a high-end SQL database server or cluster will scale almost infinitely. That is how they were intended to be deployed. Beware of dubious benchmarks comparing poorly-normalized, poorly-indexed SQL databases running mysql on entry-level servers (or worse, cloud servers like Amazon EC2) to similarly-deployed NoSQL databases. Apples and oranges. If you work with SQL, don't be scared by that hype.
SQL isn't going anywhere. DBAs are no more likely to vanish as a result of NoSQL than PHP programmers were as a result of Java and XML.
NoSQL isn't going anywhere either, because the development community has correctly recognized that RDBMSes aren't always the optimal solution to every problem.
NoSQL has an advantage with respect to the CAP Theorem, which states that nice things about databases are consistency, availability, and partition tolerance, but you can only pick 2 of the 3. Relational databases give you consistency and availability, and a lot of popular NoSQL databases give you availability and partition tolerance. i.e. You can distribute data across many computers easily with NoSQL. So if you have an application (like a lot of Google's applications) that need to scale to a gajillion users, NoSQL is a better choice.

I think some of the advocacy of NoSQL comes from people who have gotten applications working quickly with NoSQL, and ask, "...relational databases? we don't need no stink in' relational databases!". NoSQL seems to be the way to go at Google. Google folks advocate using NoSQL, but it requires a lot of code to dipsy-doodle around data in complex relationships. Problem domains with lots of joins and indices are harder to code with NoSQL plus code vs just SQL. This works for Google because they aggressively recruit prolific coders. The other thing is that a lot of Google applications boil down to being huge lists of stuff that can be scattered across multiple machines. They achieve good query speed with their search indices, Google File System, and Map-Reduce. Joins are not as much of a problem in those applications.
This video at YouTube talks about NoSQL vs SQL. It's kind of funny if you are a SQL advocate, but it describes how you solve problems in NoSQL that were solved in SQL relational databases.

So, as a developer you owe it to yourself to at least learn what NoSQL is, what products it refers to (Cassandra, BigTable, Voldemort, db4o, etc.), and how to build and code against a simple database created with one of these. But don't start throwing away all your SQL databases yet or thinking that your career is going to be made obsolete - that's hype, not reality.