What is SQL? The lingua franca of data analysis

SQL is neither the fastest nor the most elegant way to talk to databases, but it is the best way we have. Here’s why

Thinkstock

Today, Structured Query Language is the standard means of manipulating and querying data in relational databases, though with proprietary extensions among the products. The ease and ubiquity of SQL have even led the creators of many “NoSQL” or non-relational data stores, such as Hadoop, to adopt subsets of SQL or come up with their own SQL-like query languages.

But SQL wasn’t always the “universal” language for relational databases. From the beginning (circa 1980), SQL had certain strikes against it. Many researchers and developers at the time, including me, thought that the overhead of SQL would keep it from ever being practical in a production database.

Clearly, we were wrong. But many still believe that, for all of SQL’s ease and accessibility, the price exacted in runtime performance is often too high.

SQL history

Before there was SQL, databases had tight, navigational programming interfaces, and typically were designed around a network schema called the CODASYL data model. CODASYL (Committee on Data Systems Languages) was a consortium that was responsible for the COBOL programming language (starting in 1959) and database language extensions (starting 10 years later).

When you programmed against a CODASYL database, you were navigating to records through sets, which express one-to-many relationships. Older hierarchical databases only allow a record to belong to one set. Network databases allow a record to belong to multiple sets.

Say you wanted to list the students enrolled in CS 101. First you would find "CS 101" in the Courses set by name, set that as the owner or parent of the Enrollees set, find the first member (ffm) of the Enrollees set, which is a Student record, and list it. Then you would go into a loop: Find next member (fnm) and list it. When fnm failed, you would exit the loop.

That may seem like a lot of scut work for the database programmer, but it was very efficient at execution time. Experts like Michael Stonebraker of the University of California at Berkeley and Ingres pointed out that doing that sort of query in a CODASYL database such as IDMS took roughly half the CPU time and less than half the memory as the same query on a relational database using SQL.

For comparison, the equivalent SQL query to return all of the students in CS 101 would be something like 

SELECT student.name FROM courses, enrollees, students WHERE course.name ="CS 101"

That syntax implies a relational inner join (actually two of them), as I’ll explain below, and leaves out some important details, such as the fields used for the joins.

Relational databases and SQL

Why would you give up a factor of two improvement in execution speed and memory use? There were two big reasons: ease of development and portability. I didn’t think either one mattered much in 1980 compared to performance and memory requirements, but as computer hardware improved and became cheaper people stopped caring about execution speed and memory and worried more about the cost of development.

In other words, Moore’s Law killed CODASYL databases in favor of relational databases. As it happened, the improvement in development time was significant, but SQL portability turned out to be a pipe dream.

Where did the relational model and SQL come from? E.F. “Ted” Codd was a computer scientist at the IBM San Jose Research Laboratory who worked out the theory of the relational model in the 1960s and published it in 1970. IBM was slow to implement a relational database in an effort to protect the revenues of its CODASYL database IMS/DB. When IBM finally started its System R project, the development team (Don Chamberlin and Ray Boyce) wasn’t under Codd, and they ignored Codd’s 1971 Alpha relational language paper to design their own language, SEQUEL (Structured English Query Language). In 1979, before IBM had even released its product, Larry Ellison incorporated the language in his Oracle database (using IBM’s pre-launch SEQUEL publications as his spec). SEQUEL soon became SQL to avoid an international trademark violation.

The “tom-toms beating for SQL” (as Michael Stonebraker put it) were coming not only from Oracle and IBM, but also from customers. It wasn’t easy to hire or train CODASYL database designers and programmers, so SEQUEL (and SQL) looked much more attractive. SQL was so attractive in the later 1980s that many database vendors essentially stapled a SQL query processor on top of their CODASYL databases, to the great dismay of Codd, who felt that relational databases had to be designed from scratch to be relational.

A pure relational database, as designed by Codd, is built on tuples grouped into relations, consistent with first-order predicate logic. Real-world relational databases have tables that contain fields, constraints, and triggers, and tables are related through foreign keys. SQL is used to declare the data to be returned, and a SQL query processor and query optimizer turn the SQL declaration into a query plan that is executed by the database engine.

SQL includes a sub-language for defining schemas, the data definition language (DDL), along with a sub-language for modifying data, the data manipulation language (DML). Both of these have roots in early CODASYL specifications. The third sub-language in SQL declares queries, through the SELECT statement and relational joins.

SQL SELECT statement

The SELECT statement tells the query optimizer what data to return, what tables to look in, what relations to follow, and what order to impose on the returned data. The query optimizer has to figure out by itself what indexes to use to avoid brute force table scans and achieve good query performance, unless the particular database supports index hints.

Part of the art of relational database design hangs on the judicious use of indexes. If you omit an index for a frequent query, the whole database can slow down under heavy read loads. If you have too many indexes, the whole database can slow down under heavy write and update loads.

Another important art is choosing a good, unique primary key for every table. You not only have to consider the impact of the primary key on common queries, but how it will play in joins when it appears as a foreign key in another table, and how it will affect the data’s locality of reference.

In the advanced case of database tables that are split up into different volumes depending on the value of the primary key, called horizontal sharding, you also have to consider how the primary key will affect the sharding. Hint: You want the table distributed evenly across volumes, which suggests that you don’t want to use date stamps or consecutive integers as primary keys.

Discussions of the SELECT statement may start simple, but can quickly become confusing. Consider:

SELECT * FROM Customers;

Simple, right? It asks for all fields and all rows of the Customers table. Suppose, however, that the Customers table has a hundred million rows and a hundred fields, and one of the fields is a large text field for comments. How long will it take to pull down all that data over a 10 megabit per second network connection if each row contains an average of 1 kilobyte of data?

Perhaps you should cut down how much you send over the wire. Consider:

SELECT TOP 100 companyName, lastSaleDate, lastSaleAmount, totalSalesAmount FROM Customers
WHERE state ="Ohio" AND city ="Cleveland"
ORDER BY lastSaleDate DESCENDING;

Now you’re going to pull down a lot less data. You have asked the database to give you only four fields, to only consider the companies in Cleveland, and to give you just the 100 companies with the most recent sales. To do that most efficiently at the database server, however, the Customers table needs an index on state+city for the WHERE clause and an index on lastSaleDate for the ORDER BY and TOP 100 clauses.

By the way, TOP 100 is valid for SQL Server and SQL Azure, but not MySQL or Oracle. In MySQL, you’d use LIMIT 100 after the WHERE clause. In Oracle, you’d use a bound on ROWNUM as part of the WHERE clause, i.e. WHERE... AND ROWNUM <=100. Unfortunately, the ANSI/ISO SQL standards (and there are nine of them to date, stretching from 1986 to 2016) only go so far, beyond which each database introduces its own proprietary clauses and features.

SQL joins 

So far, I’ve described the SELECT syntax for single tables. Before I can explain JOIN clauses, you need to understand foreign keys and relations between tables. I’ll explain this by using examples in DDL, using SQL Server syntax.

The short version of this is fairly simple. Every table that you want to use in relations should have a primary key constraint; this can either be a single field or a combination of fields defined by an expression. For example:

CREATE TABLE Persons (
    PersonID int NOT NULL PRIMARY KEY,
    PersonName char(80),
    ...

Every table that needs to relate to Persons should have a field that corresponds to the Persons primary key, and to preserve relational integrity that field should have a foreign key constraint. For example:

CREATE TABLE Orders (
    OrderID int NOT NULL PRIMARY KEY,
    ...
    PersonID int FOREIGN KEY REFERENCES Persons(PersonID)
);

There are longer versions of both statements that use the CONSTRAINT keyword, which lets you name the constraint. That’s what most database design tools generate.

Primary keys are always indexed and unique (the field values cannot be duplicated). Other fields can optionally be indexed. It’s often useful to create indexes for foreign key fields and for fields that appear in WHERE and ORDER BY clauses, although not always, because of the potential overhead from writes and updates.

How would you write a query that returns all of the orders placed by John Doe?

SELECT PersonName, OrderID FROM Persons
INNER JOIN Orders ON Persons.PersonID = Orders.PersonID
WHERE PersonName ="John Doe";

In fact, there are four kinds of JOIN: INNER, OUTER, LEFT, and RIGHT. The INNER JOIN is the default (you can omit the word INNER), and it’s the one that includes only rows that contain matching values in both tables. If you want to list persons whether or not they have orders, you’d use a LEFT JOIN, for example:

SELECT PersonName, OrderID FROM Persons
LEFT JOIN Orders ON Persons.PersonID = Orders.PersonID
ORDER BY PersonName;

When you start doing queries that join more than two tables, that use expressions, or that coerce data types, the syntax can get a little hairy at first. Fortunately, there are database development tools that can generate correct SQL queries for you, often by dragging and dropping tables and fields from the schema diagram into a query diagram.

SQL stored procedures

Sometimes the declarative nature of the SELECT statement doesn’t get you where you want to go. Most databases have a facility called stored procedures; unfortunately this is an area where nearly all of the databases use proprietary extensions to the ANSI/ISO SQL standards.

In SQL Server, the initial dialect for stored procedures (or stored procs) was Transact-SQL, aka T-SQL; in Oracle, it was PL-SQL. Both databases have added additional languages for stored procedures, such as C#, Java, and R. A simple T-SQL stored procedure might only be a parameterized version of a SELECT statement. Its advantages are ease of use and efficiency. Stored procedures are optimized when they are saved, not every time they are executed.

A more complicated T-SQL stored procedure might use multiple SQL statements, input and output parameters, local variables, BEGIN...END blocks, IF...THEN...ELSE conditions, cursors (row-by-row processing of a set), expressions, temporary tables, and a whole host of other procedural syntax. Obviously if the stored procedure language is C#, Java, or R, you’re going to use the functions and syntax of those procedural languages. In other words, despite the fact that the motivation for SQL was to use standardized declarative queries, in the real world you see lots of database-specific procedural server programming.

That doesn’t quite take us back to the bad old days of CODASYL database programming (although cursors come close), but it does walk back from the ideas that SQL statements should be standardized and that performance concerns should be left to the database query optimizer. In the end, a doubling of performance is often too much to leave on the table.

Learn SQL

The sites listed below can help you learn SQL, or discover the quirks of various SQL dialects.

  • Codecademy. Learn SQL. Free and interactive. Pro upgrade available for a fee.
  • Khan Academy. Intro to SQL: Querying and Managing Data. Free video tutorial.
  • SoloLearn. SQL Fundamentals. MySQL-oriented. Free.
  • SQL Problems and Solutions and SQL Exercises. Interactive textbook and exercises. Free.
  • SQLZoo. An interactive SQL tutorial developed and maintained by Edinburgh Napier University. Supports Microsoft SQL Server, Oracle Database, MySQL, IBM DB2, and PostgreSQL. Free. 
  • Tutorialspoint. Learn SQL. Text only, not interactive. Free. 
  • Udacity. Intro to Relational Databases. Uses Python and requires some Python knowledge. Free. 
  • Udemy. Free courses include Introduction to Databases and SQL Querying, MySQL Database for Beginners, Microsoft SQL for Beginners, Hands-on SQL for Beginners (SELECT, FROM, and WHERE), and Sachin Quickly Learns (SQL).
  • Vertabelo Academy. SQL Basics, Operating on Data in SQL, Creating Tables in SQL, and eight other interactive SQL courses. Some courses have free test drives, after which there may be a fee. There are seven additional courses for Microsoft SQL Server. The site also has a graphical database design tool for PostgreSQL, MySQL, Oracle Database, SQL Server, SQLite, and IBM DB2.
  • W3Schools. SQL Tutorial. Free and available without registration.

Database documentation:

SQL commands quick reference:

  • DDL: CREATE, ALTER, DROP (refer to tables)
  • DMS: SELECT, INSERT, UPDATE, DELETE (refer to rows)
  • DCL: GRANT, REVOKE (refer to permissions)