Why LINQ beats SQL

If you're not a LINQ addict, you might wonder what the fuss is about. SQL isn't broken, so why fix it? Why do we need another querying language?

The popular answer is that LINQ is INtegrated with C# (or VB), thereby eliminating the impedance mismatch between programming languages and databases, as well as providing a single querying interface for a multitude of data sources. While that's true, it's only part of the story. More importantly: when it comes to querying databases, LINQ is in most cases a significantly more productive querying language than SQL.

Compared to SQL, LINQ is simpler, tidier, and higher-level. It's rather like comparing C# to C++. Sure, there are times when it's still best to use C++ (as is the case with SQL), but in most situations, working in a modern tidy language and not having to worry about lower-level details is a big win.

SQL is a very old language—invented in 1974. Since then it's been extended endlessly, but never redesigned. This has made the language messy—rather like VB6 or Visual FoxPro. You might have become so accustomed to this that you can't see anything wrong!

Let's take an example. You want to write a simple query that retrieves customers as follows:

SELECT UPPER(Name)
FROM Customer
WHERE Name LIKE 'A%'
ORDER BY Name

That doesn't look too bad, right? But now suppose these results are feeding a web page, and we want to retrieve just rows 21-30. Suddenly, you need a subquery:

SELECT UPPER(Name) FROM
(
   SELECT *, RN = row_number()
   OVER (ORDER BY Name)
   FROM Customer
   WHERE Name LIKE 'A%'
) A
WHERE RN BETWEEN 21 AND 30
ORDER BY Name

And if you need to support older databases (prior to SQL Server 2005), it gets worse:

SELECT TOP 10 UPPER (c1.Name)
FROM Customer c1
WHERE
   c1.Name LIKE 'A%'
   AND c1.ID NOT IN
   (
      SELECT TOP 20 c2.ID
      FROM Customer c2
      WHERE c2.Name LIKE 'A%'
      ORDER BY c2.Name
   ) 
ORDER BY c1.Name

Not only is this complicated and messy, but it violates the DRY principle (Don't Repeat Yourself). Here's same query in LINQ. The gain in simplicity is clear:

var query =
   from c in db.Customers
   where c.Name.StartsWith ("A")
   orderby c.Name
   select c.Name.ToUpper();

var thirdPage = query.Skip(20).Take(10);

Only when we enumerate thirdPage will the query actually execute. In the case of LINQ to SQL or Entity Framework, the translation engine will convert the query (that we composed in two steps) into a single SQL statement optimized for the database server to which it's connected.

Composability

You might have noticed another more subtle (but important) benefit of the LINQ approach. We chose to compose the query in two steps—and this allows us to generalize the second step into a reusable method as follows:

IQueryable<T> Paginate<T> (this IQueryable<T> query, int skip, int take)
{
   return query.Skip(skip).Take(take);
}

We can then do this:

var query = ...
var thirdPage = query.Paginate (20, 10);

The important thing, here, is that we can apply our Paginate method to any query. In other words, with LINQ you can break down a query into parts, and then re-use some of those parts across your application.

Associations

Another benefit of LINQ is that you can query across relationships without having to join. For instance, suppose we want to list all purchases of $1000 or greater made by customers who live in Washington. To make it interesting, we'll assume purchases are itemized (the classic Purchase / PurchaseItem scenario) and that we also want to include cash sales (with no customer). This requires querying across four tables (Purchase, Customer, Address and PurchaseItem). In LINQ, the query is effortless:

from p in db.Purchases
where p.Customer.Address.State == "WA" || p.Customer == null
where p.PurchaseItems.Sum (pi => pi.SaleAmount) > 1000
select p

Compare this to the SQL equivalent:

SELECT p.*
FROM Purchase p
    LEFT OUTER JOIN 
        Customer c INNER JOIN Address a ON c.AddressID = a.ID
    ON p.CustomerID = c.ID	
WHERE
   (a.State = 'WA' || p.CustomerID IS NULL)
    AND p.ID in
    (
        SELECT PurchaseID FROM PurchaseItem
        GROUP BY PurchaseID HAVING SUM (SaleAmount) > 1000
    )

Extending this example, suppose we want to sort the results in reverse order of price, and also want the salesperson's name and number of purchased items in the final projection. Notice how naturally we can express these additional criteria without repetition:

from p in db.Purchases
where p.Customer.Address.State == "WA" || p.Customer == null
let purchaseValue = p.PurchaseItems.Sum (pi => pi.SaleAmount)
where purchaseValue > 1000
orderby purchaseValue descending
select new
{
   p.Description,
   p.Customer.SalesPerson.Name,
   PurchaseItemCount = p.PurchaseItems.Count()
}

Here's the same query in SQL:

SELECT 
    p.Description,
    s.Name,
    (SELECT COUNT(*) FROM PurchaseItem pi WHERE p.ID = pi.PurchaseID) PurchaseItemCount	
FROM Purchase p
    LEFT OUTER JOIN 
        Customer c 
            INNER JOIN Address a ON c.AddressID = a.ID
            LEFT OUTER JOIN SalesPerson s ON c.SalesPersonID = s.ID
    ON p.CustomerID = c.ID	
WHERE
    (a.State = 'WA' OR p.CustomerID IS NULL)
    AND p.ID in
    (
        SELECT PurchaseID FROM PurchaseItem
        GROUP BY PurchaseID HAVING SUM (SaleAmount) > 1000
    )
ORDER BY
    (SELECT SUM (SaleAmount) FROM PurchaseItem pi WHERE p.ID = pi.PurchaseID) DESC

An interesting point is that it's possible to transliterate the above SQL query back into LINQ, yielding a query that's every bit as repetitive - and at least as clumsy. You'll often see (typically non-working versions of) such queries posted on forums - this happens as a result of thinking in SQL, rather than thinking in LINQ. It's rather like transliterating a Fortran program into C# 7, and then complaining about the clumsy syntax for GOTO.

Shaping Data

Selecting from more than one table in SQL requires joining - the end result being rows of flat tuples. If you've used SQL for many years, you may have become so accepting of this that it may not occur to you that this forced denormalization is often undesirable: it leads to data duplication and makes result sets awkward to work with on the client. In contrast, LINQ lets you retrieve shaped or hierarchical data. This avoids duplication, makes results easier to work with, and in most cases it even obviates the need for joining. For example, suppose we want to retrieve a selection of customers, each with their high-value purchases. In LINQ, you can do this:

from c in db.Customers
where c.Address.State == "WA"
select new
{
   c.Name,
   c.CustomerNumber,
   HighValuePurchases = c.Purchases.Where (p => p.Price > 1000)
}

HighValuePurchases, here, is a collection. And because we were querying an association property, we didn't need to join. Which means the detail of whether this was a inner or outer join is nicely abstracted away. In this case, the query, when translated to SQL, would be an outer join: LINQ doesn't exclude rows just because a subcollection returns zero elements. If we wanted something that translated to an inner join, we could do this:

from c in db.Customers
where c.Address.State == "WA"
let HighValuePurchases = c.Purchases.Where (p => p.Price > 1000)
where HighValuePurchases.Any()
select new
{
   c.Name,
   c.CustomerNumber,
   HighValuePurchases
}

LINQ also supports flat outer joins, adhoc joins, subqueries, and numerous other kinds of queries through a rich set of operators.

Parameterization

What if we wanted to parameterize our previous example, so that the state "WA" came from a variable? This is all we do:

string state = "WA";

var query =
   from c in db.Customers
   where c.Address.State == state
   ...

No messing with parameters on DbCommand objects or worrying about SQL injection attacks. LINQ's parameterization is inline, typesafe, and highly readable. It doesn't just solve the problem—it solves it really well.

As because LINQ queries are composable, we can add predicates conditionally. For example, we could write a method as follows:

IQueryable<Customer> GetCustomers (string state, decimal? minPurchase)
{
    var query = Customers.AsQueryable();
    
    if (state != null)
        query = query.Where (c => c.Address.State == state);
    
    if (minPurchase != null)
        query = query.Where (c => c.Purchases.Any (p => p.Price > minPurchase.Value));
    
    return query;
}

If we call this method with null state and minPurchase values, the following SQL is generated when we enumerate the result:

SELECT [t0].[ID], [t0].[Name], [t0].[AddressID]
FROM [Customer] AS [t0]

However, if we specify values for state and minPurchase, LINQ to SQL will not only add the predicates to the query, but the necessary joins as well:

SELECT [t0].[ID], [t0].[Name], [t0].[AddressID]
FROM [Customer] AS [t0]
LEFT OUTER JOIN [Address] AS [t1] ON [t1].[ID] = [t0].[AddressID]
WHERE (EXISTS(
    SELECT NULL AS [EMPTY]
    FROM [Purchase] AS [t2]
    WHERE ([t2].[Price] > @p0) AND ([t2].[CustomerID] = [t0].[ID])
    )) AND ([t1].[State] = @p1)

Because our method returns an IQueryable, the query is not actually translated to SQL and run until enumerated. This give the caller a chance to add further predicates, pagination, custom projections, and so on.

Static Type Safety

In the preceding queries, if we had declared the state variable as an integer rather than a string, the query would fail at compile-time rather than run-time. The same applies if you get any of the table or column names wrong. This is of real benefit when refactoring: the compiler will tell you if you haven't completely done your job.

Client Processing

LINQ lets you effortlessly shift part of the query onto the client for processing. Why would you want to do this? With a heavily burdened database server, it can actually improve performance. As long as you don't take more data than you need (in other words, you still do all the filtering on the server) you can often help performance by shifting some of the burden of reordering, transforming and regrouping the results onto a less loaded application server. With LINQ, all you need to do is to slip AsEnumerable() into the query, and everything from that point on executes locally.

When not to use LINQ for querying databases

Despite its power, LINQ doesn't deprecate SQL. It takes more than 95% of the querying brunt, but you still sometimes need SQL for hand-tweaked queries - especially with optimization or locking hints. Most providers don't support the use of LINQ for predicated updates, and of course you still need SQL for triggers. (SQL's also needed for stored procedures and functions, although the need for these crops up less often when you're using LINQ). You can combine LINQ with SQL by writing table-valued functions in SQL, and then calling those functions within more elaborate LINQ queries.

Having to know two querying languages is not really an issue because you'll want to learn LINQ anyway—LINQ is so useful for querying local collections and XML DOMs. If you're still using the old XmlDocument-based DOM, you'll find LINQ to XML's DOM a dramatic step up.

And because LINQ is easier to master than SQL, the goal of writing really good queries is more achievable with LINQ than with SQL.

LINQ in the Field

I use LINQ almost exclusively for querying databases because it's more productive.

For writing applications, my personal experience is a LINQ-enabled data access layer (using an API such as LINQ to SQL or Entity Framework) cuts the data access development time by more than a half, as well as making maintenance far easier.