If you talk to any data professionals about query performance, they’ll bring up indexes in the first few minutes. For good reasons, Indexes are critical to the overall performance of the database. There’s no such thing as required SQL indexes. However, I often get asked what indexes you should add when creating a new table. I have a simple guideline for two types of columns to index.
To get us started, let’s create a demo environment. This script creates three tables and only takes about 20 seconds to run. Please don’t run this in a production environment unless your resume and LinkedIn are up to date.
-- Run this script to follow along with the demo. USE [master]; GO -- Checking to see if our database exists and if it does drop it. IF DATABASEPROPERTYEX ('SqlHabits','Version') IS NOT NULL BEGIN ALTER DATABASE [SqlHabits] SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE [SqlHabits]; END GO -- Make sure you have at least 1GB to follow along. Also make sure the file path is already created. CREATE DATABASE [SqlHabits] ON PRIMARY ( NAME = N'TestDatabase', FILENAME = N'C:\SQLFiles\SqlHabits.mdf' ,SIZE = 100000KB , FILEGROWTH = 100000KB) LOG ON ( NAME = N'TestDatabase_log', FILENAME = N'C:\SQLFiles\SqlHabits_log.ldf',SIZE = 100000KB , FILEGROWTH = 100000KB) GO ALTER DATABASE [SqlHabits] SET RECOVERY SIMPLE; GO USE [SqlHabits]; GO CREATE SCHEMA [Sales]; GO CREATE SCHEMA [Admin]; GO -- Script inspired by Cathrine Wilhelmsen. Please checkout her blog https://www.cathrinewilhelmsen.net/. CREATE TABLE [Admin].[Numbers] ( [Number] bigint NOT NULL, CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED ([Number]) WITH FILLFACTOR = 100); GO WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5) INSERT INTO [Admin].[Numbers] ([Number]) SELECT TOP (1000000) n FROM Nums ORDER BY n; GO CREATE TABLE [Sales].[SalesPerson] ( [Id] int identity(1,1) NOT NULL, [EmployeeNumber] nvarchar(10) NOT NULL, [FirstName] nvarchar(500) NOT NULL, [LastName] nvarchar(500) NOT NULL, [SalaryHr] decimal(32,2) NULL, [Email] nvarchar(500) NULL, [IsActive] bit NOT NULL, [StartDate] date NOT NULL, [CreateDate] datetime NOT NULL DEFAULT GETDATE(), [ModifyDate] datetime NULL, ); GO WITH FirstName AS ( SELECT 'Tom' AS FirstName UNION ALL SELECT 'Sally' AS FirstName UNION ALL SELECT 'Bill' AS FirstName UNION ALL SELECT 'Karen' AS FirstName UNION ALL SELECT 'Lisa' AS FirstName UNION ALL SELECT 'Kerrie' AS FirstName UNION ALL SELECT 'Arun' AS FirstName UNION ALL SELECT 'Wanda' AS FirstName UNION ALL SELECT 'Tammy' AS FirstName UNION ALL SELECT 'Sarah' AS FirstName UNION ALL SELECT 'Emmit' AS FirstName UNION ALL SELECT 'Chris' AS FirstName UNION ALL SELECT 'Cathy' AS FirstName UNION ALL SELECT 'Dion' AS FirstName UNION ALL SELECT 'Aakash' AS FirstName ), LastName AS ( SELECT 'Jones' AS LastName UNION ALL SELECT 'Smith' AS LastName UNION ALL SELECT 'House' AS LastName UNION ALL SELECT 'Knocks' AS LastName UNION ALL SELECT 'James' AS LastName UNION ALL SELECT 'Friend' AS LastName UNION ALL SELECT 'Seker' AS LastName UNION ALL SELECT 'Lincoln' AS LastName UNION ALL SELECT 'Morgan' AS LastName UNION ALL SELECT 'Jones' AS LastName UNION ALL SELECT 'Jones' AS LastName UNION ALL SELECT 'Kumar' AS LastName ) INSERT INTO [Sales].[SalesPerson] ([EmployeeNumber], [FirstName], [LastName], [SalaryHr], [Email], [IsActive], [StartDate]) SELECT CONCAT('000',ROW_NUMBER() OVER(ORDER BY (SELECT NULL))) AS 'EmployeeNumber', FirstName AS 'FirstName', LastName AS 'LastName', ABS(CHECKSUM(NEWID()) % 100) + 50 AS 'SalaryHr', CONCAT(FirstName,'.',LastName,n.number,'@SqlHabits.com') AS 'Email', CASE WHEN n.Number % 20 = 0 THEN 0 ELSE 1 END AS 'IsActive', DATEADD(DAY, RAND(CHECKSUM(NEWID()))*(1+DATEDIFF(DAY, '01/01/2016', '05/01/2022')),'01/01/2016') AS 'StartDate' FROM FirstName CROSS JOIN LastName CROSS JOIN [Admin].[Numbers] n WHERE [n].[Number] < 30; GO CREATE TABLE [Sales].[SalesOrder] ( [Id] int identity(1,1) NOT NULL, [SalesPerson] int NOT NULL, [SalesAmount] decimal(36,2) NOT NULL, [SalesDate] date NOT NULL, [OrderDescription] nvarchar(MAX) NULL, [CreateDate] datetime NOT NULL DEFAULT GETDATE(), [ModifyDate] datetime NULL, ); GO DECLARE @Count int = 0; DECLARE @UpperBound int = (SELECT MAX(Id) FROM [Sales].[SalesPerson]) WHILE (@Count < 1000000) BEGIN INSERT INTO [Sales].[SalesOrder] ([SalesPerson], [SalesAmount], [SalesDate], [OrderDescription]) SELECT ABS(CHECKSUM(NEWID()) % @UpperBound) + 1 AS 'SalesPerson', ABS(CHECKSUM(NEWID()) % 50) + 10 AS 'SalesAmount', DATEADD(DAY, RAND(CHECKSUM(NEWID()))*(1+DATEDIFF(DAY, '01/01/2016', '04/30/2020')),'01/01/2016') AS 'SalesDate', REPLICATE('I love buying vintage games!',10) AS 'SalesDescription' FROM [Admin].[Numbers] AS nt WHERE [nt].[Number] < 10001 SET @Count = @Count + @@ROWCOUNT; END GO
Clustered Index on Primary Key
If you look through the syntax above, you can see we created a SalesPerson and SalesOrder table. We didn’t specify in the syntax, but the first thing I would do is add a clustered index and make the primary key on both tables.
You could argue that we could use the EmployeeNumber on the SalesPerson table as the primary key, but I’m sticking with the Id since it’s an integer. It will likely never change. However, I’ve seen an employee identification number change with new HR software.
Why add the clustered index, you might ask. Because currently, we have two heaps on our hands. Suppose we try and perform a simple select statement. Make sure to turn on your actual execution plan.
SELECT * FROM [Sales].[SalesOrder] Where Id = 100;
The plan will look something like this. Notice that thick line. We have to scan the entire table to find one row!
Let’s go ahead and create the clustered indexes and specify the primary keys for both tables.
ALTER TABLE [Sales].[SalesOrder] ADD CONSTRAINT PK_SalesOrder_Id PRIMARY KEY CLUSTERED (Id); GO ALTER TABLE [Sales].[SalesPerson] ADD CONSTRAINT PK_SalesPerson_Id PRIMARY KEY CLUSTERED (Id); GO -- Check your execution plan again. SELECT * FROM [Sales].[SalesOrder] Where Id = 100;
Now that’s much better! The table’s now sorted by the Id column.
Nonclustered Index on Foreign Keys
The second type of index I recommend is a nonclustered on any foreign keys. We don’t currently have a foreign key on the SalesOrder table. Let’s start by creating it.
ALTER TABLE [Sales].[SalesOrder] ADD CONSTRAINT [FK_SalesPerson_Id] FOREIGN KEY ([SalesPerson]) REFERENCES [Sales].[SalesPerson]([Id]); GO -- Let's select a specific SalesPerson from the table. SELECT * FROM [Sales].[SalesOrder] so WHERE so.SalesPerson = 100
What’s going on with the plan? Why do you have to scan the entire SalesOrder table when filtering on a SalesPerson?
That’s right, just because you add a foreign key, it doesn’t mean an index is created. Now let’s create the nonclustered index on the SalesPerson column.
CREATE NONCLUSTERED INDEX IX_SalesPerson_Id ON [Sales].[SalesOrder](SalesPerson); GO -- Let's run our query again. SELECT so.* FROM [Sales].[SalesOrder] so WHERE so.SalesPerson = 100;
The plan above is what I’m expecting. I know the Key Lookup isn’t ideal, but it’s better than what we had.
What’s the Bottom Line?
For 90%+ of tables, I’d advise adding a clustered index to the primary key. Additionally, I’d create nonclustered indexes on any foreign key columns. It should go without saying these guidelines are only a starting place for your overall indexing strategy.
Notice that I said 90%+; there will always be exceptions to the rules. For example, I wouldn’t apply this same principle if you’re loading data into staging tables.
Please let me know in the comments below if you have other ideas.