Replace Sub Queries within “IN” or “Not IN” command by JOIN Operator to improve SQL Query Performence

As a developer we have to ensure faster response time of our applications. From our practical experiences, we found that most of the performance degradation are caused by non optimized SQL query. At the very initial stage developers may not experience slow query performance but sooner or later he will realize or acknowledge when database started growing rapidly. So for performance issues we have to concentrate on costly database queries first. Tuning SQL queries is not an easy task. Because lot of parameters may increase query execution time. One of the most common thumb rule for optimizing SQL query is to replace all sub queries within “IN” or “NOT IN” clause. Here in this article we will show you how a sub query decrease the performance and what will be the remedy.

Case Study:
One of our technical team got an assignment to process millions of CSV files and do bulk insert into a SQL Server database. To ensure 100% accuracy they have added an extra column, named filename for each record in each table. So that by using a unique count they easily get the no of inserted files and also can identify the missing file list. They have done everything smoothly. Now time to report on data accuracy. They run the file count but unfortunately the count is lower than the no of files they have to be inserted. They build a query, execute it but the execution time is not up to the mark. They have to run this query to find out the missing file list for almost 100+ times. So this is a case where we need to tune up query to improve the query execution performance. Management involved our development team & assigned me to make sure the quality delivery in time.

What I found:
They have created 2 tables. One contains the file list they need to process & another one contains the inserted file list. They used a sub query as usual within a “NOT IN” clause.

How I Solved it?
I have just re write the query using LEFT JOIN Operator instead of sub query & it’s now taking few seconds instead of more than an hour.

Didn’t get it yet? Start a simulation:
Create two tables & named those as “Source” & “Db” (table name, don’t be confused). Let’s say “Source” table contains all file name & “Db” table contains only inserted files name.

SQL Script to Create “Source Table”:

CREATE TABLE [dbo].(
	[filename] [varchar](500) NULL
) ON [PRIMARY]

SQL Script to Create “Db Table”:

CREATE TABLE [dbo].[db](
	[filename] [varchar](500) NULL
) ON [PRIMARY]

Now need to insert dummy data into SQL Server Table:
Step 1:

DECLARE @id INT
SET @id= 1
WHILE(@id<=100000)
BEGIN
     INSERT INTO Source Values('TAP_File_'+CAST(@id as varchar))
     SET @id=@id+1
END

Step 2:

DECLARE @id INT
SET @id= 1
WHILE(@id<=99000)
BEGIN
     INSERT INTO DB Values('TAP_File_'+CAST(@id as varchar))
     SET @id=@id+1
END

Now the “source” table contains 100000 files name and “db” table contains 99000 files name. So the missing file count is 1000. So all are set to perform the query performance test. Write the Sub Query as they wrote like below:

SELECT filename FROM Source
WHERE RIGHT(filename,8) NOT IN (SELECT RIGHT(filename,8) FROM db)

Now replace the Sub Query with Left JOIN Operator:

SELECT S.filename FROM Source S
LEFT JOIN
db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8)
WHERE D.filename IS NULL

Now run both. See the below video, captured to prove the performance enhancement:
Replace Sub Queries within IN or Not IN command by JOIN operator to improve SQL query Performence
Note: Another performence tip is, never use Count(*) like above. Always use Count(Column_Name/Column_Number).

Root Cause:
Sub Query caused the performance degradation.

Why Sub Query is performance killer?
“Not IN” command requires a check to be made on each record individually resulting the slow performance. In this case the number is 100000*99000.

The final word:
So what we learn? Sub Query is one of the major performance killer. Never use sub queries for big result set. Rather try to convert those with join operator.

Replace “IN/NOT IN” Using Full Join (another experiment):

SELECT S.filename FROM Source S
FULL JOIN
db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8)
WHERE D.filename IS NULL

So far we have talked about to find out the missing files. If we want to know about the inserted files or reconcile the list we can do it in the following way:

SELECT S.filename FROM Source S
INNER JOIN
db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8)

Hope this SQL/TSQL performance tuning article with the above real life case study will make you understand to avoid sub queries in future whenever possible. So, be cautious on using Sub Query !!

Posted in Database, SQL Server, SQL Server SQL/TSQL Performence Tuning

Leave a Reply

Your email address will not be published. Required fields are marked *

     

*