As a developer we have to ensure faster response time of our applications. From our practical experiences, we found that most of the performance degradation are caused by non optimized SQL query. At the very initial stage developers may not experience slow query performance but sooner or later he will realize or acknowledge when database started growing rapidly. So for performance issues we have to concentrate on costly database queries first. Tuning SQL queries is not an easy task. Because lot of parameters may increase query execution time. One of the most common thumb rule for optimizing SQL query is to replace all sub queries within “IN” or “NOT IN” clause. Here in this article we will show you how a sub query decrease the performance and what will be the remedy.
One of our technical team got an assignment to process millions of CSV files and do bulk insert into a SQL Server database. To ensure 100% accuracy they have added an extra column, named filename for each record in each table. So that by using a unique count they easily get the no of inserted files and also can identify the missing file list. They have done everything smoothly. Now time to report on data accuracy. They run the file count but unfortunately the count is lower than the no of files they have to be inserted. They build a query, execute it but the execution time is not up to the mark. They have to run this query to find out the missing file list for almost 100+ times. So this is a case where we need to tune up query to improve the query execution performance. Management involved our development team & assigned me to make sure the quality delivery in time.
What I found:
They have created 2 tables. One contains the file list they need to process & another one contains the inserted file list. They used a sub query as usual within a “NOT IN” clause.
How I Solved it?
I have just re write the query using LEFT JOIN Operator instead of sub query & it’s now taking few seconds instead of more than an hour.
Didn’t get it yet? Start a simulation:
Create two tables & named those as “Source” & “Db” (table name, don’t be confused). Let’s say “Source” table contains all file name & “Db” table contains only inserted files name.
SQL Script to Create “Source Table”:
CREATE TABLE [dbo].( [filename] [varchar](500) NULL ) ON [PRIMARY]
SQL Script to Create “Db Table”:
CREATE TABLE [dbo].[db]( [filename] [varchar](500) NULL ) ON [PRIMARY]
Now need to insert dummy data into SQL Server Table:
DECLARE @id INT SET @id= 1 WHILE(@id<=100000) BEGIN INSERT INTO Source Values('TAP_File_'+CAST(@id as varchar)) SET @id=@id+1 END
DECLARE @id INT SET @id= 1 WHILE(@id<=99000) BEGIN INSERT INTO DB Values('TAP_File_'+CAST(@id as varchar)) SET @id=@id+1 END
Now the “source” table contains 100000 files name and “db” table contains 99000 files name. So the missing file count is 1000. So all are set to perform the query performance test. Write the Sub Query as they wrote like below:
SELECT filename FROM Source WHERE RIGHT(filename,8) NOT IN (SELECT RIGHT(filename,8) FROM db)
Now replace the Sub Query with Left JOIN Operator:
SELECT S.filename FROM Source S LEFT JOIN db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8) WHERE D.filename IS NULL
Now run both. See the below video, captured to prove the performance enhancement:
Note: Another performence tip is, never use Count(*) like above. Always use Count(Column_Name/Column_Number).
Sub Query caused the performance degradation.
Why Sub Query is performance killer?
“Not IN” command requires a check to be made on each record individually resulting the slow performance. In this case the number is 100000*99000.
The final word:
So what we learn? Sub Query is one of the major performance killer. Never use sub queries for big result set. Rather try to convert those with join operator.
Replace “IN/NOT IN” Using Full Join (another experiment):
SELECT S.filename FROM Source S FULL JOIN db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8) WHERE D.filename IS NULL
So far we have talked about to find out the missing files. If we want to know about the inserted files or reconcile the list we can do it in the following way:
SELECT S.filename FROM Source S INNER JOIN db D ON RIGHT(S.filename,8)=RIGHT(D.filename,8)
Hope this SQL/TSQL performance tuning article with the above real life case study will make you understand to avoid sub queries in future whenever possible. So, be cautious on using Sub Query !!