Database design tricks using Mathematics

10. March 2011 21:54

Increase performance by 30-40% and reduce the table size by 90% by considering mathematics while database design

Writer: Amol Rajamane

Applies to: SQL Server 2000 / 2005 / 2008 / 2011

Summary:

In many area Mathematics makes your life easier and faster. This paper contains the SQL Server database design to get benefits of Bitwise operation in terms of size and performance.

Introduction:

When database table grows in size then its bit difficult to manage insert update deletes and Index maintenance. It’s very difficult to modify the existing schema as well. For example if you want change one datatype  say, INT to BIGINT, the operation would take its own sweet time to complete which is directly proportional to the number of records in the table.
More number of records resulting in slow query performance and large amount of space.  Which could affect the ETL process.
To overcome these two problems, size and performance, database design can  play a very important role. This paper tells you some tricky design to overcome these two problems.

Overview:

You might be aware of the bitwise operation in SQL Server.  Let’s step back and see how it works. Lets consider the simple example of Student and their SkillSet. We need to populate SkillSet table as shown in below.

 SkillSetId Skill 1 DB2 2 FoxPro 4 Informix 8 Interbase 16 MS Access 32 MS SQL Server 64 MySQL

SkillSetId whould be the sequence of POWER of 2, like 1, 2, 4, 8 and so on …
If you see carefully, sum of bit values will never exceeds next bit value. For example
1 + 2 = 3
1 + 2 + 4 = 7
1 + 2 + 4 + 8 = 15
. . .

Consider Student table as shown in below. (Students may be in millions)

 StudentId Name 1 Student 1 2 Student 2 3 Student 3 4 Student 4

Now consider Student 1 has skills sets MS Access, MS SQL Server, MySQL. We need to store Student and their skill sets. Consider the table StudentSkillSet like below. This is the normal way of storing the relational data.

 StudentId SkillSetId 1 16 1 32 1 64

In another approach, Instead of storing SkillSetIds of student, we will store the SUM of SkillSetIds.  Here 16 + 32 + 64 = 112

 StudentId SkillSetId 1 112

This approach reduces the number of records. As you see earlier you have 3 records for StudentId 1.
Now how to relate this 112 number with IDs associated with MSAccess(16), MS SQL Server(32) and MySQL(64).  Here we will use bitwise AND to get the SkillSetIds associated to number 112. Just run the following statement in query analyzer.

```SELECT  112 & 1 AS 'DB2',
112 & 2 AS 'FoxPro',
112 & 4 AS 'Informix',
112 & 8 AS 'Interbase',
112 & 16 AS 'MS Access',
112 & 32 AS 'MS SQL Server',
112 & 64 AS 'MySQL'
GO```

This will return the output like

 DB2 FoxPro Informix Interbase MS Access MS SQL Server MySQL 0 0 0 0 16 32 64

Only skill sets which are non zero are associated to 112. Isn’t it simple?

How it reduces the number of rows and size of table. For that consider there are 1 million students and each student has 10 skillsets. Then there will be 10 millions of rows in StudentSkillSet table. If we store the SUM of skillSetIds there will be the same number of records as in Student table. That means 1 millions.  To get the space used I created two tables StudentSkillSet and StudentSkillSetTemp. StudentSkillSet table contains StudentId and sum of SkillSetIds. And StudentSkillSetTemp contains StudentId and SkillSetId. Here is size difference between both approaches.

Execute below query to get the space used by StudentSkillSet

```EXEC sp_spaceused StudentSkillSet
GO```

Out is as below which indicate total space used = 35.976 MB  (Data + Index_size)

 Name rows Reserved data index_size unused StudentSkillSet 1000000 37216 KB 20816 KB 15160 KB 1240 KB

Same way get the space used by StudentSkillSetTemp

```EXEC sp_spaceused StudentSkillSetTemp
GO```

Out is as below which indicate total space used = 357.544 MB (Data + Index_size)

 name rows reserved Data index_size unused StudentSkillSetTemp 10000000 358224 KB 207824 KB 149720 KB 680 KB

You can see the difference between these two tables in size. There is size saving almost 90%

This is all about the size. Now step back and see if there is any performance impact of this technic.
Let’s create the following tables

1. SkillSet   –- To store all the skillsets
2. Student   –- To store all the students
3. StudentSkillSet  -- To store sum of skillsets associated to the student
4. StudentSkillSetTemp -- To store skillsets associated to the student

Run the below script in your test database.  Please be patient for 4-5 minutes to execute below script.

```-- First drop the dependent objects
IF OBJECT_ID('[dbo].[vwStudentSkillSet]') IS NOT NULL
DROP VIEW [dbo].[vwStudentSkillSet]
GO
IF OBJECT_ID('[dbo].[vwStudentSkillSetTemp]') IS NOT NULL
DROP VIEW [dbo].[vwStudentSkillSetTemp]
GO

-- Drop if exists
IF OBJECT_ID('[dbo].[SkillSet]') IS NOT NULL
DROP TABLE [dbo].[SkillSet]
GO

-- Create SkillSet table
CREATE TABLE [dbo].[SkillSet] (SkillSetID BIGINT NOT NULL, Skill VARCHAR(256) NOT NULL)

GO

-- Insert skillsets, SkillSetIDs must be in POWER of 2
INSERT INTO [dbo].[SkillSet] (SkillSetID, Skill)
SELECT 1 AS SkillSetID, 'DB2' AS Skill UNION ALL
SELECT 2 AS SkillSetID, 'FoxPro' AS Skill UNION ALL
SELECT 4 AS SkillSetID, 'Informix' AS Skill UNION ALL
SELECT 8 AS SkillSetID, 'Interbase' AS Skill UNION ALL
SELECT 16 AS SkillSetID, 'MS Access' AS Skill UNION ALL
SELECT 32 AS SkillSetID, 'MS SQL Server' AS Skill UNION ALL
SELECT 64 AS SkillSetID, 'MySQL' AS Skill UNION ALL
SELECT 128 AS SkillSetID, 'Oracle' AS Skill UNION ALL
SELECT 256 AS SkillSetID, 'Paradox' AS Skill UNION ALL
SELECT 512 AS SkillSetID, 'PostgreSQL' AS Skill UNION ALL
SELECT 1024 AS SkillSetID, 'Borland C++ Builder' AS Skill UNION ALL
SELECT 2048 AS SkillSetID, 'Borland Delphi' AS Skill UNION ALL
SELECT 4096 AS SkillSetID, 'Borland JBuilder' AS Skill UNION ALL
SELECT 8192 AS SkillSetID, 'ErWIN' AS Skill UNION ALL
SELECT 16384 AS SkillSetID, 'GNU C/C++' AS Skill UNION ALL
SELECT 32768 AS SkillSetID, 'MS Visual C++' AS Skill UNION ALL
SELECT 65536 AS SkillSetID, 'Sun ONE Studio' AS Skill UNION ALL
SELECT 131072 AS SkillSetID, 'HP ANSI C++' AS Skill UNION ALL
SELECT 262144 AS SkillSetID, 'JDBC' AS Skill UNION ALL
SELECT 524288 AS SkillSetID, 'JDK' AS Skill UNION ALL
SELECT 1048576 AS SkillSetID, 'Macromedia Flash' AS Skill UNION ALL
SELECT 2097152 AS SkillSetID, 'Macromedia Flash Media Server' AS Skill UNION ALL
SELECT 4194304 AS SkillSetID, 'MS Visual Basic' AS Skill UNION ALL
SELECT 8388608 AS SkillSetID, 'MS Visual J++' AS Skill UNION ALL
SELECT 16777216 AS SkillSetID, 'MS Visual Source Safe' AS Skill UNION ALL
SELECT 33554432 AS SkillSetID, 'Powerbuilder Foundation Classes' AS Skill UNION ALL
SELECT 67108864 AS SkillSetID, 'Rational Rose' AS Skill UNION ALL
SELECT 134217728 AS SkillSetID, 'UML' AS Skill UNION ALL
SELECT 268435456 AS SkillSetID, 'JavaScript' AS Skill UNION ALL
SELECT 536870912 AS SkillSetID, 'Lua' AS Skill UNION ALL
SELECT 1073741824 AS SkillSetID, 'Modula-2' AS Skill UNION ALL
SELECT 2147483648 AS SkillSetID, 'Pascal' AS Skill UNION ALL
SELECT 4294967296 AS SkillSetID, 'Perl' AS Skill UNION ALL
SELECT 8589934592 AS SkillSetID, 'PL/I' AS Skill UNION ALL
SELECT 17179869184 AS SkillSetID, 'PHP' AS Skill UNION ALL
SELECT 34359738368 AS SkillSetID, 'Python' AS Skill UNION ALL
SELECT 68719476736 AS SkillSetID, 'REALBasic' AS Skill UNION ALL
SELECT 137438953472 AS SkillSetID, 'Visual Basic' AS Skill UNION ALL
SELECT 274877906944 AS SkillSetID, 'Active X' AS Skill UNION ALL
SELECT 549755813888 AS SkillSetID, 'ADO' AS Skill UNION ALL
SELECT 1099511627776 AS SkillSetID, 'ASP' AS Skill UNION ALL
SELECT 2199023255552 AS SkillSetID, 'ATL' AS Skill UNION ALL
SELECT 4398046511104 AS SkillSetID, 'BDE' AS Skill UNION ALL
SELECT 8796093022208 AS SkillSetID, 'CGI' AS Skill UNION ALL
SELECT 17592186044416 AS SkillSetID, 'Client-Server Architecture' AS Skill UNION ALL
SELECT 35184372088832 AS SkillSetID, 'COM/DCOM' AS Skill UNION ALL
SELECT 70368744177664 AS SkillSetID, 'CORBA' AS Skill UNION ALL
SELECT 140737488355328 AS SkillSetID, 'DAO/RDO' AS Skill UNION ALL
SELECT 281474976710656 AS SkillSetID, 'DICOM' AS Skill UNION ALL
SELECT 562949953421312 AS SkillSetID, 'HTML' AS Skill UNION ALL
SELECT 1125899906842624 AS SkillSetID, 'HL7' AS Skill UNION ALL
SELECT 2251799813685248 AS SkillSetID, 'IBM WebSphere' AS Skill UNION ALL
SELECT 4503599627370496 AS SkillSetID, 'Image processing' AS Skill UNION ALL
SELECT 9007199254740992 AS SkillSetID, 'JDBC' AS Skill UNION ALL
SELECT 18014398509481984 AS SkillSetID, 'MFC' AS Skill UNION ALL
SELECT 36028797018963968 AS SkillSetID, 'Microsoft .NET' AS Skill UNION ALL
SELECT 72057594037927936 AS SkillSetID, 'MIDAS' AS Skill UNION ALL
SELECT 144115188075855872 AS SkillSetID, 'ODBC' AS Skill UNION ALL
SELECT 288230376151711744 AS SkillSetID, 'OLE' AS Skill UNION ALL
SELECT 576460752303423488 AS SkillSetID, 'Oracle Developer/Programmer' AS Skill UNION ALL
SELECT 1152921504606846976 AS SkillSetID, 'Sound processing' AS Skill UNION ALL
SELECT 2305843009213693952 AS SkillSetID, 'VBA' AS Skill

GO
-- Create PK
ALTER TABLE [dbo].[SkillSet] ADD CONSTRAINT pk_SkillSet PRIMARY KEY (SkillSetID)

GO

-- Drop if exists
IF OBJECT_ID('[dbo].[Student]') IS NOT NULL
DROP TABLE [dbo].[Student]
GO

-- Create student table
CREATE TABLE [dbo].[Student] (StudentID INT IDENTITY(1,1), StudentName VARCHAR(100) NOT NULL)
GO

-- Add 1 million records in Student table
SET NOCOUNT ON
DECLARE @LoopCounter INT
SET @LoopCounter = 1

BEGIN TRAN
WHILE @LoopCounter <= 100000
BEGIN
INSERT [dbo].[Student] VALUES('Student ' + CAST(@LoopCounter AS VARCHAR))
SET @LoopCounter = @LoopCounter + 1
END
COMMIT
GO

-- Create PK
ALTER TABLE [dbo].[Student] ADD CONSTRAINT pk_Student PRIMARY KEY (StudentID)

GO

-- Drop if exists
IF OBJECT_ID('[dbo].[vwStudentSkillSet]') IS NOT NULL
DROP VIEW [dbo].[vwStudentSkillSet]

GO
IF OBJECT_ID('[dbo].[vwStudentSkillSetTemp]') IS NOT NULL
DROP VIEW [dbo].[vwStudentSkillSetTemp]

GO

-- Drop if exists
IF OBJECT_ID('[dbo].[StudentSkillSet]') IS NOT NULL
DROP TABLE [dbo].[StudentSkillSet]
GO

-- Create StudentSkillSet
CREATE TABLE [dbo].[StudentSkillSet] (StudentID INT NOT NULL, SkillSetID BIGINT NOT NULL)
GO

-- Drop if exists
IF OBJECT_ID('[dbo].[StudentSkillSetTemp]') IS NOT NULL
DROP TABLE [dbo].[StudentSkillSetTemp]
GO

-- Create StudentSkillSetTemp
CREATE TABLE [dbo].[StudentSkillSetTemp] (StudentID INT NOT NULL, SkillSetID BIGINT NOT NULL)
GO

-- Insert random 10 skills for each student into StudentSkillSetTemp
SET NOCOUNT ON
DECLARE @LoopCounter INT, @StudentID INT, @strLoopCounter VARCHAR(100)
SET @LoopCounter = 1

BEGIN TRAN
WHILE @LoopCounter <= 100000
BEGIN

SELECT @StudentID = StudentID
FROM Student
WHERE StudentID = @LoopCounter

INSERT [dbo].[StudentSkillSetTemp] (StudentID, SkillSetID)
SELECT TOP 10 @StudentID, SkillSetID
FROM SkillSet
ORDER BY NEWID()

SET @LoopCounter = @LoopCounter + 1

SET @strLoopCounter = CAST(@LoopCounter AS VARCHAR)
RAISERROR (@strLoopCounter, 0, 1) WITH NOWAIT

END
COMMIT

GO

-- Create PK
ALTER TABLE [dbo].[StudentSkillSetTemp] ADD CONSTRAINT pk_StudentSkillSetTemp PRIMARY KEY (SkillSetID, StudentID)
GO

-- Create NC on StudentID
CREATE NONCLUSTERED INDEX [ix_StudentSkillSetTemp_StudentID] ON [dbo].[StudentSkillSetTemp] ([StudentID])
GO

-- Insert SUM of Skillsets for students into StudentSkillSet
INSERT [dbo].[StudentSkillSet] (StudentID, SkillSetID)
SELECT StudentID, SUM(SkillSetID) AS SkillSetID
FROM [dbo].[StudentSkillSetTemp]
GROUP BY StudentID
GO

-- Create PK, Here don't need to create composit key as PK as we have 1-1 mapping
ALTER TABLE [dbo].[StudentSkillSet] ADD CONSTRAINT pk_StudentSkillSet PRIMARY KEY (SkillSetID)
GO

-- Create NC on SkillSetID
CREATE NONCLUSTERED INDEX [ix_StudentSkillSet_StudentID] ON [dbo].[StudentSkillSet] (StudentID)
GO

-- Create view to get the actual skillset ids associated to students
CREATE VIEW [dbo].[vwStudentSkillSet]
WITH SCHEMABINDING
AS
SELECT
sss.StudentID,
ss.SkillSetID,
ss.Skill
FROM [dbo].[StudentSkillSet] sss
CROSS JOIN [dbo].[SkillSet] ss
WHERE ss.SkillSetID > 0
AND ss.SkillSetID | sss.SkillSetID = sss.SkillSetID

GO

-- Here we will create composite PK
CREATE UNIQUE CLUSTERED INDEX ucix_vwStudentSkillSet_StudentID
ON [dbo].[vwStudentSkillSet] (SkillSetID, StudentID)
GO

-- Create the same view for our temp table as well
CREATE VIEW [dbo].[vwStudentSkillSetTemp]
WITH SCHEMABINDING
AS
SELECT  sss.StudentID, ss.SkillSetID, ss.Skill
FROM    [dbo].[StudentSkillSetTemp] sss
INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID)
WHERE   ss.SkillSetID > 0

GO

-- Create composite PK
CREATE UNIQUE CLUSTERED INDEX ucix_vwStudentSkillSetTemp_StudentID
ON [dbo].[vwStudentSkillSetTemp] (SkillSetID, StudentID)
GO```

Now Lets look at the performance different between these two approaches.
First we will run some queries against the tables directly instead of indexed view which we created above.
Query Number  1 :  -- Get all the Students having Skill Set  1, 2,  and 3

```--With new design
SELECT  sss.StudentID,
ss.SkillSetID,
ss.Skill
FROM    [dbo].[StudentSkillSet] sss
CROSS JOIN [dbo].[SkillSet] ss
WHERE   ss.SkillSetID IN (1,2,4) AND
ss.SkillSetID | sss.SkillSetID = sss.SkillSetID
GO
--With normal design
SELECT  sss.StudentID,
ss.SkillSetID,
ss.Skill
FROM    [dbo].[StudentSkillSetTemp] sss
INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID)
WHERE   ss.SkillSetID IN (1,2,4)
GO```

Run these queries and see the execution plan.
Figure 1

See the execution plan in Figure 1, It’s 22% faster.

Query Number  2 :  -- Get all the Skill Sets of Student  1

```-- With new design
SELECT  sss.StudentID,
ss.SkillSetID,
ss.Skill
FROM    [dbo].[StudentSkillSet] sss
CROSS JOIN [dbo].[SkillSet] ss
WHERE   ss.SkillSetID > 0 AND
ss.SkillSetID | sss.SkillSetID = sss.SkillSetID
AND sss.StudentID = 1
GO

-- With normal design
SELECT  sss.StudentID,
ss.SkillSetID,
ss.Skill
FROM    [dbo].[StudentSkillSetTemp] sss
INNER JOIN [dbo].[SkillSet] ss ON (sss.SkillSetID = ss.SkillSetID)
WHERE   ss.SkillSetID > 0 AND
sss.StudentID = 1
GO```

Execute above queries and see the execution plan.

Figure 2

See the execution plan in Figure 2, It’s 8% faster.

Now lets run some queries against Indexed views, [dbo].[vwStudentSkillSet] and [dbo].[vwStudentSkillSetTemp]

Query Number  3 :  -- Get all the Students having Skill Sets  1, 2,  and 3 using Indexed views

```--With new design
SELECT  StudentID,
SkillSetID,
Skill
FROM    [dbo].[vwStudentSkillSet]
WHERE   SkillSetID IN (1,2,4)

GO
--With normal design
SELECT  StudentID,
SkillSetID,
Skill
FROM    [dbo].[vwStudentSkillSetTemp]
WHERE   SkillSetID IN (1,2,4)
GO```

Figure 3

See the execution plan in Figure 3, It’s 22% faster.

Query Number  4 :  -- Get all the Skill Sets of Student  1 using Indexed views

```--With new design
SELECT  StudentID,
SkillSetID,
Skill
FROM    [dbo].[vwStudentSkillSet]
WHERE   StudentID = 1

GO
--With new design
SELECT  StudentID,
SkillSetID,
Skill
FROM    [dbo].[vwStudentSkillSetTemp]
WHERE   StudentID  = 1
GO```

Figure 4

Query Number  5 :  -- Get the number of skillsets for Student  1 using Indexed views

```SELECT  COUNT(*)
FROM    [dbo].[vwStudentSkillSet]
WHERE   SkillSetID = 1

GO

SELECT  COUNT(*)
FROM    [dbo].[vwStudentSkillSetTemp]
WHERE   SkillSetID  = 1
GO```

Figure 5

If you observe, all the queries are running at least 22% (this number may vary depending on hardware) faster than old design.

Limitations:

The biggest disadvantage of this design is, you cannot use it if you have more than 62 dimensions keys. In this case I used 62 Skill Sets. This is because the POWER(2,63) crosses the max limit of big integer supported by SQL Server. And binary operations cannot be performed against Float or Decimal values.

Inferences:

This approach is very useful if you have dimensions having keys less than 63 and you have huge facts which holds these dimension keys.
Consider the case where you have 100 millions fact keys and each key hold minimum 10 dimension keys. So there will be 100 * 10 = 1000 Millions of rows. But if you use this design there will be 100 millions of rows. So at the max you could save the space 62 time, it’s huge one …

SQL Queries:

SQL Queries.sql (10.31 kb)

Whitepaper