Performance Engineering - TempDB File contention identification - DBMarlin
In this topic, I will explain how to find out TempDB File contention (SQL SERVER) with or without DBMarlin - a light weight smart monitoring tool for databases.
Environment : SQL Server 2019, Application type : .Net
Tools : JMeter, Grafana, DBMarlin and SQLQueryStress
In a recent upgrade, we did change the infra with new SQL server replacing the old one and exactly with same configuration in terms of CPU cores , CPU logical processors, memory, and storage in cloud. When we run our load tests with the defined load we could able to see more failures / latency when we exceed certain limit of Vusers. Multiple iterations of run confirms the same and Azure infra monitoring did not hint anything useful. CPU / Memory Utilization of App Server looks under utilized and it hints about some thing to look at Database end.
This below metrics is from the SQL server which shows not much read and write operations during the load test time. Also the CPU utilization around 80% but not exceeded consistently above 80%
Now, I started looking at the DB side monitoring with DBMarlin for the same period and interesting findings. DBMarlin will help much in case of SQL Server wait time statistics which will tell you how long DB is spending on specific waits.
You can see the for the specific duration, most of time DB was waiting for PAGELATCH_UP followed by running. Now you know little bit about SQL Server wait time analysis you can skip to the next heading. for beginners I will explain here .
Usually SQL server will have running, runnable and sleep states. When an app server is trying to execute a specific query, SQL server DB Engine will assign a thread to serve for the purpose. if no threads are available in the SQL server thread pool , then it denotes a CPU contention in SQL server and the it will be in the runnable state. But ideal case, it will be assigned with a thread. Now the thread will take the query and parse it and it will now looking for the data to be read from disk /buffer. Now SQL server engine will put this thread in a suspended state and look for the data from disk / buffer. Once the data is read from DB pages, then it will notify the thread (which is in suspended state) that Data is ready for processing and the thread will be moved runnable queue. if any threads available , it will be immediately moved to running queue and all the processing of data will happen and requested data will be provided to the application server and the thread will be freed.
SQL server keeps tracks of where the thread is spending time (Suspended, runnable, running) and the wait types in case of Suspended queue. This metrics will be stored in DMV and can be retrieved by simple queries against DMV. This data will be residing in DMV until the next SQL server engine restart.
The "sys.dm_os_wait_stats" dynamic management view can be used to obtain information about the wait type details. This view holds cumulative statistics about the wait types until the database engine restarted or the wait statistics cleared manually. Sample code from sqlhacks.com is below:
SELECT TOP 10 wait_typ
AS [Wait Type],
wait_time_ms/1000.0
AS [Total Wait Time (second)],
(wait_time_ms-signal_wait_time_ms)/1000.0
AS [Resource Wait Time (second)],
signal_wait_time_ms/1000.0
AS [Signal Wait Time (second)],
waiting_tasks_count
AS [Wait Count]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN
(N'CLR_SEMAPHORE',
N'LAZYWRITER_SLEEP',
N'RESOURCE_QUEUE',
N'SQLTRACE_BUFFER_FLUSH',
N'SLEEP_TASK',
N'SLEEP_SYSTEMTASK',
N'WAITFOR',
N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
N'CHECKPOINT_QUEUE',
N'REQUEST_FOR_DEADLOCK_SEARCH',
N'XE_TIMER_EVENT',
N'XE_DISPATCHER_JOIN',
N'LOGMGR_QUEUE',
N'FT_IFTS_SCHEDULER_IDLE_WAIT',
N'BROKER_TASK_STOP',
N'CLR_MANUAL_EVENT',
N'CLR_AUTO_EVENT',
N'DISPATCHER_QUEUE_SEMAPHORE',
N'TRACEWRITE',
N'XE_DISPATCHER_WAIT',
N'BROKER_TO_FLUSH',
N'BROKER_EVENTHANDLER',
N'FT_IFTSHC_MUTEX',
N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'DIRTY_PAGE_POLL',
N'SP_SERVER_DIAGNOSTICS_SLEEP')
ORDER BY wait_time_ms-signal_wait_time_ms DESC;e
This will provide results like below :
Wait Type TotalWaitTime (s) ResourceWaitTime (s) Signal Wait Time (s)Wait Coun
QDS_PERSIST_TASK_MAIN_LOOP_SLEEP 14700.292000 14700.285000 0.007000 246
PAGEIOLATCH_SH 564.933000 563.587000 1.346000 47088
WRITELOG 149.947000 149.598000 0.349000 6445
PAGEIOLATCH_EX 49.911000 49.896000 0.015000 4486
PWAIT_ALL_COMPONENTS_INITIALIZED 31.028000 31.028000 0.000000 5
PREEMPTIVE_OS_DEVICEOPS 27.685000 27.685000 0.000000 8041
IO_COMPLETION 21.395000 21.380000 0.015000 4020
MEMORY_ALLOCATION_EXT 16.901000 16.901000 0.000000 7509349
LCK_M_U 12.684000 12.683000 0.001000 26
WAIT_XTP_HOST_WAIT 11.446000 11.446000 0.000000 6t
The same data can be visualized with DBMarlin in a nice way as a pie chart. Refer the images above.
Now all I know that , DB worker threads spent more time on suspended queue with most of the wait time occurred in PAGELATCH_UP wait time. With the help of DBMarlin , I can find which query is taking a longer time in this wait type.
Recommended by LinkedIn
This is one of the query in a list of queries that is causing PAGELATCH_UP wait type event.
When we started reading about PAGELATCH_UP wait type further. This wait type accumulates while page latches are in update mode. Update mode is not just for UPDATE operations. It is used when SQL needs to read then modify a page, before modification it will place an update on the data then when ready upgrade this to an exclusive latch. If this wait type is high then the common problem of PFS contention may be present. Also most of the tempdb related contention.
Thumb Rule is : One tempDB file for 4 cpu cores
When we created more tempDB files according to our infrastructure then run the similar load this wait type disappears also got good response times. Below Azure snapshot shows much improved DB Utilization (look at the DB read / writes per second)., the DBMarlin snapshot looks like below:
DBMarlin snapshot after fixing number of TempDB files according to the Thumb rule.
Annexure:
How to check the number of Temp Db in SQL Server:
Use Master
select
Name as [LogicalName],
physical_name as [Location],
state_desc as [Status]
from sys.master_files
where database_id = DB_ID(N'tempdb')
How to create more Temp DB's in SQl Server?:
USE [master]
GO
ALTER DATABASE [tempdb] ADD FILE (NAME = N'tempdev2', FILENAME = N'T:\MSSQL\DATA\tempdev2.ndf' , SIZE = 8GB , FILEGROWTH = 0);
ALTER DATABASE [tempdb] ADD FILE (NAME = N'tempdev3', FILENAME = N'T:\MSSQL\DATA\tempdev3.ndf' , SIZE = 8GB , FILEGROWTH = 0);
ALTER DATABASE [tempdb] ADD FILE (NAME = N'tempdev4', FILENAME = N'T:\MSSQL\DATA\tempdev4.ndf' , SIZE = 8GB , FILEGROWTH = 0);
GO;