Getting Started with Adobe After Effects - Part 6: Motion Blur
First Time? You can support us by signing up. It takes only 5 seconds. Click here to sign up. If you already have an account, click here to login.
Loading

SQLServer Quiz 2010 - SQL Server Exectuion plan scans, what it says?

  • SSQA Retail limited uses an inventory application for their day-to-day stock control operations. The actively used tables within application database are Products (200,000 rows), Sales (5.5 million rows) and Customers (750,000 rows) having the one-to-one relationship between Sales and Customer tables, one-to-many between Product, Sales and Customer tables. Recently the Marketing team has complained about performance issues for 2 active reports (daily-sales and stock) resulted in slow performance and at times timed-out.

    A server side trace using TSQL tuning template produces that 3 stored procedures are used during those reports execution, a SHOWPLAN event activity reveals that there are multiple scan operations such as index scan, hash-join, index seek and table scan on the above 3 tables.

    You as a DBA has a challenge to provide a descriptive answer about each of the above operation (hah-join/scan/seek) and feedback which operation will be fastest (why) and will be slowest (why) with valid reasons. You are at liberty to define an index strategy between 3 tables.

    Posted on 10-25-2010 00:00 |
    Satya Jayanty (@sqlmaster)
    34 · 5% · 1720

12  Answers  

Subscribe to Notifications
Previous 1 | 2 Next
  • Score
    8

    Hash Join:

    Hash join is the one of the join technique used by sql server to join 2 tables internally. It will use one table which has less records than other as Build table and another table has proble table. With the each rows of Build table, it hash the key columns (columns used in join expression) of that rows and Builds a hash table. Then it will process the probe table, and for each row it hashes the key columns of probe table and directly will finds the location of match entry in other table.

    Advantages of Hash join are as Hash table will provides o(1) time complexity to search match for a row and , where as other methods of nested loop and merge join will provides o(n2) and o(nlogn) time complexity,Hence, Hash join is faster for huge tables.Where as other join methods are faster for less no of rows(since, Hash join has additional cost of building hashtable).

    Seek:

    An Index/table seek will indicates that SQL Engine is using the mechanism of Binary tree search while processing the records.In Binary Search Mechanism it will follows 3 rules.

    1. If the current Node is Greater than Search node, Go for its left-child(In case of sql server, it will be previous page/previous record)

    2. If the current Node is less than Search Node, Go for its right-child(In case of sql server, it will be next page/next record)

    3. Whe its finds the page, mark it as successful and Go for next record, until it finds one with higher value than search value(In case of exists kind of logic, it stops, when its finds a record)

    Applying these rules will simply reduces the cost of search to no of pages to no of Levels. Suppose, IF sales table has 1 Million records, with 10,000 pages, if it follows scan, it sequentially searches all pages, so, it will takes always 10,000 reads(except in case of exists, when it stops, when it finds a record). If it follows seeks, it will take very fewer reads, no of levels in the index(Mostly it will be less than 10 pages)

    Scan :

    An Index/table Scan will indicates that Engine is processing the rows one by one, without taking the advantage of Binary tree search mechanism, Simply it follows sequential search. Suppose, assume you have a Sales table with 30,000 records of 30 days in a month, Suppose, we query the 16th day records, if it follows scan, it will scan all 30,000 records. it will not take the advantage of logical ordering. if it has an index on date, it can simply go to 16th day records with a few page reads.

    Now coming to comparision of slow/fast,

    For Huge no of tables, Hash join will be more faster, where as for smaller tables, Nestedloop joins or Merge joins will be faster.

    Seeks are always faster, where as scans are slower.But while at the same time we need to have a look at random I/Os and sequential I/Os. While you are retrieving huge set of records, Random I/O will cost more, since disk need to go forward/backward more.

    Coming to indexing stratagey

    Products table will have index on ProductId

    Sales table will have index on SalesDate,ProductID,CustomerID

    Customers table will have index on CustomerID

    Daily report will greatly benefit from having index on SalesDate, so it can only process Pages of that day's records by using index seek

    Stock report, Which shows the current remainig stock of product, I would like to maintain an additional column in stock table "Remaining quantity", rather than doing all calculations of summing sales and incoming quantities table and subtracing. Having an additional column will simplifies. But again it depends on type of system. If it is more of transactional based system, keeping an additional column will increases insert/delete/update costs, since, when an sales table or the table which will stored incoming quantities is updated, it needs to update stock table also, means we need to implement transactions, which makes slower insert/updates/deletes. If it is more of type, Analytical system, its better to have additional column.

    Replied on Oct 25 2010 1:36AM  . 
    Ramireddy
    2 · 40% · 12972
  • Score
    9

    Question 1: a descriptive answer about each of the above operation (hah-join/scan/seek) and

    Hash Join:

    Hash join gives best performance when two more join tables are joined and at-least one of them have no index or is not sorted

    When it comes to physical join operators, hash join does the heavy lifting. Hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

    Hash join is two phase process. 1) Build Phase 2) Probe Phase.

    1. Build Phase : It reads rows of smaller table into the memory. It hashes the keys of the rows which will be used for join.

    2. Probe Phase : It reads rows of other table and hashes the keys of the rows which will be used for join. While doing this it checks for the matching rows on hashed keys in the table build in Build phase. Smaller table in memory and larger table in disk is basic rule. If smaller table does not fit in memory it spits to hard drive. DBSPACETEMP configuration parameter is used to stored hashed table in probe phase. In summary, instead of joining on the columns separate hash table is created and it is used to join tables for improved performance

    Scans and seeks are the iterators that SQL Server uses to read data from tables and indexes. These iterators are among the most fundamental ones that SQL Server support.

    Scan

    A scan returns the entire table or index.

    Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate. However, if the table is large and if most of the rows do not qualify, we touch many more pages and rows and perform many more I/Os than is necessary.

    Seek

    A seek efficiently returns rows from one or more ranges of an index based on a predicate.

    Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table. Thus, a seek is generally a more efficient strategy if we have a highly selective seek predicate; that is, if we have a seek predicate that eliminates a large fraction of the table.

    For example, consider the following query:

     select OrderDate from Orders where OrderKey = 2
    

    With a Scan, we read each row in the orders table, evaluate the predicate “where OrderKey = 2” and, if the predicate is true (i.e., if the row qualifies), return the row. In this case, we refer to the predicate as a “residual” predicate

    With a Seek, if we have an index on OrderKey, a seek may be a better plan. With a seek, we use the index to navigate directly to those rows that satisfy the predicate. In this case, we refer to the predicate as a “seek” predicate. In most cases, we do not need to re-evaluate the seek predicate as a residual predicate; the index ensures that the seek only returns rows that qualify

    Question 2: Feedback which operation will be fastest (why) and will be slowest (why) with valid reasons.

    Among operation (hah-join/scan/seek), which will be fastest and slowest is depends on the different scanario.

    Hash Join: Hash join gives best performance when two more join tables are joined and at-least one of them have no index or is not sorted. When it comes to physical join operators, hash join does the heavy lifting. Hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

    Scan: A scan returns the entire table or index Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate. However, if the table is large and if most of the rows do not qualify, we touch many more pages and rows and perform many more I/Os than is necessary.

    Seek: A seek efficiently returns rows from one or more ranges of an index based on a predicate. Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table. Thus, a seek is generally a more efficient strategy if we have a highly selective seek predicate; that is, if we have a seek predicate that eliminates a large fraction of the table.

    Question 3: define an index strategy between 3 tables

    On Products table : Add unique clustered index on ProductId

    On Sales table : Add unique clusted index on InvoiceNo , and have non_clustered index on SalesDate,ProductID,CustomerID

    On Customers : Add unique cluster index on CustomerID

    Refrences:

    Hash Join

    Scan Vs Seek

    Replied on Oct 25 2010 2:32AM  . 
    Sivaprasad S - SIVA
    240 · 1% · 188
  • Score
    5

    First to the business, I will define what is index and how it is useful in organizing data. I will probably use the book index example. Probably give a preview to simple storage structure using B-Tree. Then, define the terms as below :

    1. Index Scan – when the sql has to read through each and every records stored in index. Good for smaller tables.
    2. Index Seek – when sql can directly go to the records in the index table using b-tree structure. The most effective operation.
    3. Table Scan – when there is no effective index to be used for the query and the SQL engine has to scan the whole table – not efficient – especially for larger tables.
    4. Hash Join – Record search using Hash tables – effective for large tables/joins.

    Strategy to optimize :

    1. All tables should have clustered index
    2. Create non-clustered index for all foreign keys
    3. Create non-clustered or covered indexes for rows used in search criteria (covering what is there in the result set).
    Replied on Oct 25 2010 9:37AM  . 
    rpathak
    300 · 0% · 145
  • Score
    6

    provide a descriptive answer about each of the above operation (hah-join/scan/seek)

    I would use the (common?) phone book example as a base for a descriptive answer: The task is to find all people (last name and zip code) with a local phone number starting with '4', the first name equals 'Joseph' and the zip code start with '1001'

    table scan

    A query against a single table that will read the entire table (each and every row) and compare the columns in question with the search predicates in order to find matching rows. This query might be part of a more complex query, whereas the table scan is used to read (and, if possible to limit) the number of matching rows and extract the columns required for further processing.

    In order to answer the phone book question from above we would need to read each and every entry off the complete NYC phone books and compare each and every entry with our search arguments.

    index scan

    An index is a logical order of rows holding a (limited) number of columns from a table. A clustered index is a special kind of an index since the logical order of the key values determines the physical order of the corresponding rows in a table, therewith representing the whole table.

    A index scan will read each and every entry to find matching rows by comparing the index column(s) with the related search arguments.

    The "clustered index" of our physical phone book could be based on area code range, last name, first name, and street holding the complete address and some adverts here and there... (I have to guess here since I don't have a physical copy of one of the NYC's phone books...).

    If this would be the only index available, it would be equal to search the original phone book. Or, in other words, a clustered index scan is equal to a table scan.

    Now let's assume we would have another kind of phone book with the same sort order as the original one, just without any adverts and with the zip code only instead of the entire address. If we were able to choose one of the two books knowing the sort order and content, which one would we use to find the answer? The smaller book, of course. SQL Server would do the same to speed up the process...

    index seek

    A index seek is similar to an index scan with the major difference that a scan will only read a specific section of the index to find matching rows. The more selective an index will be, the more likely it will be chosen during query optimization.

    If we would have another kind of phone book (let's call it NYC_2) ordered by first name, zip code, and local phone number holding all the other information requested we would probably use that book since we would just have to pick the 'Joseph' section, locate the zip code section and check this small section of the whole book for the phone numbers starting with '4'.

    Sounds like a doable task, even manually...

    (I'll leave the discussion about "covering index" and "nonkey columns at the leaf level -aka included columns-" for a separate discussion since it will go beyond the scope from my point of view...)

    conclusion:

    A index seek will be the fastest (or at least equally fast) compared to an index scan on the same index. If there is no data selection that will limit the number of affected rows, an index seek will take as long as an index scan.

    If we would have to find all people (last name and zip code) from NYC without any other conditions, we'd still have to read the whole NYC_2 book. We would not benefit from the sort order. But we would benefit from the reduced size (e.g. due to eliminated adverts).

    A index scan will usually be faster than a table scan (assuming the index in question is smaller in size compared to the clustered index / heap table). A index scan on the clustered index will take as long as a table scan.

    hash-join

    Other than the index types from above a hash join describes the way two tables are joined in a query. As per http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx:

    When it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

    To stick to the phone book example: If we'd need to additionally display all restaurants in the same zip code the "matching Joseph's" are by referencing a book that would hold all NYC's restaurants and the zip code we would need to read the whole phone book and for each entry we'd need to read the entire restaurant guide...

    Define an index strategy:

    Step 1: verify that each table has a clustered index that is narrow and higly selective (e.g.ProductID, SaleID and CustomerID).

    Step 2: Analyze the queries in the three sprocs in question. For each query, get the tables involved including the columns of each table in the FROM and the WHERE clause (including the WHERE condition itself) as well as (separately) all selected columns. (Instead of doing all that stuff manually, use some system DMV's such as the sys.dmdbmissingindex* group to help identifying index candidates.) Compare those results and try to merge the results into a few narrow indexes that will cover as many queries as possible. Sometimes it's better to have a few more indexes that are narrow, sometimes indexes might be wider but a reduced number of indexes itself. (Depends on various factors, including update frequency) Try to use INCLUDED to add columns used in the SELECT part of the queries as nonkey columns rather than incorporationg those columns into the index itself.

    Step 3: Re-run Profiler to verify the results. Repeat step 2, if needed.

    Replied on Oct 25 2010 8:27PM  . 
    lmu92
    0 · 0% · 0
  • Score
    8

    Question:

    SSQA Retail limited uses an inventory application for their day-to-day stock control operations. The actively used tables within application database are Products (200,000 rows), Sales (5.5 million rows) and Customers (750,000 rows) having the one-to-one relationship between Sales and Customer tables, one-to-many between Product, Sales and Customer tables. Recently the Marketing team has complained about performance issues for 2 active reports (daily-sales and stock) resulted in slow performance and at times timed-out.

    A server side trace using TSQL tuning template produces that 3 stored procedures are used during those reports execution, a SHOWPLAN event activity reveals that there are multiple scan operations such as index scan, hash-join, index seek and table scan on the above 3 tables.

    You as a DBA has a challenge to provide a descriptive answer about each of the above operation (hah-join/scan/seek) and feedback which operation will be fastest (why) and will be slowest (why) with valid reasons. You are at liberty to define an index strategy between 3 tables.


    First I'd like to explain what kind of joins exist in SQL Server and how do they compare in performance.

    SQL Server supports three physical join operators: nested loops join, merge join, and hash join.

    Quoting Craig Freedman:

    Nested Loops Join

    In its simplest form, a nested loops join compares each row from one table (known as the outer table) to each row from the other table (known as the inner table) looking for rows that satisfy the join predicate.

    We can express the algorithm in pseudo-code as:

    for each row R1 in the outer table
        for each row R2 in the inner table
            if R1 joins with R2
                return (R1, R2)
    

    It’s the nesting of the for loops in this algorithm that gives nested loops join its name.

    The total number of rows compared and, thus, the cost of this algorithm is proportional to the size of the outer table multiplied by the size of the inner table. Since this cost grows quickly as the size of the input tables grow, in practice we try to minimize the cost by reducing the number of inner rows that we must consider for each outer row.

    Is NL join good or bad?

    Neither actually. There is no “best” join algorithm and no join algorithm is inherently good or bad. Each join algorithm performs well in the right circumstances and poorly in the wrong circumstances. Because the complexity of a nested loops join is proportional to the size of the outer input multiplied by the size of the inner input, a nested loops join generally performs best for relatively small input sets. The inner input need not be small, but, if it is large, it helps to include an index on a highly selective join key.

    There is a recent blog by Denis Gobo indicating how using a JOIN hint can significantly improve performance Use an OUTER LOOP JOIN to improve performance

    Merge Join

    Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate. Moreover, the inputs to the merge join must be sorted on the join keys. For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b.

    The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.

    We can express the algorithm in pseudo-code as:

    get first row R1 from input 1
    get first row R2 from input 2
    while not at the end of either input
        begin
            if R1 joins with R2
                begin
                    return (R1, R2)
                    get next row R2 from input 2
                end
            else if R1 < R2
                get next row R1 from input 1
            else
                get next row R2 from input 2
        end
    

    Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs. Thus, merge join is often a better choice for larger inputs.

    Hash Join

    When it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

    Now, looking at the comparison between these 3 joins Summary of Join Properties

    we may conclude that although hash join is good for large inputs, it doesn't work well for high number of concurrent users. Since we can expect this report to be used by many users simultaneously, we may experiment with trying to force using merge join instead of the hash join for better performance.

    In regards to index scan, index seeks and table scan it is always the best to get the index seek as this results in less I/O operations. We can refer to Pinal Dave's blog SQL SERVER – Index Seek Vs. Index Scan (Table Scan)

    We can also review this helpful article SQL Server Indexes: The Basics by Kathi Kellenberger.

    For the Product table we need a clustered index on ProductID. For the Customers table we need a clustered index on CustomerID. For the Sales table we need a clustered index on SalesID and non-clustered indexes on ProductId and CustomerID (to be used in JOIN expressions). We also need a covering index on SalesDate that includes ProductID and CustomerID fields. I hope that these indexes will help with the Daily Summary report.

    Inventory Stock report is usually a complex report (say, see description to such report here) There may be different algorithms involved in calculating the cost of goods. These algorithms are FIFO (first in first out), WAC (weighted average cost) and LAST Cost. In order to obtain these reports it is not unusual to utilize a table valued user defined function. Tunning performance of this report may be a really daunting task.

    Replied on Oct 25 2010 11:59PM  . 
    Naomi
    31 · 6% · 1776
  • Score
    8

    SSQA Retail limited uses an inventory application for their day-to-day stock control operations. The actively used tables within application database are Products (200,000 rows), Sales (5.5 million rows) and Customers (750,000 rows) having the one-to-one relationship between Sales and Customer tables, one-to-many between Product, Sales and Customer tables. Recently the Marketing team has complained about performance issues for 2 active reports (daily-sales and stock) resulted in slow performance and at times timed-out. A server side trace using TSQL tuning template produces that 3 stored procedures are used during those reports execution, a SHOWPLAN event activity reveals that there are multiple scan operations such as index scan, hash-join, index seek and table scan on the above 3 tables. You as a DBA has a challenge to provide a descriptive answer about each of the above operation (hah-join/scan/seek) and feedback which operation will be fastest (why) and will be slowest (why) with valid reasons. You are at liberty to define an index strategy between 3 tables.


    Answer:

    Introduction:

    In order to answer following terms, it is very important for any DBA to understand about Indices, execution plan, and Joins in SQL Server. Not only this, it is very critical to understand how the SQL Server architecture and how the SQL Server Optimizer works and its strategy to select the best execution plans for fetching the records from the database engine.

    What is Index?

    All relational databases use indices to select or retrieve data quickly when the query is fired against the database table or tables. Generally, indices work behind the scene to support relational database engine to keep the application highly optimum. Creating proper index in the database tables can drastically improve the performance of an application.

    Let us look at the different types of Indexes in SQL Server. Clustered index, non-clustered index, covering index, and unique index are the types of indices in SQL Server.

    Clustered Index:

    Clustered index stores actual data at the leaf level of the index of the table. The main characteristic of using clustered index in a database table is that the indexed values are sorted in either ascending or descending order. There can be only one clustered index in a table or view.

    Non-Clustered Index:

    Non-clustered index data cannot be sorted like clustered index. We can create more than one non-clustered index in a table or a view

    Covering Index:

    Creating a non-clustered index which contains all the columns used in SQL Query is called Covering Index. Using index covering in a table, it resolves many slow running query performance problems.

    Composite Index:

    Composite index is another type of index which contains more than one column in the index definition. We can have up to 16 columns in an index and index should not exceed 900 byte limit.

    Unique Index:

    An index that ensures the uniqueness of each value in the indexed column. References: http://www.simple-talk.com/sql/learn-sql-server/

    Execution Plan:

    Execution plan is a graphical representation of how a query is executed. It is one of the tools for the DBA to gather the metrics of poorly performing queries. Using this plan, we will be able to obtain the baseline of what is happening within the query that is being executed against the database. We can use the execution plan to find out the SQL segment that is causing the problem. Query optimizer does an excellent job to select the plan that takes less time to fetch the record. Using the graphical representation of the execution plan, we can see if there are any index seek, index scan, table scan, Hash-Join, or RID Lookup. In any case, scans in the database tables are bad so needs to look into these scans and see if we can add the index and make it index seek.

    1.Index Scan:

    Index scan retrieves all the rows from the table that has an index. Generally, index scan occurs when an index definition of a table cannot reach to its closeness on a row to fully satisfy the search conditions. In the index scan, SQL Server engine have to scan several pages to meet the range of the rows to satisfy the search conditions. Index scan is used efficiently if the table is relatively small.

    2.Hash-Join:

    There are three kinds of join operations in SQL Server and they are: merge join, nested-loop joins, and hash-joins. Unlike other join operations, hash-joins work in large data sets. Hash-Joins are used if there are no enough indexes on the columns that are used in the join conditions. This is very bad for the performance of the data retrieval. Hash join is most efficient when one of the tables is significantly differ in size than another one. When two tables are joined and one of the tables do not have index, than there will be good performance. Hash joins is faster in large amount of data sets.

    3.Index Seek:

    When we submit a query in the database engine, query optimizer creates and execution plan for it to use, and tries to use an index seek. Index seek means that the query optimizer is able to find a useful index in order to find appropriate result set. If the optimizer is not able to do the index seek than it is clear that either there is no index or no useful index is available in the tables. When there are no indexes or no useful indexes, SQL Server engine has to scan all the records of the table from top to bottom to satisfy the query conditions.

    4.Table Scan

    When we submit a query in SQL Server, it gets parsed and then optimizer takes this query and does analysis of objects that are involved in the query. Optimizer tries to find which indexes are available and if they are useful to the query we use. It determines if the criteria/condition used in the query is using the index efficiently. If the indexes are not being used in an efficient manner to fetch the record, query processor will have to process all the records of the table one at time to fetch the complete result set, and this is called a table scan. Basically, query process has to scan the entire table row by row one at a time to get the result. This is not good performance wise.

    5.Comparison:

    When we compare all these operations, they have their own advantages and disadvantages. Hash-Joins perform better on large number of tables that have large number of records. Merge-Joins and Nested-Loop joins will be faster in smaller tables with small amount of data. Scans are bad because SQL Server engine have to look for record row by row from top to bottom of the data pages to meet the query conditions. Table seeks/index seeks are always faster than index scan/table scan while fetching the records from the database tables.

    6.Index Strategy:

    As a rule, all tables should have clustered index. It is recommended to create non-clustered index on those columns that are referenced to other tables(foreign key columns). Also create covering indexes on those columns that are used in search criteria or which are used in conditions.

    • Products Table:

    Create a clustered index on ProductID

    • Sales Table:

    Create covering or composite on SalesDage, ProductID, and CustomerID.

    • Customers Table:

    Create a clustered index on CustomerID.

    Replied on Oct 26 2010 8:45PM  . 
    Abi Chapagai
    70 · 3% · 808
  • Score
    7

    What's faster depends on how selective the report is.

    With High-selectivity reports, the nested loop join with an index seek on its right side is fastest. Vs the cost of a scan, the inefficiency of random-access I/O is justified if the # of iterations (# rows on left side of the join) is low. Tables serviced by selective queries are also more likely to be in cache, further mitigating the inefficiency of random-access I/O.

    With Low -selectivity reports that return (or feed into an aggregate) 100,000+ rows are better of with a hash join with scans or WHERE-predicate-only seeks on its right side. This is because high-volume reports incur more phsyical I/O and random-access physical I/O is very slow - scans are largely sequential (fragmentation can cause random-access patterns), seeks are random-access.

    If there is no SalesProducts table, then this database is in serious need of help, unless the business rules state that there is only one product involved in every sale. I assume that this database is in such dire need of help as there are plenty of badly designed databases out in the world.

    For indexing strategy, I prefer every table to be clustered. I try to favor the right-side of joins for the cluster key, to guarentee coverage (no lookups - always random-access IO). Nonclustered indexes optimized for WHERE predicates can take advantage of cluster-key inheritance. All-b-tree keys must be unique, so nonunique nonclustered indexes must append the clustered index's seek keys to its seek key to fufill that requirement. That will enable the nonclustered indexes to be useful for both join and where predicates at the same time. INCLUDE needs to be used to ensure coverage, as long is it isn't too wide, but at least cover all WHERE/JOIN predicates. Inequality and non-SARGable WHERE predicates the top priority for include-only columns. Coverage is important. Bookmark lookups are entirely random-access I/O, a bane to all reporting performance (whether it is lookup up too many rows, or switching to a scan because the optimizer figured out that the lookups will be too expensive). If the table is hardly ever used on the right side of a join, then I cluster by the predominant WHERE predicates used.

    It is difficult to give a index strategy for this question because the join order details are not given, but from experience I would lean toward clustering the ProductID in Products (PK) and Sales (FK), and the CustomerID in Customers (PK). Provide an additional nonclustered index on CustomerID on the sales table (FK). Add nonclustered index for any WHERE predicates. Column(s) depicting stock level for products, and columns depicting sales date for Sales (this is likely an inequality predicate). Make very liberal use of INCLUDE columns, but if full coverage creates excessive width, then focus only on covering all WHERE and JOIN predicates (pointing mostly to inequality and non-SARGable predicates such as LIKE or strings of ORs).

    ALTER TABLE dbo.Products 
        ADD CONSTRAINT PK_C_IX__dbo_Products__ProductID 
        PRIMARY KEY CLUSTERED (PK_ProductID)
    
    CREATE NONCLUSTERED INDEX IX__dbo_Products__Stock__{Columns Needed To Cover Query} 
        ON dbo.Products (Stock) INCLUDE ({columns needed to cover query})
    
    ALTER TABLE dbo.Sales 
        ADD CONSTRAINT PK_IX__dbo_Sales__SalesID 
        PRIMARY KEY NONCLUSTERED (PK_SalesID)
    
    CREATE UNIQUE CLUSTERED INDEX U_C_IX__dbo_Sales__ProductID_SalesID 
        ON dbo.Sales (FK_ProductID, PK_SalesID)
    
    CREATE NONCLUSTERED INDEX IX__dbo_Sales__SalesDate__CustomerID_ {Additional Columns needed to cover query} 
        ON dbo.Sales (SalesDate) INCLUDE (FK_CustomerID, {Additional COlumns needed to cover query})
    
    CREATE NONCLUSTERED INDEX IX__dbo_Sales__CustomerID 
        ON dbo.Sales (FK_CustomerID)
    
    ALTER TABLE dbo.Customers 
        ADD CONSTRAINT PK_C_IX__dbo_Customers__CustomerID 
        PRIMARY KEY CLUSTERED (PK_CustomerID)
    
    Replied on Oct 28 2010 3:30PM  . 
    Jesse Roberge
    81 · 2% · 707
  • Score
    7

    Here are my quick answer:

    1) Hash Join:

    The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.

    Hash joins are used for many types of set-matching operations: inner join; left, right, and full outer join; left and right semi-join; intersection; union; and difference. Moreover, a variant of the hash join can do duplicate removal and grouping, such as SUM(salary) GROUP BY department. These modifications use only one input for both the build and probe roles.

    http://technet.microsoft.com/en-us/library/ms189313.aspx

    2) Index Seek vs Index Scan:

    Index Scan retrieves all the rows from the table. Index Seek retrieves selective rows from the table.

    Index Scan:

    Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.

    Index Seek:

    Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.

    Index Scan is nothing but scanning on the data pages from the first page to the last page. If there is an index on a table, and if the query is touching a larger amount of data, which means the query is retrieving more than 50 percent or 90 percent of the data, and then optimizer would just scan all the data pages to retrieve the data rows. If there is no index, then you might see a Table Scan (Index Scan) in the execution plan.

    Index seeks are generally preferred for the highly selective queries. What that means is that the query is just requesting a fewer number of rows or just retrieving the other 10 (some documents says 15 percent) of the rows of the table.

    In general query optimizer tries to use an Index Seek which means that optimizer has found a useful index to retrieve recordset. But if it is not able to do so either because there is no index or no useful indexes on the table then SQL Server has to scan all the records that satisfy query condition.

    Firstly, let us revisit indexes briefly. An index in a SQL Server database is analogous to the index at the start of a book. That is, its function is to allow you to quickly find the data you are searching for inside the book; in the case of a database, the “book” is a table.

    An index scan means that SQL Server reads all rows in a table, and then returns only those rows that satisfy the search criteria. When an index scan is performed, all the rows in the leaf level of the index are scanned. This essentially means that all of the rows of the index are examined instead of the table directly. This is sometimes contrasted to a table scan, in which all the table data is read directly. However, there is usually little difference between an index scan and a table scan.

    You may wonder why the Query Optimizer may choose to do an index or table scan. Surely it is much faster to first look up data using an index than to go through all the rows in a table? In fact, for small tables data retrieval via an index or table scan is faster than using the index itself for selection. This is because the added overhead of first reading the index, then reading the pages containing the rows returned by the index, does not offer any performance improvement for a table with only a few rows.

    Other reasons to use an index scan would be when an index is not selective enough, and when a query will return a large percentage (greater than 50%) of rows from the table. In such cases the additional overhead of first using the index may result in a small degradation of performance.

    An index seek, on the other hand, means that the Query Optimizer relies entirely on the index leaf data to locate rows satisfying the query condition. An index seek will be most beneficial in cases where a small percentage (less than 10 or 15%) of rows will be returned. An index seek will only affect the rows that satisfy a query condition and the pages that contain these qualifying rows; this is highly beneficial, in performance terms, when a table has a very large number of rows.

    It is also worth noting that it is usually not worthwhile to create indexes on low-cardinality columns as they would rarely be used by the Query Optimizer. A low-cardinality column is one that contains a very small range of distinct values, for example a ‘Gender’ column would have only two distinct values- Male or Female. An example of a high-cardinality column is of course the primary key column, in which each value is distinct.

    In summary, the Query Optimizer generally tries to perform an index seek. If this is not possible or beneficial (for example when the total number of rows is very small) then an index scan is used instead.

    http://blog.sqlauthority.com/2007/03/30/sql-server-index-seek-vs-index-scan-table-scan/

    http://blog.sqlauthority.com/2009/08/24/sql-server-index-seek-vs-index-scan-diffefence-and-usage-a-simple-note/

    For Large tables, Hash join will faster. For smaller table Nested loop joins/ Merge joins will be faster.

    For Large tables, Index Seek will faster. For smaller table Index Scan will be faster.

    Replied on Nov 2 2010 6:15AM  . 
    Nupur Dave
    173 · 1% · 284
  • Score
    9

    You as a DBA have a challenge to provide a descriptive answer about each of the above operation (hash-join/scan/seek) and feedback which operation will be fastest (why) and will be slowest (why) with valid reasons. You are at liberty to define an index strategy between 3 tables.

    Products (200,000)

    Sales (5.5m)

    Customers (750,000)

    Seeing query plans on the above tables with table or index scans, seeks and hash joins is in no way an indicator of bad performance or bad query plans. The only way to determine which queries are performing badly is to look at them individually and assess the plan based on the intentions of the query.

    A scan, whether on a table or an index, hits every row regardless of whether the row qualifies for selection. A seek is targeted and only hits a row or range of rows that satisfy the selection criteria. Therefore if you have highly selective criteria, targeting a relatively small proportion of the table, then an index seek is likely to be what you'd aim for in the query plan. By contract, if you wish to return a large proportion of the table then a table scan is likely to be quicker. There is no "magic threshold" for this, as it depends on many factors, such as the index and b-tree size, whether the index is a covering index, the physical location of the pages on disk (i.e. for the amount of random i/o needed) etc.

    Joins also influence the outcome. There are 3 types of joins; nested loops, merge and hash. Nested loop joins run through every record on the inner dataset, and lookup the matching value(s) on the outer dataset. They are very efficient where the inner dataset is small, and the outer dataset can use index seeks for the lookups. Merge joins require both inputs to be sorted, allowing the join key(s) to be scanned in order to identify the matching rows on either side. Merge joins are probably the fastest type of join if the inputs are already sorted, however if the sorts must take place first it can add an overhead to the join that makes it a bad option. Hash joins are generally considered the fastest join, due to their efficiency in joining large datasets. The inner dataset is scanned, and a hash table built using the join key(s). The outer dataset is then scanned and each value looked up in the hash table to find matches. With two large unsorted datasets, the hash join will outperform the other two comfortably in most cases.

    So, we need to assess the current scenario - a daily sales report, and a stock report. The daily sales report is likely to require a small subset of the Sales table, driven by the sales date. It is also likely to display the Customer and the Product sold, meaning we need to use foreign keys to those tables (at this point it is worth mentioning the assumption that only one product exists for a sale, else we need a bridging table). Immediately it is clear that our driving criteria is highly selective - one day's sales - so we need an index seek on the Sales table and therefore a clustered index on the sales date (descending to get the latest first). An alternative, if a clustered index is not feasible, is to create a nonclustered index and "include" the fields needed for the report, as well as the foreign keys to the Customer and Product tables. We also need to look for highly selective values on both the Customer and Product tables via the foreign keys, therefore a unique clustered index on each of the primary key fields in the Product and Customer tables would allow the index seeks needed - however the query optimizer may opt for index scans on the Product and Customer tables with merge joins due to the clustered indexes being sorted, so experiments with join hints can be used to test the optimal strategy. Foreign Key constraints can also be used, however they are not always the best option, particularly in large data warehouses, so I'll skip over them for this question.

    The stock report is very different, and is likely to be sourced entirely from the Product table. It really depends on the requirements for the report as to how this should be sourced, e.g. is it all products in stock, or is it the stock levels of all products, or is it an aggregation by product category etc? The likelihood is that we will need to hit a large proportion of the Product table meaning an index scan is probably going to be the best option. If however there is only a small subset of "active" products, and the report requires such, then an index would be needed on any ActiveFlag or StockQuantity fields to allow an index seek for better performance.

    In truth it is virtually impossible to create an ideal index strategy without knowing the requirements, or the system as a whole. Indexes create overheads, so it depends on how the system is used and the level of reads/writes etc, as well as how many other processes use these tables. Only then is it possible to decide on the optimal approach.

    Replied on Nov 9 2010 7:55AM  . 
    Mike Lewis (@SQLTuna)
    42 · 4% · 1336
  • Score
    8

    Hash Join

    • It is one of the SQL join Technique that is used by SQL Server to join 2 tables internally.
    • It has built and probe tables. Build table has fewer records than probe table as build phase reads rows of smaller table into the memory .
    • Using rows in build table, hash join hashes columns in join expression of the corresponding rows and builds a hash table. After that probe table gets processed. For each row key columns of probe table gets hashed and finds match entry directly.
    • It has additional cost of building hash table. Smaller table will be in memory and larger table will be in disk. If small table do not fit in memory then it uses hard disk DBSPACETEMP configuration parameter is used to stored hashed table in probe phase. Instead of joining table using columns a separate hash table is created for the performance hit.
    • It provides time complexity. Hence it is faster for huge tables with large data.
    • It is much helpful in maximizing throughput in data-ware housing applications.
    • Hash join gives best performance when two more tables are joined and at-least one of them has no index

    Seek

    • This is used to read data from SQL Server Tables and Indexes
    • In An Index/table seek SQL Engine is uses Binary tree search, when records are getting processed.
    • The binary search is based on the following rules.
    • Rule1. If the current Node is Greater than node that is searched, Go for its left-child i.e. the previous page / previous record
    • Rule2. If the current Node is less than Search Node, Go for its right-child i.e. the Next page / next record.
    • Rule3. When its finds the page, mark node as successful and Go for next record, Until the next higher value is found.
    • Here Cost is directly proportional to total number of rows and pages in the table.

    Scan

    • This is used to read data from SQL Server Tables and Indexes
    • In An Index/table Scan the Engine is processing the rows one by one. It follows sequential search algorithm. It touches each and every row. Cost is directly proportional to total number of rows in the table.
    • Index Scan retrieves all the rows from the table. Index Seek retrieves selective rows from the table.
    • Scan is an efficient strategy if the table is small.
    • For Huge no of tables, Hash join will be faster, whereas for smaller tables, Nested loop joins or Merge joins will be faster.

    Analyze the 3 sps that are causing problems and identify the tables involved , indexes are properly created or not?

    Run the Queries used in SPs and execute them applying different scenarios to find out best practice .

    Indexing strategy

    1. On Products table will have clustered index on ProductId
    2. On Sales table will have non clustered indexes on SalesDate, ProductID, CustomerID and a clusteredindex on InvoiceNo
    3. On Customers table will have Clustered index on CustomerID
    4. Index seek can be applied for SalesDate for Daily Report hence it can only process pages of current day.
    5. Create Non clustered indexes on all foreign Keys and rows used in search criteria
    Replied on Nov 24 2010 4:49AM  . 
    Vamshi
    134 · 1% · 376
Previous 1 | 2 Next

Your Answer


Sign Up or Login to post an answer.
Please note that your answer will not be considered as part of the contest because the competition is over. You can still post an answer to share your knowledge with the community.

Copyright © Rivera Informatic Private Ltd.