The monthly blog party that is known as T-SQL Tuesday has hit 100th episode (and I only missed 98 of them!). In true Olympic fashion, it returns to its roots for its centennial celebration. In other words, our host this month is Adam Machanic (b|t) himself. And even though 100 is a perfectly valid number to look back, he decided that looking ahead is more fun. He asks all bloggers to whip out the crystal ball and predict what our world will look like in another 100 months, when (hopefully) T-SQL Tuesday #200 will be celebrated.
I must admit that, looking at the news, looking at current trends in the world, it is easy to get very pessimistic about the future. But as a father of two fantastic children, I decided long ago that I can no longer afford to wallow in pessimistic visions of the future. So let’s all assume that we as the collective human race will not destroy our world in any of the dozens of ways we could do that. Let’s assume we will all be here in another 100 months, and be able to look back at these posts and laugh at how wrong we were.
So I will not look at politics, environment, social movements, or any other global trends. I will focus on trends in the nice shielded world of data professionals. And within that subset, I have decided to pick four key points, three because they are or have been trending in the last few years and one because that’s where my heart is.
Cloud computing definitely has taken off. I will admit that I have been on the wrong side of this for way too long. I fully expected that most companies would prefer not to have their valuable data stored on someone else’s computer. But the benefits of cloud computing are too real, so cloud computing has taken off big time. Having your application and/or your data in the cloud means that your employees can connect to it from wherever they are, using whatever device they have, without you having to provide the infrastructure. This is great for enabling sales people, drivers, or support staff that gets sent to a client site. It also makes it a lot easier to offer remote working to your employees.
Because of those benefits, I now expect this trend to continue even more than it already has. In all the more technologically advanced areas of the world, I expect that in a few years we will all expect wireless internet signal to be available everywhere. This will be used by personal devices (we’d call them phones now but their name may change) that everyone has as a replacement for many different things we now need to keep: phone, car keys, house keys, agenda, wallet, tickets, remote control, etc. The device itself will have limited storage and computing; it will rely on that fast internet connection that is assumed to always be available. Internet outages will become REALLY interesting!
More and more companies will move towards serverless. The computers used in the offices will connect to cloud computers where all application logic is done and all data is stored.
There will be some exceptions to this. For companies that have really large volumes of data and really complex programs running, the cost of buying all that “as a service” will be prohibitive and it will be cheaper to have on premises hardware and the accompanying support staff. There may also be companies where the sensitivity of the data continues to stand in the way of moving to cloud. And some companies will keep critical applications on their own hardware because they do not want to run the risk of downtime when an internet provided has issues.
Overall, however, I predict that in 100 months most applications that are now hosted and maintained on premises will have moved to the cloud or someone is working on moving it there.
Artificial intelligence and machine learning
There has been a lot of progress in the AI area. One of the areas where I was recently really impressed is in voice recognition. I was in the car, navigation app open on my phone, which was on the dashboard. I clicked the speak button and muttered the name of a company and a city. In between the background noise of the engine already running and the radio playing, it recognized what I had said, found the address of the company’s offices, and provided me with driving directions (avoiding a traffic jam along the way).
However, I also noticed its limitations. When I got a call that the meeting had been cancelled, I said “Hey Google! Bestemming wissen” (which is Dutch for “erase destination”). It computed a new route for a three hour drive. I later found that this route was to the German city Wissen (written like the Dutch word for erase, but pronounced differently). Clearly, the system has not been programmed to recognize normal Dutch instructions to erase a destination and responded with the best match it could find.
These problems will disappear over time. Siri and Cortana already beyond speech to text transformation and are starting to get better at interpreting the meaning of spoken instructions. This trend too will continue. Remember those personal unified devices I mention in the previous paragraph? They will not have an on-screen keyboard anymore; they will be purely voice driven. And they will become smarter and smarter. Application developers may need to start looking for other employment as software becomes smart enough to understand a spoken request and find on its own the best way to fulfill it. There will still be demand for people to code the more complex applications but standard reports and dashboards will become entirely machine-generated.
But AI and ML is much more than just voice recognition and voice-based interaction. There is also a huge trend nowadays towards using machine learning algorithms to analyze huge collections of data to train a predictive model that is then applied to new cases to predict the most likely outcome. And in this area, I expect problems. Not because of the algorithms itself, but because of the people. When a computer pops up a conclusion, people tend to trust the algorithms and blindly follow the recommendations. So sales people will no longer call all leads, but only the 30% marked as most likely to respond. And when the prediction is wrong and a very viable prospect is never called, no one will ever know so this problem will not heal itself. Until things really start to go wrong. Until an AI algorithm at an airport points out the people in line that need to be inspected and fails to point out someone who then turns out to be a bad person with bad plans. Or until something else terrible happens and everybody points at anonymous algorithms in unknown computers that base their conclusion on recognizing patterns in thousands of known inputs.
The picture to the right (click to enlarge) is a screenshot that shows the result of feeding the T-SQL Tuesday logo to a publicly available visual recognition website. Yes, it recognizes Tuesday with 99% certainty. But all other weekdays are also mentioned with high 90s certainty. This is still a beta and scientists are working day over day to improve it. One day we will have software that uses the results as visual recognition as input to a prediction model trained by Machine Learning. And we will all blindly follow the recommendations – until the stinky stuff starts hitting the fan because either the recognition was incorrect, or the model has a bad prediction, or both.
My prediction is that, 100 months from now, many data professionals will be licking their wounds, managers will be doing damage control, and the world will start to exhibit a healthy distrust from machine-generated predictions for which we do not see exactly what reasoning has caused the prediction. Machine learning and predictive models will remain in use but only in areas where they are “safe”. In other areas their results will at best be treated as gentle recommendations.
Data protection and privacy
A subject that has always been at interest to data professionals but is now also getting a lot of attention at management level, thanks to the GDPR legislation, is protecting data from unauthorized access and protecting the privacy of people you collect data about. At this time a lot of companies are struggling to get GDPR compliant. Some will finish in time, some will finish a bit late, and a few are not even trying. Once the EU sets a few examples by dishing out harsh penalties for non-compliance, the remaining companies will scurry to either get compliant or rigorously shut out all EU customers so they don’t have to.
But other countries will follow the EU example of setting stricter guidelines (not all; some countries continue to believe in self-regulation). Customers will start to demand more privacy guarantees even in areas where less strict laws apply. And press coverage of large scale data breaches and privacy violations will continue to fuel that fire. In a 100 months from now, every serious company will have better data protection plans then they have now, and be more privacy-friendly in their terms and conditions.
But that does not mean that there will be no more scandals. Those, too, will continue. Just like neither laws nor change in cultural perception will ever fully root out drunk driving, there will always be incidents where data is leaked or privacy is violated. Sometimes because people tend to cut corners (usually for all the right reasons, but it can still backfire). Sometimes because companies try to save money (usually for the wrong reasons, because short-term gain is easier to explain to shareholders or directors than long-term safety).
I also expect that by that time, the GDPR restrictions that are now considered to be very strict will have been surpassed. At that time other countries will be working on legislation (or maybe already have it completed) that is stricter then GDPR. One area where I mainly expect additional rules is in the use of data collected about in individual as input to train predictive models. Because of the incidents described in the previous paragraphs, large percentages of the public will have become aware of machine learning. Many of them will not accept that “their” data is used without their consent to train these models, and politicians will start to push for regulation – either because they agree with those people, or because they want their votes for the next election.
Performance tuning and execution plans
So here is my precited future. Computing is done in the cloud, through personal devices that everyone is carrying and that are always connected. They use voice recognition, and most requests are interpreted by smart algorithms the way that currently the SQL Server optimizer figures out how to perform a task based on a query. Data is stored in that same cloud but subject to strict rules as to who can use what data for what purpose. Some companies will still have on-premises computing but it’s a minority. What does this mean for the future of someone like me: someone who specializes in query tuning, someone who loves to pour over an execution plan, find the reason of a slow response, and fixes it. Will I still have work?
I think I will, but my work will be different from what it is now. And I think a lot of people currently doing other development work will be moving to “my” new line of work.
The parallel I drew between computers responding to voice commands and SQL Server compiling an execution plan for a query is very deliberate. There will be large similarities between the two processes. When you say “Hey Cortana! My parents will be over tonight so come up with a good meal that they will like and have the shop deliver the ingredients to my home by 4PM”, the computer will have to parse the request to make sure it understands what you say, then come up with a plan to do this efficiently (identify who your parents are, fetch data accessible to you on food preferences and allergies of these people, select recipes that qualify, identify store near you that can deliver on time, place order). That plan may sometimes be flawed, either causing incorrect results or just processing slower than it should, and in those cases a root case needs to be found and a fix be deployed. That will be my job, and I will still be looking at execution plans – using operators that are slightly different from what we now see in SQL Server execution plans, but probably not even that different. Identifying your parents is probably still an Index Seek in a public data collection!
So the next question is: who will we be working for. Well, for the public at large, the service of connecting to the internet and asking it questions will be based on a monthly fee. It may be a flat fee or it may be usage based, but it will not be based on the compute power involved with your requests. Just the number of requests. So it will be the provider that benefits from serving the highest number of requests using the least amount of resources. So for requests as described above, it will be the service providers who employ optimization specialists.
Companies will have different contracts. They will pay for the amount of data retrieved from various sources, for the amount of data stored, and for the compute resources used to satisfy the requests that the CEO blurts into his reporting tool. These companies will need to employ their own optimization specialists, to help the CEO phrase his request such that it will not break the bank. And finally, the few companies that still have their own computers will obviously also buy licenses to use these same algorithms and will therefor also have a need for optimization specialists.
The road ahead of us will be interesting. I really do not actually know how much of the changes I predict will actually come true. But one thing is sure: there will be change. And the pace of change will continue to increase.
If you do not want to fall behind, better make sure you stay on top of the change, and on top of emerging new technologies.
This is the third post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.
In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.
Today I will look at a pattern that can be seen when working with partitioned tables. I will not use a standard demo database but create my own set of sample tables to get maximum control. Here is the code to create two partitioned tables and load them with some data. Execute this code in your own “playground” database if you want to follow along.
-- Create three partitioned demo tables on two partition functions
-- Put in some sample data
This may take some time to run (about 2:45 minutes on my laptop). Since it’s a one-time deal, I did not bother to optimize this. Do make sure to run this with the Include Actual Execution Plan option disabled, or you will generate 60,000 execution plans (which will probably crash your SSMS instance).
Once the tables are created and the data is loaded, let’s run a simple query that joins two of the demo tables:
The execution plan for this query is very simple and holds no surprises. There is no WHERE clause and the tables are fairly large, so a Nested Loops join would be a bad idea. A Merge Join would be an option, but the equality part of the predicate is on column a only, which is not unique in both tables; so this would be a many to many Merge Join, with all the associated overhead in tempdb. A Hash Match join is a better choice in this case, and that is indeed what the optimizer picks:
Now let’s see what happens if we make one small change to the query. Instead of joining tables Part_1a and Part_2 (both partitioned but on different partition functions), we’ll now join Part_1a and Part_1b (both partitioned and using the same partition function). Here is the changed query:
With basically the exact same query you would probably expect to see the exact same execution plan. But that would not make for a good Plansplaining post, so fortunately we do actually get a different execution plan this time:
Comparing this execution plan to the one above, we see that the exact same pattern of a Hash Match join processing two Clustered Index Scans is used here. But in this case that pattern is on the inner side of a Nested Loops join, which means it executes once for each row in the outer input (which is a Constant Scan operator that we’ll examine in more detail shortly). The properties of the Hash Match operator show us exactly how often this section of the plan was executed:
Upon seeing this, you would be forgiven for thinking that the optimizer has gone absolutely bonkers. Why is the optimizer creating a plan that, by the looks of it, appears to do the same work four times? Granted, since the aggregation asks for a MAX only there will be no concerns over correctness of results in this case. But repeating the same task four times? Really? Why?
The good news is that there actually is a good reason for this. And that the execution plan is not actually doing the same work four times. However, this is not visible by looking at a picture of the graphical execution plan; we need to dig deep into the details to understand what’s going on.
The devil is in the details
Before diving into the apparently repeated section of the plan, let’s first examine the Nested Loops join operator and the Constant Scan on its outer input. The Constant Scan operator is by itself actually already a mystery here. Looking at the query, there is no obvious reason at all to have a Constant Scan in this plan, and yet here it is.
Whenever we see a non-obvious Constant Scan operator, out first stop should always be its properties, specifically the Output List and Values properties. These expose the data that the Constant Scan returns. Though these properties are exposed in the property popup, the formatting of the Values property makes it hard to parse. The full properties window has the same data in an easier to understand format:
The Output list shows that this Constant Scan will return rows with just a single column, called Expr1005. Note this column name, because we need to find which operators use this column in order to understand the role of this Constant Scan. The Values property shows that four rows will be returned (not surprising, since we already saw that the inner input of the Nested Loops operator executes four times, and that the value of Expr1005 will be 1 for the first row, 2 for the second, then 3, and finally 4. None of these values appear in the query so they make no sense yet. We really need to find where these values are used!
Looking at the Nested Loops operator, we see a few interesting things. We know that it receives four single-column rows from its outer input, and then executes its inner input for each of these rows. Let’s look at its properties, more specifically at the Output List and Outer References properties:
Let’s start at the bottom, with the Outer References property. We have already seen this property in previous plansplaining posts, but as a quick reminder: Outer References means that a value from the outer input is pushed into the inner input; the inner input then ensures that only matching rows are returned which is why there is no Predicate property on this join operator. In this case Expr1005, the column that is set 1, 2, 3, and 4 for each of the rows from the Constant Scan, is pushed into the inner input.
The Output List does not include Expr1005. Apparently, the Nested Loops operator in this case doesn’t actually join its two input sources; it merely uses the outer input to drive four executions of the inner input, and then returns data from the inner input only. The values returned from the Constant Scan are not returned by the Nested Loops operator, which also means that we now know that the Stream Aggregate doesn’t do anything different than in the first plan – it doesn’t receive Expr1005, so it cannot act upon it.
Clustered Index Scan
The next part of the execution plan to look at is the section with a Hash Match and two Clustered Index Scan operators. I am not discussing the Hash Match here, because there is nothing of interest on this operator. However, both of the Clustered Index Scan operators have a property that most people would not expect to see, ever, on a Clustered Index Scan (or, in fact, any Index Scan) operator: a Seek Predicates property! (I show a screenshot of the second Clustered Index Scan here, but they both have a very similar Seek Predicates property).
Normally, a Seek Predicates property is only found on (Clustered) Index Seek operators, and not on a (Clustered) Index Scan. An Index Scan is not designed to seek; it is designed to read a whole table. So how can this scan suddenly apply a Seek Predicates? And also, what exactly is the predicate here? A column called PtnId1002, that we never created in any of our tables, is compared to the Expr1005 column, the data coming from the Constant Scan and pushed into the Nested Loop’s inner input.
Luckily the optimizer loves to use mnemonic codes when introducing its own columns. The name PtnId1002 is a so-called “partition id”. Remember, all the tables used in this example are partitioned tables. And partitioned tables happen to be the only (as far as I know) context where you can actually see a Seek Predicates property on a scan operator. It is used to limit the scan to one or more selected partitions only. In this case a single partition. Which partition? Well, that is determined by Expr1005.
Remember, Expr1005 is set to 1 for the first row (and hence the first execution of this section of the execution plan), then to 2 for the second, and so on. So for the first execution, the Clustered Index Scan operators will read data from partition 1 only. This applies to both Clustered Index Scan operators. The Hash Match then combines this data, returning the joined results for partition 1, which Nested Loops then passes to Stream Aggregate. After that, the second row from Constant Scan is read, and the inner loop restarts, this time reading and joining data from the second partition of both tables. Once all four partitions are processed in this way, Stream Aggregate returns the maximum value it found and execution stops.
Moving back to the helicopter view, we now know that the section consisting of one Hash Match and two Clustered Index Scans only appears to be the same in the two plans. In reality, the first plan actually processed the entire tables in the Clustered Index Scan operators, causing a massive join between 20,000 rows from each input. The second plan used a Constant Scan to enumerate the partitions, a Nested Loops to repeat the section for each partition, and a surprising Seek Predicates property on each of the Clustered Index Scan operators to process only a single partition in each execution. So while the logic is indeed executed four times, each execution now only had to join 5,000 rows from each input. And though the scans execute four times, they do not scan the full table four times. Each execution scans only the requested partition. In the end, the amount of IO for the Clustered Index Scan operators in this plan is exactly the same as in the first plan.
This is an optimization pattern known as “join collocation”. The most obvious place where you can see the effect of this optimization is in the memory grant (visible as a property of the SELECT operator in each execution plan). The first query requests 6112 KB, needed for the Hash Match operator to store 20,000 rows in the in-memory hash table. Because the second query processes the same 20,000 rows in four independent chunks of 5,000 rows each, its memory grant is a mere 1536 KB. This memory will be reused for each of the four executions.
On my laptop, this is not very relevant. But imagine a busy system, with lots of users, plus a buffer pool, a plan cache, and other resources, all competing over the same precious memory. Now a 75% reduction in the memory footprint of a query suddenly becomes very valuable!
Why not both?
At this point you may be wondering why SQL Server doesn’t always apply this pattern? Why did we not get a similar “join collocation” plan for the first query? Or for every join ever done between two large tables?
The answer is simple. Join collocation needs a few conditions to be met before it can be used. The conditions are:
Both tables in the join need to be partitioned.
The join condition has to include equality of the partition columns in the two tables.
Both tables in the join need to use the same partitioning function. Or rather, the partitions need to align exactly (if you use two different partition functions with the exact same definition, you will still get join collocation).
The combination of these three requirements enables the join collocation pattern. Because the tables are partitioned, the (Clustered) Index Scan operators can use a Seek Predicates to return only a subset of the data without overhead. Because the partitions align and the join is on equality of the partition columns, we know that data that needs to be joined from both sources is processed in the same execution of the inner input.
In the case of the first query, the partitions did not align completely: they use the same boundary values, but one uses RANGE LEFT and the other RANGE RIGHT. Now it is possible that a row that is in partition 1 of the first table may need to be joined to a row in partition 2 of the second table, but these rows would not be processed during the same execution so the results can be incorrect.
Details are always important. Just because two execution plan fragments look the same in the graphical execution plan, does not mean they are the same. You always need to look at all the details that are available in the property sheet.
In this post, it appeared at first sight as if the same logic was being executed multiple times. But by looking at the properties we were able to ascertain that these four executions each worked on a quarter of the data. We were also able, by looking at the properties of a Constant Scan and following the generated column through the execution plan, to make sense of an operator that at first appeared to make no sense at all.
In plansplaining part four, planned for April, we will look at an execution plan that uses Bitmap operators to optimize a star join. However, I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post. (Note that I WILL ask you for a copy of the actual execution plan, saved as a .sqlplan file, or for a repro script).
When I am tuning a query, I usually start with the query text. I execute it to retrieve an execution plan, and then the fun starts. However, very often I first need to find the specific query (or queries) to tune. There are multiple ways to do this, depending on what monitoring tools and which version of SQL Server the customer uses. In some cases, I at first find only an execution plan. Now I do like to see execution plans when tuning, but I also need to have the query!
The old method
So here is an example of what your SSMS screen may look like when you open an execution plan:
As you can see, the top of the screen shows the query text. Or rather, the start of it. Even with large, high resolution monitors, and even with very small fonts, the single line allocated to this is typically too short to show the full query. In most cases, I see SELECT and the first 10 or 15 columns names, but the FROM is already invisible.
This is just a display issue. The actual execution plan contains the first 4000 characters of the query (which luckily in most cases equates to the entire query), it just doesn’t fit in the SSMS window. And until recently, I did not know a simple way to get to this string. So what I had to do was right-click somewhere in the plan area, select “Show Execution Plan XML”, then in the XML look for the property StatementText=“xxxx”. This then shows the entire query string, with line breaks represented as “
” and a few similar XML-friendly (but person-unfriendly) replacements.
This is usually enough to locate the original query in the source code of the application. If that fails, I can copy the entire content of the StatementText property, paste it in a new window, use search and replace to change all that weird XML stuff back to normal text, and then I finally have the query and I can start tuning.
Bottom line: It works, but it ain’t pretty.
Enter: Edit Query Text
However, there is good news. I recently discovered an incredibly useful new function in SSMS 17.4. I do not know when it was first introduced. Perhaps it really is completely new, or perhaps it has been around for a few years already and I just never noticed. Who knows.
Look at the little button to the right of the query text. This is the new function I discovered. If I hover my mouse over it, it reveals its name: “Edit Query Text”. But that’s not what excites me. What does make me happy is what it does. When I click this button, a new SSMS window opens, with the query text that is stored within the execution plan. And no weird
shenanigans either – just normal query text, ready to be executed, or to be modified and tuned.
Now obviously, the 4000 character limit does still apply. Only the first 4000 characters of a query are stored in the execution plan. SSMS can not present you data it doesn’t have. But for any query of normal length, this tiny button offers an amazingly simple and useful way to quickly get the text of the query that resulted in this execution plan.
Thank you, Microsoft!
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.