This video has captions you may enable in the viewer, plus a written transcript below.
Learn what we’ll cover in this course, then dive into the sample windowing function that we’ll be tuning.
Welcome to Indexing for Window Functions in SQL Server
Today we are going to cover: a quick overview of what window functions are, and then we’ll spend most of our time today demonstrating how to tune a specific window function. I have an example where we can tune it in all sorts of different ways: with rowstore indexes, with an indexed view, with columnstore indexes. And it gets — I think — really interesting!
We’re going to spend most of our time tuning that windowing function, then we’ll do a recap of the techniques we used and see how are they all turned out. Then we’ll finish up with some Q & A. Before we dig into this: some quick concepts.
When I say ‘window function’ what do I mean?
The terms for this are getting kinda interesting. In Books Online, they do use the term ‘windowing functions’. The term ‘windowing function’ isn’t just specific to SQL Server. There are other database engines where people use these– where they talk about windowing functions too.
What we’re talking about when we talk about windowing functions is: looking at a row in the context of a window of data. If I want to do something like row numbering — okay I want to number my rows, and maybe I want to do a windowed row number. Maybe I want number a set of rows in a group, right? And then restart my numbering for another group. Maybe I want to do a running total: what is the total of this in the context of another set of rows? Like the sum of revenue this quarter, and then the sum of revenue year- to-date.
Maybe I want to do something like lag and lead: I want to see okay what’s this row, and then what was this value for the previous row, or for the next? We can do a lot of these cool things with window functions in SQL Server.
Here’s the window function we’re going to be looking at today in our demo
We will be seeing a lot of this code and I want to talk about it a little bit in screenshots, because I can very easily literally highlight the code. This isn’t the whole query we’re looking at here.
There’s two common table expressions at the beginning. The first one is called RunningTotal. What we’re doing in that first CTE named RunningTotal is: I’m saying okay I want the FirstNameId , you know, given the first name — I want the state it’s in, the Gender for that FirstNameId, and the ReportYear. Then I want to sum up the name count — OVER — then we get into the guts of our window function. When I say ‘partition by’, what we’re doing is we’re breaking the data into groups. I want to group by the FirstNameId, the StateCode, and gender.
‘Partition’ is a little confusing here. We’re NOT talking about table partitioning! This is specific to the windowing function. That ‘over’ keyword right before partition is introducing: okay I’m summing up name count over groups of the FirstNameId, StateCode, and Gender, and I’m ordering by ReportYear. That is my TotalNamed column.
That first CTE builds my running total. After that we have a second CTE and I have a second windowing function in it: RunningTotalPlusLag is the name of this second CTE, and it’s pulling from RunningTotal. RunningTotal itself: you look at the ‘FROM’ in there, it’s pulling from a table named agg.FirstNameByYearState.
We’re accessing the table in the first CTE. In the second CTE, we’re actually accessing RunningTotal, because I want to say: ok, how many were, based on that group of FirstNameId, StateCode, and Gender, how many total had that name the year before? Right? So what I’m doing with my running total, is what I want to know is– when the total amount of FirstNames get that name based on their FirstNameId, StateCode and Gender. Because I don’t want to, you know, sum up all the names for all the Chris’s who are both male and female. I want to keep the baby names separate.
But I want to keep track of: Okay, what was their total this year and the previous year. Because our query is going to look at, when did they cross a given threshold? When were more than x number of babies; when did they get this name, based on FirstNameId, StateCode, and Gender, and ReportYear.
See how I’m ordering by ReportYear.
So I’m using my window functions in a fun way here, and what we’re looking at is breakthrough values: when did a name break through getting, you know, this x number of babies given this name, based on that windowing function.
Demo Time: Let’s dig in and see how we can tune this!
We’re asking a lot of this windowing function. Not am I doing a running total, right, but I want to look at the running total for this year AND the previous year altogether. Wo here is our code that we saw in the setup slides. I’m just going to highlight it. So, we’ve got RunningTotal, our CTE with that running total in it, we’ve RunningTotalPlusLag, that says okay what was the running total for the prior year. And here’s the guts of our query: this is not pretty TSQL, and that’s on purpose, because in the real-world we don’t always get to deal pretty TSQL!
What we’re saying in the guts of this is, all right we want the data from RunningTotalPlusLag, because it has TotalNamed, it has everything we want. It has the lag information as well as the running total. We want that data, we’re going to join to ref.FirstName where we can actually translate out and see: what are the names. We’ve got some predicates in here. We’re passing in a threshold parameters, so if the threshold parameter is null, then we’re going to say, okay, we’re going to set a threshold of one hundred.
So if you don’t pass in a threshold, we’ll just assume okay when did the running total get past the one hundred babies, with that name, for that gender, in that state, for that year. Or what year did they pass that value in. Or if threshold isn’t null, then we’ll take the threshold that’s passed in and use that one, right? We do allow a threshold to be null, and then we kinda handle it. And this is where we’re saying: okay, we’re looking at the running total and then saying, okay in the previous year they had less than that amount, and then they passed that amount in the current year.
So TotalNamed less than our threshold. Let’s go ahead and create our procedure there we now have the PopularNames procedure created. I have not indexed this yet. We are going to give this a first-run, and let me actually turn on my actual execution plans with this button here. We’re going to turn our actual plans and give this a first-run.
The table we’re pulling from is agg.FirstNameByYearState
It has a clustered primary key on it, it does not have any non clustered indexes, nothing tuned for this window function yet. So we’re going to get our baseline here, which as you can see is NOT that fast.
This isn’t a huge table!
But, we’re having too, for every version mit in the table do a running total based on the FirstNameId, Gender andStateCode. And, then looking at it by each year, ordering by year, and then figuring out: okay when did they pass getting five hundred names (as their running total)?
For that FirstName and StateCode. For example, here’s Tristan. In Arkansas, for males named tristan, it was 2015. In 2014, there were four hundred ninety two Tristans, and in 2015 there were five hundred five. This is, for 2015 in arkansas, this is the set of names, you know, that passed that threshold that year.