Parallelism (Distribute Streams)

Introduction

The Parallelism operator, also known as Exchange operator, is used to exchange rows between parallel and serial zones of an execution plan, or between parallel zones with different requirements as to how the rows are distributed. The Parallelism operator has three different Logical Operations, that are each represented with different icons in the graphical execution plan. For ease of understanding, each of the three Logical Operations will be described on its own page here.

The Distribute Streams operation of the Parallelism operator makes the operator read rows from a serial zone, so from a single thread. It outputs those rows into a parallel zone, so distributed across multiple threads. The two other logical operations are Gather Streams (that reads from a parallel zone and outputs to a serial zone) and Repartition Streams (that connects two parallel zones with different distributions).

For a good primer on the basics of parallelism, please read this article by Paul White.

Visual appearance in execution plans

Depending on the tool being used, a Parallelism operator with its Logical Operation set to Distribute Streams is displayed in a graphical execution plan as shown below:

SSMS and ADS (current versions)	Legacy SSMS	Plan Explorer	Paste The Plan

Algorithm

The Parallelism operator always runs multiple processes on multiple threads at the same time. When the Logical Operation is set to Distribute Streams, then Parallelism has one process on a single node on the side that is shown on the right in graphical execution plans, connecting to the serial child operator of the Parallelism operator. This side is called the “producer side”. On the other side, called the “consumer side”, one process runs on each thread, connecting to the parent operator of the Parallelism operator on that thread.

Producer side

The process on the producer side reads rows from the child operator, stores them in holding areas called “packets” (of which there is one for each consumer side thread), and initiates a transfer of the packet to the consumer side process when a packet is full.

The basic algorithm for the producer side process of a Parallelism operator with its Logical Operation set to Distribute Streams is as shown below:

Note that this flowchart is a simplification. It does not show the details of what happens when the producer side process tries to push a packet of rows to the consumer side (this is shown in more detail below). It also doesn’t show the details of how a parallel plan starts up its execution. (Paul White has written an excellent series of articles to describe specifically the startup phase: Part 1, Part 2, Part 3, Part 4, Part 5).

Predicate?

The Parallelism operator supports an optional Predicate property. When present, the supplied predicate is applied to every row read in the producer side process. Only rows where the predicate evaluates to True are processed; if the predicate is False or Unknown, the row is discarded.
So far, I have only ever observed Predicate expressions with a PROBE function in their expression.

Determine target thread(s)

Perhaps the most important step in the producer side flowchart is the determination of the correct thread (or packet) that a row needs to be sent to. The Parallelism operator supports five different ways to distribute the rows; which one is used depends on the Partitioning Type property.

These are the five supported partitioning types:

Hash: One or more columns from the input, as indicated by the Partition Columns property, are hashed to find a number between 1 and the degree of parallelism. This number is the number of the packet that the row is stored in.
This partitioning type is typically used when a parallel zone includes operators that require rows with the same value in certain columns to be on the same thread (such as join or aggregation operators).
Round Robin: The first row read is stored in packet 1, the second in packet 2, and so on. Once all packets have received their first row, it wraps back to packet 1.
This partitioning type is typically used when there are no functional reasons to have rows on specific threads. A benefit of round robin partitioning is that each thread will receive the same number of rows. This does not automatically prevent skew, though, as some rows might require more work than others, and some threads may receive less CPU cycles than others due to other concurrent processes running on the same core.
Broadcast: Identical copies of each row read are stored in all packets.
This partitioning type is typically used when the input is estimated to be very small in size. By making a copy of this small input available on all threads, the rest of the execution plan can often be simplified. When the cardinality of the input is severely misestimated and its actual size is much larger, then this can cause performance issues due to doing the same work on all those rows on each thread in the parallel zone.
Range: One column from the input, as indicated by the Partition Columns property, is used to determine the number of the packet that the row is stored in, based on consecutive and non-overlapping ranges. Values in the first range go to packet 1, values in the second range to packet 2, etc.
This partitioning type is rare; it is, as far as currently known, only used in certain parallel index build execution plans and in parallel update statistics plans.
The boundaries for the defined ranges are not exposed in either the graphical execution plan, or the execution plan xml. I assume that they are stored in the internal representation of the execution plan, but not represented in the execution plan xml.
emand: Each row is sent to the first consumer thread that asks for rows. That makes this partitioning type the only one that effectively creates a pull-based distribution. Because the distribution is demand based, skew is almost always avoided.
This partitioning type is rare; it is only used in execution plans with partitioned tables.
I have so far never been able to reproduce an execution plan that uses this partitioning type, so I am not aware of any specific properties that might occur in this case. Please let me know if you have seen this partitioning type and can share an execution plan or some repro code with me.

Push packet

When the producer side has completely filled the packet for one of the consumer side processes, or when there are partially filled packets when the end of data has been reached, the producer pushes the packet to the consumer. This process of pushing a packet to the consumer has its own logic, as follows:

Here, the “Empty packet?” test checks whether the consumer process on the corresponding thread has an empty packet available. If this is not the case, then the consumer side process is still working on the rows sent in a previous packet, and the producer side will have to wait until an empty packet is once more available.

Once the consumer side packet is empty, the “Push packet” process moves the data in the packet on the producer side to the packet on the consumer side, by first copying the data to the packet on the consumer side, and then removing it from the packet on the producer side.

Wait (until packet available)

When the consumer side process has not yet finished processing all rows in a packet before the next packet for the same thread is already available, then the producer side process cannot continue until the consumer side finishes processing these rows. It enters a wait state, where the process is taken off the processor (so that other SQL Server tasks can use it), and waits until the consumer side has an empty packet available for receiving new rows.

These are the infamous CXPACKET waits. Ideally, the consumer side should be able to finish work on all rows in a packet before the next packet is ready, so there should be little to no CXPACKET waits. Especially when there is a lot of skew in the execution plan, you will see high CXPACKET waits.

Consumer side

On the consumer side, the same process runs on each thread. This process reads rows from the packet (after it has been transferred from the producer side) and returns them to its parent operator.

The basic algorithm for the consumer side process of a Parallelism operator with its Logical Operation set to Distribute Streams is as shown below:

Note that this flowchart is a simplification. It doesn’t show that execution stops whenever a row is returned, and resumes where it was upon next being called.

Wait (until data available)

When the packet on a consumer side process is empty, then the process on that thread has to wait until the producer side process pushes a new packet of data for it to process. The consumer side processes on the other threads do not have to wait for this; they can continue to work.

These waits used to be marked as CXPACKET waits, but since Service Pack 2 for SQL Server 2016 and Cumulative Update 3 for SQL Server 2017, they have been relabeled as CXCONSUMER waits. Ideally, the consumer side should be able to finish work on all rows in a packet before the next packet is ready, so high CXCONSUMER waits are to be expected and not a reason to worry. (Before these waits were relabeled as CXCONSUMER, it was a lot harder to troubleshoot parallelism, because benign and actionable waits were all registered under the same wait type).

Operator properties

The properties below are specific to the Parallelism operator with its Logical Operation set to Distribute Streams, or have a specific meaning when appearing on it. For all other properties, see Common properties. Properties that are included on the Common properties page but are also included below for their specific meaning for the Constant Scan operator are marked with a *.

Property name	Description
Partition Columns	When the Partitioning Type property is set to Hash or Range, then the Partitioning Columns property lists the column or columns that are used to determine what thread each row gets sent to. All rows with the same value in each of the columns in this property are guaranteed to go to the same thread.
Partitioning Type	The Partitioning Type property determines how rows will be distributed across the threads on the consumer side. The five available values for Partitioning Type and their meaning are described in the main text.
Predicate	When present, this property defines a logical test to be applied to all rows that the producer side process reads. Only rows for which the predicate evaluates to True are returned.

Implicit properties

This table below lists the behavior of the implicit properties for the Parallelism operator with its Logical Operation set to Distribute Streams.

Property name	Description
Batch Mode enabled	The Parallelism operator with its Logical Operation set to Distribute Streams only appears in row mode execution plans, or in row mode sections of batch mode execution plans. Batch mode (sections of) execution plans use a different model for parallelism, where no Parallelism operators are needed.
Blocking	The Parallelism operator is non-blocking.
Memory requirement	The Parallelism operator with its Logical Operation set to Distribute Streams requires enough memory to store a number of packets equal to the degree of parallelism on the producer side, and one packet for each process on the consumer side. Unlike most other operators with high memory consumption, the Parallelism operator does not expose the Memory Fractions and Memory Usage properties.
Order-preserving	The Parallelism operator with its Logical Operation set to Distribute Streams is fully order-preserving.
Parallelism aware	The Parallelism operator is only used in parallel execution plans, and is fully parallelism aware.
Segment aware	The Parallelism operator is not segment aware.