The algorithm works in steps - one per each level of the tree with all tree nodes of given level computed in parallel. During my work I observed that various solutions have different efficiency depending on the number of primitives in a node. In order to reach maximum speed, kd-tree construction is divided into 3 stages: big, medium and small. Each of them differs on how counting is resolved and how work is distributed on GPU.
My current program constructs the tree faster than any other known to me algorithm but still have competing quality compared to serial algorithms. Also, tests have shown that execution time scales down well in respect to power of GPU which gives reasons to believe it will continue doing so with future releases of the hardware.