How can building a heap be O(n) time complexity? Ask Question

How can building a heap be O(n) time complexity? Ask Question

Can someone help explain how can building a heap be O(n) complexity?

Inserting an item into a heap is O(log n), and the insert is repeated n/2 times (the remainder are leaves, and can't violate the heap property). So, this means the complexity should be O(n log n), I would think.

In other words, for each item we "heapify", it has the potential to have to filter down (i.e., sift down) once for each level for the heap so far (which is log n levels).

What am I missing?

ベストアンサー1

I think there are several questions buried in this topic:

  • How do you implement buildHeap so it runs in O(n) time?
  • How do you show that buildHeap runs in O(n) time when implemented correctly?
  • Why doesn't that same logic work to make heap sort run in O(n) time rather than O(n log n)?

How do you implement buildHeap so it runs in O(n) time?

Often, answers to these questions focus on the difference between siftUp and siftDown. Making the correct choice between siftUp and siftDown is critical to get O(n) performance for buildHeap, but does nothing to help one understand the difference between buildHeap and heapSort in general. Indeed, proper implementations of both buildHeap and heapSort will only use siftDown. The siftUp operation is only needed to perform inserts into an existing heap, so it would be used to implement a priority queue using a binary heap, for example.

I've written this to describe how a max heap works. This is the type of heap typically used for heap sort or for a priority queue where higher values indicate higher priority. A min heap is also useful; for example, when retrieving items with integer keys in ascending order or strings in alphabetical order. The principles are exactly the same; simply switch the sort order.

The heap property specifies that each node in a binary heap must be at least as large as both of its children. In particular, this implies that the largest item in the heap is at the root. Sifting down and sifting up are essentially the same operation in opposite directions: move an offending node until it satisfies the heap property:

  • siftDown swaps a node that is too small with its largest child (thereby moving it down) until it is at least as large as both nodes below it.
  • siftUp swaps a node that is too large with its parent (thereby moving it up) until it is no larger than the node above it.

および に必要な操作の数はsiftDownsiftUpノードが移動しなければならない距離に比例します。 の場合siftDown、作業はツリーの最下部までの距離なので、siftDownツリーの最上部にあるノードではコストがかかります。 の場合siftUp、作業はツリーの最上部までの距離に比例するので、siftUpツリーの最下部にあるノードではコストがかかります。 両方の操作は最悪の場合O(log n)ですが、ヒープでは、最上部にあるノードは 1 つだけであり、ノードの半分は最下層にあります。そのため、すべてのノードに操作を適用する必要がある場合、よりも の方が好ましいのは当然です。siftDownsiftUp

このbuildHeap関数は、ソートされていない項目の配列を受け取り、それらがすべてヒープ プロパティを満たすまで移動して、有効なヒープを生成します。ここで説明したおよび操作buildHeapを使用するには、2 つの方法があります。siftUpsiftDown

  1. ヒープの最上部 (配列の先頭) から始めて、siftUp各項目を呼び出します。各ステップで、以前にふるいにかけた項目 (配列内の現在の項目の前の項目) が有効なヒープを形成し、次の項目を上にふるいにかけて、ヒープ内の有効な位置に配置しします。各ノードを上にふるいにかけた後、すべての項目がヒープ プロパティを満たします。

  2. または、反対方向に進み、配列の最後から始めて、前に向かって後ろ向きに移動します。各反復で、アイテムが正しい位置になるまで、アイテムをふるいにかけます。

どちらの実装がbuildHeapより効率的でしょうか?

どちらのソリューションでも有効なヒープが生成されます。当然ながら、より効率的なのは を使用する 2 番目の操作ですsiftDown

h = log nをヒープの高さとします。このアプローチに必要な作業は、次siftDownの合計で与えられます。

(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (h * 1).

合計の各項は、特定の高さのノードが移動しなければならない最大距離(最下層の場合はゼロ、ルートの場合はh)にその高さのノードの数を掛けたものである。対照的に、siftUp各ノードを呼び出す合計は

(h * n/2) + ((h-1) * n/4) + ((h-2)*n/8) + ... + (0 * 1).

2 番目の合計の方が大きいことは明らかです。最初の項だけを見るとhn/2 = 1/2 n log nなので、このアプローチの複雑さは最高でもO(n log n)になります。

このアプローチの合計siftDownが実際にO(n)であることをどうやって証明するのでしょうか?

1 つの方法 (他にも有効な分析方法があります) は、有限和を無限級数に変換してからテイラー級数を使用することです。最初の項はゼロなので無視できます。

buildHeap の複雑さに関するテイラー級数

各ステップがなぜ機能するのか分からない場合は、プロセスの正当性を言葉で説明します。

  • 項はすべて正なので、有限和は無限和よりも小さくなければなりません。
  • この級数はx=1/2で評価されたべき級数に等しい。
  • That power series is equal to (a constant times) the derivative of the Taylor series for f(x)=1/(1-x).
  • x=1/2 is within the interval of convergence of that Taylor series.
  • Therefore, we can replace the Taylor series with 1/(1-x), differentiate, and evaluate to find the value of the infinite series.

Since the infinite sum is exactly n, we conclude that the finite sum is no larger, and is therefore, O(n).

Why does heap sort require O(n log n) time?

If it is possible to run buildHeap in linear time, why does heap sort require O(n log n) time? Well, heap sort consists of two stages. First, we call buildHeap on the array, which requires O(n) time if implemented optimally. The next stage is to repeatedly delete the largest item in the heap and put it at the end of the array. Because we delete an item from the heap, there is always an open spot just after the end of the heap where we can store the item. So heap sort achieves a sorted order by successively removing the next largest item and putting it into the array starting at the last position and moving towards the front. It is the complexity of this last part that dominates in heap sort. The loop looks likes this:

for (i = n - 1; i > 0; i--) {
    arr[i] = deleteMax();
}

Clearly, the loop runs O(n) times (n - 1 to be precise, the last item is already in place). The complexity of deleteMax for a heap is O(log n). It is typically implemented by removing the root (the largest item left in the heap) and replacing it with the last item in the heap, which is a leaf, and therefore one of the smallest items. This new root will almost certainly violate the heap property, so you have to call siftDown until you move it back into an acceptable position. This also has the effect of moving the next largest item up to the root. Notice that, in contrast to buildHeap where for most of the nodes we are calling siftDown from the bottom of the tree, we are now calling siftDown from the top of the tree on each iteration! Although the tree is shrinking, it doesn't shrink fast enough: The height of the tree stays constant until you have removed the first half of the nodes (when you clear out the bottom layer completely). Then for the next quarter, the height is h - 1. So the total work for this second stage is

h*n/2 + (h-1)*n/4 + ... + 0 * 1.

Notice the switch: now the zero work case corresponds to a single node and the h work case corresponds to half the nodes. This sum is O(n log n) just like the inefficient version of buildHeap that is implemented using siftUp. But in this case, we have no choice since we are trying to sort and we require the next largest item be removed next.

要約すると、ヒープソートの作業は、buildHeap に O(n) 時間、各ノードを順番に削除するのに O(n log n) かかる2 つの段階の合計なので、複雑さは O(n log n) です。比較ベースのソートでは、O(n log n)が期待できる最高の結果であることを証明できます (情報理論のアイデアを使用)。したがって、これに失望したり、ヒープソートが O(n) の時間制限を達成することを期待したりする理由はありませんbuildHeap

おすすめ記事