# java 為什麼處理排序數組比處理未排序數組更快?

## c++ performance (17)

``````#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];

for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;

// !!! With this, the next loop runs faster
std::sort(data, data + arraySize);

// Test
clock_t start = clock();
long long sum = 0;

for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}

double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
``````
• 沒有`std::sort(data, data + arraySize);` ，代碼運行時間為11.54秒。
• 使用排序數據，代碼運行1.93秒。

``````import java.util.Arrays;
import java.util.Random;

public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];

Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;

// !!! With this, the next loop runs faster
Arrays.sort(data);

// Test
long start = System.nanoTime();
long sum = 0;

for (int i = 0; i < 100000; ++i)
{
// Primary loop
for (int c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}

System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
``````

• 到底是怎麼回事？
• 為什麼處理排序數組比處理未排序數組更快？
• 代碼總結了一些獨立的術語，順序無關緊要。

``````____________________________________________________________________________________
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
``````

``````______   ________
|     |__|
``````

``````___________________________________________ Straight road
``````

``````if (data[c] >= 128)
sum += data[c];
``````

`if`-statement 的表現取決於其條件是否具有可預測的模式。如果條件始終為真或始終為假，則處理器中的分支預測邏輯將獲取模式。另一方面，如果模式不可預測，那麼 - `if`陳述將更加昂貴。

``````for (int i = 0; i < max; i++)
if (condition)
sum++;
``````

``````Condition            Pattern                 Time (ms)

(i & 0×80000000) == 0    T repeated          322

(i & 0xffffffff) == 0    F repeated          276

(i & 1) == 0            TF alternating    760

(i & 3) == 0            TFFFTFFF…          513

(i & 2) == 0            TTFFTTFF…          1675

(i & 4) == 0            TTTTFFFFTTTTFFFF… 1275

(i & 8) == 0            8T 8F 8T 8F …     752

(i & 16) == 0            16T 16F 16T 16F … 490
``````

1. 它是一個小桌子，可能會被緩存在處理器中
2. 您正在以非常緊湊的循環運行和/或處理器可以預加載數據

Pfew，那到底是什麼意思呢？

``````for (int i=0; i<array.Length; ++i)
// Use array[i]
``````

``````// Generate data
int arraySize = 32768;
int[] data = new int[arraySize];

Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.Next(256);

//To keep the spirit of the code in-tact I'll make a separate lookup table
// (I assume we cannot modify 'data' or the number of loops)
int[] lookup = new int[256];

for (int c = 0; c < 256; ++c)
lookup[c] = (c >= 128) ? c : 0;

// Test
DateTime startTime = System.DateTime.Now;
long sum = 0;

for (int i = 0; i < 100000; ++i)
{
// Primary loop
for (int j = 0; j < arraySize; ++j)
{
// Here you basically want to use simple operations - so no
// random branches, but things like &, |, *, -, +, etc. are fine.
sum += lookup[data[j]];
}
}

DateTime endTime = System.DateTime.Now;
Console.WriteLine(endTime - startTime);
Console.WriteLine("sum = " + sum);

``````

``````==32551== Branches:        656,645,130  (  656,609,208 cond +    35,922 ind)
==32551== Mispredicts:         169,556  (      169,095 cond +       461 ind)
==32551== Mispred rate:            0.0% (          0.0%     +       1.2%   )
``````

``````==32555== Branches:        655,996,082  (  655,960,160 cond +  35,922 ind)
==32555== Mispredicts:     164,073,152  (  164,072,692 cond +     460 ind)
==32555== Mispred rate:           25.0% (         25.0%     +     1.2%   )
``````

``````          Bc    Bcm Bi Bim
10,001      4  0   0      for (unsigned i = 0; i < 10000; ++i)
.      .  .   .      {
.      .  .   .          // primary loop
327,690,000 10,016  0   0          for (unsigned c = 0; c < arraySize; ++c)
.      .  .   .          {
327,680,000 10,006  0   0              if (data[c] >= 128)
0      0  0   0                  sum += data[c];
.      .  .   .          }
.      .  .   .      }
``````

``````          Bc         Bcm Bi Bim
10,001           4  0   0      for (unsigned i = 0; i < 10000; ++i)
.           .  .   .      {
.           .  .   .          // primary loop
327,690,000      10,038  0   0          for (unsigned c = 0; c < arraySize; ++c)
.           .  .   .          {
327,680,000 164,050,007  0   0              if (data[c] >= 128)
0           0  0   0                  sum += data[c];
.           .  .   .          }
.           .  .   .      }
``````

``````perf stat ./sumtest_sorted
``````

`````` Performance counter stats for './sumtest_sorted':

11808.095776 task-clock                #    0.998 CPUs utilized
1,062 context-switches          #    0.090 K/sec
14 CPU-migrations            #    0.001 K/sec
337 page-faults               #    0.029 K/sec
26,487,882,764 cycles                    #    2.243 GHz
41,025,654,322 instructions              #    1.55  insns per cycle
6,558,871,379 branches                  #  555.455 M/sec
567,204 branch-misses             #    0.01% of all branches

11.827228330 seconds time elapsed
``````

`````` Performance counter stats for './sumtest_unsorted':

28877.954344 task-clock                #    0.998 CPUs utilized
2,584 context-switches          #    0.089 K/sec
18 CPU-migrations            #    0.001 K/sec
335 page-faults               #    0.012 K/sec
65,076,127,595 cycles                    #    2.253 GHz
41,032,528,741 instructions              #    0.63  insns per cycle
6,560,579,013 branches                  #  227.183 M/sec
1,646,394,749 branch-misses             #   25.10% of all branches

28.935500947 seconds time elapsed
``````

``````perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
``````
`````` Percent |      Source code & Disassembly of sumtest_unsorted
------------------------------------------------
...
:                      sum += data[c];
0.00 :        400a1a:       mov    -0x14(%rbp),%eax
39.97 :        400a1d:       mov    %eax,%eax
5.31 :        400a1f:       mov    -0x20040(%rbp,%rax,4),%eax
4.60 :        400a26:       cltq
...
``````

1. 沒有分支預測器。

1. 使用分支預測器，不要進行條件跳轉。讓我們假設預測沒有採取條件跳躍。

1. 使用分支預測器並進行條件跳轉。讓我們假設預測沒有採取條件跳躍。

``````if (expression)
{
// Run 1
} else {
// Run 2
}
``````

### 可視化：

`````` O      Route 1  /-------------------------------
/|\             /
|  ---------##/
/ \            \
\
Route 2  \--------------------------------
``````

``````// Test
clock_t start = clock();
long long a[] = {0, 0};
long long sum;

for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
int j = (data[c] >> 7);
a[j] += data[c];
}
}

double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
sum = a[1];
``````

``````// declare and then fill in the lookup table
int lut[256];
for (unsigned c = 0; c < 256; ++c)
lut[c] = (c >= 128) ? c : 0;

// use the lookup table after it is built
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
sum += lut[data[c]];
}
}
``````

``````if (x < node->value)
node = node->pLeft;
else
node = node->pRight;
``````

``````i = (x < node->value);
``````

``````if (likely( everything_is_ok ))
{
/* Do something */
}
``````

``````if (unlikely(very_improbable_condition))
{
/* Do something */
}
``````

``````if (data[c] >= 128)
sum += data[c];
``````

`C``C++` ，這個語句可以直接編譯（沒有任何優化）到`x86`的條件移動指令，是三元運算符`... ? ... : ...` `... ? ... : ...` 所以我們將上面的語句重寫為等價的語句：

``````sum += data[c] >=128 ? data[c] : 0;
``````

86

``````//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71
``````

64位

``````//  Branch - Random
seconds = 11.302

//  Branch - Sorted
seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737
``````

`max1`使用條件分支`if... else ...`

``````int max1(int a, int b) {
if (a > b)
return a;
else
return b;
}
``````

`max2`使用三元運算符`... ? ... : ...` `... ? ... : ...`

``````int max2(int a, int b) {
return a > b ? a : b;
}
``````

``````:max1
movl    %edi, -4(%rbp)
movl    %esi, -8(%rbp)
movl    -4(%rbp), %eax
cmpl    -8(%rbp), %eax
jle     .L2
movl    -4(%rbp), %eax
movl    %eax, -12(%rbp)
jmp     .L4
.L2:
movl    -8(%rbp), %eax
movl    %eax, -12(%rbp)
.L4:
movl    -12(%rbp), %eax
leave
ret

:max2
movl    %edi, -4(%rbp)
movl    %esi, -8(%rbp)
movl    -4(%rbp), %eax
cmpl    %eax, -8(%rbp)
cmovge  -8(%rbp), %eax
leave
ret
``````

## 什麼是分支預測？

• 如果你猜對了，那就繼續吧。
• 如果你猜對了，機長會停下來，然後向你大喊大叫，然後翻轉開關。 然後它可以重新啟動另一條路徑。

• 如果你猜對了，你繼續執行。
• 如果您猜錯了，則需要刷新管道並回滾到分支。 然後你可以重新啟動另一條路徑。

## 如上所述，罪魁禍首就是這個if語句：

``````if (data[c] >= 128)
sum += data[c];
``````

``````T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

= NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)
``````

``````data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

= TTNTTTTNTNNTTTN ...   (completely random - hard to predict)
``````

``````if (data[c] >= 128)
sum += data[c];
``````

``````int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
``````

（注意，這個hack並不完全等同於原始的if語句。但在這種情況下，它對`data[]`所有輸入值都有效。）

C ++ - Visual Studio 2010 - x64發行版

``````//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587
``````

Java - Netbeans 7.1.1 JDK 7 - x64

``````//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823
``````

• 使用分支：已排序和未排序的數據之間存在巨大差異。
• 使用Hack：排序和未排序數據之間沒有區別。
• 在C ++的情況下，當數據被排序時，hack實際上比分支慢一點。

• 在x64上使用`-O3``-ftree-vectorize` GCC 4.6.1能夠生成條件移動。 因此，排序和未排序數據之間沒有區別 - 兩者都很快。

• 即使在`/Ox`下，VC ++ 2010也無法為此分支生成條件移動。

• 英特爾編譯器11做了一些奇蹟。 它交換兩個循環 ，從而將不可預測的分支提升到外循環。 因此，它不僅能夠免受錯誤預測的影響，而且速度也是VC ++和GCC能夠產生的速度的兩倍！ 換句話說，ICC利用測試循環來擊敗基準......

• 如果你給英特爾編譯器提供無分支代碼，那麼它就是向右矢量化它......並且與分支一樣快（使用循環交換）。

1. IF - 從內存中獲取指令
2. ID - 解碼指令
3. EX - 執行指令
4. WB - 寫回CPU寄存器

4級管道一般用於2條指令。

``````                        A) if (data[c] >= 128)
/\
/  \
/    \
true /      \ false
/        \
/          \
/            \
/              \
B) sum += data[c];          C) for loop or print().
``````

1. 所有元素都小於128
2. 所有元素都大於128
3. 一些起始新元素小於128，後來大於128

• 分支預測器是古老的性能改進技術之一，它仍然與現代建築相關。雖然簡單的預測技術提供快速查找和功率效率，但是它們具有高的誤預測率。

• 另一方面，複雜的分支預測 - 無論是基於神經的還是兩級分支預測的變體 - 提供更好的預測準確性，但它們消耗更多的功率和復雜性以指數方式增加。

• 除此之外，在復雜的預測技術中，預測分支所花費的時間本身非常高，從2到5個週期 - 這與實際分支的執行時間相當。

• 分支預測本質上是一種優化（最小化）問題，其中重點在於以最少的資源實現最低可能的未命中率，低功耗和低複雜度。

``````bool a, b, c, d;
c = a && b;
d = a || b;
``````

``````bool a, b, c, d;
if (a != 0) {
if (b != 0) {
c = 1;
}
else {
goto CFALSE;
}
}
else {
CFALSE:
c = 0;
}
if (a == 0) {
if (b == 0) {
d = 0;
}
else {
goto DTRUE;
}
}
else {
DTRUE:
d = 1;
}
``````

``````char a = 0, b = 1, c, d;
c = a & b;
d = a | b;
``````

`char`而不是`bool`為了使得可以使用按位運算符（`&``|`）而不是布爾運算符（`&&``||`）。按位運算符是單指令，只需一個時鐘週期。OR運算符（`|`）即使`a`並且`b`具有除`0`或之外的其他值也可以工作`1`。如果操作數具有除和之外的其他值，則AND運算符（`&`）和EXCLUSIVE OR運算符（`^`）可能會產生不一致的結果。`0``1`

`~`不能用於NOT。相反，你可以對已知的變量`0``1`通過XOR對其進行布爾值`1`

``````bool a, b;
b = !a;
``````

``````char a = 0, b;
b = a ^ 1;
``````

`a && b`不能替換為`a & b`if `b`是一個不應該被評估的表達式，如果`a``false``&&`將不會評估`b``&`將）。同樣地，`a || b`不能被替換`a | b`，如果`b`是，如果不應該被計算的表達式`a``true`

``````bool a; double x, y, z;
a = x > y && z < 5.0;
``````

``````MOV R0, #0     // R0 = sum = 0
MOV R1, #0     // R1 = c = 0
ADR R2, data   // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop    // Inner loop branch label
LDRB R3, [R2, R1]     // R3 = data[c]
CMP R3, #128          // compare R3 to 128
ADDGE R0, R0, R3      // if R3 >= 128, then sum += data[c] -- no branch needed!
ADD R1, R1, #1        // c++
CMP R1, #arraySize    // compare c to arraySize
BLT inner_loop        // Branch to inner_loop if c < arraySize
``````

1. 英特爾 - 避免分支機構錯誤預測的成本
2. 英特爾 - 分支和循環重組以防止錯誤預測
3. 科學論文 - 分支預測計算機結構
4. 書籍：JL Hennessy，DA Patterson：計算機體系結構：定量方法
5. 科學出版物中的文章：TY Yeh，YN Patt在分支預測中做了很多這些。

``````data[c] = std::rand() % 256;
``````

``````for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
sum += data[j];
}
}
``````

``````for (unsigned j = 0; j < arraySize; ++j)
{
for (unsigned i = 0; i < 100000; ++i)
{
if (data[j] >= 128)
sum += data[j];
}
}
``````

``````for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
{
for (unsigned i = 0; i < 100000; ++i)
{
sum += data[j];
}
}
}
``````

``````for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
{
sum += data[j] * 100000;
}
}
``````