TOC

数组按行访问和按列访问速度有这么大差异?

看到一片技术文章《改了一行代码,数组遍历耗时从 10.3 秒降到了 0.5 秒!》,试验了一下。

测试代码

#define SIZE 10240

#include <stdio.h>
#include <time.h>

int array[SIZE][SIZE];

int test1() {
    int x, y;
    for (x=0; x<SIZE; x++) {
        for (y=0; y<SIZE; y++) {
            array[x][y] = 1024; // 按行访问
        }
    }
}

int test2() {
    int x, y;
    for (x=0; x<SIZE; x++) {
        for (y=0; y<SIZE; y++) {
            array[y][x] = 1024; // 按列访问
        }
    }
}

int main() {
    clock_t start, end;
    double cpu_time_used;

    start = clock();
    test1();
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("按行访问: %f\n", cpu_time_used);

    start = clock();
    test2();
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("按列访问: %f\n", cpu_time_used);

    return 0;
}
gcc arrayFast.c -o /tmp/a && /tmp/a
按行访问: 0.576199
按列访问: 2.444890

文章中是 0.5 对 10.3 的差异,在我本地(GCC 11.4)差异小一些。

原理

数据在 Cache 和内存之间传输时,不是一个字节一个字节进行传输的,而是以缓存行(Cache Line)为单位进行传输的。
不同 CPU 的 Cache line 大小可能不同,典型的 CPU Cache line 大小是 64 个字节。

C 语言的数组,所有元素是存放在地址连续的内存中的,此外,C 语言的多维数组,是按行进行存储的。

文中介绍了使用 pref 来做分析:

sudo apt update
sudo apt install -y linux-tools-common linux-tools-generic linux-cloud-tools-generic
# 后面两个不安装会提示:
# perf --version
# WARNING: perf not found for kernel 5.15.0-87
#
#   You may need to install the following packages for this specific kernel:
#     linux-tools-5.15.0-87-generic
#     linux-cloud-tools-5.15.0-87-generic
#
#   You may also want to install one of the following packages to keep up to date:
#     linux-tools-generic
#     linux-cloud-tools-generic
sudo perf stat -d /tmp/a

 Performance counter stats for '/tmp/a':

            511.27 msec task-clock                #    0.998 CPUs utilized
                 5      context-switches          #    9.780 /sec
                 1      cpu-migrations            #    1.956 /sec
           102,453      page-faults               #  200.389 K/sec
     1,304,037,593      cycles                    #    2.551 GHz                      (49.93%)
     1,997,636,690      instructions              #    1.53  insn per cycle           (62.44%)
       176,746,327      branches                  #  345.700 M/sec                    (62.45%)
           104,734      branch-misses             #    0.06% of all branches          (62.46%)
       543,705,000      L1-dcache-loads           #    1.063 G/sec                    (60.30%)
         8,549,186      L1-dcache-load-misses     #    1.57% of all L1-dcache accesses  (25.03%)
         1,035,710      LLC-loads                 #    2.026 M/sec                    (25.02%)
           557,731      LLC-load-misses           #   53.85% of all LL-cache accesses  (37.52%)

       0.512050821 seconds time elapsed

       0.347952000 seconds user
       0.163977000 seconds sys
sudo perf stat -d /tmp/b

 Performance counter stats for '/tmp/b':

          2,454.55 msec task-clock                #    1.000 CPUs utilized
                19      context-switches          #    7.741 /sec
                 5      cpu-migrations            #    2.037 /sec
           102,452      page-faults               #   41.740 K/sec
     6,191,911,818      cycles                    #    2.523 GHz                      (49.82%)
     1,986,048,111      instructions              #    0.32  insn per cycle           (62.36%)
       177,822,350      branches                  #   72.446 M/sec                    (62.37%)
           194,485      branch-misses             #    0.11% of all branches          (62.53%)
       548,174,579      L1-dcache-loads           #  223.330 M/sec                    (62.24%)
       217,998,941      L1-dcache-load-misses     #   39.77% of all L1-dcache accesses  (25.09%)
        39,386,881      LLC-loads                 #   16.046 M/sec                    (24.93%)
         1,255,181      LLC-load-misses           #    3.19% of all LL-cache accesses  (37.31%)

       2.455436836 seconds time elapsed

       2.291020000 seconds user
       0.164216000 seconds sys

可以看到,采用按列访问之后,L1 miss 从 1.57% 增长到了 39.77%。
文章中提到,这是因为 CPU 缓存是按 Cache Line 加载的,按列读取会导致更多 Miss,从而总是从内存加载数据,速度会慢一些。

举一反三

相同逻辑转换成 Golang 之后也是类似的现象:

go run arrayFast.go
按行访问 took 414.485988ms
按列访问 took 2.5762119s
按行访问 took 98.654441ms
按列访问 took 2.564306352s