TOC

数组按行访问和按列访问速度有这么大差异？

作者： catroll
日期： 2023-12-01
标签： Clang Golang

看到一片技术文章《改了一行代码，数组遍历耗时从 10.3 秒降到了 0.5 秒！》，试验了一下。

测试代码

#define SIZE 10240

#include <stdio.h>
#include <time.h>

int array[SIZE][SIZE];

int test1() {
    int x, y;
    for (x=0; x<SIZE; x++) {
        for (y=0; y<SIZE; y++) {
            array[x][y] = 1024; // 按行访问
        }
    }
}

int test2() {
    int x, y;
    for (x=0; x<SIZE; x++) {
        for (y=0; y<SIZE; y++) {
            array[y][x] = 1024; // 按列访问
        }
    }
}

int main() {
    clock_t start, end;
    double cpu_time_used;

    start = clock();
    test1();
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("按行访问: %f\n", cpu_time_used);

    start = clock();
    test2();
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("按列访问: %f\n", cpu_time_used);

    return 0;
}

gcc arrayFast.c -o /tmp/a && /tmp/a
按行访问: 0.576199
按列访问: 2.444890

文章中是 0.5 对 10.3 的差异，在我本地（GCC 11.4）差异小一些。

原理

数据在 Cache 和内存之间传输时，不是一个字节一个字节进行传输的，而是以缓存行(Cache Line)为单位进行传输的。
不同 CPU 的 Cache line 大小可能不同，典型的 CPU Cache line 大小是 64 个字节。

C 语言的数组，所有元素是存放在地址连续的内存中的，此外，C 语言的多维数组，是按行进行存储的。

文中介绍了使用 pref 来做分析：

sudo apt update
sudo apt install -y linux-tools-common linux-tools-generic linux-cloud-tools-generic
# 后面两个不安装会提示：
# perf --version
# WARNING: perf not found for kernel 5.15.0-87
#
#   You may need to install the following packages for this specific kernel:
#     linux-tools-5.15.0-87-generic
#     linux-cloud-tools-5.15.0-87-generic
#
#   You may also want to install one of the following packages to keep up to date:
#     linux-tools-generic
#     linux-cloud-tools-generic

sudo perf stat -d /tmp/a

 Performance counter stats for '/tmp/a':

            511.27 msec task-clock                #    0.998 CPUs utilized
                 5      context-switches          #    9.780 /sec
                 1      cpu-migrations            #    1.956 /sec
           102,453      page-faults               #  200.389 K/sec
     1,304,037,593      cycles                    #    2.551 GHz                      (49.93%)
     1,997,636,690      instructions              #    1.53  insn per cycle           (62.44%)
       176,746,327      branches                  #  345.700 M/sec                    (62.45%)
           104,734      branch-misses             #    0.06% of all branches          (62.46%)
       543,705,000      L1-dcache-loads           #    1.063 G/sec                    (60.30%)
         8,549,186      L1-dcache-load-misses     #    1.57% of all L1-dcache accesses  (25.03%)
         1,035,710      LLC-loads                 #    2.026 M/sec                    (25.02%)
           557,731      LLC-load-misses           #   53.85% of all LL-cache accesses  (37.52%)

       0.512050821 seconds time elapsed

       0.347952000 seconds user
       0.163977000 seconds sys

sudo perf stat -d /tmp/b

 Performance counter stats for '/tmp/b':

          2,454.55 msec task-clock                #    1.000 CPUs utilized
                19      context-switches          #    7.741 /sec
                 5      cpu-migrations            #    2.037 /sec
           102,452      page-faults               #   41.740 K/sec
     6,191,911,818      cycles                    #    2.523 GHz                      (49.82%)
     1,986,048,111      instructions              #    0.32  insn per cycle           (62.36%)
       177,822,350      branches                  #   72.446 M/sec                    (62.37%)
           194,485      branch-misses             #    0.11% of all branches          (62.53%)
       548,174,579      L1-dcache-loads           #  223.330 M/sec                    (62.24%)
       217,998,941      L1-dcache-load-misses     #   39.77% of all L1-dcache accesses  (25.09%)
        39,386,881      LLC-loads                 #   16.046 M/sec                    (24.93%)
         1,255,181      LLC-load-misses           #    3.19% of all LL-cache accesses  (37.31%)

       2.455436836 seconds time elapsed

       2.291020000 seconds user
       0.164216000 seconds sys

可以看到，采用按列访问之后，L1 miss 从 1.57% 增长到了 39.77%。
文章中提到，这是因为 CPU 缓存是按 Cache Line 加载的，按列读取会导致更多 Miss，从而总是从内存加载数据，速度会慢一些。

举一反三

相同逻辑转换成 Golang 之后也是类似的现象：

go run arrayFast.go
按行访问 took 414.485988ms
按列访问 took 2.5762119s
按行访问 took 98.654441ms
按列访问 took 2.564306352s

发布于码厩技术博客的所有文章，除注明转载外，均为作者原创，欢迎转载，但必须注明出处。
尊重他人劳动，共创开源社区！转载请注明以下信息：
转载来源：码厩技术博客 [https://www.markjour.com]
原文标题：数组按行访问和按列访问速度有这么大差异？
原文地址：/article/20231201-array-fast.html

如果你有魔法，你可以看到一个评论框～

一	二	三	四	五	六	日