看到一片技术文章《改了一行代码,数组遍历耗时从 10.3 秒降到了 0.5 秒!》,试验了一下。
测试代码
#define SIZE 10240
#include <stdio.h>
#include <time.h>
int array[SIZE][SIZE];
int test1() {
int x, y;
for (x=0; x<SIZE; x++) {
for (y=0; y<SIZE; y++) {
array[x][y] = 1024; // 按行访问
}
}
}
int test2() {
int x, y;
for (x=0; x<SIZE; x++) {
for (y=0; y<SIZE; y++) {
array[y][x] = 1024; // 按列访问
}
}
}
int main() {
clock_t start, end;
double cpu_time_used;
start = clock();
test1();
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("按行访问: %f\n", cpu_time_used);
start = clock();
test2();
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("按列访问: %f\n", cpu_time_used);
return 0;
}
gcc arrayFast.c -o /tmp/a && /tmp/a
按行访问: 0.576199
按列访问: 2.444890
文章中是 0.5 对 10.3 的差异,在我本地(GCC 11.4)差异小一些。
原理
数据在 Cache 和内存之间传输时,不是一个字节一个字节进行传输的,而是以缓存行(Cache Line)为单位进行传输的。
不同 CPU 的 Cache line 大小可能不同,典型的 CPU Cache line 大小是 64 个字节。C 语言的数组,所有元素是存放在地址连续的内存中的,此外,C 语言的多维数组,是按行进行存储的。
文中介绍了使用 pref 来做分析:
sudo apt update
sudo apt install -y linux-tools-common linux-tools-generic linux-cloud-tools-generic
# 后面两个不安装会提示:
# perf --version
# WARNING: perf not found for kernel 5.15.0-87
#
# You may need to install the following packages for this specific kernel:
# linux-tools-5.15.0-87-generic
# linux-cloud-tools-5.15.0-87-generic
#
# You may also want to install one of the following packages to keep up to date:
# linux-tools-generic
# linux-cloud-tools-generic
sudo perf stat -d /tmp/a
Performance counter stats for '/tmp/a':
511.27 msec task-clock # 0.998 CPUs utilized
5 context-switches # 9.780 /sec
1 cpu-migrations # 1.956 /sec
102,453 page-faults # 200.389 K/sec
1,304,037,593 cycles # 2.551 GHz (49.93%)
1,997,636,690 instructions # 1.53 insn per cycle (62.44%)
176,746,327 branches # 345.700 M/sec (62.45%)
104,734 branch-misses # 0.06% of all branches (62.46%)
543,705,000 L1-dcache-loads # 1.063 G/sec (60.30%)
8,549,186 L1-dcache-load-misses # 1.57% of all L1-dcache accesses (25.03%)
1,035,710 LLC-loads # 2.026 M/sec (25.02%)
557,731 LLC-load-misses # 53.85% of all LL-cache accesses (37.52%)
0.512050821 seconds time elapsed
0.347952000 seconds user
0.163977000 seconds sys
sudo perf stat -d /tmp/b
Performance counter stats for '/tmp/b':
2,454.55 msec task-clock # 1.000 CPUs utilized
19 context-switches # 7.741 /sec
5 cpu-migrations # 2.037 /sec
102,452 page-faults # 41.740 K/sec
6,191,911,818 cycles # 2.523 GHz (49.82%)
1,986,048,111 instructions # 0.32 insn per cycle (62.36%)
177,822,350 branches # 72.446 M/sec (62.37%)
194,485 branch-misses # 0.11% of all branches (62.53%)
548,174,579 L1-dcache-loads # 223.330 M/sec (62.24%)
217,998,941 L1-dcache-load-misses # 39.77% of all L1-dcache accesses (25.09%)
39,386,881 LLC-loads # 16.046 M/sec (24.93%)
1,255,181 LLC-load-misses # 3.19% of all LL-cache accesses (37.31%)
2.455436836 seconds time elapsed
2.291020000 seconds user
0.164216000 seconds sys
可以看到,采用按列访问之后,L1 miss 从 1.57% 增长到了 39.77%。
文章中提到,这是因为 CPU 缓存是按 Cache Line 加载的,按列读取会导致更多 Miss,从而总是从内存加载数据,速度会慢一些。
举一反三
相同逻辑转换成 Golang 之后也是类似的现象:
go run arrayFast.go
按行访问 took 414.485988ms
按列访问 took 2.5762119s
按行访问 took 98.654441ms
按列访问 took 2.564306352s