TOC

StackOverflow 上被复制最多的代码

作者： catroll
日期： 2023-11-28
标签：开发者 Java

根据《The most copied StackOverflow snippet of all time is flawed!》，StackOverflow 上被复制最多的一段代码来自问题 How can I convert byte size into a human-readable format in Java?，也就是将字节数转换成可读格式，比如 1024 -> 1KB 这样。

问题代码

public static String humanReadableByteCount(long bytes, boolean si) {
    int unit = si ? 1000 : 1024;
    if (bytes < unit) return bytes + " B";
    int exp = (int) (Math.log(bytes) / Math.log(unit));
    String pre = (si ? "kMGTPE" : "KMGTPE").charAt(exp - 1) + (si ? "" : "i");
    return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);
}

国际单位制（SI）就按十进制，kB = 1000、MB = 1000 kB、GB = 1000 MB...
二进制单位的话就是 KiB = 1024，MiB = 1024 KiB、GiB = 1024 MiB...
字符串格式化的时候会四舍五入，比如 550 kB，也就是 0.55M，使用 %.1f 格式化（保留一位小数），就会变成 0.6

其他时候都还好，只是在接近下一级的时候就有点让人懵，比如：
靠近 MB 时 >= 999950 = 1000.0 kB
接近 MiB 时 >= 1048525 = 1024.0 KiB

实际上是到了 .95 的时候，进一了，但是没有进一级，所以给人困扰。

解决方案

作者更新了相关代码，就解决了这个问题，比如超过 999949 之后，不再是 1000.0 kB 了，而是 1.0 MB。

public static strictfp String humanReadableByteCount(long bytes, boolean si) {
    int unit = si ? 1000 : 1024;
    long absBytes = bytes == Long.MIN_VALUE ? Long.MAX_VALUE : Math.abs(bytes);
    if (absBytes < unit) return bytes + " B";
    int exp = (int) (Math.log(absBytes) / Math.log(unit));
    long th = (long) Math.ceil(Math.pow(unit, exp) * (unit - 0.05));
    if (exp < 6 && absBytes >= th - ((th & 0xFFF) == 0xD00 ? 51 : 0)) exp++;
    String pre = (si ? "kMGTPE" : "KMGTPE").charAt(exp - 1) + (si ? "" : "i");
    if (exp > 4) {
        bytes /= unit;
        exp -= 1;
    }
    return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);
}

生产环境

作者推荐在生产环境版本使用以下代码：

SI (1 k = 1,000)

public static String humanReadableByteCountSI(long bytes) {
    if (-1000 < bytes && bytes < 1000) {
        return bytes + " B";
    }
    CharacterIterator ci = new StringCharacterIterator("kMGTPE");
    while (bytes <= -999_950 || bytes >= 999_950) {
        bytes /= 1000;
        ci.next();
    }
    return String.format("%.1f %cB", bytes / 1000.0, ci.current());
}

Binary (1 Ki = 1,024)

public static String humanReadableByteCountBin(long bytes) {
    long absB = bytes == Long.MIN_VALUE ? Long.MAX_VALUE : Math.abs(bytes);
    if (absB < 1024) {
        return bytes + " B";
    }
    long value = absB;
    CharacterIterator ci = new StringCharacterIterator("KMGTPE");
    for (int i = 40; i >= 0 && absB > 0xfffccccccccccccL >> i; i -= 10) {
        value >>= 10;
        ci.next();
    }
    value *= Long.signum(bytes);
    return String.format("%.1f %ciB", value / 1024.0, ci.current());
}

思考

这个问题和语言无关，在其他语言中肯定也有相同的问题，可以注意一下。
不过，这个场景下，最后保留小数的时候，不做四舍五入是否更加符合更多人的共识一些？比如 99.999999MB 保留两位小数就应该是 99.99MB。

用 Python 示例：

def size4human(size_bytes: int, si: bool = True) -> str:
    unit = 1000 if si else 1024
    if size_bytes < unit:
        return f"{size_bytes} B"
    exp = int(math.log(size_bytes) / math.log(unit))
    pre = ("" if si else "i") + "kMGTPE"[exp - 1]
    return f"{size_bytes * 100 // math.pow(unit, exp) / 100:.2f} {pre}B"

发布于码厩技术博客的所有文章，除注明转载外，均为作者原创，欢迎转载，但必须注明出处。
尊重他人劳动，共创开源社区！转载请注明以下信息：
转载来源：码厩技术博客 [https://www.markjour.com]
原文标题：StackOverflow 上被复制最多的代码
原文地址：/article/20231128-most-copied-stackoverflow-snippet.html

一	二	三	四	五	六	日