#902 F12 网络过滤器

2023-07-16

示例

mime-type:image/png larger-than:1K
# Edge 官方示例,表示过滤出大于 1K 的 PNG 图片

domain:*.csdn.net method:POST
# 过滤出 CSDN 域的 POST 请求。

has-response-header:Content-Type -domain:*.baidu.com
# 过滤出带 Content-Type 头的请求,排除百度域。

清单

属性 详情
domain 仅显示来自指定域的资源。您可以使用通配符(*)来包含多个域。
例如,*.com 会显示所有以 .com 结尾的域名资源。
DevTools 会将所有找到的域名填充到自动完成下拉菜单中。
has-response-header 显示包含指定 HTTP 响应头的资源。
DevTools 会将所有找到的响应头填充到自动完成下拉菜单中。
is 使用 is:running 查找 WebSocket 资源。
larger-than 显示大于指定大小的资源,单位为字节。
设置值为 1000 相当于设置值为 1k
method 显示通过指定的 HTTP 方法类型检索的资源。
DevTools 会将所有找到的 HTTP 方法填充到下拉菜单中。
mime-type 显示指定 MIME 类型的资源。
DevTools 会将所有找到的 MIME 类型填充到下拉菜单中。
mixed-content 显示所有混合内容资源(mixed-content:all),或仅显示当前显示的资源(mixed-content:displayed)。
scheme 显示通过不安全的 HTTP(scheme:http)或安全的 HTTPS(scheme:https)检索的资源。
set-cookie-domain 显示具有与指定值匹配的 Set-Cookie 标头中的 Domain 属性的资源。
DevTools 会将所有找到的 Cookie 域填充到自动完成中。
set-cookie-name 显示具有与指定值匹配的 Set-Cookie 标头中的名称的资源。
DevTools 会将所有找到的 Cookie 名称填充到自动完成中。
set-cookie-value 显示具有与指定值匹配的 Set-Cookie 标头中的值的资源。
DevTools 会将所有找到的 Cookie 值填充到自动完成中。
status-code 显示与特定 HTTP 状态码匹配的资源。
DevTools 会将所有找到的状态码填充到自动完成下拉菜单中。
  1. 多个条件之间是 AND 关系,OR 关系暂时不支持。
  2. 前面加上 - 表示取反,小技巧:可以通过输入 - 之后的自动补全查看所有支持的选项。

参考资料与拓展阅读

#901 软件开发中的“上游”与“下游”

2023-07-13

类比到河流,上游和下游概念就非常明确,水从上游往下游流动。

在工业生产车间中,不同工序就组成一条河流,流动的产品,前面的工序是上游,后面的工序是下游。
下游依赖上游。

在软件设计中,不同服务(SOA、微服务)组成一条河流,流动的是数据,先处理数据的是上游,后处理数据的是下游。
下游依赖上游。
注意:数据是怎么个流动方式的不重要,可能是上游推的,可能是下游拉的。上游和下游之间可能会互相调用 API,这个也不影响数据的依赖关系。
总之,上游靠近入口网关,下游靠近出口网关。

开源项目中,fork 关系组成一条河流,流动的是代码,被 fork 的项目是上游,fork 出来的项目是下游。
下游依赖上游。
当然,下游的代码更新也可能会反馈到上游。

#900 转载:美国最具争议的问题,哈佛为何败诉?

2023-07-11

两周前,美国社会向来最具争议的政策之一——自上世纪 60 年代开始践行的“平权法案”(Affirmative Action)被推翻。

美国最高法院针对涉及哈佛大学和北卡罗莱纳大学被控在招生时考虑种族比例的案件裁定,宪法禁止大学在招生过程中考虑申请人的种族因素。这一判决被视作自推翻堕胎权判例后又一次历史性裁决,也有观点指出,这一裁决更多体现美国最高院保守派压倒性势力以及保守主义的全面回归。

看理想两位节目主讲人庞颖、詹青云,关于此次平权法案被推翻特别录制了一期番外,延续《思辨力 35 讲》的节目形式,庞颖和詹青云分别从支持者、反对者的两方立场就平权法案的争议与利弊进行了一场辩论。问题不仅关乎种族正义、教育平等,也关乎社会的公正与良善。

#899 PEP 703 提案

2023-07-11

Python 内部的 GIL(全局解释器锁)使 Python 的多线程开发变得更加简单,避免了大部分竞争条件,但是限制了 Python 进程使用多核心处理器,所有需要并行的场景全部受限。
官方建议,如果需要用到多核,那就选择多进程编程。

Python 出现的时候还没有多核 CPU,而现在多核心早就成为了主流,4 核、8 核太常见了。

因此从来不乏移除 GIL 的声音,不过没有一次成功。要想不对现有的项目造成太大的影响简直难于登天。
PS:几年前有一个叫做 Gilectomy 的项目(GIL Ectomy,GIL 切除术)因为导致 Python 性能下滑严重而失败。

最大的挑战是要保持对现有代码的兼容性,保住基本盘,也就是容易上手,容易开发。

  • 2021 年,香农计划成员 Eric Snow 提交 PEP 684 A Per-Interpreter GIL(每个解释器一个 GIL)。
    这应该是一个比较稳妥的方案,但我怀疑性价比是否足够。
    PS:和下面提到的 nogil 方案不冲突。

  • PEP 683 – Immortal Objects, Using a Fixed Refcount

  • PEP 554 – Multiple Interpreters in the Stdlib

  • 2021 年,Meta(Facebook)开发者 Sam Gross 基于 Python 3.9 创建了一个 nogil 的分支,最终证明了这个方案技术上可行,相较于 Gilectomy 单核性能没有收到太大的影响,拓展性也不错。

  • 2022 年 5 月,Sam Gross 在 Python 语言峰会上介绍了他的 nogil 项目,提议移除 GIL。
  • 2023 年 1 月,Sam Gross 又提交了一份新的提案 PEP 703 Making the Global Interpreter Lock Optional in CPython(使 GIL 成为可选项)。

现在是最接近移除 GIL 的时刻。
如果顺利通过,几年之后,我们就能用上没有 GIL 的 Python。

问题

  1. PEP 703 引入 --disable-gil 来创建一个没有 GIL 的 Python 版本。
    但是会导致以后部分 Python 库的开发需要提供两种包文件,运行在 nogil 上的版本,和运行在 GIL 上的版本。

参考资料与拓展阅读

#897 Go 1.21 for 语义变更

2023-07-05

仔细观察下面的例子就能知道问题在哪里了:

例子 1

package main

import "fmt"

func main() {
    items := []int{1, 2, 3}

    {
        var all []*int
        for _, item := range items {
            all = append(all, &item)
        }
        fmt.Printf("%+v\n", all)
        // [0xc00008c018 0xc00008c018 0xc00008c018]
        // 输出的都是最后一个值!!!
        for _, i := range all {
            fmt.Printf("%+v, %+v\n", i, *i)
        }
    }

    // fix it:
    {
        var all []*int
        for _, item := range items {
            item := item // 重点
            all = append(all, &item)
        }
        for _, i := range all {
            fmt.Printf("%+v, %+v\n", i, *i)
        }
    }
}

例子 2

package main

import "fmt"

func main() {
    {
        var prints []func()
        for _, v := range []int{1, 2, 3} {
            prints = append(prints, func() { fmt.Println(v) })
        }
        for _, print := range prints {
            print()
        }
    }
    // 输出的是 3 3 3,而不是 1,2,3,Why?

    // fix it:
    {
        var prints []func()
        for _, v := range []int{1, 2, 3} {
            v := v // 重点
            prints = append(prints, func() { fmt.Println(v) })
        }
        for _, print := range prints {
            print()
        }
    }
}

例子 3

package main

import (
    "fmt"
    "sync"
)

func main() {
    items := []int{1, 2, 3}

    {
        wg := sync.WaitGroup{}
        for _, v := range items {
            wg.Add(1)
            go func() {
                // 会提示:loop variable v captured by func literal
                fmt.Println(v)
                wg.Done()
            }()
        }
        wg.Wait()
    }

    // fix it:
    {
        wg := sync.WaitGroup{}
        for _, v := range items {
            wg.Add(1)
            v := v // 重点
            go func() {
                fmt.Println(v)
                wg.Done()
            }()
        }
        wg.Wait()
    }
}

这个例子可以改写成:

package main

import (
    "fmt"
)

func main() {
    items := []int{1, 2, 3}
    done := make(chan bool)
    {
        for _, v := range items {
            go func() {
                // 会提示:loop variable v captured by func literal
                fmt.Println(v)
                done <- true
            }()
        }
        for _ = range items {
            <-done
        }
    }

    // fix it:
    {
        for _, v := range items {
            v := v // 重点
            go func() {
                fmt.Println(v)
                done <- true
            }()
        }
        for _ = range items {
            <-done
        }
    }
}

我的理解

根据 Go 的设计思想,花括号内是一个独立的作用域,for 循环每一次也应该是独立的。
当前这次循环中的变量和上一次循环的变量应该是不一样的。
但实际上,根据运行结果来看,他们的地址是一样的。

闭包函数应该也是这样的,去原来的位置读相关变量,但是之前的位置写入了新的值。

这个设计是一个大坑,对于新人非常不友好。

从 1.21 开始支持根据环境变量 GOEXPERIMENT=loopvar 来启用新的订正版语义。
从 1.22 开始正式修改 for 语义。

Russ Cox 认为当前语义的代价很大,出错的频率高于正确的频率。

但是这次变更 for 语句语义的决定和之前的承诺有冲突(保证兼容性的基础就是语义不会被重新定义)。
但是好在 Go 已经为这种情况做好了足够的准备,即,go 支持在同一批编译中,按照不同包的 mod 文件中声明的 go 版本来对体现不同的语法特性。
就比如 A 包需要 1.22,然后依赖 B 包,需要 1.18,那么 A 包中的代码按照新的定义来,B 包中的代码按照旧的定义来。

https://github.com/golang/go/discussions/56010

#896 卡拉马佐夫兄弟

2023-07-04

卡拉马佐夫兄弟

《卡拉马佐夫兄弟》(俄语:Бра́тья Карама́зовы、英语:The Brothers Karamazov)是俄罗斯作家陀思妥耶夫斯基创作的最后一部长篇小说,通常也被认为是他一生文学创作的巅峰之作。这部宏篇巨制在经历了《俄国导报》上两年的连载后,于 1880 年完成。他曾构想将其作为他的一部更宏大的作品《一个伟大罪人的一生》(The Life of a Great Sinner)的第一部分,然而未能如愿,他在《卡拉马佐夫兄弟》完成后仅四个月就辞世了。

我没有看过这本书,但是看到这句话还是挺触动的:

要爱具体的人,不要爱抽象的人。 > 要爱生活本身,不要爱生活的意义。

罗翔的阐述

知识分子的一个经常性的倾向,就是我们喜欢抽象概念,我们胜过具象的事物,
但是一个越爱抽象人的人往往难对具象的人表现关爱。
因为抽象的人是美好的,抽象的人存在于理念之间,而具体的人都是有缺陷的。所以这就是为什么你越是感到抽象人的美好,你越会发现具体人,你身边人的可恶,可鄙,可耻。

小故事

记得我小的时候看过一个小故事,讲的是一个小孩,清晨起床之后,安安静静的坐着思考今天要为爸爸、妈妈、弟弟做点什么,但是爸爸、妈妈、弟弟逐个过来找他帮忙的时候,他每次都不耐烦的说:我在思考呢!

参考资料与拓展阅读

#895 Windows 10 开发环境重装笔记

2023-07-03

基础环境:Windows 10 (带 Edge)

  1. 驱动更新

  2. 激活

  3. Windows 更新
    PS:系统更新之后就会有 winget 了

  4. 用户目录下的文件夹链接到移动磁盘
    视频,图片,文档,下载,音乐

  5. 360 “优化” 一番

  6. 360安全卫士

  7. 驱动大师
  8. 360压缩

  9. 复制目录 .ssh, .config 目录

  10. 配置网络代理 + 办公网络 VPN

  11. 安装软件

winget 在 msstore 中叫做“应用安装程序”
Windows Terminal 也可以在 msstore 中找到

winget install Microsoft.VisualStudioCode
winget install Microsoft.Edge
winget install Microsoft.WindowsTerminal

winget install Git.Git
winget install qishibo.AnotherRedisDesktopManager # 导入配置即可
winget install ScooterSoftware.BeyondCompare4
winget install KeePassXCTeam.KeePassXC
winget install Apifox.Apifox

# winget install Tencent.wechat-work
winget install Tencent.WeCom
winget install Tencent.WeChat

winget install Python.Python.3.8
winget install GoLang.Go

winget install voidtools.Everything
winget install hluk.CopyQ

winget install Nutstore.Nutstore
winget install NetEase.YoudaoNote # 有道云笔记
winget install NetEase.CloudMusic # 有道云音乐

# winget install iFlytek.iFlyIME # 讯飞输入法
winget install Rime.Weasel # 小狼毫输入法(Rime)

# 无需安装,走笔记本中转
# winget install OpenVPNTechnologies.OpenVPNConnect

其他软件:

  • Filezilla Client

无需安装:

  • GFW,走笔记本中转
  • HeidiSQL(Portable 版本)

Edge 浏览器

  1. 自动更新:... > 帮助与反馈 > 关于 Microsoft Edge
  2. 登录账号,自动同步
  3. 同步需要一段时间,可以先安装上 SwitchyOmega:
    https://microsoftedge.microsoft.com/addons/search/switchyomega?hl=zh-CN

VSCode

  1. 登录账号,自动同步
  2. 同步需要一段时间,可以先安装上 Remote - SSH

Windows Terminal

  1. 配置上开发机器的 SSH 连接,作为默认会话
  2. 记住 GitBash 的快捷键

访问开发机器 zsh 的时候,Home / End 失灵,只能 Ctrl + A / Ctrl + E 代替。

  1. bash 没有问题
  2. Putty 连接 zsh 也是好的

经过一番实验,发现使用 Git 带的 ssh 就好了:

# C:\Windows\System32\OpenSSH\ssh.exe
C:\Users\Administrator>ssh -V
OpenSSH_for_Windows_7.7p1, LibreSSL 2.6.5

# C:\Program Files\Git\usr\bin\ssh.exe
C:\Users\Administrator>"C:\Program Files\Git\usr\bin\ssh.exe" -V
OpenSSH_9.0p1, OpenSSL 1.1.1p  21 Jun 2022

# 作为对照,这是开发机器 (Ubuntu 22.10) 上的 SSH 版本:
ssh -V
OpenSSH_8.9p1 Ubuntu-3ubuntu0.1, OpenSSL 3.0.2 15 Mar 2022
"C:\Program Files\Git\usr\bin\ssh.exe" markjour@172.16.0.49 -F C:\Users\Administrator\.ssh\configwin

Git Bash

~/.bash_profile

test -f ~/.profile && . ~/.profile
test -f ~/.bashrc && . ~/.bashrc

~/.bashrc

source ~/Projects/StdEnv/aliases/main.sh

旧磁盘格式化 + 反复覆写

dd if=/dev/zero of=/e/bigfile bs=10M

#894 转载:Making Python 100x faster with less than 100 lines of Rust

2023-06-28

A while ago at $work, we had a performance issue with one of our core Python libraries.

This particular library forms the backbone of our 3D processing pipeline. It’s a rather big and complex library which uses NumPy and other scientific Python packages to do a wide range of mathematical and geometrical operations.

Our system also has to work on-prem with limited CPU resources, and while at first it performed well, as the number of concurrent physical users grew we started running into problems and our system struggled to keep up with the load.

We came to the conclusion that we had to make our system at least 50 times faster to handle the increased workload, and we figured that Rust could help us achieve that.

Because the performance problems we encountered are pretty common, we can recreate & solve them right here, in a (not-so-short) article.

So grab a cup of tea (or coffee) and I’ll walk you through (a) the basic underlying problem and (b) a few iterations of optimizations we can apply to solve this problem.

If you want to jump straight to the final code, just to go to the summary.

Our running example

Let’s create a small library, which will exhibit our original performance issues (but does completely arbitrary work).

Imagine you have a list of polygons and a of list points, all in 2D. For business reasons, we want to “match” each point to a single polygon.

Our imaginary library is going to:

  1. Start with an initial list of points and polygons (all in 2D).
  2. For each point, find a much smaller subset of polygons that are closest to it, based on distance from the center.
  3. Out of those polygons, select the “best” one (we are going to use “smallest area” as “best”).

In code, that’s going to look like this (The full code can be found here):

from typing import List, Tuple
import numpy as np
from dataclasses import dataclass
from functools import cached_property

Point = np.array

@dataclass
class Polygon:
    x: np.array
    y: np.array

    @cached_property
    def center(self) -> Point: ...
    def area(self) -> float: ...

def find_close_polygons(polygon_subset: List[Polygon], point: Point, max_dist: float) -> List[Polygon]:
    ...

def select_best_polygon(polygon_sets: List[Tuple[Point, List[Polygon]]]) -> List[Tuple[Point, Polygon]]:
    ...

def main(polygons: List[Polygon], points: np.ndarray) -> List[Tuple[Point, Polygon]]:
    ...

The key difficulty (performance wise) is this mix of Python objects and numpy arrays.

We are going to analyze this in depth in a minute.

It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).

Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

Why not just Rewrite It (all) In Rust™?

As compelling as a complete rewrite was, it had a few problems:

  1. The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?
  2. It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, and our poor on-prem server is dying today.
  3. A bunch of friendly researchers are actively working on said library, implementing better algorithms and doing a lot of experiments. They aren’t going to be very happy to learn a new programming language, waiting for things to compile and fighting with the borrow checker. They would appreciate us not moving their cheese too far.

Dipping our toes

It is time to introduce our friend the profiler.

Python has a built in Profiler (cProfile), but in this case it’s not really the right tool for the job:

  1. It’ll introduce a lot of overhead to all the Python code, and none for native code, so our results might be biased.
  2. We won’t be able to see into native frames, meaning we aren’t going to be able to see into our Rust code.

We are going to use py-spy (GitHub).

py-spy is a sampling profiler which can see into native frames.

They also mercifully publish pre-built wheels to pypi, so we can just pip install py-spy and get to work.

We also need something to measure.

# measure.py
import time
import poly_match
import os

# Reduce noise, actually improve perf in our case.
os.environ["OPENBLAS_NUM_THREADS"] = "1"

polygons, points = poly_match.generate_example()

# We are going to increase this as the code gets faster and faster.
NUM_ITER = 10

t0 = time.perf_counter()
for _ in range(NUM_ITER):
    poly_match.main(polygons, points)
t1 = time.perf_counter()

took = (t1 - t0) / NUM_ITER
print(f"Took and avg of {took * 1000:.2f}ms per iteration")

It’s not very scientific, but it’s going to take us very far.

“Good benchmarking is hard. Having said that, do not stress too much about having a perfect benchmarking setup, particularly when you start optimizing a program.”

~ Nicholas Nethercote, in “The Rust Performance Book”

Running this script will give us our baseline:

$ python measure.py
Took an avg of 293.41ms per iteration

For the original library, we used 50 different examples to make sure all cases are covered.

This matched the overall system perf, meaning we can start working on crushing this number.

Side note: We can also measure using PyPy (we’ll also add a warmup to allow the JIT to do its magic).

$ conda create -n pypyenv -c conda-forge pypy numpy && conda activate pypyenv
$ pypy measure_with_warmup.py
Took an avg of 1495.81ms per iteration

Measure first

So, let’s find out what is so slow here.

$ py-spy record --native -o profile.svg -- python measure.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

Took an avg of 365.43ms per iteration

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 391 Errors: 0

Already, we can see that the overhead is pretty small. Just for comparison, using cProfile we get this:

$ python -m cProfile measure.py
Took an avg of 546.47ms per iteration
         7551778 function calls (7409483 primitive calls) in 7.806 seconds
         ...

We get this nice, reddish graph called a flamegraph:

Each box is a function, and we can see the relative time we spend in each function, including the functions it is calling to (going down the graph/stack). Try clicking on a the norm box to zoom into it.

Here, the main takeaways are:

  1. The vast majority of time is spent in find_close_polygons.
  2. Most of that time is spend doing norm, which is a numpy function.

So, let’s have a look at find_close_polygons:

def find_close_polygons(
    polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
    close_polygons = []
    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

    return close_polygons

We are going to rewrite this function in Rust.

Before diving into the details, it’s important to notice a few things here:

  1. This function accepts & returns complex objects (Polygonnp.array).
  2. The size of the objects is non-trivial (so copying stuff might cost us).
  3. This function is called “a lot” (so overhead we introduce is probably going to matter).

My first Rust module

pyo3 is a crate for interacting between Python and Rust. It has exceptionally good documentation, and they explain the basic setup here.

We are going to call our crate poly_match_rs, and add function called find_close_polygons.

mkdir poly_match_rs && cd "$_"
pip install maturin
maturin init --bindings pyo3
maturin develop

Starting out, our crate is going to look like this:

use pyo3::prelude::*;

#[pyfunction]
fn find_close_polygons() -> PyResult<()> {
    Ok(())
}

#[pymodule]
fn poly_match_rs(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(find_close_polygons, m)?)?;
    Ok(())
}

We also need to remember to execute maturin develop every time we change the Rust library.

And thats it! Let’s call our new function and see what happens.

>>> poly_match_rs.find_close_polygons(polygons, point, max_dist)
E TypeError: poly_match_rs.poly_match_rs.find_close_polygons() takes no arguments (3 given)

v1 - A naive Rust translation

We’ll start with matching the expected API.

PyO3 is pretty smart about Python to Rust conversions, so that’s going to be pretty easy:

#[pyfunction]
fn find_close_polygons(polygons: Vec<PyObject>, point: PyObject, max_dist: f64) -> PyResult<Vec<PyObject>> {
    Ok(vec![])
}

PyObject is (as the name suggest) a generic “anything goes” Python object. We’ll try to interact with it in a bit.

This should make the program run (albeit incorrectly).

I’m going to just copy and paste the original Python function, and fix the syntax.

#[pyfunction]
fn find_close_polygons(polygons: Vec<PyObject>, point: PyObject, max_dist: f64) -> PyResult<Vec<PyObject>> {
    let mut close_polygons = vec![];

    for poly in polygons {
        if norm(poly.center - point) < max_dist {
            close_polygons.push(poly)
        }
    }

    Ok(close_polygons)
}

Cool, but this won’t compile:

% maturin develop
...

error[E0609]: no field `center` on type `Py<PyAny>`
 --> src/lib.rs:8:22
  |
8 |         if norm(poly.center - point) < max_dist {
  |                      ^^^^^^ unknown field


error[E0425]: cannot find function `norm` in this scope
 --> src/lib.rs:8:12
  |
8 |         if norm(poly.center - point) < max_dist {
  |            ^^^^ not found in this scope


error: aborting due to 2 previous errors ] 58/59: poly_match_rs

We need three crates to implement our function:

# For Rust-native array operations.
ndarray = "0.15"

# For a `norm` function for arrays.
ndarray-linalg = "0.16"

# For accessing numpy-created objects, based on `ndarray`.
numpy = "0.18"

First, lets turn the opaque and generic point: PyObject into something we can work with.

Just like we asked PyO3 for a “Vec of PyObjects”, we can ask for a numpy-array, and it’ll auto-convert the argument for us.

use numpy::PyReadonlyArray1;

#[pyfunction]
fn find_close_polygons(
    // An object which says "I have the GIL", so we can access Python-managed memory.
    py: Python<'_>,
    polygons: Vec<PyObject>,
    // A reference to a numpy array we will be able to access.
    point: PyReadonlyArray1<f64>,
    max_dist: f64,
) -> PyResult<Vec<PyObject>> {
    // Convert to `ndarray::ArrayView1`, a fully operational native array.
    let point = point.as_array();
    ...
}

Because point is now an ArrayView1, we can actually use it. For example:

// Make the `norm` function available.
use ndarray_linalg::Norm;

assert_eq!((point.to_owned() - point).norm(), 0.);

Now we just need to get the center of each polygon, and “cast” it to an ArrayView1.

In PyO3, this looks like this:

let center = poly
  .getattr(py, "center")?                 // Python-style getattr, requires a GIL token (`py`).
  .extract::<PyReadonlyArray1<f64>>(py)?  // Tell PyO3 what to convert the result to.
  .as_array()                             // Like `point` before.
  .to_owned();                            // We need one of the sides of the `-` to be "owned".

It’s a bit of a mouthful, but overall the result is a pretty clear line-to-line translation of the original code:

 1use pyo3::prelude::*;
 2
 3use ndarray_linalg::Norm;
 4use numpy::PyReadonlyArray1;
 5
 6#[pyfunction]
 7fn find_close_polygons(
 8    py: Python<'_>,
 9    polygons: Vec<PyObject>,
10    point: PyReadonlyArray1<f64>,
11    max_dist: f64,
12) -> PyResult<Vec<PyObject>> {
13    let mut close_polygons = vec![];
14    let point = point.as_array();
15    for poly in polygons {
16        let center = poly
17            .getattr(py, "center")?
18            .extract::<PyReadonlyArray1<f64>>(py)?
19            .as_array()
20            .to_owned();
21
22        if (center - point).norm() < max_dist {
23            close_polygons.push(poly)
24        }
25    }
26
27    Ok(close_polygons)
28}

vs the original:

def find_close_polygons(
    polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
    close_polygons = []
    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

    return close_polygons

We expect this version to have some advantage over the original function, but how much?

$ (cd ./poly_match_rs/ && maturin develop)
$ python measure.py
Took an avg of 609.46ms per iteration

So.. Is Rust just super slow? No! We just forgot to ask for speed! If we run with maturin develop --release we get much better results:

$ (cd ./poly_match_rs/ && maturin develop --release)
$ python measure.py
Took an avg of 23.44ms per iteration

Now that is a nice speedup!

We also want to see into our native code, so we are going to enable debug symbols in release. While we are at it, we might as well ask for maximum speed.

# added to Cargo.toml
[profile.release]
debug = true       # Debug symbols for our profiler.
lto = true         # Link-time optimization.
codegen-units = 1  # Slower compilation but faster code.

v2 - Rewrite even more in Rust

Now, using the --native flag in py-spy is going to show us both Python and our new native code.

Running py-spy again

$ py-spy record --native -o profile.svg -- python measure.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

we get this flamegraph (non-red colors are added to so we can refer to them):

Looking at the profiler output, we can see a few interesting things:

  1. The relative size of find_close_polygons::...::trampoline (the symbol Python directly calls) and __pyfunction_find_close_polygons (our actual implementation).
  2. Hovering, they are 95% vs 88% of samples, so the overhead is pretty small.
  3. The actual logic (if (center - point).norm() < max_dist { ... }) which is lib_v1.rs:22 (very small box on the right), is about 9% of the total runtime.
  4. So x10 improvement should still be possible!
  5. Most of the time is spent in lib_v1.rs:16, which is poly.getattr(...).extract(...) and if we zoom in we can see is really just getattr and getting the underlying array using as_array.

The conclusion here is that we need to focus on solving the 3rd point, and the way to do that is to Rewrite Polygon in Rust.

Let’s look at our target:

@dataclass
class Polygon:
    x: np.array
    y: np.array
    _area: float = None

    @cached_property
    def center(self) -> np.array:
        centroid = np.array([self.x, self.y]).mean(axis=1)
        return centroid

    def area(self) -> float:
        if self._area is None:
            self._area = 0.5 * np.abs(
                np.dot(self.x, np.roll(self.y, 1)) - np.dot(self.y, np.roll(self.x, 1))
            )
        return self._area

We’ll want to keep the existing API as much as possible, but we don’t really need area to be that fast (for now).

The actual class might have additional complex stuff, like a merge method which uses ConvexHull from scipy.spatial.

To cut costs (and limit the scope of this already long article), we will only move the “core” functionality of Polygon to Rust, and subclass that from Python to implement the rest of the API.

Our struct is going to look like this:

// `Array1` is a 1d array, and the `numpy` crate will play nicely with it.
use ndarray::Array1;

// `subclass` tells PyO3 to allow subclassing this in Python.
#[pyclass(subclass)]
struct Polygon {
    x: Array1<f64>,
    y: Array1<f64>,
    center: Array1<f64>,
}

Now we need to actually implement it. We want to expose poly.{x, y, center} as:

  1. Properties.
  2. numpy Arrays.

We also need a constructor so Python can create new Polygons.

use numpy::{PyArray1, PyReadonlyArray1, ToPyArray};

#[pymethods]
impl Polygon {
    #[new]
    fn new(x: PyReadonlyArray1<f64>, y: PyReadonlyArray1<f64>) -> Polygon {
        let x = x.as_array();
        let y = y.as_array();
        let center = Array1::from_vec(vec![x.mean().unwrap(), y.mean().unwrap()]);

        Polygon {
            x: x.to_owned(),
            y: y.to_owned(),
            center,
        }
    }

    // the `Py<..>` in the return type is a way of saying "an Object owned by Python".
    #[getter]
    fn x(&self, py: Python<'_>) -> PyResult<Py<PyArray1<f64>>> {
        Ok(self.x.to_pyarray(py).to_owned()) // Create a Python-owned, numpy version of `x`.
    }

    // Same for `y` and `center`.
}

We need to add our new struct as a class to the module:

#[pymodule]
fn poly_match_rs(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Polygon>()?; // new.
    m.add_function(wrap_pyfunction!(find_close_polygons, m)?)?;
    Ok(())
}

And now we can update the Python code to use it:

class Polygon(poly_match_rs.Polygon):
    _area: float = None

    def area(self) -> float:
        ...

We can compile it and it’ll actually work, but it’ll be much slower! (Remember that xy, and center will now need to create a new numpy array on each access).

To actually improve performance, we need to extract our original Rust-based Polygon from the list of Python-Polygons.

PyO3 is very flexible with this type of operation, so there are a few ways we could do it. One limit we have is that we also need to return Python-Polygons, and we don’t want to do any cloning of the actual data.

It’s possible to manually call .extract::<Polygon>(py)? on each PyObjects, but we ask PyO3 to give us Py<Polygon> directly.

This is a reference to a Python-owned object, which we expect to contain an instance (or a subclass, in our case) of a native pyclass struct.

45#[pyfunction]
46fn find_close_polygons(
47    py: Python<'_>,
48    polygons: Vec<Py<Polygon>>,             // References to Python-owned objects.
49    point: PyReadonlyArray1<f64>,
50    max_dist: f64,
51) -> PyResult<Vec<Py<Polygon>>> {           // Return the same `Py` references, unmodified.
52    let mut close_polygons = vec![];
53    let point = point.as_array();
54    for poly in polygons {
55        let center = poly.borrow(py).center // Need to use the GIL (`py`) to borrow the underlying `Polygon`.
56            .to_owned();
57
58        if (center - point).norm() < max_dist {
59            close_polygons.push(poly)
60        }
61    }
62
63    Ok(close_polygons)
64}

Let’s see what we get using this code:

$ python measure.py
Took an avg of 6.29ms per iteration

We are nearly there! Just x2 to go!

v3 - Avoid allocations

Let’s fire up the profiler one more time.

  1. We start to see select_best_polygon, which now calls some Rust code (when it gets the x & y vectors)
  2. We could fix that, but that’s a very small potential improvement (maybe 10%)
  3. We see we spend about 20% the time on extract_argument (under lib_v2.rs:48), so we are still paying quite a lot on overhead!
  4. But most of the time is in PyIterator::next and PyTypeInfo::is_type_of, which aren’t easy to fix.
  5. We see a bunch of time spent allocating stuff!
  6. lib_v2.rs:58 is our if, and we see drop_in_place and to_owned.
  7. The actual line is about 35% of the overall time, which is a lot more than we expect: this should be the “fast bit” with all the data in place.

Let’s tackle the last point.

This our problematic snippet:

let center = poly.borrow(py).center
    .to_owned();

if (center - point).norm() < max_dist { ... }

What we want is to avoid that to_owned. But we need an owned object for norm, so we’ll have to implement that manually.

(The reason we can improve on ndarray here is that we know that our array is actually just 2 f32s).

This would look like this:

use ndarray_linalg::Scalar;

let center = &poly.as_ref(py).borrow().center;

if ((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt() < max_dist {
    close_polygons.push(poly)
}

But, alas, the borrow checker is unhappy with us:

error[E0505]: cannot move out of `poly` because it is borrowed
  --> src/lib.rs:58:33
   |
55 |         let center = &poly.as_ref(py).borrow().center;
   |                       ------------------------
   |                       |
   |                       borrow of `poly` occurs here
   |                       a temporary with access to the borrow is created here ...
...
58 |             close_polygons.push(poly);
   |                                 ^^^^ move out of `poly` occurs here
59 |         }
60 |     }
   |     - ... and the borrow might be used here, when that temporary is dropped and runs the `Drop` code for type `PyRef`

As usual, the borrow checker is correct: we are doing memory crimes.

The simpler fix is to Just Clone, and close_polygons.push(poly.clone()) compiles.

This is actually a very cheap clone, because we only incr the reference count of the Python object.

However, in this case we can also shorten the borrow by doing a classic Rust trick:

let norm = {
    let center = &poly.as_ref(py).borrow().center;

    ((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt()
};

if norm < max_dist {
    close_polygons.push(poly)
}

Because poly is only borrowed in the inner scope, once we reach close_polygons.push the compiler can know that we no longer hold that reference, and will happily compile the new version.

And finally, we have

$ python measure.py
Took an avg of 2.90ms per iteration

Which is 100x improvement over the original code.

Summary

We started out with this Python code:

@dataclass
class Polygon:
    x: np.array
    y: np.array
    _area: float = None

    @cached_property
    def center(self) -> np.array:
        centroid = np.array([self.x, self.y]).mean(axis=1)
        return centroid

    def area(self) -> float:
        ...

def find_close_polygons(
    polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
    close_polygons = []
    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

    return close_polygons

# Rest of file (main, select_best_polygon).

We profiled it using py-spy, and even our most naive, line-to-line translation of find_close_polygons resulted in more than x10 improvement.

We did a few profile-rewrite-measure iterations until we finally gained x100 improvement in runtime, while keeping the same API as the original library.

Version Avg time per iteration (ms) Multiplier
Baseline implementation (Python) 293.41 1x
Naive line-to-line Rust translation of find_close_polygons 23.44 12.50x
Polygon implementation in Rust 6.29 46.53x
Optimized allocation implementation in Rust 2.90 101.16x

The final python code looks like this

import poly_match_rs
from poly_match_rs import find_close_polygons

class Polygon(poly_match_rs.Polygon):
    _area: float = None

    def area(self) -> float:
        ...

# Rest of file unchanged (main, select_best_polygon).

which calls this Rust code:

use pyo3::prelude::*;

use ndarray::Array1;
use ndarray_linalg::Scalar;
use numpy::{PyArray1, PyReadonlyArray1, ToPyArray};

#[pyclass(subclass)]
struct Polygon {
    x: Array1<f64>,
    y: Array1<f64>,
    center: Array1<f64>,
}

#[pymethods]
impl Polygon {
    #[new]
    fn new(x: PyReadonlyArray1<f64>, y: PyReadonlyArray1<f64>) -> Polygon {
        let x = x.as_array();
        let y = y.as_array();
        let center = Array1::from_vec(vec![x.mean().unwrap(), y.mean().unwrap()]);

        Polygon {
            x: x.to_owned(),
            y: y.to_owned(),
            center,
        }
    }

    #[getter]
    fn x(&self, py: Python<'_>) -> PyResult<Py<PyArray1<f64>>> {
        Ok(self.x.to_pyarray(py).to_owned())
    }

    // Same for `y` and `center`.
}

#[pyfunction]
fn find_close_polygons(
    py: Python<'_>,
    polygons: Vec<Py<Polygon>>,
    point: PyReadonlyArray1<f64>,
    max_dist: f64,
) -> PyResult<Vec<Py<Polygon>>> {
    let mut close_polygons = vec![];
    let point = point.as_array();
    for poly in polygons {
        let norm = {
            let center = &poly.as_ref(py).borrow().center;

            ((center[0] - point[0]).square() + (center[1] - point[1]).square()).sqrt()
        };

        if norm < max_dist {
            close_polygons.push(poly)
        }
    }

    Ok(close_polygons)
}

#[pymodule]
fn poly_match_rs(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Polygon>()?;
    m.add_function(wrap_pyfunction!(find_close_polygons, m)?)?;
    Ok(())
}

Takeaways

  • Rust (with the help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises.

  • Python is a superb API for researchers, and crafting fast building blocks with Rust is an extremely powerful combination.

  • Profiling is super interesting, and it pushes you to truly understand everything that’s happening in your code.

And finally: computers are crazy fast. The next time you wait for something to complete, consider firing up a profiler, you might learn something new 🚀