Web Excursions 2021-04-09

Platy Hsu

Apr 10, 2021

Over the years, I've encountered specific instances and common patterns that make software or computers slow. In this post, I'll shine a spotlight on some of them.
Environment Detection in Build Systems (e.g. configure and cmake)
- Build systems often feature an environment detection / configuration phase before the build phase.
- This environment detection and configuration is a necessary evil because machines and environments often vary substantially and you need to account for those variances.
  - The problem is that this configuration step often takes longer to run than the build itself
    - a few years ago Mozilla observed that building LLVM/Clang on a 96 vCPU EC2 instance resulted in more time spent in cmake/configuring than compiling and linking
- To improve efficiency, build configuration needs to be parallelized.
  - Even better, it should be integrated into the main build DAG itself so parts of the build can start running without having to wait for all build configuration.
- Another solution to this problem is avoiding the problem of environment detection in the first place.
  - If you have deterministic and reproducible build environments, you can take a lot of shortcuts to skip environment detection that just isn't needed any more.
New Process Overhead on Windows
- New processes on Windows can't be spawned as quickly as they can on POSIX based operating systems, like Linux.
- On Windows, assume a new process will take 10-30ms to spawn.
- On Linux, new processes (often via fork() + exec() will take single digit milliseconds to spawn, if that).
- However, thread creation on Windows is very fast (~dozens of microseconds)
- it is not uncommon for a configure script to spawn thousands of new processes. Assuming 10ms per process, at 1,000 invocations that is 10s of overhead just spawning new processes
- Consider a multi-threaded architecture or using longer-lived daemon/background processes instead.
Closing File Handles on Windows
- These calls were often taking 1-10+ milliseconds to complete
- The cause for this was/is Windows Defender.
  - Work on Windows by installing what's called a filesystem filter driver.
    - This is a kernel driver that essentially hooks itself into the kernel and receives callbacks on I/O and filesystem events
  - the close file callback triggers scanning of written data
    - occur synchronously, blocking CloseHandle() from returning.
  - as long as Windows Defender (and presumably other A/V scanners) are running, there's no way to make the Windows I/O APIs consistently fast.
- Use a thread pool for calling CloseHandle().
Writing to Terminals
- Writing tons of output or getting clever with writing to the terminal (e.g.
  - writing colors, moving the cursor position to write over existing content) can drastically slow down applications.
- Writing to the terminal via stderr/stdout is likely performed via blocking I/O.
  - So if the thing handling your write() (the terminal emulator) doesn't finish its handling promptly, your process just sits around waiting on the terminal to do its thing.
- Different terminals have their own quirks.
  - Historically, the Windows Command Prompt and the built-in Terminal.app on macOS were very slow at handling tons of output.
  - Modern terminals are better about writing a ton of plain text than they were in ~2012
- Exercise extreme caution when doing fancy things with the terminal, like coloring text, drawing footers, etc.
- Consider throttling writes to the terminal.
- If drawing a progress bar or spinner or something of that nature, I would limit drawing to ~10 Hz to minimize terminal overhead.
- Thermal Throttling / ACPI C/P-States / Processor Throttling Behavior
- Processors are constantly changing their operating envelope as they are running.
- Truths
  - The behavior of power scaling can vary substantially depending on whether the battery is fully charged or nearly empty.
  - Apple laptops may exhibit thermal throttling when charging from the left side.
  - A core may slow down in order to process certain instructions (like AVX-512).
- When benchmarking, you need to control the power variable or at least report its state so results are qualified appropriately.
- Personally had a MacBook Pro become thermal throttled because an internal screw came loose and blocked a fan from spinning.
Python, Node.js, Ruby, and other Interpreter Startup Overhead
- each invocation often takes single to dozens of milliseconds to initialize the interpreter.
  - i.e. the new process spends time at the beginning of process execution just getting to the code you are telling it to run.
- Consider the use of fewer processes and/or consider alternative programming languages that don't have significant startup overhead if this could become a problem
  - (anything that compiles down to assembly is usually fine).
Pretty Much all Storage I/O
- Modern NVMe storage is 1.5-3 magnitudes faster than the best spinning disks from little over a decade ago.
- Most software fails to utilize the potential of modern storage devices or even worse actively undermines it through bad practices.
- Software abstractions in the OS/kernel are eating a lot of potential.
- fsync() POSIX function.
  - By calling this function, you effectively say be sure the state of this file descriptor is persisted to the storage device or I don't want to lose any changes I've made.
  - On many Linux filesystems (including ext4), the implementation of fsync() is such that upon calls, all unflushed writes are persisted to storage.
  - On most systems, there's usually something continuously incurring write I/O,
    - so the amount of storage device I/O incurred by fsync() is almost always larger than just the mutated file/directory you actually want persisted.
  - When people say databases and other stores should be isolated to their own volumes/filesystems, fsync()'s wonky behavior is a big reason why.
- Many data workloads and machine environments don't actually need strong data guarantees
  - Kubernetes pods or CI runners. Or even servers for a stateless service.
  - designed your system to be stateless and fault tolerant.
  - fsync() buys you little to nothing but can cost you a lot
Data Compression
- Data compression is a trade-off between CPU and I/O usage.
- Since the early days of computing, a maxim has been that storage is slow and expensive compared to CPU.
  - CPUs have somewhat plateaued in their single core performance in the past decade.
  - the relative performance difference between CPUs and I/O has compressed significantly
    - ~30 years ago, CPUs ran at ~100 MHz and Internet was using dial-up at say 50 kbps, or 0.05 mbps, or 6.25 kBps. That's 16,000 cycles per byte.
    - Today, we're at ~4 GHz with say 1 Gbps / 125 MB/s networks. That's 32 cycles per byte, a decrease of 500x.
  - Years ago, trading CPU to lessen the I/O load was often obviously correct.
    - Today, because of the advancements in I/O performance relative to CPU and a substantially reduced cycles per I/O byte budget, the answer is a lot murkier.
- Not helping is the prevalence of ancient compression algorithms.
  - Use of zlib in 2021 constitutes negligence because its performance lags contemporary solutions.
  - If decompression reduces your line speed from uncompressed, you are artificially slowing down access to your data.
- Common operations that are bottlenecked by use of slow compression formats
  - Installing Apt packages (packages are gzip compressed).
  - Installing Homebrew packages (packages are gzip compressed).
  - Installing Python packages via pip (source archives are gzip tarballs and wheels are zip files, which use zlib compression).
  - Pushing/pulling Docker images (layers inside Docker images are gzip compressed).
  - Git (wire protocol data exchange and on-disk storage is using zlib)
- x86_64 Binaries in Linux Distribution Packages
- By default, binaries provided by many Linux distributions won't contain instructions from modern Instruction Set Architectures (ISAs). No SSE4. No AVX. No AVX2. And more.
- C/C++ compilers (like Clang and GCC) will also target an ancient x86_64 microarchitecture level by default
Many Implementations of Myers Diff and Other Line Based Diffing Algorithms
- There are various algorithms for generating a diff of text. Myers Diff is probably the most famous.
- The run-time of the algorithms is proportional to the number of lines. Probably O(nlog(n)) or O(n^2).
- when diffing two text documents, large parts of the inputs are often identical
- most implementations of diff algorithms have a myriad of optimizations to limit the number of lines compared. Two common optimizations are to identify and exclude the common prefix and suffix of the input.
  - splitting text into lines can be grossly inefficient
    - An efficient solution to this problem employs the use of vectorized CPU instructions (like AVX/AVX2) which can scan several bytes at a time looking for a sentinel value or matching a byte mask. So instead of 1 instruction per input byte, you have 1/n.
  - Another culprit causing inefficiency is hashing each line
  - Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix.
- it demonstrates that something that is seemingly O(n) is slower than O(nlog(n))/O(n^2)
Conclusion
- The issues I described could be manifesting in your software and environments but the effort to address them may not be worth the reward.
- Computers and software, like life, are full of trade-offs. Performance is just one trade-off.

Hacker News

chungy: A very old trick I remember on Windows, is to minimize command prompts if a lot of output was expected and would otherwise slow down the process.
- lifthrasiir: IIRC the font rendering in Windows was surprisingly slow and even somewhat unsafe.
  - In one case lots of different webfonts displayed in the MSIE rendered the whole text rendering stack broken, with all text across the entire system disappeared.

快人一步：使用 Tasker+AutoX 一键直达健康码 - 少数派

支付宝小程序 appid 获取

大部分常用小程序都能在网上查到对应 ID，搜索 "支付宝"+"appid"+"小程序名称" 即可，善用前人智慧。
从分享链接中提取：
1. 进入支付宝小程序→点击右上角「更多」→分享→复制链接，得到短链接如：https://ur.alipay.com/2IcAMP
2. 在 PC 端浏览器中访问短链接，页面第一次跳转后快速按 Esc，在浏览器地址栏得到长链接如：https://render.alipay.com/p/s/i/?scheme=alipays%3A%2F%2Fplatformapi%2Fstartapp%3FappId%3D2021001135679870%26page%3Dpages%252Fhome%252Findex%26enbsv%3D0.2.2103202323.33%26chInfo%3Dch_share__chsub_CopyLink
3. 长链接中 appId%3D 后即为小程序 appid。

Apple expands its Find My feature - iPhone J.D.

To try to prevent a stalker from doing something similar with a Find My tracking device, such as the upcoming Chipolo ONE Spot, Apple will reportedly include a new feature in iOS 14.5 (which should be out this month) called Item Safety Alerts.
This feature, which is enabled by default, will notify you if an unknown item has been found moving with you.

Platy’s Web Excursions

Web Excursions 2021-04-09

🌟 [Post of The Day] Gregory Szorc's Digital Home | Surprisingly Slow

Hacker News

快人一步：使用 Tasker+AutoX 一键直达健康码 - 少数派

Apple expands its Find My feature - iPhone J.D.