TGStat Bot

Telegram'дан чиқмай туриб каналлар статистикасини олиш

SearcheeBot

Telegram-каналлар оламидаги сизнинг йўлбошчингиз.

TGAlertsBot

Каналингиз репостлари ва эсловлари ҳақида хабар беради.

Telegram Analytics

TGStat хизмати янгиликларидан бохабар бўлиш учун обуна бўл!

Statistika Saralanganlar

Easton Man's Channel

@easton_channel

Kanal geosi va tili: Xitoy, Xitoycha

Toifa: Texnologiyalar

@EastonMan 看的新闻
+碎碎念
+膜大佬
+偶尔猫猫
+ 伊斯通听的歌

Связанные каналы | Похожие каналы

Kanal geosi va tili

Xitoy, Xitoycha

Toifa

Texnologiyalar

Statistika

Postlar filtri

O‘chirilganlarni yashirish

Repostlarni yashirish

Easton Man's Channel

21 Dec, 02:40

Chips and Cheese
Skymont in Desktop Form: Atom Unleashed
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Skymont in Desktop Form: Atom Unleashed

Skymont is Intel's newest E-Core architecture. E-Cores trace their lineage to low power and low performance Atom cores of long ago. But E-Cores have become an integral part of Intel's high performance desktop strategy, letting Intel maintain competitive multithreaded performance against AMD's high c...

72 0 1 1

Easton Man's Channel

18 Dec, 06:05

一个存在的世界 dan repost

BIRD 3.0.0 https://gitlab.nic.cz/labs/bird/-/blob/v3.0.0/NEWS?ref_type=tags

99 0 1 3

Easton Man's Channel

17 Dec, 07:20

杰哥的{运维，编程，调板子}小笔记

CPU 微架构逆向方法学¶

背景¶

最近做了不少微架构的评测，其中涉及到了很多的 CPU 微架构的逆向：

● Qualcomm Oryon 微架构评测
● AMD Zen 5 微架构评测
● ARM Neoverse V2 微架构评测

因此总结一下 CPU 微架构逆向方法学。

定义¶

首先定义一下：什么是 CPU 微架构逆向，我认为 CPU 微架构逆向包括两部分含义：

1. 在已经知道某 CPU 微架构采用某种设计，只是不知道其设计参数时，通过逆向，得到它的设计参数
2. 在不确定某 CPU 微架构采用的是什么设计，给出一些可能的设计，通过逆向，排除或确认其设计，再进一步找到它的设计参数

举一个例子，已经知道某 CPU 微架构有一个组相连的 L1 DCache，但不知道它的容量，几路组相连，此时通过微架构逆向的方法，可以得到它的容量，具体是几路组相连，进一步可能把它的 Index 函数也逆向出来。这是第一部分含义。

再举一个例子，已经知道某 CPU 微架构有一个分支预测器，但不知道它使用了什么信息来做预测，可能用了分支的地址，可能用了分支要跳转的目的地址，可能用了分支的方向，这时候通过微架构逆向的方法，对不同的可能性做排除，找到真正的那一个。如果不能排除到只剩一个可能，或者全部可能都被排除掉，说明实际的微架构设计和预期不相符。

第一部分含义，目前已经有大量的成熟的 Microbenchmark（针对微架构 Microarchitecture 设计的 Benchmark，叫做 Microbenchmark）来解决，它们针对常见的微架构设计，实现了对相应设计参数的逆向的 Microbenchmark，可以在很多平台上直接使用。第二部分含义，目前还只能逐个分析，去猜测背后的设计，再根据设计去构造对应该设计的 Microbenchmark。

下面主要来介绍，设计和实现 Microbenchmark 的方法学。

原理¶

首先要了解 Microbenchmark 的原理，它的核心思路就是，通过构造程序，让某个微架构部件成为瓶颈，接着在想要逆向的设计参数的维度上进行扫描，通过某种指标来反映是否出现了瓶颈，通过瓶颈对应的设计参数，就可以逆向出来设计参数的取值。这一段有点难理解，下面给一个例子：

比如要测试的是 L1 DCache 的容量，那就希望 L1 DCache 的容量变成瓶颈。为了让它成为瓶颈，那就需要不断地访问一片内存，它的大小比 L1 DCache 要更大，让 L1 DCache 无法完整保存下来，出现缓存缺失。为了判断缓存缺失是否出现，可以通过时间或周期，因为缓存缺失肯定会带来性能损失，也可以直接通过缓存缺失的性能计数器。既然要逆向的设计参数是 L1 DCache 的容量，那就在容量上进行一个扫描：在内存中开辟不同大小的数组，比如一个是 32KB，另一个是 64KB，每次测试的时候只访问其中一个数组。每个数组扫描访问若干次，然后统计总时间或周期数或缓存缺失次数。假如实际 L1 DCache 容量介于 32KB 和 64KB 之间，那么应该可以观察到 64KB 数组大小测得的性能相比 32KB 有明显下降。如果把测试粒度变细，每 1KB 设置一个数组大小，最终就可以确定实际的 L1 DCache 容量。

在上面这个例子里，成为瓶颈的微架构部件是 L1 DCache，想要逆向的设计参数是它的容量，反映是否出现瓶颈的指标是性能或缓存缺失次数，构造的程序做的事情是不断地访问一个可变大小的数组，其中数组大小和想要逆向的设计参数是挂钩的。

因此可以总结出 Microbenchmark 设计的几个要素：

1. 针对什么微架构部件
2. 针对该部件的什么设计参数
3. 反映出现瓶颈的指标是什么
4. 如何构造程序来导致瓶颈出现
5. 程序在什么情况下会导致瓶颈出现
6. 程序的参数如何对应到设计参数上

比如上面的 L1 DCache 容量的测试上，这几个要素的回答是：

1. 针对什么微架构部件：L1 DCache
2. 针对该部件的什么设计参数：L1 DCache 的容量
3. 反映出现瓶颈的指标是什么：时间，周期数，缓存缺失次数
4. 如何构造程序来导致瓶颈出现：在内存中开辟数组，然后不断地扫描访问
5. 程序在什么情况下会导致瓶颈出现：数组大小超过 L1 DCache 容量
6. 程序的参数如何对应到设计参数上：数组的大小对应到 L1 DCache 的容量

假如要设计一个针对 ROB(ReOrder Buffer) 容量的测试，思考同样的要素：

1. 针对什么微架构部件：ROB
2. 针对该部件的什么设计参数：ROB 能容纳多少条指令
3. 反映出现瓶颈的指标是什么：时间，周期数
4. 如何构造程序来导致瓶颈出现：在 ROB 开头和结尾各放一条长延迟指令，中间填充若干条指令
5. 程序在什么情况下会导致瓶颈出现：如果指令填充得足够多，导致结尾的长延迟指令不能进入 ROB，那么它无法被预测执行
6. 程序的参数如何对应到设计参数上：把结尾的长延迟指令阻拦在 ROB 之外时，在 ROB 中的指令数

思考明白这些要素，就可以知道怎么设计出一个 Microbenchmark 了。

原理介绍完了，下面介绍一些常用的方法。

指标的获取¶

上面提到，为了反映出瓶颈，需要有一个指标，它最好能够精确地反映出瓶颈的发生与否，同时也尽量要减少噪声。能用的指标不多，只有两类：

1. 时间：最通用，所有平台都可以用，在程序前后各记一次时间，取差
2. 性能计数器：使用起来比较麻烦，有时需要 root 权限，或者硬件相关信息不公开，又或者硬件就没有实现对应的性能计数器。各平台性能计数器可用情况： 1. Windows：可用，有现成 API 2. macOS：可用，有逆向出来的私有框架 API 3. Linux：可用，有现成 API 4. iOS：目前仅可通过 XCode 使用，不好用 5. Android：需要 root 或通过 adb shell 使用，比较麻烦 6. HarmonyOS NEXT：不可用

虽然测时间最简单也最通用，但它会受到频率波动的限制，如果在运行测试的时候，频率剧烈变化（特别是手机平台），引入了大量噪声，就会导致有效信息被淹没在噪声当中。

其中性能计数器是最为精确的，虽然使用起来较为麻烦，但也确实支撑了很多更深入的 CPU 微架构的逆向。希望硬件厂商看到这篇文章，不要为了避免逆向把性能计数器藏起来：因为它对于应用的性能分析真的很有用。具体怎么用性能计数器，可以参考一些现成的 Microbenchmark 框架。

套路¶

接下来介绍一些构造瓶颈的一些常见套路：

1. 测试容量（比如各级 I/D Cache 和 TLB）：构造一个程序，去把容量用满，当容量被用满的时候，就可以观察到性能下降
2. 测试微架构队列或 Buffer 深度（比如 ROB，寄存器堆，调度队列）：在队列开头通过指令堵住队列的出队，接着不断地向队列中入队新的指令，当队列满的时候，不再能够入队新的指令，此时再引入一些原来不会被堵住的指令，现在因为队列被堵住了而进不去，导致性能下降
3. 测试组相连结构（比如 BTB，Cache 等组相连结构）：组相连结构下，每个 Index 内的容量是固定的，通过测试容量，可以得到有多少 Index 被覆盖了，如果通过修改 Index 函数的输入（比如 PC），使得某些 Index 无法被访问到，就可以观察到容量上的减少，并且实际容量也反馈出了还有多少 Index 能够被访问到的信息
4. 构造 pointer chasing：以 8B(对应 64 位指针)、缓存行大小或页大小为粒度，进行随机打乱，然后把它们用指针串联起来，前一个指针指向的内存中保存后一个指针的地址
5. 构造长延迟指令：在测试指令队列相关的场景下常用，通常可以用 pointer chasing long latency load 或者一段具有串行依赖的浮点除法或开根指令来实现

再介绍一些常见的坑：

1. 尽量用汇编来构造测例，C/C++ 编译器可能会带来不期望的行为
2. 链接器有一些行为可能是需要避免的，例如它可能会修改一些指令
3. 链接器还可能有一些局限性，例如它不支持巨大的对齐

现成 Microbenchmark¶

实际上，现在已经有很多现成的 Microbenchmark，以及一些记录了 Microbenchmark 的文档：

● https://www.agner.org/optimize/
● https://github.com/clamchowder/Microbenchmarks/
● https://github.com/JamesAslan/MicroArchBench
● https://github.com/name99-org/AArch64-Explore
● https://github.com/jiegec/cpu-micro-benchmarks

以及一些用 Microbenchmark 做逆向并公开的网站：

● https://chipsandcheese.com
● Anandtech（可惜不再更新）
● https://blog.hjc.im/
● https://www.zhihu.com/people/jamesaslan
● 本博客

如果你想要去逆向某个微架构的某个部件，但不知道怎么做，不妨在上面这些网站上寻找一下，是不是已经有现成的实现了。

如果你对如何编写这些 Microbenchmark 不感兴趣，也可以试试在自己电脑上运行这些程序，或者直接阅读已有的分析。

source

191 0 6 3

Easton Man's Channel

17 Dec, 06:11

Chips and Cheese
Rebellions: From High Frequency Trading to AI Acceleration
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Rebellions: From High Frequency Trading to AI Acceleration

Hello you fine Internet folks, At Supercomputing 2024 we stopped by the Rebellions Booth. Rebellions is a Korean startup that originally focused on the High Frequency Trading (HFT) sector and now is transitioning to the AI sector with their second and third generation products. Rebellions’ first SoC...

126 0 0

Easton Man's Channel

16 Dec, 10:47

#名言

The best kind of security is the lack of marketshare

166 0 2

Easton Man's Channel

16 Dec, 03:57

Daniel Lemire's blog
Accessing the attributes of a struct in C++ as array elements?

In C++, it might be reasonable to represent a URL using a class or a struct made of several strings, like so:
struct basic {
std::string protocol;
std::string username;
std::string password;
std::string hostname;
std::string port;
std::string pathname;
std::string search;
std::string hash;
};
You might associate to each component (protocol, username, etc.) an index, like so:
enum class component {
PROTOCOL = 0,
USERNAME = 1,
PASSWORD = 2,
HOSTNAME = 3,
PORT = 4,
PATHNAME = 5,
SEARCH = 6,
HASH = 7,
};
What you might like to do then is to access a component by its index. The following code might do:
std::string& get_component(basic& url, component comp) {
switch (comp) {
case component::PROTOCOL: return url.protocol;
case component::USERNAME: return url.username;
case component::PASSWORD: return url.password;
case component::HOSTNAME: return url.hostname;
case component::PORT: return url.port;
case component::PATHNAME: return url.pathname;
case component::SEARCH: return url.search;
case component::HASH: return url.hash;
}
}
But what if you are constantly accessing values by their indexes? You might be concerned that the overhead of the switch/case could be too much.

Instead, you might flip the data structure around and store the values in an array within the data structure. The following might work:
struct fat {
std::array data;
std::string &protocol = data[0];
std::string &username = data[1];
std::string &password = data[2];
std::string &hostname = data[3];
std::string &port = data[4];
std::string &pathname = data[5];
std::string &search = data[6];
std::string &hash = data[7];
};
With this new data structure, getting a component by its index becomes simpler:
std::string& get_component(fat& url, component comp) {
return url.data[int(comp)];
};
Unfortunately, each reference in the new fat data structure might use 8 bytes. That is not a concern if you expect to have few instances of the data structures. However, if you do, you might want to avoid the references. You might try to replace the references by simple methods:
struct advanced {
std::array data;
std::string &protocol() { return data[0]; }
std::string &username() { return data[1]; }
std::string &password() { return data[2]; }
std::string &hostname() { return data[3]; }
std::string &port() { return data[4]; }
std::string &pathname() { return data[5]; }
std::string &search() { return data[6]; }
std::string &hash() { return data[7]; }
};
It is not entirely satisfactory as it requires calling methods instead of accessing attributes.

I am not sure whether you can do any better currently in C++.

source

137 0 1

Easton Man's Channel

15 Dec, 19:45

杰哥的{运维，编程，调板子}小笔记
Linux 的性能分析（Perf）实现探究

Telegraph | source

Linux 的性能分析（Perf）实现探究

Linux 的性能分析（Perf）实现探究¶ 背景¶ 最近使用 Linux 的性能分析功能比较多，但是很少去探究背后的原理，例如硬件的 PMU 是怎么配置的，每个进程乃至每个线程级别的 PMU 是怎么采样的。这篇博客尝试探究这背后的原理。 PMU¶

175 0 2 1

Easton Man's Channel

13 Dec, 21:48

Chips and Cheese
Fujitsu's Monaka CPU: ARMv9, SVE2, and 3D Stacking
#ChipAndCheese

Telegraph | source
(author: George Cozma)

Fujitsu's Monaka CPU: ARMv9, SVE2, and 3D Stacking

Hello you fine Internet folks, today we are going back to SC24 with a short about Fujitsu’s upcoming Monaka CPU. Fujitsu’s CPUs are quite prevalent in the HPC space with their prior gen A64FX CPU powering the former world’s number 1 Supercomputer, Fugaku. However, Monaka is not a direct replacement ...

188 0 0

Easton Man's Channel

12 Dec, 08:29

Lobste.rs dan repost

Common Misconceptions about Compilers

Comments

via sbaziotis.com via bitfield

Common Misconceptions about Compilers

A curated list of misconceptions about mainstream compilers.

194 0 1 1

Easton Man's Channel

11 Dec, 07:31

Chips and Cheese
Turning off Zen 4's Op Cache for Curiosity and Giggles
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Turning off Zen 4's Op Cache for Curiosity and Giggles

CPUs start executing instructions by fetching those instruction bytes from memory and decoding them into internal operations (micro-ops). Getting data from memory and operating on it consumes power and incurs latency. Micro-op caching is a popular technique to improve on both fronts, and involves ca...

217 0 1 1

Easton Man's Channel

9 Dec, 00:11

Daniel Lemire's blog
Data structures as jigs for programmers (Go edition)

Telegraph | source

Data structures as jigs for programmers (Go edition)

A data structure in programming is a specific way of organizing and storing data in a computer so that it can be accessed and used efficiently. In woodworking or metalworking, a jig holds a piece of work and guides the tools operating on it. It helps to produce consistent results. The simplest jig i...

244 0 0 2

Easton Man's Channel

8 Dec, 19:30

Chips and Cheese
400G Omnipath is Coming: Cornelis Networks at SC24
#ChipAndCheese

Hello you fine Internet folks,

iframe (www.youtube-nocookie.com)

At Supercomputing 2024 we stopped by the Cornelis Networks Booth where we got a peek at their CN5000 series of products which are their 400G line of of NICs, Switches, and Director Class Switches which is due to arrive in H1 of 2025.

Hope y’all enjoy!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord and subscribing to the Chips and Cheese Youtube channel.

source
(author: George Cozma)

263 0 0

Easton Man's Channel

8 Dec, 17:03

Lancern's Treasure Chest dan repost

std::hive 介绍

https://lancern.xyz/posts/2024/12/std-hive

205 0 0 1

Easton Man's Channel

5 Dec, 19:03

#TIL
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4148aeab412432bf928f311eca8a2ba52bb05df

reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas

304 0 0 3

Easton Man's Channel

5 Dec, 00:37

Chips and Cheese
Examining Intel's Arrow Lake, at the System Level
#ChipAndCheese

Telegraph | source
(author: Chester Lam)

Examining Intel's Arrow Lake, at the System Level

Arrow Lake is the codename for Intel's newest generation of high performance desktop CPUs. Its highest end offering, the Core Ultra 9 285K, implements eight Lion Cove P-Cores and 16 Skymont E-Cores. This site has already covered the Lion Cove and Skymont architectures in articles on Intel's Lunar La...

295 0 0 1

Easton Man's Channel

3 Dec, 19:29

https://github.com/dendibakh/perf-book/releases/download/2.0_release/PerformanceAnalysisAndTuningOnModernCPUs_SecondEdition.pdf

280 0 0

Easton Man's Channel

3 Dec, 09:22

302 0 0

Easton Man's Channel

3 Dec, 09:21

https://www.itjungle.com/2024/12/02/power11-takes-memory-bandwidth-up-to-well-eleven/

Power11 Takes Memory Bandwidth Up To, Well, Eleven - IT Jungle

Last week, we went over the roadmaps for the future Power11 processor from IBM and its follow-on, the Power Next chip that we presume will be called Power 12 because, you know, history. This week we want to take a little bit of a deeper dive into the Power11 strategy and what this might mean

286 0 0

Easton Man's Channel

2 Dec, 22:54

Chips and Cheese
An EPYC Exclusive for Azure: AMD's MI300C
#ChipAndCheese

Telegraph | source
(author: George Cozma)

An EPYC Exclusive for Azure: AMD's MI300C

Hello you fine Internet folks, At SC24 we stopped by the Azure Booth to check out their new HBv5 VMs powered by the AMD EPYC 9v64H CPU. Each AMD EPYC 9v64H CPU physically have 96 Zen 4 cores along with 128GB of HBM3E. Four of these 9v64H CPUs are then put into a HBv5 VM which has a combined 352 Zen ...

263 0 0

Easton Man's Channel

2 Dec, 17:07

267 0 0 3

20 ta oxirgi post ko‘rsatilgan.

455

obunachilar

Kanal statistikasi

Kanalda mashhur

https://github.com/riscv/riscv-profiles/issues/193

Chips and Cheese Ayar Labs at Supercomputing 2024: Making Light Move Bits (YouTube Short) #ChipAn...

Daniel Lemire's blog Parsing floats at over a gigabyte per second in C# source

Chips and Cheese Pushing AMD’s Infinity Fabric to its Limits #ChipAndCheese Telegraph | source (...

Chips and Cheese NextSilicon: Putting HPC First #ChipAndCheese Telegraph | source (author: Georg...