基本配置
Specifications | |
---|---|
Xe-cores | 20 |
Graphics Clock | 2670 MHz |
Memory | 12 GB GDDR6 |
Graphics Memory Interface | 192 bit |
Graphics Memory Bandwidth | 456 GB/s |
GPU Peak TOPS (Int8) | 233 |
TBP | 190 W |
系统环境
kernal: 6.12.7
mesa: 24.3.2
os: fedora 41
egpu: m2 to oculink
PCIe Gen 1x1
lspci
对于 Intel Arc 显卡总是显示 PCIe 1x1 的速度
这似乎是预期行为,并不是实际的速度,Troubleshooting, Intel Support Forums
具体的连接速度可以用 xpumanager 来测试
Resizable BAR
Intel 说明 Arc 显卡需要开启 Resizable BAR 才能完全发挥性能
查了下资料,Base Address Register 用于把 pcie 设备的资源映射到系统内存地址,内部有一套同步逻辑
台式主板一般有直接的选项开启
笔记本和mini主机需要把分配给核显的内存开到 4G 以上
$ sudo lspci -s 03:00.0 -vv | grep -i bar
Capabilities: [420 v1] Physical Resizable BAR
BAR 2: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
AI 应用
llama.cpp
支持 SYCL 和 Vulkan 后端,我写好了 Dockerfile.
SYCL 版本需要安装 intel oneapi,然后在构建时挂载进容器llama-bench -ngl 99
结果(统计参考):
SYCL
model size params backend ngl test t/s llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 pp512 1982.69 ± 4.02 llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 tg128 34.83 ± 0.11 Vulkan
ggml_vulkan: 0 = Intel(R) Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none
model size params backend ngl test t/s llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 175.56 ± 2.65 llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 44.12 ± 0.09
TODO
游戏
TODO
性能测试
GravityMark
- Vulkan 1920x1080 Asteroids: 200000 Score: 24896
- Opengl 1920x1080 Asteroids: 200000 Score: 20477
clpeak
Single-precision compute
和 4060 差不多Half-precision compute
网站没有数据,看下面 vkpeak
, 和 4070 super 差不多Integer compute
差不多 4060 的一半1
$ ./clpeak
Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Graphics [0xe20b]
Driver version : 24.35.30872.32 (Linux x64)
Compute units : 160
Clock frequency : 2850 MHz
Global memory bandwidth (GBPS)
float : 417.71
float2 : 429.27
float4 : 434.78
float8 : 446.30
float16 : 449.03
Single-precision compute (GFLOPS)
float : 14083.21
float2 : 14401.35
float4 : 14096.05
float8 : 12096.30
float16 : 13596.31
Half-precision compute (GFLOPS)
half : 24831.84
half2 : 28076.27
half4 : 28199.86
half8 : 27916.44
half16 : 27768.14
Double-precision compute (GFLOPS)
double : 891.12
double2 : 893.37
double4 : 898.90
double8 : 888.26
double16 : 842.54
Integer compute (GIOPS)
int : 4433.77
int2 : 4461.17
int4 : 4456.32
int8 : 4364.93
int16 : 4192.12
Integer compute Fast 24bit (GIOPS)
int : 4473.72
int2 : 4448.66
int4 : 4470.87
int8 : 4379.65
int16 : 4225.50
Integer char (8bit) compute (GIOPS)
char : 21487.19
char2 : 26128.29
char4 : 25867.58
char8 : 25227.17
char16 : 24600.19
Integer short (16bit) compute (GIOPS)
short : 20924.35
short2 : 26189.49
short4 : 25835.04
short8 : 25127.30
short16 : 23874.53
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 6.77
enqueueReadBuffer : 6.71
enqueueWriteBuffer non-blocking : 7.08
enqueueReadBuffer non-blocking : 7.01
enqueueMapBuffer(for read) : 6.97
memcpy from mapped ptr : 26.67
enqueueUnmap(after write) : 7.20
memcpy to mapped ptr : 26.41
Kernel launch latency : 49.15 us
vkpeak
fp32
和 4060 差不多int32
特别差
其他的居然能跑到和 4070 super 差不多
$ ./vkpeak 0
device = Intel(R) Graphics (BMG G21)
fp32-scalar = 7277.51 GFLOPS
fp32-vec4 = 10706.95 GFLOPS
fp16-scalar = 21068.69 GFLOPS
fp16-vec4 = 23754.69 GFLOPS
fp16-matrix = 0.00 GFLOPS
fp64-scalar = 799.67 GFLOPS
fp64-vec4 = 779.47 GFLOPS
int32-scalar = 3024.92 GIOPS
int32-vec4 = 3102.06 GIOPS
int16-scalar = 10801.67 GIOPS
int16-vec4 = 13525.01 GIOPS
VkFFT
- single precision
$ ./VkFFT_TestSuite -vkfft 0
0 - VkFFT FFT + iFFT C2C benchmark 1D batched in single precision
VkFFT System: 3 8x16777216 Buffer: 1024 MB avg_time_per_step: 10.984 ms std_error: 0.054 num_iter: 3 benchmark: 95462 bandwidth: 364.2
VkFFT System: 4 16x8388608 Buffer: 1024 MB avg_time_per_step: 11.302 ms std_error: 0.378 num_iter: 3 benchmark: 92774 bandwidth: 353.9
VkFFT System: 5 32x4194304 Buffer: 1024 MB avg_time_per_step: 11.164 ms std_error: 0.332 num_iter: 3 benchmark: 93921 bandwidth: 358.3
VkFFT System: 6 64x2097152 Buffer: 1024 MB avg_time_per_step: 10.819 ms std_error: 0.024 num_iter: 3 benchmark: 96918 bandwidth: 369.7
VkFFT System: 7 128x1048576 Buffer: 1024 MB avg_time_per_step: 10.816 ms std_error: 0.022 num_iter: 3 benchmark: 96946 bandwidth: 369.8
VkFFT System: 8 256x524288 Buffer: 1024 MB avg_time_per_step: 10.918 ms std_error: 0.016 num_iter: 3 benchmark: 96037 bandwidth: 366.4
VkFFT System: 9 512x262144 Buffer: 1024 MB avg_time_per_step: 10.963 ms std_error: 0.045 num_iter: 3 benchmark: 95645 bandwidth: 364.9
VkFFT System: 10 1024x131072 Buffer: 1024 MB avg_time_per_step: 10.990 ms std_error: 0.052 num_iter: 3 benchmark: 95414 bandwidth: 364.0
VkFFT System: 11 2048x65536 Buffer: 1024 MB avg_time_per_step: 11.018 ms std_error: 0.067 num_iter: 3 benchmark: 95172 bandwidth: 363.1
VkFFT System: 12 4096x32768 Buffer: 1024 MB avg_time_per_step: 10.981 ms std_error: 0.080 num_iter: 3 benchmark: 95486 bandwidth: 364.3
VkFFT System: 13 8192x16384 Buffer: 1024 MB avg_time_per_step: 11.638 ms std_error: 1.004 num_iter: 3 benchmark: 90095 bandwidth: 343.7
- half precision
$ ./VkFFT_TestSuite -vkfft 2
2 - VkFFT FFT + iFFT C2C benchmark 1D batched in half precision
VkFFT System: 3 8x16777216 Buffer: 512 MB avg_time_per_step: 5.755 ms std_error: 0.054 num_iter: 2 benchmark: 182207 bandwidth: 347.5
VkFFT System: 4 16x8388608 Buffer: 512 MB avg_time_per_step: 5.855 ms std_error: 0.018 num_iter: 2 benchmark: 179090 bandwidth: 341.6
VkFFT System: 5 32x4194304 Buffer: 512 MB avg_time_per_step: 6.394 ms std_error: 0.066 num_iter: 2 benchmark: 163989 bandwidth: 312.8
VkFFT System: 6 64x2097152 Buffer: 512 MB avg_time_per_step: 6.731 ms std_error: 0.127 num_iter: 2 benchmark: 155790 bandwidth: 297.1
VkFFT System: 7 128x1048576 Buffer: 512 MB avg_time_per_step: 7.127 ms std_error: 0.059 num_iter: 2 benchmark: 147137 bandwidth: 280.6
VkFFT System: 8 256x524288 Buffer: 512 MB avg_time_per_step: 5.517 ms std_error: 0.100 num_iter: 2 benchmark: 190062 bandwidth: 362.5
VkFFT System: 9 512x262144 Buffer: 512 MB avg_time_per_step: 5.403 ms std_error: 0.004 num_iter: 2 benchmark: 194084 bandwidth: 370.2
VkFFT System: 10 1024x131072 Buffer: 512 MB avg_time_per_step: 5.575 ms std_error: 0.168 num_iter: 2 benchmark: 188085 bandwidth: 358.7
VkFFT System: 11 2048x65536 Buffer: 512 MB avg_time_per_step: 5.560 ms std_error: 0.126 num_iter: 2 benchmark: 188598 bandwidth: 359.7
VkFFT System: 12 4096x32768 Buffer: 512 MB avg_time_per_step: 5.590 ms std_error: 0.108 num_iter: 2 benchmark: 187580 bandwidth: 357.8
VkFFT System: 13 8192x16384 Buffer: 512 MB avg_time_per_step: 10.461 ms std_error: 1.182 num_iter: 2 benchmark: 100241 bandwidth: 191.2