掘金 阅读 ( ) • 2024-04-24 10:39

引言

uPIMulator是由KAIST的Bongjoon Hyun等人开发的用于UPMEM DPU架构的周期级硬件模拟器,于2024年3月4日在HPCA 2024发布并获最佳论文奖。
项目地址:https://github.com/VIA-Research/uPIMulator
原论文地址:https://arxiv.org/pdf/2308.00846.pdf

uPimulator介绍

image.png

uPIMulator是一个集成了UPMEM SDK的基于LLVM的编译工具链和自主研发的周期级硬件表现模拟器,提供了脱离UPMEM SDK附带模拟器受真实硬件结构影响的限制,探索硬件结构上拓展的可能性。

软件编译工具链

开源的UPMEM SDK编译工具(dpu-upmem-dpurte-clang)与标准c语言编译器工作流程相同:接受程序员编写的源代码和兼容UPMEM-PIM的glibc风格C库(如使用用于DPU Wram的mem_alloc替代malloc),进行预处理、编译和汇编成二进制对象,最后链接成一个UPMEM-PIM二进制可执行文件。

image.png

uPIMulator利用了UPMEM SDK的预处理器和编译器将dpu端的源代码和dpu端的glibc风格C库编译至汇编级代码,然后送入uPIMulator自定义的链接器(论文和源代码中的Linker),将两者结合进行词法、语法分析和分析活性,最终生成二进制的链接完成的程序并根据mram、wram、iram完成转储。

汇编器(论文和源代码中的Assembler)基于选择的benchmark类型,随机生成测试数据并根据在dpu中的存储位置完成转储。

硬件表现模拟器

默认参数

DPU processor architecture Operating frequency 350 MHz Number of pipeline stages 14 Revolver scheduling cycles 11 WRAM / IRAM size 64 KB / 24 KB WRAM / IRAM access latency 1 cycle WRAM / IRAM access granularity 4 / 6 B per clock WRAM / IRAM access bandwidth 1,400 / 2,100 MB/sec Atomic memory size 256 Bits
DRAM system MRAM size 64 MB DDR specification DDR4-2400 Memory scheduling policy FR-FCFS Row buffer size 1 KB tRCD, tRAS, tRP, tCL, tBL 16, 39, 16, 16, 4 cycles
Communication CPU→DPU bandwidth (per rank) 0.296 GB/s per DPU CPU←DPU bandwidth (per rank) 0.063 GB/s per DPU
Software architecture Number of general-purpose registers 24 Maximum number of threads 24 Stack size (per thread) 2 KB Heap size 4 KB

测试结果

#!/bin/bash

# 设置uPIMulator的根目录和二进制文件目录的路径 
ROOT_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator" 
BIN_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator/bin" 

# 设置基准测试名称和其他参数 
VERBOSE=0 
BENCHMARK="VA" 
NUM_CHANNELS=1 
NUM_RANKS_PER_CHANNEL=1 
NUM_DPUS_PER_RANK=1 
NUM_TASKLETS=1 
DATA_PREP_PARAMS=2048 

# 检查bin目录是否存在,如果不存在,则创建它 
rm -rf "${BIN_DIRPATH}" 
mkdir "${BIN_DIRPATH}" 

# 执行uPIMulator命令 
./build/uPIMulator --verbose $VERBOSE \ 
                   --root_dirpath $ROOT_DIRPATH \ 
                   --bin_dirpath $BIN_DIRPATH \ 
                   --benchmark $BENCHMARK \ 
                   --num_channels $NUM_CHANNELS \ 
                   --num_ranks_per_channel $NUM_RANKS_PER_CHANNEL \ 
                   --num_dpus_per_rank $NUM_DPUS_PER_RANK \ 
                   --num_tasklets $NUM_TASKLETS \ 
                   --data_prep_params $DATA_PREP_PARAMS

使用如上的shell脚本,通过修改channel,rank,dpu和dpu内tasklet的数量或修改数据集的大小,可以分析dpu运行时的表现或执行过程的瓶颈,单一测试输出结果如下所示

NUM_CHANNELS=1
NUM_RANKS_PER_CHANNEL=1
NUM_DPUS_PER_RANK=1
NUM_TASKLETS=1
DATA_PREP_PARAMS=524288

ThreadScheduler[0_0_0]_breakdown_etc: 37233714
ThreadScheduler[0_0_0]_breakdown_run: 3723370
ThreadScheduler[0_0_0]_breakdown_dma: 4489211
Logic[0_0_0]_num_instructions: 3723370
Logic[0_0_0]_active_tasklets_0: 4556810
Logic[0_0_0]_active_tasklets_1: 40889485
Logic[0_0_0]_logic_cycle: 45446295
CycleRule[0_0_0]_cycle_rule: 20497
MemoryController[0_0_0]_memory_cycle: 272677770
MemoryScheduler[0_0_0]_num_fcfs: 743424
MemoryScheduler[0_0_0]_num_fr: 32768
RowBuffer[0_0_0]_num_activations: 10240
RowBuffer[0_0_0]_num_precharges: 10239
RowBuffer[0_0_0]_num_writes: 262144
RowBuffer[0_0_0]_write_bytes: 2097152
RowBuffer[0_0_0]_num_reads: 524288
RowBuffer[0_0_0]_read_bytes: 4194304

IPC(instruction per cycle)

计算公式:IPC  = (value of num_instructions) / (value of logic_cycle)

image.png

Breakdown of DPU’s runtime

计算公式:

  • Issuable ratio = (value of breakdown_run) / (value of logic_cycle)
  • Idle (Memory) ratio = (value of breakdown_dma) / (value of logic_cycle)
  • Idle (Revolver) ratio = (value of breakdown_etc) / (value of logic_cycle)
  • Idle (RF) ratio = (value of backpressure) / (value of logic_cycle)

image.png