掘金 后端 ( ) • 2024-04-23 19:59

以下来自本人拉的一个关于 Java 技术的讨论群。关注公众号:hashcon,私信进群拉你

Hotspot JVM 进程启动后,流量到来的时候 JIT 吃掉很多 CPU,如何观察到?

很多途径都能观察到:

  1. top -Hp:这个需要一直 dump 并保存在另一个地方,有可能采集不到需要的时间点,看的不明显。但是大部分情况能看出趋势。
  2. 通过 JFR 的事件看到:这个是 JVM 自带的机制,可以通过 Thread CPU Load 看出来,这个和 top -Hp 的原理差不多,也有可能采集不到需要的时间点,看的不明显。但是大部分情况能看出趋势。
  3. 通过 safepoint 日志可以看出:我们知道进入 safepoint 对于 JVM 是一个非常有消耗的事情(GC 中的 Stop-the-world 阶段就是要所有应用线程暂停进入安全点),JIT 的频繁大量编译也会触发全局安全点。

我们看一下 JFR 中的事件(截图由 @吃泡面的女娲 https://space.bilibili.com/395445925 提供,感谢):

image

通过 -Xlog:safepoint=debug:file=${LOG_PATH}/safepoint%t.log:utctime,level,tags:filecount=10,filesize=10M 这个参数观察 safepoint 日志,可以看到流量第一次到达的时候,有大量的 ICBufferFull 这种 JIT 相关:

[2022-06-20T07:13:22.518+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 505901458 ns, Reaching safepoint: 203535 ns, At safepoint: 15007 ns, Total: 218542 ns
[2022-06-20T07:13:22.745+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 226749494 ns, Reaching safepoint: 229619 ns, At safepoint: 11684 ns, Total: 241303 ns
[2022-06-20T07:13:22.938+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 192997437 ns, Reaching safepoint: 207982 ns, At safepoint: 12948 ns, Total: 220930 ns
[2022-06-20T07:13:23.126+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 187535899 ns, Reaching safepoint: 300756 ns, At safepoint: 14449 ns, Total: 315205 ns
[2022-06-20T07:13:23.247+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 120311192 ns, Reaching safepoint: 249434 ns, At safepoint: 14875 ns, Total: 264309 ns
[2022-06-20T07:13:23.340+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 92724282 ns, Reaching safepoint: 232015 ns, At safepoint: 12110 ns, Total: 244125 ns
[2022-06-20T07:13:23.411+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 70987313 ns, Reaching safepoint: 211428 ns, At safepoint: 13071 ns, Total: 224499 ns
[2022-06-20T07:13:23.635+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 223611196 ns, Reaching safepoint: 254973 ns, At safepoint: 18280 ns, Total: 273253 ns
[2022-06-20T07:13:23.749+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 114334671 ns, Reaching safepoint: 222406 ns, At safepoint: 13293 ns, Total: 235699 ns
[2022-06-20T07:13:23.770+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 20734555 ns, Reaching safepoint: 226780 ns, At safepoint: 15401 ns, Total: 242181 ns
[2022-06-20T07:13:23.792+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 21983900 ns, Reaching safepoint: 205602 ns, At safepoint: 14367 ns, Total: 219969 ns
[2022-06-20T07:13:23.825+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 32108657 ns, Reaching safepoint: 242410 ns, At safepoint: 16345 ns, Total: 258755 ns
[2022-06-20T07:13:23.866+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 41107333 ns, Reaching safepoint: 217483 ns, At safepoint: 14228 ns, Total: 231711 ns
[2022-06-20T07:13:23.895+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 29113807 ns, Reaching safepoint: 300896 ns, At safepoint: 12997 ns, Total: 313893 ns
[2022-06-20T07:13:24.137+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 240862135 ns, Reaching safepoint: 194731 ns, At safepoint: 12127 ns, Total: 206858 ns
[2022-06-20T07:13:24.510+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 372782412 ns, Reaching safepoint: 220987 ns, At safepoint: 13850 ns, Total: 234837 ns
[2022-06-20T07:13:24.977+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 466883979 ns, Reaching safepoint: 232573 ns, At safepoint: 12644 ns, Total: 245217 ns
[2022-06-20T07:13:25.434+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 457259363 ns, Reaching safepoint: 226976 ns, At safepoint: 14064 ns, Total: 241040 ns
[2022-06-20T07:13:25.713+0000][info ][safepoint] Safepoint "G1CollectForAllocation", Time since last: 244672157 ns, Reaching safepoint: 156750 ns, At safepoint: 34190733 ns, Total: 34347483 ns
[2022-06-20T07:13:26.202+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 487912565 ns, Reaching safepoint: 381822 ns, At safepoint: 15451 ns, Total: 397273 ns
[2022-06-20T07:13:26.593+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 391357644 ns, Reaching safepoint: 291855 ns, At safepoint: 14654 ns, Total: 306509 ns
[2022-06-20T07:13:27.346+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 752212094 ns, Reaching safepoint: 266156 ns, At safepoint: 14112 ns, Total: 280268 ns
[2022-06-20T07:13:28.110+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 764204652 ns, Reaching safepoint: 293265 ns, At safepoint: 13933 ns, Total: 307198 ns
[2022-06-20T07:13:28.701+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 590275844 ns, Reaching safepoint: 218853 ns, At safepoint: 12948 ns, Total: 231801 ns
[2022-06-20T07:13:29.255+0000][info ][safepoint] Safepoint "ICBufferFull", Time since last: 554270782 ns, Reaching safepoint: 228667 ns, At safepoint: 12594 ns, Total: 241261 ns
[2022-06-20T07:13:29.560+0000][info ][safepoint] Safepoint "RedefineClasses", Time since last: 274809435 ns, Reaching safepoint: 259527 ns, At safepoint: 30060483 ns, Total: 30320010 ns

CPU Usage 与 Throttling

对于因为 CPU 出现的性能瓶颈,通过 CPU Usage 可能看不出来:CPU Usage 的实现,只能看平均值,最短周期是 1s,瞬时流量看不出来。这个只能通过 CPU Throttling 看出来。

举个比较极端的例子来说明这两个指标:比如两个线程,每个执行,每个都需要占满 CPU 跑 200ms,只有一个 CPU,不考虑 Linux 线程抢断式切换以及调度算法。 从 CPU Usage 看,就是 40%(200+200)/1000 从 CPU Throttling 看,这时候 CPU 限流发生了 200ms,就是一个线程等了 200ms。

再举一个微服务的例子,这个微服务在高峰的时候,已经有瓶颈,接口开始超过 1s,通过 CPU Usage 看,问题不大,但是其实从 CPU Throttling 看打,已经比较高了:

CPU Usageimage CPU Throttlingimage

如何观察到 CPU Throtlling?

  1. 通过 JFR 的事件看到:JFR 的事件包含 CPU Throttling。
  2. JFR 底层也是通过看 /proc 下对应进程的信息知道的cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat
    • nr_periods:经历的 cgroup 周期数
    • nr_throttled:进程所有线程发生 CPU 限流的次数
    • throttled_time:进程所有线程被限流的时间(ns)

微服务预热

我们主要做了3件事:

  1. 启动的时候预热微服务之后再注册:初始化链接池,线程池,然后,多线程本地调用 health 接口 50001 次(超过各种 jit 编译界限),之后 health 接口中的 Readiness 才会就绪,才会注册到注册中心( https://zhuanlan.zhihu.com/p/452533097
  2. 针对 JVM:使用 CDS 加速类加载(我们的微服务发布之后,10 分钟后,dump APPCDS,用于下次启动新实例) Graal JIT 替换 C2。对于无存储 io 的微服务,使用 CRaC(同理,也是微服务发布之后,10 分钟后,dump APPCDS,用于下次启动新实例)
  3. 客户端,调用的时候,检查实例注册时间,使用 beta 函数,减少调用到新实例的比例,最大 3 分钟。

个人简介:个人业余研究了 AI LLM 微调与 RAG,目前成果是微调了三个模型:

  1. 一个模型是基于 whisper 模型的微调,使用我原来做的精翻的视频按照语句段落切分的片段,并尝试按照方言类别,以及技术类别分别尝试微调的成果。用于视频字幕识别。
  2. 一个模型是基于 Mistral Large 的模型的微调,识别提取视频课件的片段,辅以实际的课件文字进行识别微调。用于识别课件的片段。
  3. 最后一个模型是基于 Claude 3 的模型微调,使用我之前制作的翻译字幕,与 AWS、Go 社区、CNCF 生态里面的官方英文文档以及中文文档作为语料,按照内容段交叉拆分,进行微调,用于字幕翻译。

目前,准确率已经非常高了。大家如果有想要我制作的视频,欢迎关注留言。

本人也是开源代码爱好者,贡献过很多项目的源码(Mycat 和 Java JFRUnit 的核心贡献者,贡献过 OpenJDK,Spring,Spring Cloud,Apache Bookkeeper,Apache RocketMQ,Ribbon,Lettuce、 SocketIO、Langchain4j 等项目 ),同时也是深度技术迷,编写过很多硬核的原理分析系列(JVM)。本人也有一个 Java 技术交流群,感兴趣的欢迎关注。

另外,一如即往的是,全网的所有收益,都会捐赠给希望工程,坚持靠爱与兴趣发电。