Issue
I created a single-node container. After running it for a few days, the node hung up, but no problems were found in the logs and system logs. What was going on?
I checked the cnosdb log and the deadline is 20240120203253, as follows:
2024-01-20T20:32:53.963266264Z INFO tskv::compaction::picker: Picker: Calculate level scores: [ { Level-1: 1.6 }, { Level-4: 1.6 }, { Level-3: 0.4 } ]
2024-01-20T20:32:53.963393433Z INFO tskv::compaction::picker: Picker: picked level: 1 to 2
2024-01-20T20:32:53.965405243Z INFO tskv::compaction::picker: Picker: picked L1 files(2) does not reach trigger(4), return None
2024-01-20T20:32:53.965539938Z INFO tskv::compaction::job: Starting compaction on ts_family 3
2024-01-20T20:32:53.965603168Z INFO tskv::compaction::picker: Picker: picked no level
2024-01-20T20:32:53.966003610Z INFO tskv::compaction::job: Compacting on vnode(job start): {12: true, 6: true, 3: true} costs 0 sec
and I check the sys log with dmesg -T,as follow:
[六 1月 20 10:05:16 2024] [36395] 89 36395 22974 268 48 0 0 pickup
[六 1月 20 10:05:16 2024] Out of memory: Kill process 32449 (tokio-runtime-w) score 671 or sacrifice child
[六 1月 20 10:05:16 2024] Killed process 32449 (tokio-runtime-w), UID 0, total-vm:323738056kB, anon-rss:179339976kB, file-rss:0kB, shmem-rss:0kB
[六 1月 20 10:06:14 2024] docker0: port 1(veth798dd8b) entered disabled state
[六 1月 20 10:06:14 2024] docker0: port 1(veth798dd8b) entered disabled state
The latest OOM log is very different from the cnosdb log. So why is the cnosdb service down?
Solution
After subsequent trigger OOM testing, it was found that the OOM log time recorded by the system would deviate to a certain extent from the system time. According to the offset time, we locate the system log from the last log printed by cnosdb, and there is a record of Cnosd DB process OOM. as follow:
- System log expiration time before triggering OOM
dmesg -T
...
[Wed Jan 24 20:53:41 2024] docker0: port 7(xxx) entered forwarding state
- current time
[root@xxx ~]# date
Mon Jan 29 10:30:39 CST 2024
- Trigger OOM
[root@cicd_ujv23 ~]# stress --vm 10 --vm-bytes 25G --vm-keep
stress: info: [38900] dispatching hogs: 0 cpu, 0 io, 10 vm, 0 hdd
stress: FAIL: [38900] (415) <-- worker 38904 got signal 9
stress: WARN: [38900] (417) now reaping child worker processes
stress: FAIL: [38900] (415) <-- worker 38910 got signal 9
stress: WARN: [38900] (417) now reaping child worker processes
stress: FAIL: [38900] (451) failed run completed in 40s
- check the OOM log:
dmesg -T
[Sun Jan 28 11:59:09 2024] Out of memory: Kill process 38910 (stress) score 92 or sacrifice child
[Sun Jan 28 11:59:09 2024] Killed process 38910 (stress), UID 0, total-vm:26221716kB, anon-rss:25472716kB, file-rss:0kB, shmem-rss:0kB
Answered By - Baker X Answer Checked By - David Goodson (WPSolving Volunteer)