Issue
I have a weird one that I've spent a few days on without any luck. Coming here in the hopes that someone can point me in a direction or has seen something similar.
So, what is happening is that I have two Debian VMs on GCP running MySQL 8, call them 'Processing Server' and 'History Server'. Processing Server processes ~1 million messages a day, and inserts each messages into one of ~6000 tables on History Server (using python w/ pyMySQL). This has run fine for almost a month w/o an issue, and then suddenly every 0-30 minutes I get connection time outs from History Server
"OperationalError(2003, "Can't connect to MySQL server on 'history.server' (timed out)")".
I am using PMM and have dug through all my metrics/logs around this time and find nothing related to MySQL (no weird spikes or dips in anything obvious, no maxed out this or that..).
After digging through MySQL and all available logs, looking for any correlations, I am thinking it is related to a network or a server issue but I have no clue where to look. I am using GCP virtual machines running Debian and I don't think that anything had changed around the time of this issue. I also don't see anything obvious in the various logs I've looked at on my VMs. I dont see any resource spikes (disk, ram, cpu). It's so weird because it seems like my python connector just randomly can't connect to the server occasionally. I know this is a long shot but does anyone have a direction that they might look in? Happy to provide any more information that might be helpful.
I created a test script on 'Processing Server' that hits History Server with the same query that is failing, I can get that to give me this timeout connection error w/in 0-10 minutes, just spamming the query in a loop. If I make that same program open and close a connection to mysql w/o executing the stored proc I see when it holds up for a couple of seconds, but not long enough to trigger the same timeout, generally.
No clue where to look. I wish there was somewhere in mysql where I could see rejected/failed connections and try to correlate it to ANYTHING. No graphs or logs that I have available to me seem to show any meaningful patterns.
Systems Data:
History Server
- OS
- Debian GNU/Linux 11 (bullseye)
- RAM
- 12GB
- Cores
- 2vCPU
- 2 Threads/Core
- DISK
- 750GB "Balanced persistent disk"
- Read IOPS per instance : 3,000
- Write IOPS per instance : 3,000
- Read throughput per instance : 140
- Write throughput per instance: 140
- Host Server
- GCP Compute Engine
- Data Dumps:
- Table Info: https://justpaste.it/3qapn
- Global Status: https://justpaste.it/dfm6j
- Global Variables: https://justpaste.it/87sl8
- Process List: https://justpaste.it/3xh3f
- Status: https://justpaste.it/1oka4
- Innodb Status: https://justpaste.it/3wdck
Processing Server
- OS
- Debian GNU/Linux 10 (buster)
- RAM
- 32GB
- Cores
- 8vCPU
- 2 Threads/Core
- DISK
- 1,000GB "SSD persistent disk"
- Read IOPS per instance : 15,000
- Write IOPS per instance : 9,000
- Read throughput per instance : 240
- Write throughput per instance: 204
- Host Server
- GCP Compute Engine
- Data Dumps:
- Table Info: https://justpaste.it/8o85k
- Global Status: https://justpaste.it/7eukf
- Global Variables: https://justpaste.it/df3z4
- Process List: https://justpaste.it/bl4u2
- Status: https://justpaste.it/4b0dj
- Innodb Status: https://justpaste.it/91z2g
I have poured through all available system & MySQL logs, used Percona's Management and Monitoring tools to look for spikes or dips, maxed or min'd out values, and watched resources using 'top' as well as GCP's cloud compute monitoring, all with no luck. I see no correlation between timestamps of my query timeouts and anything significant from these sources. There are no obvious patterns. There is also no pattern in the timing between timeouts (anywhere between 0 and 30 minutes between timeouts, but nothing consistent).
Solution
Rate Per Second = RPS
Suggestions to consider for your History Server Database flags.
connect_timeout=20 # from 10 second limit for more tolerance to complete connect
innodb_parallel_read_threads=2 # from 4 because you only have 2 cores
table_open_cache_instances=2 # from 16 because you only have 2 cores
innodb_lru_scan_depth=100 # from 1024 to conserve 90% CPU cycles used for function
innodb_log_buffer_size=67108864 # from 16M to reduce innodb_os_log_written RPS of 30,938
read_rnd_buffer_size=16384 # from 256K to reduce handler_read_rnd_next RPS of 505
There are other opportunities to improve performance with additional changes.
Observations,
innodb_flush_method is O_DIRECT_NO_FSYNC and we normally observe O_DIRECT
For your workload, more cores would be helpful in completing tasks.
com_create_table reported 2,995,059 createded in 67 hours, 12 per second seems to be extreme. No reported com_drop_tables in these 2+ days.
Answered By - Wilson Hauck Answer Checked By - Willingham (WPSolving Volunteer)