Issue
The OS is RHEL 6 (2.6.32). I have isolated a core and am running a compute intensive thread on it. /proc/{thread-id}/status shows one non-voluntary context switch every second.
The thread in question is a SCHED_NORMAL thread and I don't want to change this.
How can I reduce this number of non-voluntary context switches? Does this depend on any scheduling parameters in /proc/sys/kernel?
EDIT: Several responses suggest alternative approaches. Before going that route, I first want to understand why I am getting exactly one non-voluntary context switch per second even over hours of run. For example, is this caused by CFS? If so, which parameters and how?
EDIT2: Further clarification - first question I would like an answer to is the following: Why am I getting one non-voluntary context switch per second instead of, say, one switch every half or two seconds?
Solution
This is a guess, but an educated one - since you use an isolated CPU the scheduler does not schedule any task except your own on it with one exception - the vmstat code in the kernel has a timer that schedules a single work queue item on each CPU once per second to calculate memory usage statistics and this is what you are seeing gets scheduled each second.
The work queue code is smart enough to not schedule the work queue kernel thread if the core is 100% idle but not if it is running a single task.
You can verify this using ftrace. If the sched_switch tracer shows that the entity you switch to once every second or so (the value is rounded to the nearest jiffie events and the timer does not count when the cpu is idle so this might skew the timing) is the events/CPU_NUMBER task (or keventd for older kernels), then it's almost 100% that the cause is indeed the vmstat_update function setting its timer to queue a work queue item every second which the events kernel thread runs.
Note that the cycle at which vmstat sets its timer is configurable - you can set it to other value via the vm.stat_interval sysctl knob. Increasing this value will give you a lower rate of such interruptions at the cost of less accurate memory usage statistics.
I maintain a wiki with all the sources of interruptions to isolated CPU work loads here. I also have a patch in the works for getting vmstat to not schedule the work queue item if there is no change between one vmstat work queue run to the next - such as would happen if your single task on the CPU does not use any dynamic memory allocations. Not sure it will benefit you, though - it depends on your work load.
Answered By - gby Answer Checked By - Candace Johnson (WPSolving Volunteer)