Issue
I am trying to find a solution to run a cron job in a Kubernetes deployed app without unwanted duplicates. To give you a little bit of context I will describe my scenario:
I want to schedule jobs that execute once at a specified date. More precise: Creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization. My application is running inside a Kubernetes cluster and I cannot assume that there will be only ever one instance of it running at the same time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job only runs once in the whole cluster.
I tried to find solutions for the problem of multiple jobs running:
- Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
-> This is not possible in my case, since the duplicate jobs might run on other machines!
- Utilize the Kubernetes CronJob API
-> I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean:
Client Version: v1.22.4, Server Version: v1.21.5
Solution
After thinking about a solution for a rather long time I found it. The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date/time. The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separated database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run again in very similar problems. However, if you think about whats the actual work to be done in that service you realize that its very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.
Answered By - Alexander Grass Answer Checked By - Dawn Plyler (WPSolving Volunteer)