annaroot.blogg.se - Airflow scheduler logs location

#AIRFLOW SCHEDULER LOGS LOCATION FULL#

If you have more workers, it would lessen the impact of memory/processing spikes from other tasks.Recently, I came across an annoying problem. If you run them in different pods and the scheduler pod restarts, you're tasks you can finish uninterrupted. Especially running in k8, if your scheduler sigterms it's going to kill it along with any running tasks. That way if your scheduler or celery worker dies, it doesn't take both down.

#AIRFLOW SCHEDULER LOGS LOCATION FULL#

Where size of data will need to change depending on whether you load a full dataset, or process chunks of it over the course of the execution of the task.Įven if you don't think you'll need to scale your cluster, I would still recommend using the CeleryExecutor, if only to isolate the scheduler and tasks from each other. So a (very) general rule of thumb being gracious with resources: + ( * (100MB + ) ) You can also try setting max_threads under to 1. The parallelism setting will directly limit how many task are running simultaneously across all dag runs/tasks, which would have the most dramatic effect for you using the LocalExecutor. Depending on the operator and data it's processing the amount of memory needed per task can vary wildly. Also every task (except Dummy) will run in it's own process. There isn't really a concise rule of thumb to follow because it can vary so much based on your workflow.Īs you've seen, the scheduler will create several fork processes.

I can see a lot of scheduler processes in the scheduler pod, each one using 0.2% of memory or more: Running just one of these dags, it seems that airflow is working accordingly. I cant see anything abnormal in the airflow logs, and neither in the task logs. In total it would be aproximately 40 tasks, but not all of them were running in parallel because some of them were downstream, depencies and so on, and our parallelism is set to 10, with just a single worker as described above, and dag_concurrency is set to 5. The dags running at the time were example_bash_operator, example_branch_operator, example_python_operator and one quickDag we have developed.Īll of them just with simple tasks / operators like DummyOperators, BranchOperatos, BashOperators in some cases but doing only echo or sleep and PythonOperators doing only sleep as well. Just a few more details about the confguration: I don't think our use case would require Dask, or Celery to horizontally scale Airflow with more machines for the workers. Is there any tuning, apart from decreasing the parallelism, that could be done in order to decrease the use of memory in the scheduler itself? Is there any rule of thumb we can use to calculate how much memory would we need for the scheduler based on parallel tasks? I've seen in other posts that a task might consume approximately 50Mib of memory when running, and all task operations are in memory, nothing is flushed to disk, so that would give already 1Gb. We are experimenting with Apache Airflow (version 1.10rc2, with python 2.7) and deploying it to kubernetes, webserver and scheduler to different pods, and the database is as well using cloud sql, but we have been facing out of memory problems with the scheduler pod.Īt the moment of the OOM, we were running only 4 example Dags (approximately 20 tasks).