I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor. The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:
[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',) Traceback (most recent call last): File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost human_status(exitcode)), WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
all the processes managed by supervisor are up and running properly (
supervisorctl status says RUNNNING).
I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?
These are my celery settings:
CELERY_TIMEZONE = 'UTC' CELERY_TASK_SERIALIZER = 'json' CELERY_ACCEPT_CONTENT = ['json'] BROKER_URL = os.environ['RABBITMQ_URL'] CELERY_IGNORE_RESULT = True CELERY_DISABLE_RATE_LIMITS = False CELERYD_HIJACK_ROOT_LOGGER = False
An this is my supervisord.conf:
[program:celery_worker] environment=$env_variables directory=/opt/python/current/app command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid startsecs=10 stopwaitsecs=60 stopasgroup=true killasgroup=true autostart=true autorestart=true stdout_logfile=/opt/python/log/celery_worker.stdout.log stdout_logfile_maxbytes=5MB stdout_logfile_backups=10 stderr_logfile=/opt/python/log/celery_worker.stderr.log stderr_logfile_maxbytes=5MB stderr_logfile_backups=10 numprocs=1 [program:celery_beat] environment=$env_variables directory=/opt/python/current/app command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule startsecs=10 stopwaitsecs=300 stopasgroup=true killasgroup=true autostart=false autorestart=true stdout_logfile=/opt/python/log/celery_beat.stdout.log stdout_logfile_maxbytes=5MB stdout_logfile_backups=10 stderr_logfile=/opt/python/log/celery_beat.stderr.log stderr_logfile_maxbytes=5MB stderr_logfile_backups=10 numprocs=1
edit: after restarting celery beat the problem remains :(
edit 2: changed killasgroup=true to killasgroup=false and the problem remains
The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.
Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.
grep oom /var/log/messages. If you see messages, that's your problem.
If you don't find anything, try running the periodic process manually in a shell:
And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.