Workers
Coding Guidelines
Hierarchical Config
- The default & dev config goes into
workers/config/common.py
- The overrides for production goes into
workers/config/production.py
- Based on the environment variable APP_ENV, config from that file is imported and merged with the common config.
- Add APP_ENV=production to
.env
file which load_dotenv reads or - directly set it as
export APP_ENV=production
.
- Add APP_ENV=production to
- In project files, import config as
from workers.config import config
- Imported config is a DotMap object, which supports both
config[]
andconfig.
access. - To add a new environment (for example "stage"), create a new file inside
workers/config
calledstage.py
and have the overriding config as a dict assigned to a variable namedconfig
.
Celery config
- config specific to Celery is in
workers/config/celeryconfig.py
- Config is in python values, instead of a dict
- Env specific values and secrets are loaded from
.env
file
Code Organization
- Celery Tasks:
workers/tasks/*.py
- Scheduled job and other scripts:
workers/scripts/*.py
- Helper code:
workers/*.py
- Config / settings are in
workers/config/*.py
and.env
- Test code is in
tests/
Hot Module Replacement
Worker automatically run with updated code except for the code in
- workers.config.*
- workers.utils
- workers.celery_app
- workers.task.declaration
Deployment
- Add
module load python/3.10.5
to ~/.modules - Update
.env
(make a copy of.env.example
and add values) - Install dependencies
poetry export --without-hashes --format=requirements.txt > requirements.txt
pip install -r requirements.txt
cd ~/app/workers
pm2 start ecosystem.config.js
# optional
pm2 save
Testing with workers running on local machine
Start mongo and queue
cd <rhythm_api>
docker-compose up queue mongo -d
Start Workers
python -m celery -A tests.celery_app worker --loglevel INFO -O fair --pidfile celery_worker.pid --hostname 'bioloop-celery-w1@%h' --autoscale=2,1 --queues 'bioloop-dev.sca.iu.edu.q'
--concurrency 1
: number of worker processed to pre-fork
-O fair
: Optimization profile, disables prefetching of tasks. Guarantees child processes will only be allocated tasks when they are actually available.
Use --hostname '<app_name>-celery-<worker_name>@%h'
to distinguish multiple workers running on the same machine either for the same app or different apps.
- replace
<app_name>
with app name (ex: bioloop) - replace
<worker_name>
with worker name (ex: w1)
Auto-scaling - max_concurrency,min_concurrency --autoscale=10,3
(always keep 3 processes, but grow to 10 if necessary).
--queues '<app_name>-dev.sca.iu.edu'
comma separated queue names. worker will subscribe to these queues for accepting tasks. Configured in workers/config/celeryconfig.py
with task_routes
, task_default_queue
Run test
python -m tests.test
Testing with workers running on COLO node and Rhythm API
There are no test instances of API, rhythm_api, mongo, postgres, queue running. These need to be run in local and port forwarded through ssh.
- start postgres locally using docker
cd <app_name>
docker-compose up postgres -d
- start rhythm_api locally
cd <rhythm_api>
docker-compose up queue mongo -d
poetry run dev
- start UI and API locally
cd <app_name>/api
pnpm start
cd <app_name>/ui
pnpm dev
- Reverse port forward API, mongo and queue. let the clients on remote machine talk to a server running on the local machine.
- API - local port - 3030, remote port - 3130
- Mongo - local port - 27017, remote port - 28017
- queue - local port - 5672, remote port - 5772
ssh \
-A \
-R 3130:localhost:3030 \
-R 28017:localhost:27017 \
-R 5772:localhost:5672 \
bioloopuser@workers.iu.edu
- pull latest changes in dev branch to
<bioloop_dev>
colo23> cd <app_dev>
colo23> git checkout dev
colo23> git pull
create / update
<app_dev>/workers/.env
create an auth token to communicate with the express server (postgres db)
cd <app>/api
node src/scripts/issue_token.js <service_account>
- ex:
node src/scripts/issue_token.js svc_tasks
- docker ex:
sudo docker compose -f "docker-compose-prod.yml" exec api node src/scripts/issue_token.js svc_tasks
install dependencies using poetry and start celery workers
colo23> cd workers
colo23> poetry install
colo23> poetry shell
colo23> python -m celery -A workers.celery_app worker --loglevel INFO -O fair --pidfile celery_worker.pid --hostname 'bioloop-dev-celery-w1@%h' --autoscale=2,1
Dataset Name:
- taken from the name of the directory ingested
- used in watch.py to filter out registered datasets
- used to compute the staging path
staging_dir / alias / dataset['name']
- used to compute the qc path
Path(config['paths'][dataset_type]['qc']) / dataset['name'] / 'qc'
- used to compute the scratch tar path while downloading the tar file from SDA
Path(f'{str(compute_staging_path(dataset)[0].parent)}/{dataset["name"]}.tar')