Workers
Coding Guidelines
Hierarchical Config
- The default & dev config goes into
workers/config/common.py - The overrides for production goes into
workers/config/production.py - Based on the environment variable APP_ENV, config from that file is imported and merged with the common config.
- Add APP_ENV=production to
.envfile which load_dotenv reads or - directly set it as
export APP_ENV=production.
- Add APP_ENV=production to
- In project files, import config as
from workers.config import config - Imported config is a DotMap object, which supports both
config[]andconfig.access. - To add a new environment (for example "stage"), create a new file inside
workers/configcalledstage.pyand have the overriding config as a dict assigned to a variable namedconfig.
Celery config
- config specific to Celery is in
workers/config/celeryconfig.py - Config is in python values, instead of a dict
- Env specific values and secrets are loaded from
.envfile
Code Organization
- Celery Tasks:
workers/tasks/*.py - Scheduled job and other scripts:
workers/scripts/*.py - Helper code:
workers/*.py - Config / settings are in
workers/config/*.pyand.env - Test code is in
tests/
Hot Module Replacement
Worker automatically run with updated code except for the code in
- workers.config.*
- workers.utils
- workers.celery_app
- workers.task.declaration
Deployment
- Add
module load python/3.10.5to ~/.modules - Update
.env(make a copy of.env.exampleand add values) - Install dependencies
poetry export --without-hashes --format=requirements.txt > requirements.txt
pip install -r requirements.txtcd ~/app/workers
pm2 start ecosystem.config.js
# optional
pm2 saveTesting with workers running on local machine
Start mongo and queue
cd <rhythm_api>
docker-compose up queue mongo -dStart Workers
python -m celery -A tests.celery_app worker --loglevel INFO -O fair --pidfile celery_worker.pid --hostname 'bioloop-celery-w1@%h' --autoscale=2,1 --queues 'bioloop-dev.sca.iu.edu.q'--concurrency 1: number of worker processed to pre-fork
-O fair: Optimization profile, disables prefetching of tasks. Guarantees child processes will only be allocated tasks when they are actually available.
Use --hostname '<app_name>-celery-<worker_name>@%h' to distinguish multiple workers running on the same machine either for the same app or different apps.
- replace
<app_name>with app name (ex: bioloop) - replace
<worker_name>with worker name (ex: w1)
Auto-scaling - max_concurrency,min_concurrency --autoscale=10,3 (always keep 3 processes, but grow to 10 if necessary).
--queues '<app_name>-dev.sca.iu.edu' comma separated queue names. worker will subscribe to these queues for accepting tasks. Configured in workers/config/celeryconfig.py with task_routes, task_default_queue
Run test
python -m tests.testTesting with workers running on COLO node and Rhythm API
There are no test instances of API, rhythm_api, mongo, postgres, queue running. These need to be run in local and port forwarded through ssh.
- start postgres locally using docker
cd <app_name>
docker-compose up postgres -d- start rhythm_api locally
cd <rhythm_api>
docker-compose up queue mongo -d
poetry run dev- start UI and API locally
cd <app_name>/api
pnpm startcd <app_name>/ui
pnpm dev- Reverse port forward API, mongo and queue. let the clients on remote machine talk to a server running on the local machine.
- API - local port - 3030, remote port - 3130
- Mongo - local port - 27017, remote port - 28017
- queue - local port - 5672, remote port - 5772
ssh \
-A \
-R 3130:localhost:3030 \
-R 28017:localhost:27017 \
-R 5772:localhost:5672 \
bioloopuser@workers.iu.edu- pull latest changes in dev branch to
<bioloop_dev>
colo23> cd <app_dev>
colo23> git checkout dev
colo23> git pullcreate / update
<app_dev>/workers/.envcreate an auth token to communicate with the express server (postgres db)
cd <app>/apinode src/scripts/issue_token.js <service_account>- ex:
node src/scripts/issue_token.js svc_tasks - docker ex:
sudo docker compose -f "docker-compose-prod.yml" exec api node src/scripts/issue_token.js svc_tasks
install dependencies using poetry and start celery workers
colo23> cd workers
colo23> poetry install
colo23> poetry shell
colo23> python -m celery -A workers.celery_app worker --loglevel INFO -O fair --pidfile celery_worker.pid --hostname 'bioloop-dev-celery-w1@%h' --autoscale=2,1Dataset Name:
- taken from the name of the directory ingested
- used in watch.py to filter out registered datasets
- used to compute the staging path
staging_dir / alias / dataset['name'] - used to compute the qc path
Path(config['paths'][dataset_type]['qc']) / dataset['name'] / 'qc' - used to compute the scratch tar path while downloading the tar file from SDA
Path(f'{str(compute_staging_path(dataset)[0].parent)}/{dataset["name"]}.tar')