It’s odd that it doesn’t show a stack trace along with the terminated worker.
You could try adding --log-level debug command line argument to force gunicorn to use debug log level.
Without seeing the stack trace it’s really difficult to judge what the issue is.
You could also try to install another version of medcat to see if it works on another version. But I don’t really see why that should result in a different outcome.
Reading up on gunicorn respawning without a trace of what went wrong on Google, it is mentioned that when other (e.g. OS related) causes kill the worker process , it may be that gunicorn just notices the worker was killed and respawns a new process. I will ask the systems admins of the cluster that I am on if they have implemented some settings that do not fit our use case. I can not read all the system logs here.
Is the entire model loaded into RAM (vocab + cdb ~ 1.2 GB)?
With --log-level debug no more information at the time of the crash. Perhaps you @mart.ratas see something in the start up phase?
I will contact the system admins to see if there is anything in their log or settings.
(mdcservice) [tgwelter@node012 MedCATservice]$ ./start-service-prod.sh
threads = 4
Starting up Flask app using gunicorn server ...
[2023-12-11 11:20:09 +0100] [2900978] [DEBUG] Current configuration:
config: config.py
wsgi_app: None
bind: ['0.0.0.0:5000']
backlog: 2048
workers: 1
worker_class: sync
threads: 4
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 300
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /trinity/home/tgwelter/MedCATservice
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 1165
group: 1165
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: -
disable_redirect_access_to_syslog: False
access_log_format: %(t)s [ACCESSS] %(h)s "%(r)s" %(s)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: wsgi
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0x1555465782c0>
on_reload: <function OnReload.on_reload at 0x155546578400>
when_ready: <function WhenReady.when_ready at 0x155546578540>
pre_fork: <function Prefork.pre_fork at 0x155546578680>
post_fork: <function post_fork at 0x15554657a980>
post_worker_init: <function PostWorkerInit.post_worker_init at 0x155546578900>
worker_int: <function WorkerInt.worker_int at 0x155546578a40>
worker_abort: <function WorkerAbort.worker_abort at 0x155546578b80>
pre_exec: <function PreExec.pre_exec at 0x155546578cc0>
pre_request: <function PreRequest.pre_request at 0x155546578e00>
post_request: <function PostRequest.post_request at 0x155546578ea0>
child_exit: <function ChildExit.child_exit at 0x155546578fe0>
worker_exit: <function WorkerExit.worker_exit at 0x155546579120>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x155546579260>
on_exit: <function OnExit.on_exit at 0x1555465793a0>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: None
raw_paste_global_conf: []
strip_header_spaces: False
[2023-12-11 11:20:09 +0100] [2900978] [INFO] Starting gunicorn 20.1.0
[2023-12-11 11:20:09 +0100] [2900978] [DEBUG] Arbiter booted
[2023-12-11 11:20:09 +0100] [2900978] [INFO] Listening at: http://0.0.0.0:5000 (2900978)
[2023-12-11 11:20:09 +0100] [2900978] [INFO] Using worker: gthread
[2023-12-11 11:20:09 +0100] [2900979] [INFO] Booting worker with pid: 2900979
[2023-12-11 11:20:09 +0100] [2900979] [INFO] Worker spawned (pid: 2900979)
[2023-12-11 11:20:09 +0100] [2900979] [INFO] APP_CUDA_DEVICE_COUNT device variables not set
[2023-12-11 11:20:09 +0100] [2900978] [DEBUG] 1 workers
[2023-12-11 11:20:33 +0100] [2900979] [DEBUG] POST /api/process
[2023-12-11 11:20:33,165] [DEBUG] MedCatProcessor: APP log level set to : DEBUG
[2023-12-11 11:20:33,165] [DEBUG] MedCatProcessor: MedCAT log level set to : DEBUG
[2023-12-11 11:20:33,165] [INFO] MedCatProcessor: Initializing MedCAT processor ...
[2023-12-11 11:20:34,430] [INFO] MedCatProcessor: Loading model pack...
[2023-12-11 11:20:48,652] [WARNING] medcat.cdb: You have MedCAT version '1.7.3' installed while the CDB was exported by MedCAT version '1.3.0',
which may or may not work. If you experience any compatibility issues, please reinstall MedCAT
or download the compatible model.
[2023-12-11 11:20:57 +0100] [2900978] [WARNING] Worker with pid 2900979 was terminated due to signal 9
[2023-12-11 11:20:57 +0100] [2901139] [INFO] Booting worker with pid: 2901139
[2023-12-11 11:20:57 +0100] [2901139] [INFO] Worker spawned (pid: 2901139)
[2023-12-11 11:20:57 +0100] [2901139] [INFO] APP_CUDA_DEVICE_COUNT device variables not set
Yes, the entire model will be loaded into memory. But that shouldn’t really be an issue with the amount of memory you’ve got, should it?
With that said, since this is a HPC cluster, they might not want you to run heavy applications on the front end (i.e without running through the workload management system).
If that’s what you’ve been doing, you could try doing the same things in an interactive job environment.
Thanks, I contacted the sysadmins. They did not mention that my login environment was restricted in anyway but we will see. Indeed, memory should certainly not be a problem.
@mart.ratas It appears to have to do with the cluster settings because the service runs on my laptop without problem. Thanks for the help, I will let you know when it runs on the cluster.
@mart.ratas It was indeed a cluster setting that put a limit on memory usage. Increasing memory for the session resolved the issue and it is working fine now.
regards,
Tom