|

Debugging Roundcube SMTP Errors Caused by ClamAV Out of Memory Kills

During a routine log review on one of my Postfix/Dovecot mail servers, I noticed a pattern of SMTP send failures in the Roundcube webmail logs. The errors were intermittent, affecting messages with larger attachments, and had been occurring for about a day before I spotted them. What followed was a debugging session that traced the problem through four layers of the mail stack, from Roundcube all the way down to a ClamAV process being killed by the Linux OOM killer due to host memory exhaustion.

The root cause turned out to be simple: too many containers running on an undersized VPS, leaving no memory headroom for ClamAV’s periodic database reloads. But the debugging path revealed several configuration issues worth documenting, including a cgroups v2 gotcha on Debian 13, a multiplicative timeout trap in Rspamd, and stale services that had been silently dead for weeks.

If you run a similar Postfix, Dovecot, Rspamd, and ClamAV stack on Incus containers, this post should save you some troubleshooting time.

The Errors

The Roundcube errors.log showed repeated SMTP failures when users tried to send messages with attachments:

SMTP Error: Failed to send data. [-1]
PHP Error: Invalid response code received from server

The -1 error code means Roundcube’s PHP SMTP library connected to the mail server successfully, but received an unexpected or empty response during the SMTP DATA phase. This is not a connection failure. The SMTP conversation was breaking down mid session.

Tracing Through the Stack

Postfix: The Connection Was Successful

The Postfix logs on mx1 showed that Roundcube’s connections were arriving and completing the full SMTP handshake:

postfix/smtps/smtpd: disconnect from roundcube-host
  ehlo=1 auth=1 mail=1 rcpt=1 data=1 rset=1 quit=1 commands=7

The data=1 confirms the message data was transmitted. But the rset=1 is the tell. Roundcube sent a RSET (reset) after the DATA phase, meaning it received a response it didn’t understand and aborted the session. Compare with a successful send:

ehlo=1 auth=1 mail=1 rcpt=1 data=1 quit=1 commands=6

No RSET, clean disconnect. So Postfix was receiving the message, but something during the post DATA processing was causing a delayed or invalid response back to Roundcube.

Rspamd: 60 Second Scan Timeouts

The Rspamd logs revealed the problem. Every failed message showed a processing time of exactly ~60 seconds:

rspamd_task_write_log: user: user@example.com,
  [CLAM_VIRUS_FAIL(0.00){failed to scan and retransmits exceed;},...],
  len: 8883839, time: 60228.667ms

Two critical details: CLAM_VIRUS_FAIL means ClamAV was completely failing to scan the message, and time: 60228.667ms means Rspamd waited a full 60 seconds before giving up.

The Rspamd antivirus configuration explained why it waited so long:

timeout = 15.0;
retransmits = 3;

That is 15 seconds multiplied by 3 retries, totaling 45 seconds minimum, plus overhead pushing it to ~60 seconds. The fail_action = "accept" setting meant Rspamd would eventually pass the message through, but only after burning through all retries.

Meanwhile, Roundcube’s SMTP library has its own timeout (default 30 to 60 seconds). By the time Rspamd finished waiting for ClamAV and told Postfix to proceed, Roundcube had already given up and reported the error to the user.

ClamAV: OOM Killed

systemctl status clamav-daemon

Active: failed (Result: oom-kill)
Mem peak: 1G

ClamAV was dead. The Linux OOM killer had terminated it.

The Root Cause: Host Memory Exhaustion

The mail server runs as an Incus container on a Hetzner CAX21 VPS with 8 GB of RAM. Checking the host revealed the real problem:

free -h
              total        used        free
Mem:          7.6Gi       7.3Gi       205Mi
Swap:            0B          0B          0B

Only 205 MB free, no swap. The host was running 15 containers, several of which were orphaned or idle but still consuming memory. One idle Nextcloud demo instance alone was using 1.2 GB. ClamAV, as the single largest process at ~960 MB, was the natural target for the OOM killer whenever anything spiked.

ClamAV’s memory usage is normally stable around 960 MB with the official signature databases. However, when freshclam downloads an updated database (every 2 hours), ClamAV briefly needs to hold both the old and new database in memory during the reload. This spike can push it to around 1.8 GB momentarily, as confirmed on another identical mail server running on a 16 GB VPS where this has never been an issue. On the memory starved host, this reload spike was consistently triggering OOM kills.

Additional Issues Found Along the Way

While debugging, I found several configuration issues that were compounding the problem:

cgroups v2 Memory Limit Ignored

The systemd override for ClamAV was using the old cgroups v1 directive:

# WRONG on Debian 13 with Incus (cgroups v2)
MemoryLimit=1024M

# CORRECT
MemoryMax=1024M

On Debian 13 with cgroups v2, MemoryLimit may not apply at all. The intended 1 GB cap was likely being ignored.

Duplicate Directives in clamd.conf

The ClamAV configuration contained duplicate entries:

MaxThreads 1       # early in the file
MaxThreads 12      # later, this one wins

MaxQueue 2         # early
MaxQueue 100       # later, this one wins

ClamAV uses the last value it reads. So despite an apparent attempt to limit threads to 1, the server was running 12 scanning threads.

Freshclam Dead for Six Weeks

The virus database update service had been inactive since mid February:

clamav-freshclam.service
Active: inactive (dead) since Sun 2026-02-15

ClamAV was running with a database that was over six weeks old, which also means any signature related memory optimizations in newer databases were not being applied.

ExitOnOOM Misconfigured

The ExitOnOOM true setting in clamd.conf causes ClamAV to exit when it detects internal memory pressure. Combined with systemd restart, this creates a scenario where ClamAV may exit and restart repeatedly without any clear indication in the logs about what is actually happening. Setting this to false lets systemd’s MemoryMax handle the constraint instead.

The Fixes

1. Free Up Host Memory

The most impactful fix was stopping orphaned and unused containers on the host:

incus stop nextcloud       # idle Nextcloud demo, 1.2 GB
incus stop code
incus stop easy-appointments
incus stop n8n             # not actively used

This freed over 1.5 GB of host memory, giving ClamAV enough headroom for database reload spikes.

2. Fix ClamAV Configuration

Remove duplicate directives and reduce thread count:

# In /etc/clamav/clamd.conf
MaxThreads 2        # two threads is sufficient for a small mail server
MaxQueue 15
ExitOnOOM false      # let systemd manage memory limits

3. Fix systemd Memory Limit for cgroups v2

# /etc/systemd/system/clamav-daemon.service.d/override.conf
[Service]
MemoryMax=1536M
CPUQuota=25%
IOSchedulingPriority=7
CPUSchedulingPolicy=idle
Nice=19

4. Reduce Rspamd ClamAV Timeout

The multiplicative timeout was the direct cause of the 60 second delays:

# /etc/rspamd/local.d/antivirus.conf
clamav {
  enabled = true;
  type = "clamav";
  servers = "/var/run/clamav/clamd.ctl";
  symbol = "CLAM_VIRUS";
  action = "reject";
  message = "Virus detected: %s";
  scan_mime_parts = true;
  scan_text_mime = false;
  scan_image_mime = false;
  timeout = 10.0;      # was 15.0
  retransmits = 1;      # was 3, max wait is now 10s instead of 60s
  fail_action = "accept";
  log_clean = true;
  patterns {
    virus = '^VIR';
    phish = '^Heuristics\.Phishing';
  }
}

Important: reload Rspamd after changing the config. This caught me out during debugging. Even though the config file had the correct values, Rspamd kept using the old timeout and retransmit settings until I explicitly ran:

systemctl reload rspamd

I verified this by sending test emails before and after the reload. Before the reload, scans were still timing out at 20 seconds (the old retransmits value multiplied by the new timeout). After the reload, ClamAV scanned a 9.5 MB attachment in under 250 ms.

5. Re enable Freshclam

systemctl start clamav-freshclam
systemctl enable clamav-freshclam

6. Disable Unofficial ClamAV Signatures (Precaution)

The server was also running clamav-unofficial-sigs, which adds third party signature databases from SaneSecurity, Linux Malware Detect, Yara rules, and others. On a host with plenty of memory, these run fine. On this memory constrained host, they added unnecessary pressure.

As a precaution, I disabled them:

sed -i 's/^user_configuration_complete="yes"/user_configuration_complete="no"/' \
  /etc/clamav-unofficial-sigs/user.conf
systemctl stop clamav-unofficial-sigs.timer
systemctl disable clamav-unofficial-sigs.timer

# Remove unofficial database files, keep only official ones
cd /var/lib/clamav
rm -f *.ndb *.ldb *.hdb *.hsb *.cdb *.yar *.yara *.ftm *.fp
rm -f interservertopline.db sigwhitelist.ign2

The reasoning: Rspamd already provides phishing detection, DKIM/SPF/DMARC enforcement, fuzzy hash matching, Bayesian filtering, and reputation based blocking. The unofficial ClamAV signatures have significant overlap with what Rspamd catches, particularly in the phishing and malicious URL categories. On a memory constrained system, removing them reduces ClamAV’s footprint without meaningfully reducing protection.

On my other mail server running on a 16 GB VPS, the same unofficial signatures run without issues. This is a practical decision based on available resources, not a universal recommendation against them.

7. Increase Roundcube SMTP Timeout

As a safety net, increase Roundcube’s timeout so it can wait for Rspamd to finish scanning larger messages:

// /var/www/roundcube/config/config.inc.php
$config['smtp_timeout'] = 120;
$config['smtp_debug'] = true;  // logs full SMTP dialogue to smtp.log

The smtp_debug setting is worth enabling permanently. It logs the complete SMTP conversation to /var/www/roundcube/logs/smtp.log, making future debugging trivial instead of guessing from cryptic -1 error codes.

Adding Monit for Automatic Recovery

Even after fixing the root cause, ClamAV remains the most memory hungry process on the mail server. Adding Monit provides a safety net that automatically restarts crashed services and sends email alerts.

Install and Configure

apt install monit -y

Add the following to /etc/monit/monitrc:

set daemon 30

set mailserver localhost
set alert alerts@yourdomain.com

set mail-format {
  from: monit@mail.yourdomain.com
  subject: [monit] $SERVICE - $EVENT
  message: $EVENT on $HOST at $DATE

  Service: $SERVICE
  Event:   $EVENT
  Action:  $ACTION

  $DESCRIPTION
}

set httpd port 2812 and
  use address localhost
  allow localhost

Service Monitors

Create individual config files in /etc/monit/conf.d/:

# /etc/monit/conf.d/clamav
check process clamav-daemon matching "clamd"
  start program = "/usr/bin/systemctl start clamav-daemon"
  stop program = "/usr/bin/systemctl stop clamav-daemon"
  if does not exist then restart
  if 3 restarts within 5 cycles then alert
  if memory > 1400 MB for 3 cycles then restart
# /etc/monit/conf.d/rspamd
check process rspamd matching "rspamd"
  start program = "/usr/bin/systemctl start rspamd"
  stop program = "/usr/bin/systemctl stop rspamd"
  if does not exist then restart
  if 3 restarts within 5 cycles then alert
# /etc/monit/conf.d/postfix
check process postfix matching "master"
  start program = "/usr/bin/systemctl start postfix"
  stop program = "/usr/bin/systemctl stop postfix"
  if does not exist then restart
  if 3 restarts within 5 cycles then alert
# /etc/monit/conf.d/dovecot
check process dovecot matching "dovecot"
  start program = "/usr/bin/systemctl start dovecot"
  stop program = "/usr/bin/systemctl stop dovecot"
  if does not exist then restart
  if 3 restarts within 5 cycles then alert

The ClamAV monitor is the most interesting one. If ClamAV’s memory exceeds 1400 MB for three consecutive checks (90 seconds), monit will proactively restart it before the OOM killer intervenes. If ClamAV crashes and gets restarted three times within five check cycles, monit stops trying and sends an alert, so you know there is a fundamental problem that needs manual intervention.

Test It

systemctl restart monit
monit status

# Verify alerting works
systemctl stop clamav-daemon
tail -f /var/log/monit.log
# Monit should restart ClamAV within 30 seconds and send an alert email

The Result

After freeing host memory, fixing the configuration issues, and adding monit:

StageScan TimeResult
Before fix (ClamAV dead)60,000 msCLAM_VIRUS_FAIL timeout
ClamAV restarted, Rspamd not reloaded20,000 msCLAM_VIRUS_FAIL still failing
After Rspamd reload (cold cache)16,000 msClean scan, message passed
Second scan (warm cache)248 msClean scan, fully operational

ClamAV is now stable at ~960 MB with sufficient host headroom for database reload spikes. Monit watches it and will restart and alert if anything goes wrong.

Key Takeaways

“SMTP Error: Failed to send data [-1]” in Roundcube often means a milter timeout, not a connection or authentication failure. Check Rspamd processing times in the logs.

Host memory matters more than container memory. ClamAV’s baseline memory usage of ~960 MB is manageable, but database reloads can spike to 1.8 GB. If your host doesn’t have headroom for this, ClamAV will be the first process the OOM killer targets. Clean up orphaned containers and monitor host level memory.

MemoryLimit vs MemoryMax matters on Debian 13. With cgroups v2 (used by Incus and modern systemd), use MemoryMax. The old MemoryLimit directive may be silently ignored.

Rspamd timeout multiplied by retransmits is multiplicative. A timeout of 15 with retransmits of 3 means up to 60 seconds of blocking, not 15. Set retransmits to 1 for ClamAV scanning to cap the maximum wait at the timeout value.

Always reload Rspamd after config changes. Writing new values to the config file is not enough. Without systemctl reload rspamd, the old settings remain active. Verify with a test email after changes.

Enable smtp_debug in Roundcube. The full SMTP dialogue logged to smtp.log makes future debugging straightforward instead of guessing from cryptic error codes.

Check ClamAV config for duplicate directives. ClamAV uses the last value it reads. If you have MaxThreads 1 early in the file and MaxThreads 12 later, you’re running 12 threads.

Freshclam failures are silent. If the update service dies, ClamAV keeps running with stale databases indefinitely. Monitor it or check periodically.

Monit is a lightweight safety net that automatically restarts crashed services and alerts you before users notice. It is particularly valuable for memory hungry services like ClamAV that can be affected by factors outside their own container.

This issue was debugged on a production Debian 13 mail server running Postfix, Dovecot 2.4, Rspamd, and ClamAV in an Incus container, with Roundcube webmail on a separate host. The full mail server setup is documented in my Debian 13 mail server series.

Similar Posts