| | |

Advanced Mail Filtering: ClamAV, Neural Networks, Machine Learning

Part 5 of the Building a Modern Mail Server on Debian 13 series

Introduction

Part 4 established a production-ready mail server with Hall of Fame security status on internet.nl and excellent spam filtering through Rspamd. You now have:

  • ✅ Professional spam filtering with scoring and headers
  • ✅ Complete email authentication (DKIM, SPF, DMARC)
  • ✅ Greylisting for spam bot protection
  • ✅ DANE/TLSA certificate pinning
  • ✅ Automatic spam-to-Junk delivery

This is already a production-ready mail server suitable for most organizations. Part 5 is optional and adds advanced filtering layers that improve detection rates and reduce false positives through machine learning and collaborative intelligence.

What Part 5 Adds

Virus and Malware Protection:

  • ClamAV – Scans attachments for viruses, malware, and phishing

Phishing Protection:

  • OpenPhish & PhishTank – Real-time phishing URL detection

Machine Learning and Adaptation:

  • Neural Networks – Pattern recognition for sophisticated spam
  • Fuzzy Hashing – Detects near-duplicate spam campaigns
  • Bayes Classifier – Statistical learning from your mail patterns
  • Self-Learning – Automatic training from user actions

Production Considerations

Should you implement Part 5?

YES, implement if you:

  • Handle sensitive data requiring virus scanning
  • Experience sophisticated spam that bypasses basic filters
  • Have users who make Junk/INBOX classification decisions
  • Want adaptive filtering that learns from your mail patterns
  • Need the highest possible spam detection rates

SKIP Part 5 if you:

  • Run a small personal mail server (< 10 users)
  • Have limited system resources (< 4GB RAM)
  • Prefer simpler systems with fewer moving parts
  • Are satisfied with Part 4’s detection rates

Resource requirements for Part 5:

  • Additional 512MB-1GB RAM (primarily for ClamAV)
  • 2-4GB disk space for virus signatures
  • Modest CPU overhead for scanning and learning

Remember: Part 4 alone provides production-ready mail security. Part 5 enhances detection but isn’t required.

What You’ll Build

After completing this part, your mail server will have:

Multi-Layer Virus Protection

  • ClamAV virus scanning – All attachments checked for malware
  • Real-time signature updates – Fresh virus definitions daily
  • Optional unofficial signatures – 18M+ additional virus definitions
  • Safe failure mode – Mail delivered even if the scanner is down

Phishing Protection

  • OpenPhish feed – Real-time phishing URL database
  • PhishTank feed – Community-sourced phishing URLs
  • Automatic updates – Feeds refresh hourly
  • Credential theft prevention – Block fake login pages

Intelligent Learning Systems

  • Neural network – Recognizes sophisticated spam patterns
  • Fuzzy hashing – Detects spam with minor variations
  • Bayes classifier – Learns from your specific mail patterns
  • Auto-learning – Trains automatically from high-confidence decisions

User-Driven Training

  • Self-learning pipeline – Learns from folder moves
  • IMAPSieve integration – Automatic training scripts
  • Ham and spam training – Both directions taught
  • Continuous improvement – Gets smarter over time

Performance and Monitoring

  • Resource-efficient – Optimized for production servers
  • Detailed metrics – Per-module effectiveness tracking
  • Learning progress – Monitor improvement over time
  • Fallback strategies – Resilient to scanner failures

Prerequisites Check

From Part 4: Core Rspamd Setup

Verify you’ve completed Part 4 with these essentials:

# Check Rspamd is running
systemctl status rspamd

# Verify Valkey connection (as _rspamd user)
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock ping
# Should respond: PONG
# Testing as _rspamd ensures Rspamd has proper permissions to access Valkey

# Check Rspamd metrics
rspamc stat
# Should show: Messages scanned, Actions taken

# Verify DKIM signing is configured (from Part 4)
# Note: This check only works if you've already sent mail from your server
journalctl -u rspamd --since "24 hours ago" | grep DKIM_SIGNED | tail -1
# Should show: DKIM_SIGNED(0.0){your-domain.com}
# If empty: Either no mail sent yet, or DKIM not configured

# If empty, send a test email via authenticated submission (port 587)
# This ensures mail goes through Rspamd for DKIM signing
# Replace 'user@example.com' with your actual email account and external recipient
source /root/mail-server-vars.sh
echo "Test email to verify DKIM signing" | \
  swaks --to your-personal-email@gmail.com \
        --from info@${DOMAIN} \
        --server localhost:587 \
        --auth-user info@${DOMAIN} \
        --auth-password 'your-password' \
        --tls

# Wait for processing, then check logs
sleep 5
journalctl -u rspamd --since "1 minute ago" | grep DKIM_SIGNED
# Should now show: DKIM_SIGNED(0.0){your-domain.com}

# Alternative: Verify DKIM keys exist without sending mail
ls -la /var/lib/rspamd/dkim/
# Should show: default.key (owned by _rspamd)

# Or check DKIM configuration directly
rspamd-dkim-keygen -s default -d ${DOMAIN} -k /var/lib/rspamd/dkim/default.key 2>&1 | grep -i "already exists"
# If key exists, DKIM is configured

Expected from Part 4:

  • Rspamd installed and processing mail
  • Valkey (Redis) running and connected
  • DKIM keys created in /var/lib/rspamd/dkim/
  • DKIM signing outgoing mail
  • ARC preserving authentication through mailing lists/forwarders
  • SPF, DMARC records configured
  • Spam headers being added
  • Greylisting active
  • Sieve spam filter moving mail to Junk
  • Hall of Fame status on internet.nl

System Resources Check

Part 5 adds resource requirements, especially for ClamAV:

# Check available RAM
free -h
# Recommended minimum: 4GB total (2GB+ free)

# Check available disk space
df -h /
# Recommended minimum: 10GB free (for ClamAV signatures + learning data)

# Check CPU cores
nproc
# Recommended minimum: 2 cores (4+ for smooth performance)

Resource guidelines:

  • < 4GB RAM: Skip ClamAV or use on-demand scanning only
  • 4-8GB RAM: Full setup works but monitor closely
  • 8GB+ RAM: Comfortable for all Part 5 features
  • SSD storage: Strongly recommended for learning databases

Verify Mail Flow

Ensure basic mail flow is working correctly:

# Source configuration
source /root/mail-server-vars.sh

# Send test email
echo "Part 5 Prerequisites Test - $(date)" | mail -s "Pre-P5 Test" info@${DOMAIN}

# Watch complete mail pipeline
journalctl -u postfix -u dovecot -u rspamd -f

# Check email was delivered
doveadm search -u info@${DOMAIN} subject "Pre-P5 Test"
# Should show message ID

Expected behavior:

  1. Postfix receives mail
  2. Rspamd processes and scores mail
  3. Dovecot delivers to INBOX or Junk
  4. Sieve filter executes correctly

If all checks pass, you’re ready for Part 5!

Architecture Overview

Here’s how the advanced filtering pipeline integrates with your existing setup from Part 4:

┌──────────────────────────────────────────────────────┐
│                Incoming/Outgoing Mail                │
└──────────────────────────┬───────────────────────────┘
                           │
┌──────────────────────────▼───────────────────────────┐
│                       Postfix                        │
│                 SMTP Server (Port 25)                │
└──────────────────────────┬───────────────────────────┘
                           │
                           │ Milter Protocol
                           │
┌──────────────────────────▼───────────────────────────┐
│                       Rspamd                         │
│            Advanced Multi-Layer Analysis             │
│                                                      │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐  │
│  │  Traditional │ │ Virus/Malware│ │   Learning   │  │
│  │   Filters    │ │  Protection  │ │   Systems    │  │
│  ├──────────────┤ ├──────────────┤ ├──────────────┤  │
│  │ • SPF/DKIM   │ │ • ClamAV     │ │ • Neural Net │  │
│  │ • Greylisting│ │ • Phishing   │ │ • Bayes      │  │
│  │ • Headers    │ │ • RBL Checks │ │ • Fuzzy Hash │  │
│  │ • Reputation │ │ • Signatures │ │ • Auto-learn │  │
│  └──────────────┘ └──────────────┘ └──────────────┘  │
│                           │                          │
│  Each module votes on message classification         │
│  Final score = weighted combination of all signals   │
│                                                      │
└──────────────────────────┬───────────────────────────┘
                           │
                           │ Headers: X-Spam, X-Spam-Score
                           │ Actions: No action, Greylist, Add header, Reject
                           │
┌──────────────────────────▼───────────────────────────┐
│                    Dovecot LMTP                      │
│                    Mail Delivery                     │
└──────────────────────────┬───────────────────────────┘
                           │
┌──────────────────────────▼───────────────────────────┐
│                    Sieve Filter                      │
│                if X-Spam: Yes → Junk                 │
│                     else → INBOX                     │
└──────────────────────────┬───────────────────────────┘
                           │
┌──────────────────────────▼───────────────────────────┐
│                   Final Delivery                     │
│                 INBOX or Junk Folder                 │
└──────────────────────────┬───────────────────────────┘
                           │
                           │ User Actions
                           │ (Move Junk↔INBOX)
                           │
┌──────────────────────────▼───────────────────────────┐
│                   IMAPSieve Scripts                  │
│            Automatic Learning Pipeline               │
│                                                      │
│              Move to Junk → Train as Spam            │
│              Move to INBOX → Train as Ham            │
│                    ┌────────────┐                    │
│                    │  Learning  │                    │
│                    │   Feedback │                    │
│                    └─────┬──────┘                    │
└──────────────────────────┼───────────────────────────┘
                           │
                           │ Trained Models
                           │
┌──────────────────────────▼───────────────────────────┐
│                   Valkey (Redis)                     │
│              Persistent Learning Data                │
│                                                      │
│              • Bayes Token Statistics                │
│              • Neural Network Weights                │
│              • Fuzzy Hash Database                   │
│              • Statistical Counters                  │
└──────────────────────────────────────────────────────┘

Data Flow Explained

Mail Arrival → Analysis:

  1. Postfix receives mail → Forwards to Rspamd via milter
  2. Rspamd performs parallel analysis:
    • Traditional checks: SPF, DKIM, DMARC, headers, reputation, RBL
    • Virus scanning: ClamAV (viruses, malware, phishing)
    • Learning systems: Neural network, Bayes, fuzzy hashing
  3. Each module “votes” with a score (positive = spam, negative = ham)
  4. Final score = weighted combination of all signals
  5. Action taken based on thresholds (reject, greylist, add header, no action)

Mail Delivery → Learning:

  1. Sieve filter reads X-Spam header → Routes to INBOX or Junk
  2. User reviews mail and moves messages if needed
  3. IMAPSieve detects folder changes:
    • Move to Junk → Train Rspamd as spam
    • Move to INBOX → Train Rspamd as ham (not spam)
  4. Training updates stored in Valkey:
    • Bayes: Token frequency statistics
    • Neural network: Weight adjustments
    • Fuzzy: Hash signatures

Continuous Improvement:

  • More user corrections → Better learning
  • More spam seen → Better pattern recognition
  • System adapts to YOUR specific mail patterns
  • Detection accuracy improves over time

Module Interaction

Voting system example:

Message: "Buy cheap pills now!"
├─ SPF: -0.2 (pass)
├─ DKIM: -0.2 (valid signature)
├─ Bayes: +5.0 (learned spam pattern)
├─ Neural: +4.5 (spam-like structure)
├─ Fuzzy: +3.0 (similar to previous spam)
├─ RBL: +2.8 (sender IP on blacklist)
└─ ClamAV: 0.0 (no virus)

Final Score: +14.9 / 15.0 → REJECT

Each module contributes evidence. The combined score determines the action.

ClamAV Integration

ClamAV is an open-source antivirus engine that scans email attachments for viruses, malware, trojans, and phishing attempts.

Why ClamAV for Mail

What ClamAV catches:

  • Viruses and malware – Executable files with malicious code
  • Office macro viruses – Infected Word/Excel documents
  • Phishing emails – Credential-stealing attempts
  • Suspicious attachments – Password-protected archives, scripts
  • Zero-day threats – Heuristic detection for unknown malware

Production considerations:

  • RAM-intensive: Requires 512MB-1GB RAM for the signature database
  • Disk space: Virus signatures consume 2-4GB
  • CPU overhead: Scanning adds 50-200ms per message with attachments
  • Scan timeouts: Configure reasonable limits to prevent blocking mail

Scan strategy:

  • All attachments scanned before delivery
  • Non-attachment mail passes immediately (minimal overhead)
  • Safe failure mode: Mail delivered if the scanner is unavailable
  • Daily signature updates for the latest threat detection

Installation

# Install ClamAV and update daemon
apt install -y clamav clamav-daemon

# ClamAV installs without running - fresh signature database needed
# Check service is disabled (expected)
systemctl status clamav-daemon
# Should show: disabled or inactive

# Start signature update (will take several minutes)
systemctl stop clamav-freshclam
freshclam

# Initial download takes 5-10 minutes
# Database is 200-400MB compressed, 2-4GB uncompressed

Monitor signature update:

# Watch update progress
journalctl -u clamav-freshclam -f

# Expected output:
# Reading CVD header (main.cvd): OK
# main database available for download (version: 27)
# Downloading main.cvd [100%]
# Database updated (10,389,214 signatures)

Start services after initial update:

# Enable and start freshclam (automatic updates)
systemctl enable clamav-freshclam
systemctl start clamav-freshclam

# Enable and start scanner daemon
systemctl enable clamav-daemon
systemctl start clamav-daemon

# Verify both services running
systemctl status clamav-daemon clamav-freshclam

Configure ClamAV for Mail Scanning

ClamAV’s default configuration needs tuning for mail server use.

Configure Scanner Performance

Edit /etc/clamav/clamd.conf:

# Backup original
cp /etc/clamav/clamd.conf /etc/clamav/clamd.conf.orig

# Edit configuration
vi /etc/clamav/clamd.conf

Find and modify these settings:

# Maximum file size to scan (25MB sufficient for mail)
MaxFileSize 25M

# Maximum scan size (some files expand - 100MB safe)
MaxScanSize 100M

# Maximum recursion level (for compressed archives)
MaxRecursion 10

# Maximum files in archive
MaxFiles 1000

# Phishing detection (important for email!)
PhishingSignatures yes
PhishingScanURLs yes

# Heuristic detection (catches unknown malware)
HeuristicScanPrecedence yes

# Alert on encrypted archives (suspicious in email)
AlertBrokenExecutables yes
AlertEncrypted yes
AlertEncryptedArchive yes
AlertEncryptedDoc yes

# Performance settings
# MaxThreads: Default is 12 (good for most systems)
# Reduce only if you have limited CPU cores (e.g., 2 cores = MaxThreads 2)
MaxThreads 12

Configuration explained:

  • MaxFileSize 25M: Don’t scan enormous files (mail usually < 25MB)
  • MaxScanSize 100M: Extraction limit (archives expand)
  • MaxRecursion 10: How deep into nested archives to scan
  • Phishing*: Essential for catching credential-stealing emails
  • Heuristic*: Detects suspicious patterns in unknown files
  • Alert*: Flag encrypted/broken files as suspicious
  • MaxThreads 12: Parallel scanning threads (default, good for 4+ core systems)
    • Reduce to 2-4 only if you have 2 CPU cores or less
    • Keep default (12) for modern VPS/dedicated servers

Configure Automatic Updates

Edit /etc/clamav/freshclam.conf:

vi /etc/clamav/freshclam.conf

Find and verify these settings:

# Update frequency (24 = once per day)
Checks 24

# Database mirror (use default)
DatabaseMirror database.clamav.net

Restart services to apply configuration:

systemctl restart clamav-daemon clamav-freshclam

# Verify ClamAV daemon is running and socket exists
systemctl status clamav-daemon
ls -la /run/clamav/clamd.sock /var/run/clamav/clamd.ctl
# Should show both sockets exist

# Verify no errors in logs
journalctl -u clamav-daemon -n 20

# Verify freshclam can notify clamd (no warnings)
journalctl -u clamav-freshclam -n 20 | grep -i warning
# Should be empty (no warnings about "Can't connect to clamd")

Integrate ClamAV with Rspamd

Configure Rspamd to use ClamAV for virus scanning.

Enable ClamAV Module

Create /etc/rspamd/local.d/antivirus.conf:

cat > /etc/rspamd/local.d/antivirus.conf << 'EOF'
# ClamAV antivirus configuration

clamav {
  # Enable ClamAV scanner
  enabled = true;
  
  # Virus scanning
  type = "clamav";
  
  # ClamAV socket (Debian default)
  servers = "/var/run/clamav/clamd.ctl";
  
  # Symbol for virus detection
  symbol = "CLAM_VIRUS";
  
  # Actions based on detection
  action = "reject";
  message = "Message rejected: Virus detected - %s";
  
  # Scan execution
  scan_mime_parts = true;    # Scan all MIME parts
  scan_text_mime = false;    # Skip text-only parts (performance)
  scan_image_mime = false;   # Skip images without executables
  
  # Timeout settings
  timeout = 15.0;            # 15 seconds max per scan
  retransmits = 3;           # Retry up to 3 times
  
  # If ClamAV is down, deliver mail anyway (safe failure)
  # Don't block legitimate mail because antivirus is offline
  fail_action = "accept";
  
  # Logging
  log_clean = false;         # Don't log clean messages (reduces noise)
  
  # Patterns to detect in ClamAV response
  patterns {
    # Virus found pattern
    virus = '^VIR';
    # Phishing found pattern  
    phish = '^Heuristics\.Phishing';
  }
  
  # Whitelist specific file types (trusted content)
  # Uncomment if needed:
  # whitelist = "/etc/rspamd/antivirus_whitelist.map";
}
EOF

Configuration explained:

  • servers: Unix socket to ClamAV daemon
  • symbol = "CLAM_VIRUS": Rspamd symbol for detection
  • action = "reject": What to do when virus found
  • scan_mime_parts = true: Scan all attachments
  • scan_text/image = false: Skip non-executable parts (performance)
  • timeout = 15.0: Reasonable timeout (prevents mail delays)
  • fail_action = "accept": If ClamAV down, deliver mail anyway
  • log_clean = false: Only log virus detections (less journal noise)

Configure ClamAV Symbol Weight

Create /etc/rspamd/local.d/external_services_group.conf:

cat > /etc/rspamd/local.d/external_services_group.conf << 'EOF'
# External services symbol configuration

# ClamAV virus detection and phishing URL detection
symbols = {
  "CLAM_VIRUS" {
    weight = 0.0;
    description = "Virus found by ClamAV";
    
    # Virus detection ALWAYS rejects, regardless of score
    # Weight 0.0 because we reject immediately on detection
    # Not part of spam score - it's a hard block
  }
  
  "PHISHED_OPENPHISH" {
    weight = 10.0;
    description = "URL found in OpenPhish phishing database";
  }
  
  "PHISHED_PHISHTANK" {
    weight = 10.0;
    description = "URL found in PhishTank phishing database";
  }
}
EOF

Symbol weight explanations:

  • CLAM_VIRUS (0.0): Virus detection is binary – reject action handles blocking, not scoring
  • *PHISHED_ (10.0)**: Confirmed phishing URLs warrant immediate high score for rejection

Test Configuration and Restart

# Test Rspamd configuration
rspamadm configtest
# Should show: "syntax OK"

# Restart Rspamd to load ClamAV integration
systemctl restart rspamd

# Verify ClamAV module loaded
journalctl -u rspamd --since "5 minutes ago" | grep -i clam

# Should show ClamAV socket connection

Test ClamAV Scanning

Test virus detection using SWAKS from another server. Testing from localhost doesn’t reliably trigger ClamAV scanning.

From another server (not your mail server), run:

# Create EICAR test file
cat > /tmp/eicar.txt << 'EOF'
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
EOF

# Send test email with EICAR attachment
swaks --to info@yourdomain.com \
      --from test@example.com \
      --server mail.yourdomain.com \
      --helo example.com \
      --header "Subject: EICAR Virus Test" \
      --body "Testing ClamAV virus detection" \
      --attach /tmp/eicar.txt

Expected result – Email rejected:

<** 550 5.7.1 Message rejected: Virus detected - Eicar-Signature
*** Error: Message rejected: Virus detected

On your mail server, verify:

# Check for virus detection in logs
journalctl -u rspamd --since "5 minutes ago" | grep CLAM_VIRUS

# Expected output:
# CLAM_VIRUS(0.00){Eicar-Signature;}
# forced: reject "Virus detected: Eicar-Signature"

# Verify rejection
journalctl -u postfix --since "5 minutes ago" | grep "reject.*virus"

# Mail queue should be empty (virus rejected before queueing)
mailq

Test clean email delivery:

# Send normal email without attachment
swaks --to info@yourdomain.com \
      --from test@example.com \
      --server mail.yourdomain.com \
      --helo example.com \
      --body "Clean test email"

# Should deliver successfully

Troubleshooting:

# Verify ClamAV is running
systemctl status clamav-daemon

# Test ClamAV directly
echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' | clamdscan -
# Should show: Eicar-Signature FOUND

# Update virus signatures if needed
freshclam
systemctl restart clamav-daemon

Monitor ClamAV Activity

# ClamAV daemon status
systemctl status clamav-daemon

# Recent virus detections
journalctl -u rspamd --since "7 days ago" | grep CLAM_VIRUS

# Shows lines like:
# CLAM_VIRUS(0.00){Eicar-Signature;}
# forced: reject "Virus detected: %s"; score=nan (set by clamav)

# Count virus detections in last 7 days
journalctl -u rspamd --since "7 days ago" | grep CLAM_VIRUS | wc -l

# ClamAV signature database version
clamdscan --version

# Check last database update status
journalctl -u clamav-freshclam | grep -E "Database updated|database is up-to-date" | tail -n 5
# Shows either "Database updated" (when signatures were downloaded) 
# or "database is up-to-date" (when already current)

# Current signature count
grep DatabaseDirectory /etc/clamav/clamd.conf | xargs ls -lh
# Shows main.cvd (main signatures), daily.cvd (daily updates)

# Verify ClamAV security and performance settings
cat /etc/clamav/clamd.conf | grep -E "MaxFileSize|MaxScanSize|MaxRecursion|MaxFiles|PhishingSignatures|PhishingScanURLs|HeuristicScanPrecedence|AlertBrokenExecutables|AlertEncrypted|AlertEncryptedArchive|AlertEncryptedDoc|MaxThreads"

# Expected settings (Debian defaults):
# MaxFileSize 100M          - Maximum file size to scan
# MaxScanSize 400M          - Maximum scan size (decompressed)
# MaxRecursion 17           - Archive nesting depth
# MaxFiles 10000            - Maximum files in archive
# PhishingSignatures yes    - Enable phishing signatures
# PhishingScanURLs yes      - Scan URLs for phishing
# HeuristicScanPrecedence no - Signature scan before heuristic
# AlertBrokenExecutables yes - Alert on corrupted executables
# AlertEncrypted no         - Don't alert on encrypted files (would cause false positives)
# AlertEncryptedArchive no  - Don't alert on encrypted archives
# AlertEncryptedDoc no      - Don't alert on encrypted documents
# MaxThreads 10             - Number of scanning threads

Optional: ClamAV Unofficial Signatures

⚠️ WARNING: This can add upto 2GB of extra RAM usage. Use only on systems with 8GB RAM or more.

The unofficial signatures project provides significantly more virus definitions than the official ClamAV databases, detecting many more threats.

Resource requirements:

  • An additional 2GB RAM
  • Additional 1-2GB disk space
  • Slightly longer scan times

When to use unofficial signatures:

  • ✅ High-security environments
  • ✅ Systems with ≥8GB RAM
  • ✅ Need maximum threat detection
  • ❌ Low-memory systems (< 8GB RAM)
  • ❌ Basic personal mail servers

Installation:

# Clone the unofficial signatures repository
cd /opt
git clone https://github.com/extremeshok/clamav-unofficial-sigs
cd clamav-unofficial-sigs

# Install the script
cp clamav-unofficial-sigs.sh /usr/local/sbin/
chmod +x /usr/local/sbin/clamav-unofficial-sigs.sh

# Create configuration directory
mkdir -p /etc/clamav-unofficial-sigs

# Copy configuration files
cp config/os/os.debian.conf /etc/clamav-unofficial-sigs/os.conf
cp config/master.conf /etc/clamav-unofficial-sigs/
cp config/user.conf /etc/clamav-unofficial-sigs/

# Enable the configuration
sed -i "s/#user_configuration_complete=\"yes\"/user_configuration_complete=\"yes\"/g" /etc/clamav-unofficial-sigs/user.conf

# Install logrotate and man page
/usr/local/sbin/clamav-unofficial-sigs.sh --install-logrotate
/usr/local/sbin/clamav-unofficial-sigs.sh --install-man

# Install systemd service and timer
cp systemd/clamav-unofficial-sigs.service /etc/systemd/system/
cp systemd/clamav-unofficial-sigs.timer /etc/systemd/system/
systemctl daemon-reload

# Enable automatic updates
systemctl enable clamav-unofficial-sigs.timer
systemctl start clamav-unofficial-sigs.timer

# Run initial download (this will take several minutes)
/usr/local/sbin/clamav-unofficial-sigs.sh

# Restart ClamAV to load new signatures
systemctl restart clamav-daemon

Verify installation:

# Check loaded signatures
clamscan --debug 2>&1 /dev/null | grep "loaded"

# Should show significantly more signatures after installation:
# Before: ~9 million signatures
# After: ~18+ million signatures

# Check memory usage
ps aux | grep clamd
# Expect ~2GB more RAM usage than before

Monitor signature updates:

# Check when signatures were last updated
systemctl status clamav-unofficial-sigs.timer

# View update logs
journalctl -u clamav-unofficial-sigs.service | tail -50

# Verify signatures are current
ls -lh /var/lib/clamav/

Maintenance:

The systemd timer automatically updates signatures daily. No manual intervention needed.

If you need to disable unofficial signatures:

# Stop and disable the timer
systemctl stop clamav-unofficial-sigs.timer
systemctl disable clamav-unofficial-sigs.timer

# Remove unofficial signature files (keep only official ones)
cd /var/lib/clamav/
rm -f *.ndb *.hdb *.fp *.ftm *.ign *.ign2 *.mdb *.ldb *.sfp *.yar*

# Restart ClamAV
systemctl restart clamav-daemon

# Memory usage should return to normal (~500MB-1GB)

ClamAV Maintenance

Daily automatic tasks (already configured):

  • Freshclam automatically updates signatures daily
  • No manual intervention needed
  • Updates applied without service restart

Weekly maintenance tasks:

# Check disk space (signatures grow over time)
df -h /var/lib/clamav/
# Expected: 2-4GB used for signature databases
# Expected with unofficial sigs: 4-6GB used

# Verify signatures are current
journalctl -u clamav-freshclam | grep -E "Database updated|database is up-to-date" | tail -n 5

# Check for update errors
journalctl -u clamav-freshclam --since "7 days ago" | grep -i error

# Verify ClamAV daemon is healthy
systemctl status clamav-daemon

Monthly monitoring:

# Review virus detection statistics
journalctl -u rspamd --since "30 days ago" | grep CLAM_VIRUS | wc -l
# Count of virus detections this month

# Check resource usage
ps aux | grep clamd
# Monitor RAM usage (should be 500MB-1GB)
# With unofficial signatures: 2.5-3GB

Phishing Protection

Phishing attacks try to steal credentials by impersonating legitimate services. Rspamd can check URLs against known phishing databases to block these attempts.

What is Phishing?

Phishing characteristics:

  • Fake login pages mimicking banks, email providers and social media
  • Urgent messages claiming account problems or security issues
  • Links to malicious sites harvesting credentials
  • Often uses legitimate-looking domains (paypa1.com instead of paypal.com)

Why phishing protection matters:

  • Protects users from credential theft
  • Prevents account compromise
  • Blocks access to malware distribution sites
  • Complements other security layers

Enable OpenPhish and PhishTank Feeds

OpenPhish and PhishTank maintain public feeds of known phishing URLs that are updated continuously. These are free, reliable services that provide excellent protection.

Configure phishing detection:

cat > /etc/rspamd/local.d/phishing.conf << 'EOF'
# Phishing URL detection using public feeds
# Note: Settings below are merged into the existing phishing { } block from modules.d

# Enable OpenPhish support
openphish_enabled = true;

# OpenPhish feed URL (moved to GitHub)
openphish_map = "https://raw.githubusercontent.com/openphish/public_feed/refs/heads/main/feed.txt";

# Set to true only if using premium feed
openphish_premium = false;

# Enable PhishTank feed
phishtank_enabled = true;
EOF

Configuration explained:

  • Files in local.d/ are merged into the default config from modules.d/
  • Don’t wrap settings in phishing { } – Rspamd does this automatically
  • openphish_enabled = true: Activates OpenPhish checking
  • openphish_map: OpenPhish free feed (now hosted on GitHub)
  • openphish_premium = false: Uses free feed (set to true only with paid account)
  • phishtank_enabled = true: Activates PhishTank checking

Note: Phishing symbol weights were already configured in the external_services_group.conf file created earlier in the ClamAV section.

Test and restart:

# Test configuration
rspamadm configtest
# Should show: "syntax OK"

# Restart Rspamd to load phishing feeds
systemctl restart rspamd

# Verify phishing module loaded
journalctl -u rspamd --since "1 minute ago" | grep -i phish

# Check that feeds are being downloaded
sleep 60  # Wait for initial feed download
journalctl -u rspamd --since "5 minutes ago" | grep -i "openphish\|phishtank" | tail -10

Expected output:

rspamd: loaded openphish map from https://www.openphish.com/feed.txt
rspamd: loaded phishtank feed

Monitor Phishing Protection

Important: Phishing symbols only appear after detecting actual phishing attempts. If you haven’t received phishing emails yet, statistics will be empty – this is normal!

Verify phishing protection is active:

# Confirm phishing module is loaded
journalctl -u rspamd --since "1 hour ago" | grep "init lua module phishing"
# Should show: init lua module phishing from /usr/share/rspamd/plugins/phishing.lua

# Check OpenPhish feed loaded successfully
journalctl -u rspamd --since "1 hour ago" | grep "parsed.*elements from openphish"
# Should show: parsed 300 elements from openphish feed (or similar number)

# Verify feed cache files exist
ls -lh /var/lib/rspamd/*.map | head -5
# Should show multiple .map files (phishing databases are cached here)

# Check when feeds will refresh
journalctl -u rspamd | grep "next check at" | grep -i "openphish" | tail -1
# Shows next automatic update time

Monitor phishing detections (only shows data after phishing emails are blocked):

# Check for phishing detections
journalctl -u rspamd | grep "PHISHED_" | tail -20

# Shows blocked phishing attempts like:
# PHISHED_OPENPHISH(10.0){http://malicious-site.com}

# Count phishing blocks in last 7 days
journalctl -u rspamd --since "7 days ago" | grep -E "PHISHED_OPENPHISH|PHISHED_PHISHTANK" | wc -l

# View recent phishing URLs blocked
journalctl -u rspamd --since "7 days ago" | grep "PHISHED_" | grep -oP 'https?://[^}]+' | sort -u

Note: rspamc stat and /var/lib/rspamd/ won’t show “phish” until actual phishing is detected. The map cache files use hashed names, not “phish” in the filename.

Phishing Feed Updates

Automatic updates:

  • OpenPhish: Feed refreshes automatically every hour
  • PhishTank: Feed refreshes automatically every hour
  • No manual intervention needed

Verify feeds are current:

# Check OpenPhish feed status
journalctl -u rspamd | grep -E "openphish.*read map data" | tail -1
# Should show: read map data [number] bytes

# View feed refresh schedule
journalctl -u rspamd | grep "next check at" | grep openphish | tail -1
# Shows when next update will occur

What This Provides

Real-time phishing URL detection – URLs checked against current threat databases
Free public feeds – No premium account needed
Automatic feed updates – Stays current without manual work
Very low resource overhead – Simple URL lookups, minimal CPU/RAM
High-confidence detection – Only confirmed phishing sites in feeds
Complements other filtering – Works alongside ClamAV, Bayes, Neural networks

Detection rate improvement:

  • Phishing protection catches credential-stealing attempts that traditional spam filters miss
  • Particularly effective against targeted “spear phishing” attacks
  • Works even when phishing emails have perfect SPF/DKIM/DMARC

Collaborative Spam Detection: Razor, Pyzor, and DCC

Overview

Three collaborative spam detection networks exist that work similarly:

  • Razor – Collaborative network sharing fuzzy checksums of spam
  • Pyzor – Similar to Razor, using hash-based spam digests
  • DCC (Distributed Checksum Clearinghouse) – Bulk email detection via checksums

How They Work

All three operate on the same principle:

  1. Compute a signature/checksum of incoming messages
  2. Query a global network: “Has this been reported as spam?”
  3. If many reports exist, increase spam score
  4. Optionally report spam back to the network

Why We’re Not Implementing Them

After extensive testing, we’ve decided not to include these tools in this guide for the following reasons:

1. Technical Implementation Issues

Razor and Pyzor are legacy Perl applications that have significant compatibility issues with modern systemd security sandboxing:

  • Permission problems with Perl module access for restricted users
  • Complex systemd socket activation workarounds required
  • Unreliable operation in restricted security contexts
  • Time-consuming troubleshooting for marginal benefit

2. Marginal Value

With the comprehensive stack we’ve already implemented, these tools add minimal additional protection:

Your current anti-spam stack:

  • ClamAV – Virus and malware detection
  • Bayes classifier – Personalized statistical learning
  • Neural networks – Advanced pattern recognition
  • Fuzzy hashing – Spam variant detection
  • RBL checks – Real-time IP/domain blacklists (Part 4)
  • DKIM/SPF/DMARC – Email authentication (Part 4)

Adding Razor/Pyzor/DCC provides perhaps 1-2% additional detection rate at best, which doesn’t justify the added complexity.

3. Maintenance Burden

These tools require:

  • Additional services to monitor and maintain
  • Regular connectivity checks to external networks
  • Troubleshooting when external services have issues
  • Updates when protocols or servers change

4. Network Dependencies

All three depend on external networks being available and responsive. Network issues or service outages can:

  • Slow down mail processing
  • Create timeout errors in logs
  • Require manual intervention

DCC Specific Issues

DCC has additional complications:

  • Requires accepting a commercial license (even for free tier)
  • More restrictive than other tools
  • No significant advantage over Razor/Pyzor to justify the extra licensing complexity

Our Recommendation

Skip Razor, Pyzor, and DCC entirely. Your mail server will have:

Excellent spam detection – The core stack catches 99%+ of spam
Reliable operation – No legacy tool compatibility issues
Easier maintenance – Fewer moving parts to monitor
Better performance – No external network queries for every message

For Advanced Users

If you still want to implement these tools despite the challenges:

Razor/Pyzor:

  • The official Rspamd documentation covers systemd socket integration
  • Expect to spend significant time troubleshooting Perl module permissions
  • May require disabling systemd security features

DCC:

Warning: We don’t provide setup instructions for these tools because they don’t meet our reliability and value standards for production mail servers.

Neural Network Learning

Rspamd includes a powerful neural network that learns spam patterns through training.

How Rspamd Neural Networks Work

Architecture:

  • Input layer: Rspamd symbols (SPF, DKIM, Bayes, etc.)
  • Hidden layer: Pattern recognition and feature extraction
  • Output layer: Ham vs Spam classification

Training process:

  1. Neural network observes Rspamd’s symbol outputs for each message
  2. Observes final classification (spam or ham) based on existing rules
  3. Adjusts internal weights to better predict classification
  4. Over time, learns patterns and symbol correlations
  5. Provides prediction even for messages that partially match known patterns

What it learns:

  • Symbol correlations: Which combinations indicate spam
  • Pattern recognition: Message structures typical of spam/ham
  • Local patterns: Specific to YOUR mail patterns
  • Nuanced scoring: Not binary, provides confidence scores

Example: Neural network learns that:

SPF_PASS + DKIM_ALLOW + Bayes_Ham + No_Pyzor_Match = Definitely ham (-3.0)
SPF_FAIL + No_DKIM + Bayes_Spam + Razor_Match = Definitely spam (+5.0)

Configure Neural Network

Enable and configure the neural module:

cat > /etc/rspamd/local.d/neural.conf << 'EOF'
# Neural network configuration

# Use Valkey for persistent storage
servers = "/run/valkey/valkey.sock";

# Neural network structure
train {
  max_trains = 1000;       # Limit training iterations per session
  max_usages = 20;         # Limit influences per classification
  max_iterations = 25;     # Maximum epochs during training
  learning_rate = 0.01;    # How quickly network adjusts
  
  # Training triggers
  ham_score = -1.0;        # Score below -1.0 trains as ham
  spam_score = 6.0;        # Score above 6.0 trains as spam
  
  # Minimum symbols required for training
  min_learns = 3;
}

# Network layers
layers = [
  {
    # Input layer size calculated from enabled symbols
    size = auto;
  },
  {
    # Hidden layer
    size = 64;
    activation = "relu";
  },
  {
    # Output layer (binary classification)
    size = 1;
    activation = "sigmoid";
  }
];

# Symbol for neural network prediction
symbol = "NEURAL_HAM";
symbol_spam = "NEURAL_SPAM";

# Enable per-user neural networks (learns per-domain patterns)
per_user = false;          # Set to true if serving multiple distinct domains

# Pre-filter - only use neural network for uncertain messages
# Messages with clear spam/ham signals skip neural (performance)
pre_filter = {
  min_score = 3.0;         # Only check if current score between 3-7
  max_score = 7.0;
}
EOF

Configuration explained:

  • max_trains: Limit training per session (prevents overtraining)
  • ham_score / spam_score: Confidence thresholds for automatic training
  • learning_rate: How aggressively to adjust weights (0.01 = cautious)
  • layers: Network structure (input → 64-neuron hidden → output)
  • pre_filter: Only invoke neural for uncertain messages (performance)
  • per_user = false: Single neural network for entire server (simplest)

Configure Neural Symbols

Edit /etc/rspamd/local.d/neural_group.conf:

cat > /etc/rspamd/local.d/neural_group.conf << 'EOF'
# Neural network symbol group

# Neural network symbols
symbols = {
  "NEURAL_HAM" {
    weight = -3.0;
    description = "Neural network ham prediction";
  }
  
  "NEURAL_SPAM" {
    weight = 5.0;
    description = "Neural network spam prediction";
  }
}
EOF

Enable Neural Training

Neural networks need training data. We’ll configure automatic training from high-confidence classifications:

cat > /etc/rspamd/local.d/neural_group.conf << 'EOF'
# Neural network training configuration

# Automatic training enabled
settings {
  # Train automatically from high-confidence messages
  train {
    # Spam threshold (messages above this train as spam)
    spam_score = 12.0;
    
    # Ham threshold (messages below this train as ham)
    ham_score = -5.0;
    
    # Maximum number of training samples to store
    max_trains = 10000;
    
    # How often to run training (seconds)
    learning_rate = 0.01;
  }
}
EOF

Test and Restart

# Test configuration
rspamadm configtest
# Should show: "syntax OK"

# Restart Rspamd to enable neural network
systemctl restart rspamd

# Verify neural module loaded
journalctl -u rspamd --since "5 minutes ago" | grep -i neural

# Check Valkey for neural network data
valkey-cli -s /run/valkey/valkey.sock --scan --pattern "rn:*"
# Should show neural network keys after some training

Monitor Neural Network Learning

Check training progress:

# Neural network statistics
rspamc stat | grep -i neural

# Check for neural predictions in logs
journalctl -u rspamd | grep -E "NEURAL_HAM|NEURAL_SPAM" | tail -n 20

# Valkey neural network data
valkey-cli -s /run/valkey/valkey.sock --scan --pattern "rn:*" | wc -l
# Shows count of neural network data keys

What to expect:

  • First 100-200 messages: No neural predictions (insufficient training)
  • After 500+ messages: Neural starts making predictions
  • After 1000+ messages: Neural predictions become reliable
  • Continuous improvement as more mail is processed

Neural network symbols in action:

Example spam:
├─ SPF: +2.0 (fail)
├─ Bayes: +4.0 (spam-like)
├─ Pyzor: +2.5 (detected)
├─ NEURAL_SPAM: +5.0 (network learned this pattern is spam)
└─ Final Score: +13.5 / 15.0 → ADD_HEADER (delivered to Junk)

Fuzzy Hashing

Fuzzy hashing detects near-duplicate spam messages, catching spam campaigns with minor text variations.

How Fuzzy Hashing Works

Traditional spam filters fail on variations:

Message 1: "Buy cheap watches now!"
Message 2: "Buy cheap wat ches now!"     ← Spaces added
Message 3: "Buy cheap w4tches now!"      ← Characters changed

Traditional filter: 3 different messages
Fuzzy hash: 3 nearly identical messages → Spam pattern!

Fuzzy hashing:

  1. Computes fuzzy hash of message content
  2. Stores hash with classification (spam or ham)
  3. New messages compared to stored hashes
  4. Near matches trigger spam score

Use cases:

  • Mass spam campaigns: Same message sent with minor variations
  • Personalized spam: Template with name/company variations
  • Evasion techniques: Spammers deliberately vary messages slightly

Configure Fuzzy Storage

Enable Valkey-based fuzzy hash storage:

cat > /etc/rspamd/local.d/fuzzy_check.conf << 'EOF'
# Fuzzy hash configuration

# Disable the default rspamd.com fuzzy rule (we're using local storage)
rule "rspamd.com" {
  enabled = false;
}

# Define our local fuzzy rule
rule "local" {
  # Algorithm - mumhash is fast and effective
  algorithm = "mumhash";
  
  # Backend storage
  backend = "redis";
  servers = "/run/valkey/valkey.sock";
  
  # Symbol for matches
  symbol = "LOCAL_FUZZY";
  
  # Flags
  read_only = false;       # Allow learning new hashes
  skip_unknown = true;     # Skip if no hash found
  
  # Scoring
  min_score = 1.0;         # Weak match
  max_score = 3.0;         # Strong match
  
  # Storage settings
  expire = 2592000;        # 30 days (2592000 seconds)
  min_length = 100;        # Don't hash very short messages
}

# Automatic fuzzy learning from high-confidence messages
fuzzy_learn {
  # Learn spam hashes from clear spam
  spam {
    min_score = 12.0;
  }
  
  # Learn ham hashes from clear ham
  ham {
    max_score = -3.0;
  }
}
EOF

Configuration explained:

  • rule "rspamd.com" { enabled = false; }: Disables default public fuzzy storage
  • algorithm = "mumhash": Fast modern hash algorithm
  • backend = "redis": Store hashes in Valkey (Redis-compatible)
  • min_score / max_score: How much to add to spam score on match
  • expire = 2592000: Keep fuzzy hashes for 30 days
  • min_length = 100: Don’t bother hashing very short messages

Configure Fuzzy Symbols

Edit /etc/rspamd/local.d/fuzzy_group.conf:

cat > /etc/rspamd/local.d/fuzzy_group.conf << 'EOF'
# Fuzzy hash symbol configuration

symbols = {
  "LOCAL_FUZZY" {
    weight = 3.0;
    description = "Fuzzy hash match (near-duplicate spam)";
  }
  
  "LOCAL_FUZZY_DENIED" {
    weight = 3.0;
    description = "Fuzzy hash match (known spam)";
  }
  
  "LOCAL_FUZZY_PROB" {
    weight = 1.5;
    description = "Fuzzy hash probable match";
  }
}
EOF

Test and Restart

# Test configuration
rspamadm configtest

# Restart Rspamd
systemctl restart rspamd

# Verify fuzzy module loaded
journalctl -u rspamd --since "5 minutes ago" | grep -i fuzzy

Train Fuzzy Hashes

Automatic training from high-confidence messages:

cat >> /etc/rspamd/local.d/fuzzy_check.conf << 'EOF'

# Automatic fuzzy learning
fuzzy_learn {
  # Learn spam hashes from high-confidence spam
  spam {
    min_score = 12.0;      # Only learn from clear spam
  }
  
  # Learn ham hashes from high-confidence ham
  ham {
    max_score = -3.0;      # Only learn from clear ham
  }
}
EOF

# Restart to apply
systemctl restart rspamd

Manual training (optional):

# Train specific message as spam (creates fuzzy hash)
rspamc learn_spam < /path/to/spam-message.eml

# Train specific message as ham
rspamc learn_ham < /path/to/ham-message.eml

Monitor Fuzzy Hash Effectiveness

# Check fuzzy matches
journalctl -u rspamd | grep LOCAL_FUZZY | tail -n 20

# Check Valkey fuzzy database size
valkey-cli -s /run/valkey/valkey.sock --scan --pattern "fuzzy:*" | wc -l
# Shows count of stored fuzzy hashes

# View fuzzy statistics
rspamc stat | grep -i fuzzy

What to expect:

  • Fuzzy hashes accumulate over weeks/months
  • Effectiveness increases with more spam seen
  • Particularly useful for recurring spam campaigns

Bayes Classifier

The Bayes classifier uses statistical analysis to learn spam patterns specific to YOUR mail.

How Bayes Classification Works

Statistical learning:

  • Token extraction: Breaks messages into words/tokens
  • Probability calculation: Computes P(spam|token) for each token
  • Combines probabilities: Overall spam probability for message
  • Local learning: Learns patterns specific to your mail

Example tokens and learned probabilities:

Token: "viagra" → P(spam) = 0.95      (95% of messages with "viagra" were spam)
Token: "meeting" → P(spam) = 0.05     (5% of messages with "meeting" were spam)
Token: "invoice" → P(spam) = 0.30     (ambiguous - depends on context)

Message: "Urgent meeting about viagra invoice"
Bayes: Combines probabilities → Overall spam score

Why Bayes is powerful:

  • Learns YOUR specific mail patterns
  • Adapts to your correspondents
  • Recognizes legitimate newsletters vs spam
  • Gets smarter over time with training

Configure Bayes Classifier

Enable Bayes with Valkey backend:

cat > /etc/rspamd/local.d/classifier-bayes.conf << 'EOF'
# Bayes classifier configuration

# Backend storage (Valkey/Redis)
backend = "redis";
servers = "/run/valkey/valkey.sock";

# Token settings
tokenizer {
  name = "osb";           # Orthogonal Sparse Bigrams (modern algorithm)
}

# Learning settings
learn_condition = [[
  return function(task, is_spam, is_unlearn)
    -- Only learn from high-confidence messages
    local score = task:get_metric_score('default')[1]
    
    -- Learn spam if score > 12
    if is_spam and score > 12 then
      return true
    end
    
    -- Learn ham if score < -3
    if not is_spam and score < -3 then
      return true
    end
    
    return false
  end
]];

# Autolearn from high-confidence decisions
autolearn = true;

# Minimum tokens required for learning
min_learns = 200;

# Token frequency minimum
min_token_hits = 2;

# Per-user learning (set to true for multi-tenant)
per_user = false;

# Cache settings
cache {
  backend = "redis";
  servers = "/run/valkey/valkey.sock";
  
  # Cache expiration
  expire = 86400;         # 1 day
}
EOF

Configuration explained:

  • backend = "redis": Store Bayes data in Valkey
  • tokenizer = "osb": Modern bigram tokenizer (better than simple word tokens)
  • learn_condition: Lua function to determine when to learn
    • Learn spam if score > 12 (high confidence)
    • Learn ham if score < -3 (high confidence)
    • Skip uncertain messages (avoid poisoning classifier)
  • autolearn = true: Learn automatically from high-confidence messages
  • min_learns = 200: Need 200+ samples before making predictions
  • per_user = false: Single classifier for entire server

Configure Bayes Symbols

Edit /etc/rspamd/local.d/bayes_group.conf:

cat > /etc/rspamd/local.d/bayes_group.conf << 'EOF'
# Bayes classifier symbol configuration

symbols = {
  "BAYES_HAM" {
    weight = -3.0;
    description = "Bayes classifier: Ham (not spam)";
  }
  
  "BAYES_SPAM" {
    weight = 5.0;
    description = "Bayes classifier: Spam";
  }
}
EOF

Test and Restart

# Test configuration
rspamadm configtest

# Restart Rspamd
systemctl restart rspamd

# Verify Bayes module loaded
journalctl -u rspamd --since "5 minutes ago" | grep -i bayes

Initial Bayes Training

Bayes needs initial training before making predictions. You can train it from existing mail folders:

# Source configuration
source /root/mail-server-vars.sh

# Train from existing INBOX (ham)
doveadm fetch -u info@${DOMAIN} text mailbox INBOX ALL | rspamc learn_ham

# Train from existing Junk (spam)
doveadm fetch -u info@${DOMAIN} text mailbox Junk ALL | rspamc learn_spam

Expected output:

success = true
learned = 25         # Number of messages learned

Check Bayes statistics:

# View Bayes learning stats
rspamc stat | grep -i bayes

# Check Valkey Bayes database
valkey-cli -s /run/valkey/valkey.sock --scan --pattern "bayes:*" | head -n 20
# Shows Bayes token keys

Monitor Bayes Effectiveness

# Recent Bayes classifications
journalctl -u rspamd | grep -E "BAYES_HAM|BAYES_SPAM" | tail -n 20

# Bayes statistics
rspamc stat | grep -i bayes

# Shows:
# Bayes learns: ham + spam count
# Tokens learned: total token database size

What to expect:

  • First 200 messages: No Bayes predictions (insufficient data)
  • After 200+ ham and 200+ spam: Bayes starts predicting
  • After 1000+ messages: Bayes becomes highly accurate
  • Continuous improvement with more training

Self-Learning Setup

The ultimate goal: Your mail server learns from USER ACTIONS automatically.

How Self-Learning Works

User teaches the system:

User receives email → Rspamd classifies it → Delivers to INBOX or Junk
        ↓
User reviews classification
        ↓
User moves message if needed:
  - Move from INBOX to Junk → "This is spam, learn it!"
  - Move from Junk to INBOX → "This is ham, unlearn it!"
        ↓
IMAPSieve detects folder change → Triggers learning script
        ↓
Script calls Rspamd: rspamc learn_spam or learn_ham
        ↓
Rspamd updates: Bayes, Neural, Fuzzy databases
        ↓
Future similar messages classified better ✅

Benefits:

  • Zero manual training needed
  • Users implicitly train the system by moving mail
  • System learns YOUR specific mail patterns
  • Continuous improvement over time

Step 1: Enable Sieve Protocols

Enable IMAPSieve in IMAP protocol:

cat > /etc/dovecot/conf.d/20-imap.conf << 'EOF'
###
### IMAP Protocol Settings
###
protocols {
  imap = yes
}
protocol imap {
  mail_plugins {
    quota = yes
    imap_quota = yes
    imap_sieve = yes    # Enable IMAPSieve for automatic learning
  }
  mail_max_userip_connections = 50
  imap_idle_notify_interval = 29 mins
}
EOF

Enable Sieve in LMTP protocol:

cat > /etc/dovecot/conf.d/20-lmtp.conf << 'EOF'
###
### LMTP Protocol Settings
###
protocols {
  lmtp = yes
}
protocol lmtp {
  mail_plugins {
    quota = yes
    sieve = yes    # Enable Sieve for mail delivery
  }
}
EOF

Step 2: Configure IMAPSieve (Dovecot 2.4 Syntax)

IMPORTANT: Dovecot 2.4 uses a block-based configuration structure. The IMAPSieve rules MUST be under protocol imap section.

cat > /etc/dovecot/conf.d/90-sieve.conf << 'EOF'
##
## Dovecot 2.4 Sieve Configuration with IMAPSieve
##

# Personal sieve script location
sieve_script personal {
  path = ~/sieve
  active_path = ~/.dovecot.sieve
}

# Global spam filter - runs BEFORE user scripts
sieve_script before {
  sieve_script_path = /var/lib/dovecot/sieve/spam-global.sieve
}

# Maximum script size
sieve_max_script_size = 1M

# Maximum number of actions per script
sieve_max_actions = 32

##
## Sieve / IMAPSieve configuration for Dovecot 2.4
##

# Sieve plugins (block-based syntax)
sieve_plugins {
  sieve_imapsieve = yes
  sieve_extprograms = yes
}

# Allow external program execution
sieve_pipe_bin_dir = /usr/local/bin

# Enable required Sieve extensions (block-based syntax)
sieve_global_extensions {
  vnd.dovecot.pipe = yes
  vnd.dovecot.environment = yes
  imapsieve = yes
}

# IMAPSieve rules MUST be under protocol imap in Dovecot 2.4
protocol imap {
  # When message moved TO Junk → learn spam
  mailbox Junk {
    sieve_script spam {
      type = before
      cause = copy
      path = /var/lib/dovecot/sieve/global/report-spam.sieve
    }
  }
  
  # When message moved FROM Junk → learn ham
  imapsieve_from Junk {
    sieve_script ham {
      type = before
      cause = copy
      path = /var/lib/dovecot/sieve/global/report-ham.sieve
    }
  }
}
EOF

Configuration explained:

  • sieve_plugins { }: Block-based syntax for enabling plugins (Dovecot 2.4 style)
  • sieve_global_extensions { }: Block-based syntax for extensions (Dovecot 2.4 style)
  • protocol imap { }: IMAPSieve rules MUST be inside this block in Dovecot 2.4
  • mailbox Junk { }: Triggers when mail is copied/moved TO Junk folder
  • imapsieve_from Junk { }: Triggers when mail is copied/moved FROM Junk folder

Dovecot 2.4 Syntax Notes:

  • Uses block-based syntax: sieve_plugins { }, sieve_global_extensions { }
  • IMAPSieve rules must be inside protocol imap { } block
  • Old sieve_plugins = sieve_imapsieve sieve_extprograms style doesn’t work
  • No manual sievec compilation needed – Dovecot auto-compiles on first use

Step 3: Create Sieve Learning Scripts

# Create Sieve directory
mkdir -p /var/lib/dovecot/sieve/global

# Create spam learning Sieve script
cat > /var/lib/dovecot/sieve/global/report-spam.sieve << 'EOF'
require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.user" "*" {
  pipe :copy "rspamd-learn-spam.sh" [ "${1}" ];
}
EOF

# Create ham learning Sieve script
cat > /var/lib/dovecot/sieve/global/report-ham.sieve << 'EOF'
require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.user" "*" {
  pipe :copy "rspamd-learn-ham.sh" [ "${1}" ];
}
EOF

How the Sieve scripts work:

  • require ["vnd.dovecot.pipe", ...]: Load required Sieve extensions
  • environment :matches "imap.user" "*": Get the IMAP username
  • pipe :copy "script" [ "${1}" ]: Pipe the email to the external script

Step 4: Create Rspamd Learning Shell Scripts

# Create spam learning script
cat > /usr/local/bin/rspamd-learn-spam.sh << 'EOF'
#!/bin/sh
exec /usr/bin/rspamc -h localhost:11334 learn_spam
EOF

# Create ham learning script
cat > /usr/local/bin/rspamd-learn-ham.sh << 'EOF'
#!/bin/sh
exec /usr/bin/rspamc -h localhost:11334 learn_ham
EOF

# Make scripts executable
chmod 755 /usr/local/bin/rspamd-learn-*.sh
chown root:root /usr/local/bin/rspamd-learn-*.sh

Script explained:

  • Reads email from stdin (piped from Sieve)
  • Sends to Rspamd’s learning API on localhost:11334
  • exec replaces the shell process (efficient)

Step 5: Set Permissions

# Set directory ownership and permissions FIRST (important!)
chown root:vmail /var/lib/dovecot/sieve/global
chmod 770 /var/lib/dovecot/sieve/global

# Set ownership for Sieve scripts
chown root:vmail /var/lib/dovecot/sieve/global/*.sieve
chmod 640 /var/lib/dovecot/sieve/global/*.sieve

Why these permissions:

  • Dovecot runs as vmail user and needs to read the Sieve scripts
  • Directory needs 770 (not 750) for vmail to write compiled .svbin files
  • Scripts should be writable only by root (security)
  • Critical: Directory permissions must be set before file permissions

Common permission error: If you see “Permission denied” when moving mail, the directory likely needs 770 not 750 because Dovecot needs to create .svbin compiled files.

Step 6: Test and Apply Configuration

# Test Dovecot configuration
doveconf -n | grep -E "(sieve_plugins|sieve_extensions|mailbox Junk|imapsieve_from)"

# Should show:
# sieve_plugins = sieve_imapsieve sieve_extprograms
# sieve_extensions = +vnd.dovecot.pipe
# mailbox Junk {
# imapsieve_from Junk {

# Restart Dovecot
systemctl restart dovecot

# Check Dovecot status
systemctl status dovecot

Step 7: Verify Files

# Check Sieve scripts exist
ls -la /var/lib/dovecot/sieve/global/
# Should show: report-spam.sieve, report-ham.sieve

# Check learning scripts exist
ls -la /usr/local/bin/rspamd-learn-*.sh
# Should show: rspamd-learn-spam.sh, rspamd-learn-ham.sh

# Verify permissions
stat /var/lib/dovecot/sieve/global/report-spam.sieve
# Should show: Access: (0640/-rw-r-----) Uid: (    0/    root)   Gid: ( 5000/   vmail)

Step 8: Test Self-Learning

Self-learning trains Rspamd based on your corrections. Test that the pipeline works by moving mail between folders.

Step 1: Send a test email

# From another server, send a clean test email
swaks --to info@yourdomain.com \
      --from test@example.com \
      --server mail.yourdomain.com \
      --helo example.com \
      --body "Test email for self-learning - $(date)"

The email should arrive in your INBOX.

Step 2: Test spam learning (move TO Junk)

  1. Open your email client (Thunderbird, Roundcube, webmail, etc.)
  2. Find the test email in INBOX
  3. Move it to the Junk folder (drag and drop, or right-click → Move)

Step 3: Watch the server logs

On your mail server:

# Watch for self-learning activity
journalctl -u dovecot -u rspamd -f

Expected output (SUCCESS):

dovecot: imap: sieve: executed pipe action: rspamd-learn-spam.sh
rspamd: rspamd_controller_learn_fin_task: <127.0.0.1> learned message as spam: <message-id>

Step 4: Verify spam was learned

# Check recent spam learning
journalctl -u rspamd --since "5 minutes ago" | grep "learned message as spam"

# Should show:
# rspamd_controller_learn_fin_task: learned message as spam

Step 5: Test ham learning (move FROM Junk back to INBOX)

  1. In your email client, go to the Junk folder
  2. Move the test email back to INBOX
  3. Watch the logs again

Expected output (SUCCESS):

dovecot: imap: sieve: executed pipe action: rspamd-learn-ham.sh
rspamd: rspamd_controller_learn_fin_task: <127.0.0.1> learned message as ham: <message-id>

What this proves:

✅ IMAPSieve detects folder moves
✅ Sieve scripts execute learning commands
✅ Rspamd receives and processes learning requests
✅ Self-learning pipeline is working end-to-end

Common Issues:

Nothing happens when moving mail:

# 1. Check Sieve scripts compiled
ls -la /var/lib/dovecot/sieve/global/*.svbin
# Should show: report-spam.svbin and report-ham.svbin

# 2. Check directory permissions
ls -la /var/lib/dovecot/sieve/global/
# Should show: drwxrwx--- root vmail

# 3. Verify IMAPSieve configuration loaded
doveconf -n | grep -A 3 "mailbox Junk"

# 4. Check for Sieve errors
journalctl -u dovecot --since "10 minutes ago" | grep -i sieve

# 5. Restart Dovecot
systemctl restart dovecot

Scripts execute but learning fails:

# Check Rspamd is accepting learn commands
rspamc stat | grep -i bayes

# Test learning manually
echo "test" | rspamc learn_spam
# Should complete without errors
### Understanding Bayes Training Requirements

**IMPORTANT:** You'll see this message when you start learning:

bayes_classify: not classified as ham. The ham class needs more training samples. Currently: 0; minimum 200 required

This is completely normal! Bayes requires:

  • Minimum 200 spam messages before it can classify spam
  • Minimum 200 ham messages before it can classify ham
  • ✅ Both thresholds must be met for Bayes to activate

Training Progress:

Messages Learned Bayes Status:

1-199 spam → Training (not yet active)
200+ spam → Waiting for ham training
1-199 ham → Training (not yet active)
200+ ham → Waiting for spam training
200+ spam + ham ✅ ACTIVE – Bayes now classifies!

What works immediately (no training needed):

  • ✅ Neural networks (learns from every message)
  • ✅ Fuzzy hashing (learns from high-confidence spam/ham)
  • ✅ DKIM/SPF/DMARC (external validation)
  • ✅ RBL checks (real-time blacklists)
  • ✅ Phishing detection (URL databases)

What needs training:

  • → Bayes classifier (200+ spam AND 200+ ham)

Monitor Learning Progress

Check training status:

# View current Bayes statistics
rspamc stat | grep -A 20 "Bayes"

# Expected output shows training progress:
# Statfile: BAYES_HAM type: redis; length: 50; total hits: 50; ...
# Statfile: BAYES_SPAM type: redis; length: 150; total hits: 150; ...
# (This shows 50 ham and 150 spam learned - need 200 of each)

Count learned messages:

# Count spam learning events
journalctl -u rspamd | grep "learned message as spam" | wc -l

# Count ham learning events
journalctl -u rspamd | grep "learned message as ham" | wc -l

# View recent learning activity
journalctl -u rspamd --since "24 hours ago" | grep "learned message as"

Check Valkey storage:

# Bayes data (tokens)
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "bayes:*" | wc -l

# Neural network data
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "rn:*" | wc -l

# Fuzzy hash data
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "fuzzy:*" | wc -l

# View Valkey memory usage
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock INFO memory | grep used_memory_human

Troubleshooting Self-Learning

Sieve scripts not executing:

# Check Dovecot can find the scripts
doveconf -n | grep sieve_script

# Check sievec auto-compilation
ls -la /var/lib/dovecot/sieve/global/*.svbin
# Should show compiled .svbin files (created automatically)

# Check for Sieve errors
journalctl -u dovecot | grep -i sieve | tail -20

Learning scripts not being called:

# Verify scripts are executable
ls -la /usr/local/bin/rspamd-learn-*.sh
# Should show: -rwxr-xr-x

# Test learning script manually
echo "Subject: Test" | /usr/local/bin/rspamd-learn-spam.sh
# Should complete without errors

# Check Rspamd is accepting learn commands
rspamc stat
# Should show: Statfile: BAYES_SPAM type: redis ...

Permissions errors:

# Check vmail user can read Sieve scripts
sudo -u vmail cat /var/lib/dovecot/sieve/global/report-spam.sieve
# Should output the script content

# Check directory permissions (must be 770 for .svbin compilation)
ls -la /var/lib/dovecot/sieve/ | grep global
# Should show: drwxrwx--- ... root vmail ... global

# Fix permissions if needed
chown root:vmail /var/lib/dovecot/sieve/global
chmod 770 /var/lib/dovecot/sieve/global
chown root:vmail /var/lib/dovecot/sieve/global/*.sieve
chmod 640 /var/lib/dovecot/sieve/global/*.sieve

What Gets Updated

When self-learning runs, Rspamd updates:

  1. Bayes classifier – Token statistics for spam/ham
  2. Neural networks – Weight adjustments for pattern recognition
  3. Fuzzy hashes – If configured with automatic learning

Check Valkey storage:

# Bayes data
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "bayes:*" | wc -l

# Neural data
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "rn:*" | wc -l

# Fuzzy data
sudo -u _rspamd valkey-cli -s /run/valkey/valkey.sock --scan --pattern "fuzzy:*" | wc -l

Important Notes

No manual sievec compilation needed – Dovecot 2.4 auto-compiles .sieve.svbin on first use
Both COPY and MOVE work – The cause = copy setting triggers on both operations
Works with all IMAP clients – Thunderbird, Outlook, mobile apps, webmail
Training is immediate – Rspamd updates happen as soon as mail is moved

Don’t use old Dovecot 2.3 syntax – The imapsieve_mailbox1_name style doesn’t work in 2.4
Don’t put settings in plugin { } blocks – Dovecot 2.4 uses top-level settings
Don’t manually compile Sieve scripts – Let Dovecot handle compilation

Summary and Next Steps

Continue with:
Part 6: Rspamd Web Interface & Roundcube – Web-based monitoring, management, and webmail

Complete Mail Server Journey

You’ve now built a professional, production-ready mail server:

Part 1: Prerequisites and preparation
Part 2: Core mail server (Postfix, Dovecot, PostfixAdmin)
Part 3: Intrusion prevention (CrowdSec)
Part 4: Professional spam filtering (Rspamd, DKIM, Hall of Fame)
Part 5: Advanced filtering (ClamAV, Neural Networks, Bayes, Learning)
Part 6: Web interfaces (coming soon)

Your mail server now rivals commercial solutions in capability and security!

Similar Posts