Technical deep-dive · 2026-05-25

SMART correlations: how 15 independent pipelines become one prioritised view

Isolated alerts are a ticket machine. The real threat lives in the combination. How monsys wires CVE scanning, honeypots, capacity and process DNA together via SMART correlations.

A monitoring platform that only throws alerts is a ticket machine. Every pipeline — CVE matching, honeypots, capacity, kernel tracker, integrity — produces its own signals, but the actual threat almost always lives in the combination.

This is how monsys connects those pipelines via SMART correlations.

The problem with isolated pipelines

Suppose you have three events on the same server within ten minutes:

  1. A Medium-severity CVE on an npm package (axios@0.27.2, GHSA-2025-xxxx)
  2. A disk capacity alert: 89% full, projected full in 4 hours
  3. A process DNA deviation: /usr/bin/node has an unknown hash

Individually: a Medium CVE is low priority. A full disk is operational. Process DNA is suspicious but might be an auto-update.

Together: a Node binary that has been silently replaced, combined with abnormal disk activity and a vulnerable dependency in the same process, is an active incident.

SMART correlations build those connections automatically.

Architecture: one signal_streams table

Every pipeline writes to the same hypertable via a uniform Go interface:

emitter.Emit(ctx, tenantID, signals.Signal{
    Source:      "process_dna",          // or "cve_match", "honeypot", "capacity", ...
    SubjectType: "agent",
    SubjectID:   agentID,
    Key:         "binary.hash_mismatch",
    Value:       map[string]any{
        "exe_path":      "/usr/bin/node",
        "baseline_hash": "a3f2c1...",
        "observed_hash": "b7e4d2...",
    },
    Severity:    signals.SeverityCritical,
    ObservedAt:  time.Now().UTC(),
})

Any worker that wants to add a new correlation type only has to update SourceToCategory and add one Emit() call. The rest of the infrastructure (dashboards, Trust Score, evidence packs, alerts) picks it up automatically.

The nine correlation workers

1. Blast-radius CVE prioritisation

Not every CVE is equally urgent. A Critical on a server that's connected to nothing is less acute than a Medium on a load balancer fronting 40 production services.

The CapacityPredictorWorker (1h cadence) does a Breadth-First Search over the topology graph to compute the hop distance from every node to the nearest internet entry point:

GET /api/v1/topology/exposure?map_id=<id>

Response:
{
  "nodes": [
    { "agent_id": "web-edge-01", "internet_hops": 0, "exposure_score": 1.0 },
    { "agent_id": "api-server-03", "internet_hops": 1, "exposure_score": 0.7 },
    { "agent_id": "db-primary", "internet_hops": 2, "exposure_score": 0.4 }
  ]
}

CVE recommendations are re-ranked on cvss_base × exposure_score × epss_probability. A CVSS 7.5 on the database server with EPSS 0.02 scores lower than a CVSS 5.5 on the edge server with EPSS 0.34.

2. Capacity as CVE

A full disk sounds operational, not security. But a server with a 100% full disk can't write logs, can't create core dumps, and can't start security tools. That's a security condition.

The worker fits a linear regression in Postgres:

SELECT
    regr_slope(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS slope,
    regr_intercept(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS intercept,
    regr_r2(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS r2
FROM agent_metrics
WHERE agent_id = $1
  AND mount_point = $2
  AND observed_at > NOW() - INTERVAL '30 days'

Based on slope and intercept, when the disk hits 100% is projected. This is emitted as a capacity.disk_full finding, handled like a vulnerability with a deadline.

3. Lateral-movement detection

This is the correlation that most often surprises in demos:

-- LateralMovementWorker, 60s cadence
SELECT
    hp.agent_id      AS source_agent,
    hp.canary_path,
    ae.target_agent_id,
    ae.auth_user,
    ae.src_ip,
    ae.observed_at
FROM honeypot_events hp
JOIN auth_events ae
    ON ae.src_ip IN (
        SELECT ip_address FROM agent_ips WHERE agent_id = hp.agent_id
    )
    AND ae.observed_at BETWEEN hp.observed_at - INTERVAL '5 min'
                            AND hp.observed_at + INTERVAL '5 min'
    AND ae.success = true
    AND ae.auth_type = 'ssh'
WHERE hp.tenant_id = $1
  AND hp.observed_at > NOW() - INTERVAL '70 seconds'

A honeypot trigger on server A followed by a successful SSH login from server A to server B emits a lateral_movement_suspected event with MITRE tags T1021 (Remote Services) and T1078 (Valid Accounts). Both events are linked in the detection record — the forensic trail is already there.

4. Compliance-erosion alerts

Compliance is not a static point — it erodes silently when workers crash, keys expire, or evidence generation fails.

-- ComplianceErosionWorker, 24h cadence
INSERT INTO compliance_evidence_history (tenant_id, control_id, evidence_count, snapshot_date)
SELECT tenant_id, control_id, COUNT(*), CURRENT_DATE
FROM compliance_evidence
GROUP BY tenant_id, control_id;

-- Alert on >50% loss vs 7 days ago
SELECT
    h1.control_id,
    h1.evidence_count AS current_count,
    h7.evidence_count AS week_ago_count,
    (1.0 - h1.evidence_count::float / NULLIF(h7.evidence_count, 0)) AS loss_rate
FROM compliance_evidence_history h1
JOIN compliance_evidence_history h7
    ON h1.control_id = h7.control_id
    AND h7.snapshot_date = CURRENT_DATE - INTERVAL '7 days'
WHERE h1.snapshot_date = CURRENT_DATE
  AND (1.0 - h1.evidence_count::float / NULLIF(h7.evidence_count, 0)) > 0.5

This catches situations that don't generate alerts but are still a compliance problem: a backup worker that quietly stopped, a certificate scanner that's failing, a log pipeline that's gone empty.

5. Centrality-weighted Trust Score

The Trust Score (0-100) is a weighted average over 8 compliance categories. But not every agent weighs the same. A chokepoint server (load balancer, VPN gateway, database) that fails has more impact than a leaf node.

CentralityRefreshWorker (hourly):
→ Compute betweenness centrality (Brandes algorithm) over topology_edges
→ Persist to topology_node_centrality
→ Trust Score weights each agent with (1 + centrality)
   → max centrality ≈ 1.0 → chokepoint counts up to 2× as heavy

A server with centrality 0.8 that has a failing compliance control pushes the Trust Score down harder than the same failing control on a leaf node. SPOFs are automatically weighted more heavily.

6. False-positive learning

Any detection rule that gets acknowledged within 5 minutes more than 50% of the time is probably set too aggressively.

GET /api/v1/detection/rules/flake-stats

Response:
[
  {
    "rule_id": "cpu_threshold_prod",
    "fires_30d": 84,
    "quick_ack_30d": 52,
    "flake_rate": 0.619,
    "opinion": "Raise threshold from 85% to 92%, or add a duration filter: CPU>85% for >15min"
  }
]

This is operator-driven: the suggestion is shown, never applied automatically. Auditors want repeatable, human-approved rules — not a system that adjusts itself.

7. Time-machine diff

Forensic investigation almost always starts with: "what changed between Monday and Friday?"

GET /api/v1/time-machine/diff?agent_id=web-edge-01&from=2026-05-19T00:00:00Z&to=2026-05-23T23:59:59Z

Response:
{
  "packages": {
    "added":   ["libssl3 3.0.15-1"],
    "removed": [],
    "upgraded": [
      { "name": "openssl", "from": "3.0.13-1", "to": "3.0.15-1" }
    ]
  },
  "services": {
    "added":   ["monsys-agent"],
    "removed": ["rsync"]
  },
  "kernel":   { "from": "6.8.0-51", "to": "6.8.0-55" },
  "open_ports": {
    "added":   [{ "port": 9100, "process": "node_exporter" }],
    "removed": []
  }
}

One API call gives you a forensic timeline of every inventory change. No manual comparison of log files.

How correlations affect the Trust Score

The Trust Score (0-100) reflects the combined output of every pipeline. A few examples of how correlations move the score:

EventScore impact
Lateral movement detected (unacknowledged)-15 to -25 points (depending on involved node centrality)
Compliance erosion >50% on ISO 27001 A.8.7-8 points per failing control
CVE Critical on an internet-facing node (EPSS > 0.1)-12 points
Process DNA mismatch unresolved > 24h-10 points
All honeypot canaries intact+5 points (positive signal)

The score is reproducible via the inputs_hash in the evidence pack — an auditor can recompute why the score was X on a specific day.

Operational implications

The correlation workers add some load to the hub database. Operational notes:

What this means in practice

Without correlations: three separate alerts, each triaged separately by an analyst, conclusion uncertain.

With correlations: one detection event lateral_movement_suspected with the full forensic context — which honeypot was triggered, which SSH login followed, on which servers, with which user. The analyst doesn't need to correlate — the system did it.

That's the difference between a ticket machine and a monitoring platform.


SMART correlations are documented at docs.monsys.ai/en/security/smart-correlations. First five servers free: monsys.ai/en/signup.

Back to blog