SMART correlations: how 15 independent pipelines become one prioritised view
Isolated alerts are a ticket machine. The real threat lives in the combination. How monsys wires CVE scanning, honeypots, capacity and process DNA together via SMART correlations.
A monitoring platform that only throws alerts is a ticket machine. Every pipeline — CVE matching, honeypots, capacity, kernel tracker, integrity — produces its own signals, but the actual threat almost always lives in the combination.
This is how monsys connects those pipelines via SMART correlations.
The problem with isolated pipelines
Suppose you have three events on the same server within ten minutes:
- A Medium-severity CVE on an npm package (
axios@0.27.2, GHSA-2025-xxxx) - A disk capacity alert: 89% full, projected full in 4 hours
- A process DNA deviation:
/usr/bin/nodehas an unknown hash
Individually: a Medium CVE is low priority. A full disk is operational. Process DNA is suspicious but might be an auto-update.
Together: a Node binary that has been silently replaced, combined with abnormal disk activity and a vulnerable dependency in the same process, is an active incident.
SMART correlations build those connections automatically.
Architecture: one signal_streams table
Every pipeline writes to the same hypertable via a uniform Go interface:
emitter.Emit(ctx, tenantID, signals.Signal{
Source: "process_dna", // or "cve_match", "honeypot", "capacity", ...
SubjectType: "agent",
SubjectID: agentID,
Key: "binary.hash_mismatch",
Value: map[string]any{
"exe_path": "/usr/bin/node",
"baseline_hash": "a3f2c1...",
"observed_hash": "b7e4d2...",
},
Severity: signals.SeverityCritical,
ObservedAt: time.Now().UTC(),
})
Any worker that wants to add a new correlation type only has to update SourceToCategory and add one Emit() call. The rest of the infrastructure (dashboards, Trust Score, evidence packs, alerts) picks it up automatically.
The nine correlation workers
1. Blast-radius CVE prioritisation
Not every CVE is equally urgent. A Critical on a server that's connected to nothing is less acute than a Medium on a load balancer fronting 40 production services.
The CapacityPredictorWorker (1h cadence) does a Breadth-First Search over the topology graph to compute the hop distance from every node to the nearest internet entry point:
GET /api/v1/topology/exposure?map_id=<id>
Response:
{
"nodes": [
{ "agent_id": "web-edge-01", "internet_hops": 0, "exposure_score": 1.0 },
{ "agent_id": "api-server-03", "internet_hops": 1, "exposure_score": 0.7 },
{ "agent_id": "db-primary", "internet_hops": 2, "exposure_score": 0.4 }
]
}
CVE recommendations are re-ranked on cvss_base × exposure_score × epss_probability. A CVSS 7.5 on the database server with EPSS 0.02 scores lower than a CVSS 5.5 on the edge server with EPSS 0.34.
2. Capacity as CVE
A full disk sounds operational, not security. But a server with a 100% full disk can't write logs, can't create core dumps, and can't start security tools. That's a security condition.
The worker fits a linear regression in Postgres:
SELECT
regr_slope(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS slope,
regr_intercept(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS intercept,
regr_r2(disk_used_pct, EXTRACT(EPOCH FROM observed_at)) AS r2
FROM agent_metrics
WHERE agent_id = $1
AND mount_point = $2
AND observed_at > NOW() - INTERVAL '30 days'
Based on slope and intercept, when the disk hits 100% is projected. This is emitted as a capacity.disk_full finding, handled like a vulnerability with a deadline.
3. Lateral-movement detection
This is the correlation that most often surprises in demos:
-- LateralMovementWorker, 60s cadence
SELECT
hp.agent_id AS source_agent,
hp.canary_path,
ae.target_agent_id,
ae.auth_user,
ae.src_ip,
ae.observed_at
FROM honeypot_events hp
JOIN auth_events ae
ON ae.src_ip IN (
SELECT ip_address FROM agent_ips WHERE agent_id = hp.agent_id
)
AND ae.observed_at BETWEEN hp.observed_at - INTERVAL '5 min'
AND hp.observed_at + INTERVAL '5 min'
AND ae.success = true
AND ae.auth_type = 'ssh'
WHERE hp.tenant_id = $1
AND hp.observed_at > NOW() - INTERVAL '70 seconds'
A honeypot trigger on server A followed by a successful SSH login from server A to server B emits a lateral_movement_suspected event with MITRE tags T1021 (Remote Services) and T1078 (Valid Accounts). Both events are linked in the detection record — the forensic trail is already there.
4. Compliance-erosion alerts
Compliance is not a static point — it erodes silently when workers crash, keys expire, or evidence generation fails.
-- ComplianceErosionWorker, 24h cadence
INSERT INTO compliance_evidence_history (tenant_id, control_id, evidence_count, snapshot_date)
SELECT tenant_id, control_id, COUNT(*), CURRENT_DATE
FROM compliance_evidence
GROUP BY tenant_id, control_id;
-- Alert on >50% loss vs 7 days ago
SELECT
h1.control_id,
h1.evidence_count AS current_count,
h7.evidence_count AS week_ago_count,
(1.0 - h1.evidence_count::float / NULLIF(h7.evidence_count, 0)) AS loss_rate
FROM compliance_evidence_history h1
JOIN compliance_evidence_history h7
ON h1.control_id = h7.control_id
AND h7.snapshot_date = CURRENT_DATE - INTERVAL '7 days'
WHERE h1.snapshot_date = CURRENT_DATE
AND (1.0 - h1.evidence_count::float / NULLIF(h7.evidence_count, 0)) > 0.5
This catches situations that don't generate alerts but are still a compliance problem: a backup worker that quietly stopped, a certificate scanner that's failing, a log pipeline that's gone empty.
5. Centrality-weighted Trust Score
The Trust Score (0-100) is a weighted average over 8 compliance categories. But not every agent weighs the same. A chokepoint server (load balancer, VPN gateway, database) that fails has more impact than a leaf node.
CentralityRefreshWorker (hourly):
→ Compute betweenness centrality (Brandes algorithm) over topology_edges
→ Persist to topology_node_centrality
→ Trust Score weights each agent with (1 + centrality)
→ max centrality ≈ 1.0 → chokepoint counts up to 2× as heavy
A server with centrality 0.8 that has a failing compliance control pushes the Trust Score down harder than the same failing control on a leaf node. SPOFs are automatically weighted more heavily.
6. False-positive learning
Any detection rule that gets acknowledged within 5 minutes more than 50% of the time is probably set too aggressively.
GET /api/v1/detection/rules/flake-stats
Response:
[
{
"rule_id": "cpu_threshold_prod",
"fires_30d": 84,
"quick_ack_30d": 52,
"flake_rate": 0.619,
"opinion": "Raise threshold from 85% to 92%, or add a duration filter: CPU>85% for >15min"
}
]
This is operator-driven: the suggestion is shown, never applied automatically. Auditors want repeatable, human-approved rules — not a system that adjusts itself.
7. Time-machine diff
Forensic investigation almost always starts with: "what changed between Monday and Friday?"
GET /api/v1/time-machine/diff?agent_id=web-edge-01&from=2026-05-19T00:00:00Z&to=2026-05-23T23:59:59Z
Response:
{
"packages": {
"added": ["libssl3 3.0.15-1"],
"removed": [],
"upgraded": [
{ "name": "openssl", "from": "3.0.13-1", "to": "3.0.15-1" }
]
},
"services": {
"added": ["monsys-agent"],
"removed": ["rsync"]
},
"kernel": { "from": "6.8.0-51", "to": "6.8.0-55" },
"open_ports": {
"added": [{ "port": 9100, "process": "node_exporter" }],
"removed": []
}
}
One API call gives you a forensic timeline of every inventory change. No manual comparison of log files.
How correlations affect the Trust Score
The Trust Score (0-100) reflects the combined output of every pipeline. A few examples of how correlations move the score:
| Event | Score impact |
|---|---|
| Lateral movement detected (unacknowledged) | -15 to -25 points (depending on involved node centrality) |
| Compliance erosion >50% on ISO 27001 A.8.7 | -8 points per failing control |
| CVE Critical on an internet-facing node (EPSS > 0.1) | -12 points |
| Process DNA mismatch unresolved > 24h | -10 points |
| All honeypot canaries intact | +5 points (positive signal) |
The score is reproducible via the inputs_hash in the evidence pack — an auditor can recompute why the score was X on a specific day.
Operational implications
The correlation workers add some load to the hub database. Operational notes:
LateralMovementWorker(60s) is the heaviest: JOIN across two tables with a time window. Optimised via a composite index on(tenant_id, src_ip, observed_at)onauth_events.- Every worker is idempotent: it can re-run without generating duplicate events.
- A cold-start delay of 2-5 minutes after hub restart prevents heavy queries from firing while the database is still warming up.
What this means in practice
Without correlations: three separate alerts, each triaged separately by an analyst, conclusion uncertain.
With correlations: one detection event lateral_movement_suspected with the full forensic context — which honeypot was triggered, which SSH login followed, on which servers, with which user. The analyst doesn't need to correlate — the system did it.
That's the difference between a ticket machine and a monitoring platform.
SMART correlations are documented at docs.monsys.ai/en/security/smart-correlations. First five servers free: monsys.ai/en/signup.