Network
Network health and intrusion watch
Watch fleet latency and suspicious traffic patterns, enrich real anomalies with context, and page the right team only when severity warrants it.
[ workflow / network ]
Network health and intrusion watch
Cosmos polls fleet telemetry for latency drift, jitter spikes, and traffic that looks like scanning, brute force, geo anomalies, or lateral movement. It enriches suspected anomalies with logs, firewall rules, and recent deploys, filters known maintenance events, and opens incidents only for real issues. High-severity cases page on-call; the rest notify chat.
15 nodes
11 edges
Continuous + IDS events
Latency, traffic, auth, scans
Intrusion + latency drift
Decision
Anything unusual?
Beyond rolling baseline
Update telemetry log
Decision
Anything unusual?
Beyond rolling baseline
Update telemetry log
Intrusion / latency / traffic
Logs, firewall, deploys
Maintenance, drills, pentests
Decision
Explained by known event?
Window or planned drill
Log and end
Decision
Explained by known event?
Window or planned drill
Log and end
Impact + confidence
PagerDuty / Linear / Jira
Decision
High severity?
Page-worthy threshold
Security or network on-call
Decision
High severity?
Page-worthy threshold
Security or network on-call
Slack / Teams ping
Workflow prompt
Paste this into Augment to reproduce the workflow end-to-end.
Build a Cosmos workflow that watches network health and surfaces intrusion or instability. Trigger: continuous: every metric tick from our network telemetry pipeline, plus any high-priority security event from the IDS / firewall. Steps: 1. Collect signals across the fleet: server-to-server latency and jitter, packet loss, traffic volume per link, authentication events, port-scan activity, geo of inbound connections, lateral movement between subnets. 2. Run an anomaly pass. Flag anything that drifts from the rolling baseline: latency or jitter spikes between servers, traffic patterns that look like reconnaissance, brute-force auth, sudden cross-zone movement, or unexpected outbound destinations. 3. Decision: "Anything unusual?". - If no, record a healthy baseline snapshot and end. - If yes, continue. 4. Classify the anomaly into a bucket: latency degradation, intrusion suspicion (which kind), capacity / traffic anomaly, or unknown. 5. Pull related context: application and system logs from the affected hosts for the same window, recent firewall rule changes, recent deploys, ongoing maintenance windows, and the relevant runbook if one exists. 6. Run a safety filter that suppresses known-benign events: planned maintenance windows, scheduled chaos / failover drills, and approved penetration tests. 7. Decision: "Explained by a known event?". - If yes, mark the anomaly as expected, log it on the timeline and end. - If no, continue. 8. Assess the severity. Combine impact (how many services / users affected) and confidence (how sure we are this is real and malicious or destabilising). Output a level: low, medium, high, critical. 9. Open an incident in the on-call system (PagerDuty, Linear, Jira) with the anomaly summary, the correlated logs, the diagnosed bucket, and the severity. 10. Decision: "High severity?". Treat high and critical as page-worthy. - If yes, page the right responder: the security on-call for intrusion-class anomalies, the network on-call for latency / capacity anomalies. - If no, notify the relevant team channel (Slack / Teams) with a link to the incident and don't page anyone. Constraints: - Always attach the correlated log excerpts and the firewall / deploy diff to the incident; the responder should not have to dig. - Never page on a known-benign event (maintenance, drill, approved pentest). - Keep the healthy-baseline log append-only so we can build trend dashboards (anomaly rate, false-positive rate, MTTA / MTTR) later.