Data provenance
Where the public demo's events come from, how they are emitted, and how to verify the pipeline that produced them.
The platform is real. The data is replayed. Every event, anomaly score, similar-anomaly match, and entity-timeline tick on the public sandbox at app.sentrygrid.org is derived from a row in the CICIDS-2018 dataset, scored at runtime by the production-trained ensemble using the v1 feature mapping the models were fit against. That bit-for-bit fidelity between train and serve is the load-bearing guarantee — the scores you see on the demo are the scores the production models produce against this data. The deployment itself is a sandbox, not a FedRAMP-authorized environment, and honesty about that disclosure is part of the federal posture this trust center documents.
Dataset
- Name
- CSE-CIC-IDS-2018 on AWS (CICIDS-2018)
- Publisher
- Canadian Institute for Cybersecurity, University of New Brunswick
- License
- Open — academic and commercial use permitted with attribution.
- Version
- Distrinet-corrected parquet (dhoogla redistribution)
- Source row count
- ~16.2M flows (78 CICFlowMeter columns + Label)
- Demo deck row count
- ~50,000 flows, label-stratified
- Date range
- 2018-02-14 to 2018-03-02 (UTC)
- Attack families
- Brute force, DoS, infiltration, web attacks, botnet (per CICIDS-2018 labelling).
Replay mechanics
- Per-session publisher emits ~6 events per second by default.
- Speed is selectable in the Live page header at 1×, 2×, or 4×.
- The replay clock — the data-time of the row currently being emitted — is rendered in the Live page footer compliance strip and advances with the publisher.
- Severity mix follows the dataset's natural distribution. No injection, no upweighting.
- The deck is built offline by scripts/build-replay-deck.py and emitted to services/api/replay/cicids_v1.jsonl.gz. The build script reuses the same feature_mapping.py module the trained ensemble was fit against, so scoring at deploy is bit-for-bit identical to scoring at train.
- Production deployments do not run this publisher; they consume real telemetry per the architecture document.
Feature mapping
The 78-column CICFlowMeter schema collapses to the 47-feature v1 input the ensemble consumes. Below is a representative subset covering the seven ECS-aligned event fields the platform surfaces downstream; the full mapping lives in feature_mapping.py (constant SCHEMA_TO_CICIDS_COLUMN). The 13 SentryGrid stateful columns are zero-filled at training time and computed online per entity at serving time.
| Platform field | CICIDS-2018 column | Note |
|---|---|---|
| flow_duration | Flow Duration | Microseconds between first and last packet of the flow. |
| flow_bytes_per_s | Flow Bytes/s | Throughput rate; primary signal for byte-rate anomalies. |
| flow_pkts_per_s | Flow Packets/s | Packet rate; primary signal for scan and DoS detection. |
| fwd_packets_total | Total Fwd Packets | Forward direction packet count. |
| bwd_packets_total | Total Backward Packets | Backward direction packet count. |
| tcp_flag_syn_count | SYN Flag Count | TCP SYN tally; load-bearing for half-open scan detection. |
| sg_protocol_id | Protocol | Ordinal: TCP→1, UDP→2, ICMP→3, else→4. |
Manifest
A machine-readable companion is published at /trust-center/data-provenance/manifest.json. It carries the deck path, row count, date range, build-script reference, and a SHA-256 checksum of the deployed deck. The fields below ship with placeholder values in source control; the demo deployment overlay (infra/helm/sentrygrid/values.demo.yaml) overwrites them at deploy time with the real values produced by make demo-data.
- deck.sha256 — SHA-256 of the gzipped deck file. Overwritten by `make demo-data` at deploy.
- deck.builtAt — ISO-8601 build timestamp. Overwritten at deploy.
- deck.builderScriptSha256 — SHA-256 of scripts/build-replay-deck.py at build time.
Adding this page does not change any production posture claim (infra/helm/sentrygrid/values.govcloud.yaml). It documents the demo profile.