← Trust center
SENTRYGRID — TRUST CENTER · DATA PROVENANCE

Data provenance

Where the public demo's events come from, how they are emitted, and how to verify the pipeline that produced them.

The platform is real. The data is replayed. Every event, anomaly score, similar-anomaly match, and entity-timeline tick on the public sandbox at app.sentrygrid.org is derived from a row in the CICIDS-2018 dataset, scored at runtime by the production-trained ensemble using the v1 feature mapping the models were fit against. That bit-for-bit fidelity between train and serve is the load-bearing guarantee — the scores you see on the demo are the scores the production models produce against this data. The deployment itself is a sandbox, not a FedRAMP-authorized environment, and honesty about that disclosure is part of the federal posture this trust center documents.

Dataset

Name
CSE-CIC-IDS-2018 on AWS (CICIDS-2018)
Publisher
Canadian Institute for Cybersecurity, University of New Brunswick
License
Open — academic and commercial use permitted with attribution.
Source
unb.ca/cic/datasets/ids-2018.html
Version
Distrinet-corrected parquet (dhoogla redistribution)
Source row count
~16.2M flows (78 CICFlowMeter columns + Label)
Demo deck row count
~50,000 flows, label-stratified
Date range
2018-02-14 to 2018-03-02 (UTC)
Attack families
Brute force, DoS, infiltration, web attacks, botnet (per CICIDS-2018 labelling).

Replay mechanics

  • Per-session publisher emits ~6 events per second by default.
  • Speed is selectable in the Live page header at 1×, 2×, or 4×.
  • The replay clock — the data-time of the row currently being emitted — is rendered in the Live page footer compliance strip and advances with the publisher.
  • Severity mix follows the dataset's natural distribution. No injection, no upweighting.
  • The deck is built offline by scripts/build-replay-deck.py and emitted to services/api/replay/cicids_v1.jsonl.gz. The build script reuses the same feature_mapping.py module the trained ensemble was fit against, so scoring at deploy is bit-for-bit identical to scoring at train.
  • Production deployments do not run this publisher; they consume real telemetry per the architecture document.

Feature mapping

The 78-column CICFlowMeter schema collapses to the 47-feature v1 input the ensemble consumes. Below is a representative subset covering the seven ECS-aligned event fields the platform surfaces downstream; the full mapping lives in feature_mapping.py (constant SCHEMA_TO_CICIDS_COLUMN). The 13 SentryGrid stateful columns are zero-filled at training time and computed online per entity at serving time.

Platform fieldCICIDS-2018 columnNote
flow_durationFlow DurationMicroseconds between first and last packet of the flow.
flow_bytes_per_sFlow Bytes/sThroughput rate; primary signal for byte-rate anomalies.
flow_pkts_per_sFlow Packets/sPacket rate; primary signal for scan and DoS detection.
fwd_packets_totalTotal Fwd PacketsForward direction packet count.
bwd_packets_totalTotal Backward PacketsBackward direction packet count.
tcp_flag_syn_countSYN Flag CountTCP SYN tally; load-bearing for half-open scan detection.
sg_protocol_idProtocolOrdinal: TCP→1, UDP→2, ICMP→3, else→4.

Manifest

A machine-readable companion is published at /trust-center/data-provenance/manifest.json. It carries the deck path, row count, date range, build-script reference, and a SHA-256 checksum of the deployed deck. The fields below ship with placeholder values in source control; the demo deployment overlay (infra/helm/sentrygrid/values.demo.yaml) overwrites them at deploy time with the real values produced by make demo-data.

  • deck.sha256SHA-256 of the gzipped deck file. Overwritten by `make demo-data` at deploy.
  • deck.builtAtISO-8601 build timestamp. Overwritten at deploy.
  • deck.builderScriptSha256SHA-256 of scripts/build-replay-deck.py at build time.

Adding this page does not change any production posture claim (infra/helm/sentrygrid/values.govcloud.yaml). It documents the demo profile.

Data provenance · Trust center · SentryGrid · SentryGrid