2026-kulatilleke-mambanetburst-direct-byte-level
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
canonical link → · arxiv: 2605.11034
2026-kulatilleke-mambanetburst-direct-byte-level
canonical link → · arxiv: 2605.11034
findings extracted from this paper
Striding with factor 4 (early downsampling) produces the largest single-factor degradation in the ablation study: average macro-F1 drops from 0.9909 to 0.9772 and cross-dataset variance increases from 4.77×10⁻⁵ to 4.51×10⁻⁴, with worst-case dataset performance falling to MIN 0.9524. Fine-grained byte order and short-range structure — protocol headers, payload signatures, repeated byte motifs — carry essential discriminative signal that stride-based aggregation destroys.
Early downsampling via striding (stride=4) is the single most damaging ablation, reducing average macro-F1 from 0.9909 to 0.9772 and increasing cross-dataset variance from 4.77×10⁻⁵ to 4.51×10⁻⁴, while the worst-case dataset drops to F1=0.9524 — far larger degradation than any other design choice including Mamba-1 vs Mamba-2.
A burst of just 5 packets truncated to 320 bytes each (1600 bytes total) suffices for macro-F1 ≥0.9824 across all six benchmarks; the classification token reads from the final recurrent state after a 4-layer Mamba-2 stack processing this fixed-length prefix, with no additional flow-level or session-level context required.
Classification from the first 5 packets × 320 bytes (1600-byte burst) achieves near-perfect accuracy across Tor (F1=0.9990), VPN (F1=0.9871), malware (F1=0.9954), and IoT attack traffic (F1=0.9966), with IP addresses masked and only header and initial payload retained. The earliest portion of each packet provides sufficient discriminative information for a classification decision made within the first kilobyte of a flow.
MambaNetBurst classifies Tor traffic (ISCXTor2016) at F1=0.9990 and VPN traffic (ISCXVPN2016) at F1=0.9871 using only the first 5 packets (1600 bytes total) with no pre-training, matching or exceeding pre-trained baselines like ET-BERT (ISCXTor F1=0.9967, ISCXVPN F1=0.9565) and NetMamba (ISCXTor F1=0.9986, ISCXVPN F1=0.9806) at 2.5–2.7M parameters.
Mamba-2's constrained scalar-times-identity A-matrix acts as an implicit regularizer for packet-byte sequences: under matched settings it yields higher mean F1 (0.9909 vs 0.9874), better worst-case F1 (0.9824 vs 0.9769), and 48% lower cross-dataset variance (4.77×10⁻⁵ vs 9.21×10⁻⁵) relative to Mamba-1, while delivering 30–60% faster backward passes and 2–4× lower GPU memory usage.
Mamba-2 (2.5M parameters) is Pareto-optimal on the accuracy-vs-inference-time frontier: it achieves average macro-F1 of 0.9909 with 30-60% faster backward passes than Mamba-1 and 2-3× faster inference than linear Transformers with FlashAttention-2 at medium-to-large batch sizes on a single RTX 3090. Memory usage is 2-4× lower than Transformer-based counterparts, enabling single-GPU operation at sequence length 1600.
Supervised byte-level training without pre-training reduces total compute by an estimated 3–15× in wall-clock training time and 2–4× in training memory footprint compared to pre-trained Transformer baselines (ET-BERT, YaTC, NetMamba), while achieving equivalent or superior classification F1 across six benchmarks spanning encrypted app identification, VPN/Tor, malware, and IoT attack traffic.
Eliminating self-supervised pretraining reduces total wall-clock training time by an estimated 3-15× relative to ET-BERT, YaTC, and NetMamba, while achieving comparable or superior accuracy. Pretraining in representative baselines typically consumes 10-100× more compute than downstream fine-tuning; removing it also eliminates the risk of negative transfer from mismatched pretraining corpora under concept drift.
MambaNetBurst achieves macro-F1 of 0.9990 on ISCXTor2016 and 0.9871 on ISCXVPN2016 without any pretraining, matching or exceeding heavily pretrained baselines such as ET-BERT (F1=0.9967/0.9565) and YaTC (F1=0.9986/0.9806). High-accuracy Tor and VPN traffic classification is achievable with a compact 2.5M-parameter supervised model requiring no labeled pretraining corpus.