Overview
PCAPR answers the question: given an unknown pcap, what is in it and is it suspicious? It operates without signatures, without a protocol database, and without prior knowledge of the protocol being analyzed. Structure is inferred from the byte stream itself.
Current version: v0.6.0 (feature-complete, actively maintained).
Install: uv sync in the project directory, then pcapr --help.
Analysis Pipeline
1. Stream Reconstruction
Primary reader: dpkt. Fallback: Scapy.
TCP streams are reconstructed by 5-tuple (src_ip, dst_ip, src_port, dst_port, proto) and reassembled in sequence order. UDP flows are grouped by 5-tuple within a session timeout. USB captures are parsed separately via USB-specific stream handling.
2. Protocol Recognition
Thirty-plus protocols are recognized by signature matching before structural inference begins: HTTP, TLS, DNS, SMTP, FTP, SSH, SMB, MySQL, Redis, Kafka, and others. For recognized protocols, field semantics are partially known in advance. For unrecognized protocols, all inference proceeds from first principles.
3. Framing Inference
Given a reconstructed stream, PCAPR infers how the protocol segments messages:
- Length-prefix: A fixed-position field contains the byte length of the following message. PCAPR searches for fields whose value predicts the distance to the next message boundary.
- Delimiter-framed: Messages end at a recognizable byte sequence (CRLF, null byte, custom magic).
- Fixed-frame: All messages are the same length.
- Varint: Variable-length integer encoding (protobuf-style).
4. Field Classification
Within each message, byte positions are classified by their statistical behavior across all observed messages:
| Field type | Inference criterion |
|---|---|
| Magic | Constant value at fixed offset across all messages |
| Length | Value predicts remaining payload size |
| Enum | Small bounded set of observed values (opcode-like) |
| Fixed | Low variance, not magic |
| Sequence | Monotonically increasing (counter, timestamp) |
| Echo | Matches a field in the preceding message (request/response correlation) |
| Flags | Bit-field structure (individual bits carry independent meaning) |
| Checksum | Value is a function of other fields |
Per-opcode re-analysis is run after opcode fields are identified โ fields may have different types across message types sharing the same framing.
5. Session State Machine Inference
PCAPR models the session as a state machine over observed message sequences. States are defined by the enum field values; transitions are the observed orderings. The resulting state machine describes the session protocol structure as a directed graph.
Security Detection
Beaconing (analyze/beacon.py)
C2 beaconing is detected via the coefficient of variation (CV) of inter-message timing gaps:
CV = std(gaps) / mean(gaps)
CV < 0.3: highly regular (automated beaconing) CV 0.3โ0.6: possibly automated CV > 0.6: likely human or jittered
The threshold is not a signature โ it is derived from the observed timing of the specific pcap under analysis. Cobalt Strike default beaconing (60-second interval, no jitter) produces CV โ 0.02โ0.05. Human-generated traffic produces CV > 1.0 in most cases.
DNS Tunneling (analyze/dns_tunnel.py)
DNS tunneling is scored by:
- Subdomain entropy: Exfiltration via DNS encodes data in subdomains, producing high Shannon entropy labels compared to legitimate FQDNs.
- Label length distribution: Tunneled subdomains are systematically longer than legitimate DNS queries.
- Query rate: Exfiltration requires many queries in a short window.
The scoring is continuous โ a DNS tunnel score, not a binary flag.
TLS Fingerprinting (analyze/tls_fp.py)
JA3 fingerprints are computed from TLS ClientHello fields: TLS version, cipher suites, extensions, elliptic curves, elliptic curve point formats. The resulting MD5 hash identifies the TLS client library and version.
Known-bad JA3 hashes are matched against a curated list including:
- Cobalt Strike default (multiple variants)
- Metasploit Meterpreter
- Common commodity RAT families
TLS decryption is supported via SSLKEYLOGFILE when the key material is available (analyze/tls_decrypt.py).
XOR Key Recovery (analyze/xor_recover.py)
Single-byte and multi-byte XOR obfuscation is detected and reversed via index of coincidence (IC) analysis:
- For each candidate key length k (1โ16 bytes), split the payload into k interleaved subsequences.
- Compute IC for each subsequence. If all IC values are close to the expected IC of plaintext for the relevant language/protocol, k is a candidate key length.
- For confirmed key lengths, recover each key byte by frequency analysis.
ECB Detection (analyze/ecb_detect.py)
AES and 3DES in ECB mode produce repeated 16-byte (AES) or 8-byte (3DES) blocks when encrypting repeated plaintext. PCAPR scans payload bytes for block repetitions at block-size boundaries and reports the repetition rate. ECB in network traffic typically indicates custom crypto implementation.
Sensitive Data Detection (analyze/sensitive.py)
Cleartext pattern matching for credentials, PII, and secrets: usernames/passwords in HTTP auth headers, credit card PANs (Luhn-valid 13โ19 digit sequences), US SSNs, API key patterns, private key headers.
Malware Family Attribution (analyze/malware_cluster.py)
Flows are clustered by behavioral profile (timing, entropy, field patterns) and matched against known malware families via behavioral signatures. Attribution confidence is provided alongside the family name.
Output Formats
| Format | Command flag | Notes |
|---|---|---|
| Terminal | (default) | Rich color output with anomaly cards |
| HTML | --html | Self-contained, shareable report |
| JSON | --json | Machine-readable, pipeline-friendly |
| Wireshark Lua | --wireshark | Dissector plugin for the observed protocol |
| Kaitai Struct | --kaitai | .ksy definition for binary format tooling |
| Scapy layer | --scapy | Python Packet subclass for scripting |
| Snort/Suricata | --ids-rules | Detection rules from observed signatures |
| boofuzz/AFL | --fuzz-dict | Mutation dictionary for fuzzing follow-up |
| Python client | --client | Standalone client for the observed protocol |
Additional Modes
pcapr-diff: Compare two captures โ baseline vs. suspicious. Reports fields and timing patterns that changed between captures.
pcapr-replay: Replay a captured session with dynamic field patching (modify specific fields while preserving framing).
pcapr-batch: Process many pcap files and produce a consolidated index.html.
pcapr-serve: Drag-and-drop web UI with an integrated Academy curriculum (interactive learning mode with 11 challenge scenarios).
--learn mode: Plain-English annotation of every analysis section, suitable for analysts building protocol analysis skills.
Integration Position
PCAPR is the investigation-layer tool in the minidet detection stack. It does not run on every flow โ it runs when the correlator has identified a suspicious destination IP and opens a formal investigation. The InvestigationWorker slices the relevant flows from the pcap and passes them to PCAPR; the resulting structured report is emitted as an investigation_report EVE JSON event into the SIEM stream.
Standalone use: pcapr capture.pcap --html report.html --learn