PCAPR: Protocol Reverse Engineering and Security Analysis from Raw Packet Captures

Abstract

PCAPR is a protocol reverse engineering tool that infers structure from raw packet captures without prior protocol knowledge. Given a pcap file, it reconstructs TCP/UDP streams, infers framing (length-prefix, delimiter, fixed-frame), classifies field types (magic, length, enum, fixed, sequence, echo, flags, checksum), and recovers session state machines. The security detection layer identifies C2 beaconing via coefficient of variation of inter-message gaps, DNS tunneling via subdomain entropy scoring, TLS anomalies via JA3 fingerprinting, XOR obfuscation via index-of-coincidence key recovery, ECB mode encryption via block repetition detection, and sensitive data via cleartext pattern matching. Analysis output is available as terminal report, self-contained HTML, JSON, Wireshark Lua dissector, Kaitai Struct definition, Scapy layer, Snort/Suricata rules, and boofuzz/AFL fuzzing dictionaries.

Overview

PCAPR answers the question: given an unknown pcap, what is in it and is it suspicious? It operates without signatures, without a protocol database, and without prior knowledge of the protocol being analyzed. Structure is inferred from the byte stream itself.

Current version: v0.6.0 (feature-complete, actively maintained).

Install: uv sync in the project directory, then pcapr --help.

Analysis Pipeline

1. Stream Reconstruction

Primary reader: dpkt. Fallback: Scapy.

TCP streams are reconstructed by 5-tuple (src_ip, dst_ip, src_port, dst_port, proto) and reassembled in sequence order. UDP flows are grouped by 5-tuple within a session timeout. USB captures are parsed separately via USB-specific stream handling.

2. Protocol Recognition

Thirty-plus protocols are recognized by signature matching before structural inference begins: HTTP, TLS, DNS, SMTP, FTP, SSH, SMB, MySQL, Redis, Kafka, and others. For recognized protocols, field semantics are partially known in advance. For unrecognized protocols, all inference proceeds from first principles.

3. Framing Inference

Given a reconstructed stream, PCAPR infers how the protocol segments messages:

Length-prefix: A fixed-position field contains the byte length of the following message. PCAPR searches for fields whose value predicts the distance to the next message boundary.
Delimiter-framed: Messages end at a recognizable byte sequence (CRLF, null byte, custom magic).
Fixed-frame: All messages are the same length.
Varint: Variable-length integer encoding (protobuf-style).

4. Field Classification

Within each message, byte positions are classified by their statistical behavior across all observed messages:

Field type	Inference criterion
Magic	Constant value at fixed offset across all messages
Length	Value predicts remaining payload size
Enum	Small bounded set of observed values (opcode-like)
Fixed	Low variance, not magic
Sequence	Monotonically increasing (counter, timestamp)
Echo	Matches a field in the preceding message (request/response correlation)
Flags	Bit-field structure (individual bits carry independent meaning)
Checksum	Value is a function of other fields

Per-opcode re-analysis is run after opcode fields are identified — fields may have different types across message types sharing the same framing.

5. Session State Machine Inference

PCAPR models the session as a state machine over observed message sequences. States are defined by the enum field values; transitions are the observed orderings. The resulting state machine describes the session protocol structure as a directed graph.

Security Detection

Beaconing (analyze/beacon.py)

C2 beaconing is detected via the coefficient of variation (CV) of inter-message timing gaps:

CV = std(gaps) / mean(gaps)

CV < 0.3: highly regular (automated beaconing) CV 0.3–0.6: possibly automated CV > 0.6: likely human or jittered

The threshold is not a signature — it is derived from the observed timing of the specific pcap under analysis. Cobalt Strike default beaconing (60-second interval, no jitter) produces CV ≈ 0.02–0.05. Human-generated traffic produces CV > 1.0 in most cases.

DNS Tunneling (analyze/dns_tunnel.py)

DNS tunneling is scored by:

Subdomain entropy: Exfiltration via DNS encodes data in subdomains, producing high Shannon entropy labels compared to legitimate FQDNs.
Label length distribution: Tunneled subdomains are systematically longer than legitimate DNS queries.
Query rate: Exfiltration requires many queries in a short window.

The scoring is continuous — a DNS tunnel score, not a binary flag.

TLS Fingerprinting (analyze/tls_fp.py)

JA3 fingerprints are computed from TLS ClientHello fields: TLS version, cipher suites, extensions, elliptic curves, elliptic curve point formats. The resulting MD5 hash identifies the TLS client library and version.

Known-bad JA3 hashes are matched against a curated list including:

Cobalt Strike default (multiple variants)
Metasploit Meterpreter
Common commodity RAT families

TLS decryption is supported via SSLKEYLOGFILE when the key material is available (analyze/tls_decrypt.py).

XOR Key Recovery (analyze/xor_recover.py)

Single-byte and multi-byte XOR obfuscation is detected and reversed via index of coincidence (IC) analysis:

For each candidate key length k (1–16 bytes), split the payload into k interleaved subsequences.
Compute IC for each subsequence. If all IC values are close to the expected IC of plaintext for the relevant language/protocol, k is a candidate key length.
For confirmed key lengths, recover each key byte by frequency analysis.

ECB Detection (analyze/ecb_detect.py)

AES and 3DES in ECB mode produce repeated 16-byte (AES) or 8-byte (3DES) blocks when encrypting repeated plaintext. PCAPR scans payload bytes for block repetitions at block-size boundaries and reports the repetition rate. ECB in network traffic typically indicates custom crypto implementation.

Sensitive Data Detection (analyze/sensitive.py)

Cleartext pattern matching for credentials, PII, and secrets: usernames/passwords in HTTP auth headers, credit card PANs (Luhn-valid 13–19 digit sequences), US SSNs, API key patterns, private key headers.

Malware Family Attribution (analyze/malware_cluster.py)

Flows are clustered by behavioral profile (timing, entropy, field patterns) and matched against known malware families via behavioral signatures. Attribution confidence is provided alongside the family name.

Output Formats

Format	Command flag	Notes
Terminal	(default)	Rich color output with anomaly cards
HTML	`--html`	Self-contained, shareable report
JSON	`--json`	Machine-readable, pipeline-friendly
Wireshark Lua	`--wireshark`	Dissector plugin for the observed protocol
Kaitai Struct	`--kaitai`	`.ksy` definition for binary format tooling
Scapy layer	`--scapy`	Python Packet subclass for scripting
Snort/Suricata	`--ids-rules`	Detection rules from observed signatures
boofuzz/AFL	`--fuzz-dict`	Mutation dictionary for fuzzing follow-up
Python client	`--client`	Standalone client for the observed protocol

Additional Modes

pcapr-diff: Compare two captures — baseline vs. suspicious. Reports fields and timing patterns that changed between captures.

pcapr-replay: Replay a captured session with dynamic field patching (modify specific fields while preserving framing).

pcapr-batch: Process many pcap files and produce a consolidated index.html.

pcapr-serve: Drag-and-drop web UI with an integrated Academy curriculum (interactive learning mode with 11 challenge scenarios).

--learn mode: Plain-English annotation of every analysis section, suitable for analysts building protocol analysis skills.

Integration Position

PCAPR is the investigation-layer tool in the minidet detection stack. It does not run on every flow — it runs when the correlator has identified a suspicious destination IP and opens a formal investigation. The InvestigationWorker slices the relevant flows from the pcap and passes them to PCAPR; the resulting structured report is emitted as an investigation_report EVE JSON event into the SIEM stream.

Standalone use: pcapr capture.pcap --html report.html --learn