Network Layer
The network layer builds the communication graph, detects communities, and computes node activity history. All functions accept a DataFrame and CorpusConfig; no file paths or global state.
Address Parsing
DiscoveryGraph.extract_addrs — Function
extract_addrs(s) -> Vector{String}Parse a stringified email address list and return individual addresses.
Handles two common storage formats:
- Python list literals:
"['a@corp.com', 'b@corp.com']"— RFC-5321 regex extraction is tried first. - Bare comma-separated strings:
"a@corp.com, b@corp.com"— falls back to comma splitting with quote stripping.
Returns an empty vector for missing or empty input.
Arguments
s: A string (ormissing) containing one or more email addresses.
Returns
Vector{String} of individual email addresses, or String[] if none are found.
Example
extract_addrs("['alice@corp.com', 'bob@corp.com']")
# => ["alice@corp.com", "bob@corp.com"]
extract_addrs(missing)
# => String[]Bot Detection
DiscoveryGraph.is_bot — Function
is_bot(address::AbstractString, cfg::CorpusConfig) -> BoolReturn true if address is a broadcast sender or system account that should be excluded from the communication network.
Matching proceeds in order:
- Exact membership in
cfg.bot_senders. - Any pattern in
cfg.bot_patternsmatches viaoccursin.
Arguments
address::AbstractString: The email address to test.cfg::CorpusConfig: Configuration carryingbot_sendersandbot_patterns.
Returns
true if the address matches any bot criterion, false otherwise.
Example
cfg = enron_config()
is_bot("mailer-daemon@corp.com", cfg) # => true
is_bot("alice@corp.com", cfg) # => falseDiscoveryGraph.identify_bots — Function
identify_bots(senders::Vector{String}, cfg::CorpusConfig) -> DataFrameClassify a vector of sender addresses as bot or non-bot and return a summary table.
For each address, records whether it matched a bot criterion and, if so, which pattern or "(explicit)" for direct membership in cfg.bot_senders.
Arguments
senders::Vector{String}: Sender addresses to classify.cfg::CorpusConfig: Configuration carryingbot_sendersandbot_patterns.
Returns
DataFrame with columns:
:sender::String— the address.:is_bot::Bool—trueif the address was flagged.:matched_pattern::String— the matching pattern string,"(explicit)"for exact-set matches, or""if not flagged.
Example
cfg = enron_config()
senders = ["alice@corp.com", "mailer-daemon@corp.com"]
result = identify_bots(senders, cfg)
# result.is_bot == [false, true]Edge Construction
DiscoveryGraph.build_edges — Function
build_edges(df::DataFrame, cfg::CorpusConfig) -> DataFrameBuild a broadcast-discounted edge table from a corpus DataFrame.
For each message, the function:
- Skips rows where the sender is empty, a bot, or a garbage address.
- Parses To and CC recipient lists via
extract_addrs. - Skips messages with no valid recipients.
- Computes an edge weight using
cfg.broadcast_discount(n)wherenis the total recipient count (default:1/log(n+2)), so mass broadcasts approach zero weight while one-to-one messages weight ≈ 0.91. - Emits one row per (sender, recipient) pair, filtering out bot and garbage recipients.
- When
cfg.internal_domainis non-empty, restricts output to edges where both sender and recipient belong to that domain.
Arguments
df::DataFrame: Corpus with columns named according tocfg.cfg::CorpusConfig: Configuration supplying column names, domain filter, bot rules, and discount function.
Returns
DataFrame with columns:
:hash::String— message identifier.:sender::String— sender address.:recipient::String— recipient address.:date::DateTime— message timestamp.:weight::Float64— broadcast-discounted edge weight.
Example
cfg = enron_config()
edges = build_edges(corpus, cfg)Community Detection
DiscoveryGraph.build_snapshot_graph — Function
build_snapshot_graph(edges::DataFrame, node_idx::Dict{String,Int}, n::Int) -> SimpleWeightedGraphConstruct a weighted undirected graph from an edge table for a single time snapshot.
Each unique (sender, recipient) pair in edges becomes an undirected edge with the corresponding weight. Nodes are identified by their integer index in node_idx; any address not present in node_idx is silently skipped.
Arguments
edges::DataFrame: Edge table with columns:sender,:recipient, and:weight.node_idx::Dict{String,Int}: Mapping from node address to 1-based integer index.n::Int: Total number of nodes (size of the graph).
Returns
A SimpleWeightedGraph with n vertices and one edge per row of edges that has both endpoints in node_idx.
Example
nodes = unique(vcat(edges.sender, edges.recipient))
node_idx = Dict(n => i for (i, n) in enumerate(nodes))
g = build_snapshot_graph(edges, node_idx, length(nodes))DiscoveryGraph.leiden_communities — Function
leiden_communities(g::SimpleWeightedGraph, node_labels::Vector{String};
resolution=1.0, n_iterations=10, seed=42) -> DataFrameDetect communities in a weighted graph using the Leiden algorithm.
Calls the Python leidenalg library (via PythonCall) with the RBConfigurationVertexPartition objective, which supports weighted edges and a resolution parameter. Community IDs are 1-based integers in the output (Python's 0-based membership is incremented).
Results are non-deterministic across fresh runs even with the same seed because Leiden's refinement phase uses a randomised order; use match_communities to track identity across snapshots.
Arguments
g::SimpleWeightedGraph: The graph to partition.node_labels::Vector{String}: Address label for each vertex (length must equalnv(g)).resolution: Resolution parameter controlling community granularity. Higher values yield more, smaller communities (default:1.0).n_iterations: Number of Leiden iterations (default:10).seed: Random seed for reproducibility within a single run (default:42).
Returns
DataFrame with columns:
:node::String— node address.:community_id::Int32— 1-based community membership.
Example
result = leiden_communities(g, nodes; resolution=1.0)DiscoveryGraph.jaccard — Function
jaccard(a::Set, b::Set) -> Float64Compute the Jaccard similarity between two sets.
Jaccard similarity is |a ∩ b| / |a ∪ b|. Returns 1.0 when both sets are empty (identical empty sets are treated as perfectly similar).
Arguments
a::Set: First set.b::Set: Second set.
Returns
A Float64 in [0.0, 1.0].
Example
jaccard(Set([1,2,3]), Set([2,3,4])) # => 0.5
jaccard(Set{Int}(), Set{Int}()) # => 1.0DiscoveryGraph.build_kernel — Function
build_kernel(members::Vector{String}, weekly_snapshots::Vector{DataFrame};
threshold=2/3) -> Set{String}Identify the stable core ("kernel") of a community across weekly snapshots.
A node is a kernel member if it appears in at least threshold fraction of the provided weekly snapshot DataFrames. Kernel membership is used by match_communities to produce stable community IDs across Leiden's non-deterministic reassignments.
Arguments
members::Vector{String}: Candidate community members (typically from the baseline run).weekly_snapshots::Vector{DataFrame}: Weekly node-membership DataFrames, each with a:nodecolumn.threshold: Minimum fraction of weeks a node must appear in to be a kernel member (default:2/3).
Returns
Set{String} of addresses that cleared the threshold. Returns an empty set when weekly_snapshots is empty.
Example
kernel = build_kernel(community_members, snapshots; threshold=2/3)DiscoveryGraph.match_communities — Function
match_communities(prior_kernels::Dict{Int32,Set{String}},
current_kernels::Dict{Int32,Set{String}};
min_jaccard=0.6) -> Dict{Int32,Int32}Match current community IDs to prior community IDs using kernel Jaccard similarity.
Because Leiden assigns community IDs non-deterministically, this function provides stable identity tracking across weekly snapshots. All candidate (current, prior) pairs with Jaccard similarity ≥ min_jaccard are scored, sorted descending, and greedily assigned (each community ID used at most once on either side).
Arguments
prior_kernels::Dict{Int32,Set{String}}: Kernel sets from the previous snapshot, keyed by community ID.current_kernels::Dict{Int32,Set{String}}: Kernel sets from the current snapshot, keyed by community ID.min_jaccard: Minimum Jaccard threshold to consider a pair a match (default:0.6).
Returns
Dict{Int32,Int32} mapping current_community_id => prior_community_id for all matched pairs. Unmatched current communities are absent from the dict.
Example
mapping = match_communities(prior_kernels, current_kernels; min_jaccard=0.6)
# mapping[current_id] == prior_idNode History
DiscoveryGraph.build_node_history — Function
build_node_history(edges::DataFrame, cfg::CorpusConfig) -> DataFrameBuild a weekly activity time series for every sender in the edge table.
For each sender and each calendar week (Monday-aligned) in which they sent at least one message, the function computes:
- message_count: total outgoing edges (one per recipient, including duplicates).
- recipient_count: number of distinct recipients contacted.
- entropy: Shannon entropy of the recipient frequency distribution (nats), measuring how broadly the sender distributed messages across recipients.
Weeks beyond cfg.corpus_end are excluded. Returns an empty DataFrame with the correct schema when edges is empty.
Arguments
edges::DataFrame: Edge table frombuild_edges, with columns:sender,:recipient, and:date.cfg::CorpusConfig: Configuration supplyingcorpus_endfor the date filter.
Returns
DataFrame with columns:
:node::String— sender address.:week_start::Date— Monday of the calendar week.:message_count::Int— outgoing edge count for that week.:recipient_count::Int— distinct recipient count for that week.:entropy::Float64— Shannon entropy of recipient distribution (nats).
Example
history = build_node_history(edges, cfg)