Network Layer

The network layer builds the communication graph, detects communities, and computes node activity history. All functions accept a DataFrame and CorpusConfig; no file paths or global state.

Address Parsing

DiscoveryGraph.extract_addrsFunction
extract_addrs(s) -> Vector{String}

Parse a stringified email address list and return individual addresses.

Handles two common storage formats:

  1. Python list literals: "['a@corp.com', 'b@corp.com']" — RFC-5321 regex extraction is tried first.
  2. Bare comma-separated strings: "a@corp.com, b@corp.com" — falls back to comma splitting with quote stripping.

Returns an empty vector for missing or empty input.

Arguments

  • s: A string (or missing) containing one or more email addresses.

Returns

Vector{String} of individual email addresses, or String[] if none are found.

Example

extract_addrs("['alice@corp.com', 'bob@corp.com']")
# => ["alice@corp.com", "bob@corp.com"]

extract_addrs(missing)
# => String[]
source

Bot Detection

DiscoveryGraph.is_botFunction
is_bot(address::AbstractString, cfg::CorpusConfig) -> Bool

Return true if address is a broadcast sender or system account that should be excluded from the communication network.

Matching proceeds in order:

  1. Exact membership in cfg.bot_senders.
  2. Any pattern in cfg.bot_patterns matches via occursin.

Arguments

  • address::AbstractString: The email address to test.
  • cfg::CorpusConfig: Configuration carrying bot_senders and bot_patterns.

Returns

true if the address matches any bot criterion, false otherwise.

Example

cfg = enron_config()
is_bot("mailer-daemon@corp.com", cfg)  # => true
is_bot("alice@corp.com", cfg)          # => false
source
DiscoveryGraph.identify_botsFunction
identify_bots(senders::Vector{String}, cfg::CorpusConfig) -> DataFrame

Classify a vector of sender addresses as bot or non-bot and return a summary table.

For each address, records whether it matched a bot criterion and, if so, which pattern or "(explicit)" for direct membership in cfg.bot_senders.

Arguments

  • senders::Vector{String}: Sender addresses to classify.
  • cfg::CorpusConfig: Configuration carrying bot_senders and bot_patterns.

Returns

DataFrame with columns:

  • :sender::String — the address.
  • :is_bot::Booltrue if the address was flagged.
  • :matched_pattern::String — the matching pattern string, "(explicit)" for exact-set matches, or "" if not flagged.

Example

cfg    = enron_config()
senders = ["alice@corp.com", "mailer-daemon@corp.com"]
result  = identify_bots(senders, cfg)
# result.is_bot == [false, true]
source

Edge Construction

DiscoveryGraph.build_edgesFunction
build_edges(df::DataFrame, cfg::CorpusConfig) -> DataFrame

Build a broadcast-discounted edge table from a corpus DataFrame.

For each message, the function:

  1. Skips rows where the sender is empty, a bot, or a garbage address.
  2. Parses To and CC recipient lists via extract_addrs.
  3. Skips messages with no valid recipients.
  4. Computes an edge weight using cfg.broadcast_discount(n) where n is the total recipient count (default: 1/log(n+2)), so mass broadcasts approach zero weight while one-to-one messages weight ≈ 0.91.
  5. Emits one row per (sender, recipient) pair, filtering out bot and garbage recipients.
  6. When cfg.internal_domain is non-empty, restricts output to edges where both sender and recipient belong to that domain.

Arguments

  • df::DataFrame: Corpus with columns named according to cfg.
  • cfg::CorpusConfig: Configuration supplying column names, domain filter, bot rules, and discount function.

Returns

DataFrame with columns:

  • :hash::String — message identifier.
  • :sender::String — sender address.
  • :recipient::String — recipient address.
  • :date::DateTime — message timestamp.
  • :weight::Float64 — broadcast-discounted edge weight.

Example

cfg   = enron_config()
edges = build_edges(corpus, cfg)
source

Community Detection

DiscoveryGraph.build_snapshot_graphFunction
build_snapshot_graph(edges::DataFrame, node_idx::Dict{String,Int}, n::Int) -> SimpleWeightedGraph

Construct a weighted undirected graph from an edge table for a single time snapshot.

Each unique (sender, recipient) pair in edges becomes an undirected edge with the corresponding weight. Nodes are identified by their integer index in node_idx; any address not present in node_idx is silently skipped.

Arguments

  • edges::DataFrame: Edge table with columns :sender, :recipient, and :weight.
  • node_idx::Dict{String,Int}: Mapping from node address to 1-based integer index.
  • n::Int: Total number of nodes (size of the graph).

Returns

A SimpleWeightedGraph with n vertices and one edge per row of edges that has both endpoints in node_idx.

Example

nodes    = unique(vcat(edges.sender, edges.recipient))
node_idx = Dict(n => i for (i, n) in enumerate(nodes))
g = build_snapshot_graph(edges, node_idx, length(nodes))
source
DiscoveryGraph.leiden_communitiesFunction
leiden_communities(g::SimpleWeightedGraph, node_labels::Vector{String};
                   resolution=1.0, n_iterations=10, seed=42) -> DataFrame

Detect communities in a weighted graph using the Leiden algorithm.

Calls the Python leidenalg library (via PythonCall) with the RBConfigurationVertexPartition objective, which supports weighted edges and a resolution parameter. Community IDs are 1-based integers in the output (Python's 0-based membership is incremented).

Results are non-deterministic across fresh runs even with the same seed because Leiden's refinement phase uses a randomised order; use match_communities to track identity across snapshots.

Arguments

  • g::SimpleWeightedGraph: The graph to partition.
  • node_labels::Vector{String}: Address label for each vertex (length must equal nv(g)).
  • resolution: Resolution parameter controlling community granularity. Higher values yield more, smaller communities (default: 1.0).
  • n_iterations: Number of Leiden iterations (default: 10).
  • seed: Random seed for reproducibility within a single run (default: 42).

Returns

DataFrame with columns:

  • :node::String — node address.
  • :community_id::Int32 — 1-based community membership.

Example

result = leiden_communities(g, nodes; resolution=1.0)
source
DiscoveryGraph.jaccardFunction
jaccard(a::Set, b::Set) -> Float64

Compute the Jaccard similarity between two sets.

Jaccard similarity is |a ∩ b| / |a ∪ b|. Returns 1.0 when both sets are empty (identical empty sets are treated as perfectly similar).

Arguments

  • a::Set: First set.
  • b::Set: Second set.

Returns

A Float64 in [0.0, 1.0].

Example

jaccard(Set([1,2,3]), Set([2,3,4]))  # => 0.5
jaccard(Set{Int}(), Set{Int}())      # => 1.0
source
DiscoveryGraph.build_kernelFunction
build_kernel(members::Vector{String}, weekly_snapshots::Vector{DataFrame};
             threshold=2/3) -> Set{String}

Identify the stable core ("kernel") of a community across weekly snapshots.

A node is a kernel member if it appears in at least threshold fraction of the provided weekly snapshot DataFrames. Kernel membership is used by match_communities to produce stable community IDs across Leiden's non-deterministic reassignments.

Arguments

  • members::Vector{String}: Candidate community members (typically from the baseline run).
  • weekly_snapshots::Vector{DataFrame}: Weekly node-membership DataFrames, each with a :node column.
  • threshold: Minimum fraction of weeks a node must appear in to be a kernel member (default: 2/3).

Returns

Set{String} of addresses that cleared the threshold. Returns an empty set when weekly_snapshots is empty.

Example

kernel = build_kernel(community_members, snapshots; threshold=2/3)
source
DiscoveryGraph.match_communitiesFunction
match_communities(prior_kernels::Dict{Int32,Set{String}},
                  current_kernels::Dict{Int32,Set{String}};
                  min_jaccard=0.6) -> Dict{Int32,Int32}

Match current community IDs to prior community IDs using kernel Jaccard similarity.

Because Leiden assigns community IDs non-deterministically, this function provides stable identity tracking across weekly snapshots. All candidate (current, prior) pairs with Jaccard similarity ≥ min_jaccard are scored, sorted descending, and greedily assigned (each community ID used at most once on either side).

Arguments

  • prior_kernels::Dict{Int32,Set{String}}: Kernel sets from the previous snapshot, keyed by community ID.
  • current_kernels::Dict{Int32,Set{String}}: Kernel sets from the current snapshot, keyed by community ID.
  • min_jaccard: Minimum Jaccard threshold to consider a pair a match (default: 0.6).

Returns

Dict{Int32,Int32} mapping current_community_id => prior_community_id for all matched pairs. Unmatched current communities are absent from the dict.

Example

mapping = match_communities(prior_kernels, current_kernels; min_jaccard=0.6)
# mapping[current_id] == prior_id
source

Node History

DiscoveryGraph.build_node_historyFunction
build_node_history(edges::DataFrame, cfg::CorpusConfig) -> DataFrame

Build a weekly activity time series for every sender in the edge table.

For each sender and each calendar week (Monday-aligned) in which they sent at least one message, the function computes:

  • message_count: total outgoing edges (one per recipient, including duplicates).
  • recipient_count: number of distinct recipients contacted.
  • entropy: Shannon entropy of the recipient frequency distribution (nats), measuring how broadly the sender distributed messages across recipients.

Weeks beyond cfg.corpus_end are excluded. Returns an empty DataFrame with the correct schema when edges is empty.

Arguments

  • edges::DataFrame: Edge table from build_edges, with columns :sender, :recipient, and :date.
  • cfg::CorpusConfig: Configuration supplying corpus_end for the date filter.

Returns

DataFrame with columns:

  • :node::String — sender address.
  • :week_start::Date — Monday of the calendar week.
  • :message_count::Int — outgoing edge count for that week.
  • :recipient_count::Int — distinct recipient count for that week.
  • :entropy::Float64 — Shannon entropy of recipient distribution (nats).

Example

history = build_node_history(edges, cfg)
source