Network Layer

The network layer builds the communication graph, detects communities, and computes node activity history. All functions accept a DataFrame and CorpusConfig; no file paths or global state.

Community Detection

DiscoveryGraph.build_snapshot_graph — Function

build_snapshot_graph(edges::DataFrame, node_idx::Dict{String,Int}, n::Int) -> SimpleWeightedGraph

Construct a weighted undirected graph from an edge table for a single time snapshot.

Each unique (sender, recipient) pair in edges becomes an undirected edge with the corresponding weight. Nodes are identified by their integer index in node_idx; any address not present in node_idx is silently skipped.

Arguments

edges::DataFrame: Edge table with columns :sender, :recipient, and :weight.
node_idx::Dict{String,Int}: Mapping from node address to 1-based integer index.
n::Int: Total number of nodes (size of the graph).

Returns

A SimpleWeightedGraph with n vertices and one edge per row of edges that has both endpoints in node_idx.

Example

nodes    = unique(vcat(edges.sender, edges.recipient))
node_idx = Dict(n => i for (i, n) in enumerate(nodes))
g = build_snapshot_graph(edges, node_idx, length(nodes))

source

DiscoveryGraph.leiden_communities — Function

leiden_communities(g::SimpleWeightedGraph, node_labels::Vector{String};
                   resolution=1.0, n_iterations=10, seed=42) -> DataFrame

Detect communities in a weighted graph using the Leiden algorithm.

Calls the Python leidenalg library (via PythonCall) with the RBConfigurationVertexPartition objective, which supports weighted edges and a resolution parameter. Community IDs are 1-based integers in the output (Python's 0-based membership is incremented).

Results are non-deterministic across fresh runs even with the same seed because Leiden's refinement phase uses a randomised order; use match_communities to track identity across snapshots.

Arguments

g::SimpleWeightedGraph: The graph to partition.
node_labels::Vector{String}: Address label for each vertex (length must equal nv(g)).
resolution: Resolution parameter controlling community granularity. Higher values yield more, smaller communities (default: 1.0).
n_iterations: Number of Leiden iterations (default: 10).
seed: Random seed for reproducibility within a single run (default: 42).

Returns

DataFrame with columns:

:node::String — node address.
:community_id::Int32 — 1-based community membership.

Example

result = leiden_communities(g, nodes; resolution=1.0)

source

DiscoveryGraph.jaccard — Function

jaccard(a::Set, b::Set) -> Float64

Compute the Jaccard similarity between two sets.

Jaccard similarity is |a ∩ b| / |a ∪ b|. Returns 1.0 when both sets are empty (identical empty sets are treated as perfectly similar).

Arguments

a::Set: First set.
b::Set: Second set.

Returns

A Float64 in [0.0, 1.0].

Example

jaccard(Set([1,2,3]), Set([2,3,4]))  # => 0.5
jaccard(Set{Int}(), Set{Int}())      # => 1.0

source

DiscoveryGraph.build_kernel — Function

build_kernel(members::Vector{String}, weekly_snapshots::Vector{DataFrame};
             threshold=2/3) -> Set{String}

Identify the stable core ("kernel") of a community across weekly snapshots.

A node is a kernel member if it appears in at least threshold fraction of the provided weekly snapshot DataFrames. Kernel membership is used by match_communities to produce stable community IDs across Leiden's non-deterministic reassignments.

Arguments

members::Vector{String}: Candidate community members (typically from the baseline run).
weekly_snapshots::Vector{DataFrame}: Weekly node-membership DataFrames, each with a :node column.
threshold: Minimum fraction of weeks a node must appear in to be a kernel member (default: 2/3).

Returns

Set{String} of addresses that cleared the threshold. Returns an empty set when weekly_snapshots is empty.

Example

kernel = build_kernel(community_members, snapshots; threshold=2/3)

source

DiscoveryGraph.match_communities — Function

match_communities(prior_kernels::Dict{Int32,Set{String}},
                  current_kernels::Dict{Int32,Set{String}};
                  min_jaccard=0.6) -> Dict{Int32,Int32}

Match current community IDs to prior community IDs using kernel Jaccard similarity.

Because Leiden assigns community IDs non-deterministically, this function provides stable identity tracking across weekly snapshots. All candidate (current, prior) pairs with Jaccard similarity ≥ min_jaccard are scored, sorted descending, and greedily assigned (each community ID used at most once on either side).

Arguments

prior_kernels::Dict{Int32,Set{String}}: Kernel sets from the previous snapshot, keyed by community ID.
current_kernels::Dict{Int32,Set{String}}: Kernel sets from the current snapshot, keyed by community ID.
min_jaccard: Minimum Jaccard threshold to consider a pair a match (default: 0.6).

Returns

Dict{Int32,Int32} mapping current_community_id => prior_community_id for all matched pairs. Unmatched current communities are absent from the dict.

Example

mapping = match_communities(prior_kernels, current_kernels; min_jaccard=0.6)
# mapping[current_id] == prior_id

source

Node History

DiscoveryGraph.build_node_history — Function

build_node_history(edges::DataFrame, cfg::CorpusConfig) -> DataFrame

Build a weekly activity time series for every sender in the edge table.

For each sender and each calendar week (Monday-aligned) in which they sent at least one message, the function computes:

message_count: total outgoing edges (one per recipient, including duplicates).
recipient_count: number of distinct recipients contacted.
entropy: Shannon entropy of the recipient frequency distribution (nats), measuring how broadly the sender distributed messages across recipients.

Weeks beyond cfg.corpus_end are excluded. Returns an empty DataFrame with the correct schema when edges is empty.

Arguments

edges::DataFrame: Edge table from build_edges, with columns :sender, :recipient, and :date.
cfg::CorpusConfig: Configuration supplying corpus_end for the date filter.

Returns

DataFrame with columns:

:node::String — sender address.
:week_start::Date — Monday of the calendar week.
:message_count::Int — outgoing edge count for that week.
:recipient_count::Int — distinct recipient count for that week.
:entropy::Float64 — Shannon entropy of recipient distribution (nats).

Example

history = build_node_history(edges, cfg)

source

Network Layer

Address Parsing

Bot Detection

Edge Construction

Community Detection

Node History