Discovery Layer

The discovery layer provides privilege triage, interactive inspection, temporal anomaly detection, community vocabulary analysis, and Rule 26(f) documentation.

Role Detection

DiscoveryGraph.find_rolesFunction
find_roles(node_reg::DataFrame, cfg::CorpusConfig) -> DataFrame

Annotate a node registry with role labels and counsel flags from cfg.roles.

For each node address, every RoleConfig in cfg.roles is tested in order using three matching rules (any match assigns the role):

  1. Exact membership in rc.explicit_addresses.
  2. Any pattern in rc.address_patterns matches via occursin.
  3. The address ends with "@<domain>" or ".<domain>" for any domain in rc.domain_list.

A node's is_counsel flag is set to true if it matches any role whose counsel_type is InHouse or OutsideFirm.

Arguments

  • node_reg::DataFrame: Node registry with at least a :node column of address strings.
  • cfg::CorpusConfig: Configuration carrying the roles vector to apply.

Returns

A copy of node_reg with two additional columns:

  • :roles::Vector{String} — list of role labels matched for each node (empty if none).
  • :is_counsel::Booltrue if the node matched any counsel role.

Example

node_reg = find_roles(base_node_reg, cfg)
counsel_nodes = filter(r -> r.is_counsel, eachrow(node_reg))
source
DiscoveryGraph.identify_counsel_communitiesFunction
identify_counsel_communities(result::DataFrame, cfg::CorpusConfig) -> DataFrame

Tentatively identify which Leiden communities contain counsel nodes using cfg.roles.

Applies the same role-matching logic as find_roles directly to the Leiden output, without requiring a manually curated node registry. Use this immediately after leiden_communities to identify which community IDs to focus on — replacing the need to call review_all_communities and scan output by eye.

Arguments

  • result::DataFrame: Leiden output with at least :node and :community_id columns.
  • cfg::CorpusConfig: Configuration carrying the roles vector to apply.

Returns

A DataFrame with one row per community containing at least one counsel node:

  • :community_id — Leiden community identifier.
  • :n_members — total nodes in the community.
  • :n_counsel — nodes matching any counsel role.
  • :roles — unique role labels present (e.g. ["in_house_counsel"]).
  • :counsel_nodes — addresses of matched counsel nodes.

Sorted by :n_counsel descending. Returns an empty DataFrame if no counsel nodes are found (check cfg.roles is correctly populated).

Example

result = leiden_communities(g, all_nodes; resolution=1.0, seed=42)
identify_counsel_communities(result, cfg)
# community_id  n_members  n_counsel  roles                  counsel_nodes
#           9        142         6  ["in_house_counsel"]   ["sara.shackleton@enron.com", ...]
source
DiscoveryGraph.audit_counsel_coverageFunction
audit_counsel_coverage(corpus, node_reg, cfg; keywords, broadcast_min_recipients) -> NamedTuple

Scan the corpus for attorney-flavored messages where no party is a known counsel node.

Identifies potential gaps in cfg.roles — senders who write about legal topics but were not captured by the role-matching rules. Most results will be broadcast announcements (high broadcast_fraction); outliers with low broadcast_fraction and many messages are candidates for manual review and possible addition to cfg.roles.

Messages are filtered by subject keyword match (case-insensitive). A message is excluded from results if the sender or any recipient is already in node_reg as counsel. Bot senders (per cfg) are also excluded.

Arguments

  • corpus::DataFrame: Full corpus as returned by load_corpus.
  • node_reg::DataFrame: Node registry with :is_counsel column from find_roles.
  • cfg::CorpusConfig: Configuration supplying column names and bot rules.
  • keywords: Subject keywords to match (default: ATTORNEY_KEYWORDS).
  • broadcast_min_recipients: Recipient count at or above which a message is flagged as a broadcast (default: 5).

Returns

A NamedTuple with:

  • :suspicious_senders::DataFrame — one row per non-counsel sender, columns: :sender, :n_messages, :n_broadcast, :broadcast_fraction, :sample_subjects. Sorted by :n_messages descending.
  • :uncovered_count::Int — total attorney-flavored messages with no counsel party.
  • :keywords_used::Vector{String} — the keyword list applied.

Example

node_reg = find_roles(base_node_reg, cfg)
audit  = audit_counsel_coverage(corpus, node_reg, cfg)
# Filter to non-broadcast candidates for cfg.roles additions:
filter(r -> r.broadcast_fraction < 0.5, audit.suspicious_senders)
source
DiscoveryGraph.ATTORNEY_KEYWORDSConstant
ATTORNEY_KEYWORDS

Default subject-line keywords used by audit_counsel_coverage to identify messages that discuss legal topics but involve no known counsel party. A message whose subject contains any of these terms (case-insensitive) is a candidate for review when neither its sender nor any recipient is in the counsel node set.

Pass a custom list as the keywords argument to audit_counsel_coverage to override.

source

Interactive Session

DiscoveryGraph.DiscoverySessionType
DiscoverySession

Primary interactive interface for exploring a communication network and its detected communities. Bundles the four DataFrames and configuration that all inspection functions require, eliminating repetitive argument passing.

Fields

  • corpus_df::DataFrame: The full message corpus, with columns named according to cfg.
  • result::DataFrame: Community membership table with columns :node and :community_id.
  • edge_df::DataFrame: Broadcast-discounted edge table from build_edges, with columns :sender, :recipient, :date, and :weight.
  • cfg::CorpusConfig: The corpus configuration (column names, date bounds, roles, etc.).
  • leiden_seed::Int: Random seed passed to leiden_communities (default 42). Recorded in the Rule 26(f) memo for reproducibility documentation.
  • leiden_resolution::Float64: Resolution parameter passed to leiden_communities (default 1.0). Recorded in the Rule 26(f) memo.

Pass all six fields to record non-default Leiden parameters in the methodology statement:

S = DiscoverySession(corpus_df, leiden_result, edge_df, cfg, seed, resolution)

The 4-argument form defaults to leiden_seed=42, leiden_resolution=1.0.

Example

S = DiscoverySession(corpus_df, leiden_result, edge_df, cfg)
eyeball(S, 6; mode=:block, block=(DateTime(2000,7,1), DateTime(2000,7,31)), n=20)
inspect_community(S, 6)
source
DiscoveryGraph.eyeballFunction
eyeball(S::DiscoverySession, cid::Integer;
        mode=:random, n=25, start=nothing, stop=nothing, block=nothing)

Print a sample of message headers from a single community to the console.

Filters the corpus to messages sent by members of community cid within the time window, then prints timestamp, sender, and subject for up to n messages.

Arguments

  • S::DiscoverySession: The active discovery session.
  • cid::Integer: Community ID to inspect.
  • mode: Sampling mode — :random (default) shuffles before taking n; :chrono takes the first n in chronological order.
  • n: Maximum number of messages to display (default: 25).
  • start: Window start (DateTime); defaults to S.cfg.baseline_start.
  • stop: Window end (DateTime); defaults to S.cfg.baseline_end.
  • block: A (start, stop) tuple of DateTime values; sets the window and forces mode=:chrono. Overrides start/stop when provided.

Returns

nothing (output goes to stdout).

Example

eyeball(S, 9; mode=:chrono, n=10)
eyeball(S, 6; block=(DateTime(2000,7,1), DateTime(2000,7,31)), n=20)
source
DiscoveryGraph.inspect_communityFunction
inspect_community(S::DiscoverySession, cid::Integer)

Print a structural summary of a single community to the console.

Displays the community's member count, total internal edge count, and the top-5 internal senders by message volume.

Arguments

  • S::DiscoverySession: The active discovery session.
  • cid::Integer: Community ID to summarise.

Returns

nothing (output goes to stdout).

Example

inspect_community(S, 6)
source
DiscoveryGraph.inspect_bridgeFunction
inspect_bridge(S::DiscoverySession, cid_a::Integer, cid_b::Integer;
               start=nothing, stop=nothing)

Print the count of cross-community edges between two communities within a time window.

Identifies edges where one endpoint is a member of cid_a and the other is a member of cid_b (in either direction), filtered to the specified date range.

Arguments

  • S::DiscoverySession: The active discovery session.
  • cid_a::Integer: First community ID.
  • cid_b::Integer: Second community ID.
  • start: Window start (DateTime); defaults to S.cfg.baseline_start.
  • stop: Window end (DateTime); defaults to S.cfg.baseline_end.

Returns

nothing (output goes to stdout).

Example

inspect_bridge(S, 9, 6; start=DateTime(2000,10,1), stop=DateTime(2000,12,31))
source
DiscoveryGraph.review_all_communitiesFunction
review_all_communities(S::DiscoverySession; n=10, start=nothing, stop=nothing)

Run eyeball on every community in the session and print the results sequentially.

Communities are processed in ascending community_id order. Useful for a first-pass review of all communities immediately after Leiden detection.

Arguments

  • S::DiscoverySession: The active discovery session.
  • n: Maximum messages to display per community (default: 10).
  • start: Window start (DateTime); defaults to S.cfg.baseline_start.
  • stop: Window end (DateTime); defaults to S.cfg.baseline_end.

Returns

nothing (output goes to stdout).

Example

review_all_communities(S; n=5)
source

Privilege Triage

DiscoveryGraph.TierClassType
TierClass

Five-tier classification for privilege log triage, used by generate_outputs.

Variants

  • Tier1: High-priority privilege review — litigation anticipation or active regulatory investigation. Requires immediate human review.
  • Tier2: Secondary privilege review — regulatory compliance or direct legal advice. Requires human review after Tier 1.
  • Tier3: Transactional legal work — privilege likely waived in transactional context. Deprioritised; review if time permits.
  • Tier4: Unclassified — counsel is involved but no keyword from any tier list matched. Human judgment required.
  • Tier5: No counsel involvement — excluded from privilege review queue.
source
DiscoveryGraph.generate_outputsFunction
generate_outputs(S::DiscoverySession, node_reg::DataFrame)
    -> NamedTuple{(:community_table, :review_queue, :tier1, :tier2, :tier3, :tier4, :anomaly_list)}

Generate the primary discovery outputs from a DiscoverySession.

Processes every message in S.corpus_df and identifies those involving at least one counsel party. Counsel is detected via two complementary paths:

  1. Graph-node counsel: parties present in node_reg with is_counsel = true (derived from find_roles).
  2. Pattern-matched counsel: parties absent from the graph (e.g., outside counsel at firm domains excluded by cfg.internal_domain) are checked directly against cfg.roles using the same domain/pattern/address rules as find_roles. This closes the privilege gap where messages to outside counsel are missed when the communication graph is restricted to internal addresses only.

Each matched message is added to the review queue with the roles implicated and a keyword-based tier assignment.

node_reg must be the output of find_roles(node_reg, cfg) — it must contain columns :roles and :is_counsel.

Arguments

  • S::DiscoverySession: The active discovery session.
  • node_reg::DataFrame: Node registry annotated by find_roles, with columns :node, :community_id, :roles, :is_counsel, and optionally :is_kernel.

Returns

A NamedTuple with:

  • community_table::DataFrame — subset of node_reg with columns :node, :community_id, :roles, :is_counsel, and :is_kernel (when present).
  • review_queue::DataFrame — all Tier1–4 messages combined; columns :hash, :date, :sender, :recipients, :subject, :roles_implicated, :tier (TierClass), :basis.
  • tier1tier4::DataFrame — per-tier subsets of review_queue for direct access.
  • anomaly_list::DataFrame — empty placeholder (anomaly detection performed separately by detect_anomalies); columns :node, :week_start, :anomaly_type, :z_score, :basis.

Example

node_reg = find_roles(base_reg, cfg)
S        = DiscoverySession(corpus, result, edges, cfg)
outputs  = generate_outputs(S, node_reg)
memo     = generate_rule26f_memo(S, outputs)
source

Outputs

DiscoveryGraph.write_outputsFunction
write_outputs(S::DiscoverySession, outputs::NamedTuple, dir::AbstractString;
              overwrite::Bool = false) -> NamedTuple

Write per-tier DataFrames and the Rule 26(f) memo to dir.

Creates the directory if it does not exist. By default refuses to overwrite existing files; set overwrite = true to replace them.

Files written

FileContents
tier1.arrowTier 1 review queue (litigation / regulatory)
tier2.arrowTier 2 review queue (legal advice / compliance)
tier3.arrowTier 3 review queue (transactional)
tier4.arrowTier 4 review queue (no keyword signal)
review_queue.arrowCombined Tier 1–4 queue
rule26f_memo.mdRule 26(f)(3)(D) methodology statement

The Arrow files are the intended input to a privilege-review UI (not yet built). They are not designed for direct attorney use; do not export to CSV or spreadsheet. The memo is attorney-ready as written.

Arguments

  • S::DiscoverySession: Active session (used to generate the memo).
  • outputs::NamedTuple: Return value of generate_outputs(S, node_reg).
  • dir::AbstractString: Destination directory path.
  • overwrite::Bool: If false (default), error if any output file already exists.

Returns

A NamedTuple of absolute paths for each file written.

Example

outputs = generate_outputs(S, node_reg)
paths   = write_outputs(S, outputs, "discovery_export")
@info "Memo" path=paths.memo
source

Temporal Analysis

DiscoveryGraph.detect_anomaliesFunction
detect_anomalies(history_df::DataFrame, cfg::CorpusConfig) -> DataFrame

Detect statistically anomalous weekly message-volume spikes in a node history table.

For each node with at least 3 weeks of history, computes the mean and standard deviation of message_count. Any week where the count exceeds the node mean by cfg.anomaly_zscore_threshold standard deviations is flagged as a "volume_spike". Nodes with near-zero standard deviation (< 1e-9) are skipped.

Arguments

  • history_df::DataFrame: Weekly node history from build_node_history, with columns :node, :week_start, and :message_count.
  • cfg::CorpusConfig: Configuration supplying anomaly_zscore_threshold.

Returns

DataFrame with one row per detected anomaly and columns:

  • :node::String — node address.
  • :week_start::Date — Monday of the anomalous week.
  • :anomaly_type::String — currently always "volume_spike".
  • :z_score::Float64 — z-score of the anomalous week's message count (rounded to 2 decimal places).
  • :basis::String — human-readable explanation including raw count, z-score, and node mean.

Example

anomalies = detect_anomalies(history, cfg)
filter(r -> r.node == "alice@corp.com", anomalies)
source

Community Vocabulary

DiscoveryGraph.build_community_vocabularyFunction
build_community_vocabulary(corpus_df::DataFrame, community_table::DataFrame,
                           cfg::CorpusConfig) -> Dict{Int32, Vector{Pair{String,Float64}}}

Build a TF-IDF vocabulary for each community from the corpus subject lines.

v0.1.0 stub

This function is not yet implemented. It returns empty term lists for every community. Full TF-IDF computation is a future deliverable.

Arguments

  • corpus_df::DataFrame: The full message corpus with columns named according to cfg.
  • community_table::DataFrame: Community membership table with a :community_id column.
  • cfg::CorpusConfig: Configuration supplying stopwords and column name mappings.

Returns

Dict{Int32, Vector{Pair{String,Float64}}} mapping each community ID to a list of (term => tfidf_score) pairs sorted by descending score. In v0.1.0 every community maps to an empty vector.

Example

vocab = build_community_vocabulary(corpus_df, community_table, cfg)
# vocab[6] => Pair{String,Float64}[]  (stub; always empty in v0.1.0)
source

Rule 26(f) Documentation

DiscoveryGraph.generate_rule26f_memoFunction
generate_rule26f_memo(S::DiscoverySession, outputs::NamedTuple) -> String

Generate a Rule 26(f)(3)(D) privilege log methodology statement as a Markdown string.

Produces a structured memo suitable for filing or service that documents:

  • Corpus size and the reduction ratio achieved by the review queue.
  • Community detection algorithm, parameters, and thresholds.
  • Attorney/role roster derived from outputs.community_table.
  • The five-tier classification scheme and the v0.1.0 semantic analysis caveat.
  • A reproducibility reference (Zenodo DOI pending in v0.1.0).

outputs must be the result of generate_outputs(S, node_reg) where node_reg was produced by find_roles. The outputs.community_table must contain columns :is_counsel and :roles.

Arguments

  • S::DiscoverySession: The active discovery session (supplies corpus size and cfg).
  • outputs::NamedTuple: Named tuple returned by generate_outputs, with fields community_table, review_queue, and anomaly_list.

Returns

A String containing the complete methodology memo in Markdown format.

Example

node_reg = find_roles(base_reg, cfg)
outputs  = generate_outputs(S, node_reg)
memo     = generate_rule26f_memo(S, outputs)
write("rule26f_memo.md", memo)
source