Discovery Layer
The discovery layer provides privilege triage, interactive inspection, temporal anomaly detection, community vocabulary analysis, and Rule 26(f) documentation.
Role Detection
DiscoveryGraph.find_roles — Function
find_roles(node_reg::DataFrame, cfg::CorpusConfig) -> DataFrameAnnotate a node registry with role labels and counsel flags from cfg.roles.
For each node address, every RoleConfig in cfg.roles is tested in order using three matching rules (any match assigns the role):
- Exact membership in
rc.explicit_addresses. - Any pattern in
rc.address_patternsmatches viaoccursin. - The address ends with
"@<domain>"or".<domain>"for any domain inrc.domain_list.
A node's is_counsel flag is set to true if it matches any role whose counsel_type is InHouse or OutsideFirm.
Arguments
node_reg::DataFrame: Node registry with at least a:nodecolumn of address strings.cfg::CorpusConfig: Configuration carrying therolesvector to apply.
Returns
A copy of node_reg with two additional columns:
:roles::Vector{String}— list of role labels matched for each node (empty if none).:is_counsel::Bool—trueif the node matched any counsel role.
Example
node_reg = find_roles(base_node_reg, cfg)
counsel_nodes = filter(r -> r.is_counsel, eachrow(node_reg))DiscoveryGraph.identify_counsel_communities — Function
identify_counsel_communities(result::DataFrame, cfg::CorpusConfig) -> DataFrameTentatively identify which Leiden communities contain counsel nodes using cfg.roles.
Applies the same role-matching logic as find_roles directly to the Leiden output, without requiring a manually curated node registry. Use this immediately after leiden_communities to identify which community IDs to focus on — replacing the need to call review_all_communities and scan output by eye.
Arguments
result::DataFrame: Leiden output with at least:nodeand:community_idcolumns.cfg::CorpusConfig: Configuration carrying therolesvector to apply.
Returns
A DataFrame with one row per community containing at least one counsel node:
:community_id— Leiden community identifier.:n_members— total nodes in the community.:n_counsel— nodes matching any counsel role.:roles— unique role labels present (e.g.["in_house_counsel"]).:counsel_nodes— addresses of matched counsel nodes.
Sorted by :n_counsel descending. Returns an empty DataFrame if no counsel nodes are found (check cfg.roles is correctly populated).
Example
result = leiden_communities(g, all_nodes; resolution=1.0, seed=42)
identify_counsel_communities(result, cfg)
# community_id n_members n_counsel roles counsel_nodes
# 9 142 6 ["in_house_counsel"] ["sara.shackleton@enron.com", ...]DiscoveryGraph.audit_counsel_coverage — Function
audit_counsel_coverage(corpus, node_reg, cfg; keywords, broadcast_min_recipients) -> NamedTupleScan the corpus for attorney-flavored messages where no party is a known counsel node.
Identifies potential gaps in cfg.roles — senders who write about legal topics but were not captured by the role-matching rules. Most results will be broadcast announcements (high broadcast_fraction); outliers with low broadcast_fraction and many messages are candidates for manual review and possible addition to cfg.roles.
Messages are filtered by subject keyword match (case-insensitive). A message is excluded from results if the sender or any recipient is already in node_reg as counsel. Bot senders (per cfg) are also excluded.
Arguments
corpus::DataFrame: Full corpus as returned byload_corpus.node_reg::DataFrame: Node registry with:is_counselcolumn fromfind_roles.cfg::CorpusConfig: Configuration supplying column names and bot rules.keywords: Subject keywords to match (default:ATTORNEY_KEYWORDS).broadcast_min_recipients: Recipient count at or above which a message is flagged as a broadcast (default:5).
Returns
A NamedTuple with:
:suspicious_senders::DataFrame— one row per non-counsel sender, columns::sender,:n_messages,:n_broadcast,:broadcast_fraction,:sample_subjects. Sorted by:n_messagesdescending.:uncovered_count::Int— total attorney-flavored messages with no counsel party.:keywords_used::Vector{String}— the keyword list applied.
Example
node_reg = find_roles(base_node_reg, cfg)
audit = audit_counsel_coverage(corpus, node_reg, cfg)
# Filter to non-broadcast candidates for cfg.roles additions:
filter(r -> r.broadcast_fraction < 0.5, audit.suspicious_senders)DiscoveryGraph.ATTORNEY_KEYWORDS — Constant
ATTORNEY_KEYWORDSDefault subject-line keywords used by audit_counsel_coverage to identify messages that discuss legal topics but involve no known counsel party. A message whose subject contains any of these terms (case-insensitive) is a candidate for review when neither its sender nor any recipient is in the counsel node set.
Pass a custom list as the keywords argument to audit_counsel_coverage to override.
Interactive Session
DiscoveryGraph.DiscoverySession — Type
DiscoverySessionPrimary interactive interface for exploring a communication network and its detected communities. Bundles the four DataFrames and configuration that all inspection functions require, eliminating repetitive argument passing.
Fields
corpus_df::DataFrame: The full message corpus, with columns named according tocfg.result::DataFrame: Community membership table with columns:nodeand:community_id.edge_df::DataFrame: Broadcast-discounted edge table frombuild_edges, with columns:sender,:recipient,:date, and:weight.cfg::CorpusConfig: The corpus configuration (column names, date bounds, roles, etc.).leiden_seed::Int: Random seed passed toleiden_communities(default42). Recorded in the Rule 26(f) memo for reproducibility documentation.leiden_resolution::Float64: Resolution parameter passed toleiden_communities(default1.0). Recorded in the Rule 26(f) memo.
Pass all six fields to record non-default Leiden parameters in the methodology statement:
S = DiscoverySession(corpus_df, leiden_result, edge_df, cfg, seed, resolution)The 4-argument form defaults to leiden_seed=42, leiden_resolution=1.0.
Example
S = DiscoverySession(corpus_df, leiden_result, edge_df, cfg)
eyeball(S, 6; mode=:block, block=(DateTime(2000,7,1), DateTime(2000,7,31)), n=20)
inspect_community(S, 6)DiscoveryGraph.eyeball — Function
eyeball(S::DiscoverySession, cid::Integer;
mode=:random, n=25, start=nothing, stop=nothing, block=nothing)Print a sample of message headers from a single community to the console.
Filters the corpus to messages sent by members of community cid within the time window, then prints timestamp, sender, and subject for up to n messages.
Arguments
S::DiscoverySession: The active discovery session.cid::Integer: Community ID to inspect.mode: Sampling mode —:random(default) shuffles before takingn;:chronotakes the firstnin chronological order.n: Maximum number of messages to display (default:25).start: Window start (DateTime); defaults toS.cfg.baseline_start.stop: Window end (DateTime); defaults toS.cfg.baseline_end.block: A(start, stop)tuple ofDateTimevalues; sets the window and forcesmode=:chrono. Overridesstart/stopwhen provided.
Returns
nothing (output goes to stdout).
Example
eyeball(S, 9; mode=:chrono, n=10)
eyeball(S, 6; block=(DateTime(2000,7,1), DateTime(2000,7,31)), n=20)DiscoveryGraph.inspect_community — Function
inspect_community(S::DiscoverySession, cid::Integer)Print a structural summary of a single community to the console.
Displays the community's member count, total internal edge count, and the top-5 internal senders by message volume.
Arguments
S::DiscoverySession: The active discovery session.cid::Integer: Community ID to summarise.
Returns
nothing (output goes to stdout).
Example
inspect_community(S, 6)DiscoveryGraph.inspect_bridge — Function
inspect_bridge(S::DiscoverySession, cid_a::Integer, cid_b::Integer;
start=nothing, stop=nothing)Print the count of cross-community edges between two communities within a time window.
Identifies edges where one endpoint is a member of cid_a and the other is a member of cid_b (in either direction), filtered to the specified date range.
Arguments
S::DiscoverySession: The active discovery session.cid_a::Integer: First community ID.cid_b::Integer: Second community ID.start: Window start (DateTime); defaults toS.cfg.baseline_start.stop: Window end (DateTime); defaults toS.cfg.baseline_end.
Returns
nothing (output goes to stdout).
Example
inspect_bridge(S, 9, 6; start=DateTime(2000,10,1), stop=DateTime(2000,12,31))DiscoveryGraph.review_all_communities — Function
review_all_communities(S::DiscoverySession; n=10, start=nothing, stop=nothing)Run eyeball on every community in the session and print the results sequentially.
Communities are processed in ascending community_id order. Useful for a first-pass review of all communities immediately after Leiden detection.
Arguments
S::DiscoverySession: The active discovery session.n: Maximum messages to display per community (default:10).start: Window start (DateTime); defaults toS.cfg.baseline_start.stop: Window end (DateTime); defaults toS.cfg.baseline_end.
Returns
nothing (output goes to stdout).
Example
review_all_communities(S; n=5)Privilege Triage
DiscoveryGraph.TierClass — Type
TierClassFive-tier classification for privilege log triage, used by generate_outputs.
Variants
Tier1: High-priority privilege review — litigation anticipation or active regulatory investigation. Requires immediate human review.Tier2: Secondary privilege review — regulatory compliance or direct legal advice. Requires human review after Tier 1.Tier3: Transactional legal work — privilege likely waived in transactional context. Deprioritised; review if time permits.Tier4: Unclassified — counsel is involved but no keyword from any tier list matched. Human judgment required.Tier5: No counsel involvement — excluded from privilege review queue.
DiscoveryGraph.generate_outputs — Function
generate_outputs(S::DiscoverySession, node_reg::DataFrame)
-> NamedTuple{(:community_table, :review_queue, :tier1, :tier2, :tier3, :tier4, :anomaly_list)}Generate the primary discovery outputs from a DiscoverySession.
Processes every message in S.corpus_df and identifies those involving at least one counsel party. Counsel is detected via two complementary paths:
- Graph-node counsel: parties present in
node_regwithis_counsel = true(derived fromfind_roles). - Pattern-matched counsel: parties absent from the graph (e.g., outside counsel at firm domains excluded by
cfg.internal_domain) are checked directly againstcfg.rolesusing the same domain/pattern/address rules asfind_roles. This closes the privilege gap where messages to outside counsel are missed when the communication graph is restricted to internal addresses only.
Each matched message is added to the review queue with the roles implicated and a keyword-based tier assignment.
node_reg must be the output of find_roles(node_reg, cfg) — it must contain columns :roles and :is_counsel.
Arguments
S::DiscoverySession: The active discovery session.node_reg::DataFrame: Node registry annotated byfind_roles, with columns:node,:community_id,:roles,:is_counsel, and optionally:is_kernel.
Returns
A NamedTuple with:
community_table::DataFrame— subset ofnode_regwith columns:node,:community_id,:roles,:is_counsel, and:is_kernel(when present).review_queue::DataFrame— all Tier1–4 messages combined; columns:hash,:date,:sender,:recipients,:subject,:roles_implicated,:tier(TierClass),:basis.tier1–tier4::DataFrame— per-tier subsets ofreview_queuefor direct access.anomaly_list::DataFrame— empty placeholder (anomaly detection performed separately bydetect_anomalies); columns:node,:week_start,:anomaly_type,:z_score,:basis.
Example
node_reg = find_roles(base_reg, cfg)
S = DiscoverySession(corpus, result, edges, cfg)
outputs = generate_outputs(S, node_reg)
memo = generate_rule26f_memo(S, outputs)Outputs
DiscoveryGraph.write_outputs — Function
write_outputs(S::DiscoverySession, outputs::NamedTuple, dir::AbstractString;
overwrite::Bool = false) -> NamedTupleWrite per-tier DataFrames and the Rule 26(f) memo to dir.
Creates the directory if it does not exist. By default refuses to overwrite existing files; set overwrite = true to replace them.
Files written
| File | Contents |
|---|---|
tier1.arrow | Tier 1 review queue (litigation / regulatory) |
tier2.arrow | Tier 2 review queue (legal advice / compliance) |
tier3.arrow | Tier 3 review queue (transactional) |
tier4.arrow | Tier 4 review queue (no keyword signal) |
review_queue.arrow | Combined Tier 1–4 queue |
rule26f_memo.md | Rule 26(f)(3)(D) methodology statement |
The Arrow files are the intended input to a privilege-review UI (not yet built). They are not designed for direct attorney use; do not export to CSV or spreadsheet. The memo is attorney-ready as written.
Arguments
S::DiscoverySession: Active session (used to generate the memo).outputs::NamedTuple: Return value ofgenerate_outputs(S, node_reg).dir::AbstractString: Destination directory path.overwrite::Bool: Iffalse(default), error if any output file already exists.
Returns
A NamedTuple of absolute paths for each file written.
Example
outputs = generate_outputs(S, node_reg)
paths = write_outputs(S, outputs, "discovery_export")
@info "Memo" path=paths.memoTemporal Analysis
DiscoveryGraph.detect_anomalies — Function
detect_anomalies(history_df::DataFrame, cfg::CorpusConfig) -> DataFrameDetect statistically anomalous weekly message-volume spikes in a node history table.
For each node with at least 3 weeks of history, computes the mean and standard deviation of message_count. Any week where the count exceeds the node mean by cfg.anomaly_zscore_threshold standard deviations is flagged as a "volume_spike". Nodes with near-zero standard deviation (< 1e-9) are skipped.
Arguments
history_df::DataFrame: Weekly node history frombuild_node_history, with columns:node,:week_start, and:message_count.cfg::CorpusConfig: Configuration supplyinganomaly_zscore_threshold.
Returns
DataFrame with one row per detected anomaly and columns:
:node::String— node address.:week_start::Date— Monday of the anomalous week.:anomaly_type::String— currently always"volume_spike".:z_score::Float64— z-score of the anomalous week's message count (rounded to 2 decimal places).:basis::String— human-readable explanation including raw count, z-score, and node mean.
Example
anomalies = detect_anomalies(history, cfg)
filter(r -> r.node == "alice@corp.com", anomalies)Community Vocabulary
DiscoveryGraph.build_community_vocabulary — Function
build_community_vocabulary(corpus_df::DataFrame, community_table::DataFrame,
cfg::CorpusConfig) -> Dict{Int32, Vector{Pair{String,Float64}}}Build a TF-IDF vocabulary for each community from the corpus subject lines.
This function is not yet implemented. It returns empty term lists for every community. Full TF-IDF computation is a future deliverable.
Arguments
corpus_df::DataFrame: The full message corpus with columns named according tocfg.community_table::DataFrame: Community membership table with a:community_idcolumn.cfg::CorpusConfig: Configuration supplying stopwords and column name mappings.
Returns
Dict{Int32, Vector{Pair{String,Float64}}} mapping each community ID to a list of (term => tfidf_score) pairs sorted by descending score. In v0.1.0 every community maps to an empty vector.
Example
vocab = build_community_vocabulary(corpus_df, community_table, cfg)
# vocab[6] => Pair{String,Float64}[] (stub; always empty in v0.1.0)Rule 26(f) Documentation
DiscoveryGraph.generate_rule26f_memo — Function
generate_rule26f_memo(S::DiscoverySession, outputs::NamedTuple) -> StringGenerate a Rule 26(f)(3)(D) privilege log methodology statement as a Markdown string.
Produces a structured memo suitable for filing or service that documents:
- Corpus size and the reduction ratio achieved by the review queue.
- Community detection algorithm, parameters, and thresholds.
- Attorney/role roster derived from
outputs.community_table. - The five-tier classification scheme and the v0.1.0 semantic analysis caveat.
- A reproducibility reference (Zenodo DOI pending in v0.1.0).
outputs must be the result of generate_outputs(S, node_reg) where node_reg was produced by find_roles. The outputs.community_table must contain columns :is_counsel and :roles.
Arguments
S::DiscoverySession: The active discovery session (supplies corpus size andcfg).outputs::NamedTuple: Named tuple returned bygenerate_outputs, with fieldscommunity_table,review_queue, andanomaly_list.
Returns
A String containing the complete methodology memo in Markdown format.
Example
node_reg = find_roles(base_reg, cfg)
outputs = generate_outputs(S, node_reg)
memo = generate_rule26f_memo(S, outputs)
write("rule26f_memo.md", memo)