Schema Layer
The schema layer defines configuration types, corpus validation, and corpus-specific reference data. All pipeline functions accept a CorpusConfig that maps arbitrary corpus column names to canonical fields.
Types
DiscoveryGraph.CounselType — Type
CounselTypeEnum classifying the type of legal counsel associated with a network node.
Variants
NotCounsel: The node is not identified as legal counsel.InHouse: The node is identified as in-house legal counsel (employee of the organization).OutsideFirm: The node is identified as outside legal counsel (external law firm).RegulatoryAdvisor: The node is a non-attorney staff member who routinely handles regulatory or litigation-adjacent correspondence (e.g., government affairs, compliance). Setsis_counsel = trueso messages enter the review queue, but the role label distinguishes them from attorneys in the methodology memo and community table. Messages involving onlyRegulatoryAdvisorparties are not presumptively privileged — they require separate legal analysis to determine privilege status.
DiscoveryGraph.RoleConfig — Type
RoleConfigConfiguration for identifying nodes that hold a particular legal or organizational role.
Each RoleConfig defines one role (e.g., "inhousecounsel") and the address-matching rules used to assign nodes to that role during privilege triage.
Fields
label::String: Human-readable role name (e.g.,"in_house_counsel","outside_counsel").counsel_type::CounselType: Whether this role constitutes legal counsel (InHouse,OutsideFirm, orNotCounsel).address_patterns::Vector{Regex}: Regex patterns matched against node addresses. Any match assigns the role.domain_list::Vector{String}: Email domains whose addresses are assigned the role.explicit_addresses::Set{String}: Exact email addresses that are unconditionally assigned the role.
Example
rc = RoleConfig(
"outside_counsel",
OutsideFirm,
[r".*@lawfirm\.com"],
["lawfirm.com"],
Set(["partner@lawfirm.com"]),
)DiscoveryGraph.CorpusConfig — Type
CorpusConfig(; sender, recipients_to, recipients_cc, timestamp, subject, hash, lastword,
corpus_start, corpus_end, baseline_start, baseline_end, roles,
extra_columns, internal_domain, bot_patterns, bot_domains, bot_senders,
broadcast_discount, kernel_threshold, kernel_jaccard_min,
anomaly_zscore_threshold, semantic_classifier, stopwords)Configuration struct that fully describes a corpus and its analysis parameters. All pipeline functions accept a CorpusConfig to remain corpus-agnostic.
Required keyword arguments
sender::Symbol: Column name for the message sender address.recipients_to::Symbol: Column name for the To-recipients field (stored as a stringified list).recipients_cc::Symbol: Column name for the CC-recipients field (stored as a stringified list).timestamp::Symbol: Column name for the message timestamp (DateTime).subject::Symbol: Column name for the subject line.hash::Symbol: Column name for the unique message identifier.lastword::Symbol: Column name for a corpus-specific auxiliary text field.corpus_start::DateTime: Earliest date of the full corpus window.corpus_end::DateTime: Latest date of the full corpus window.baseline_start::DateTime: Start of the community-detection baseline period.baseline_end::DateTime: End of the community-detection baseline period.roles::Vector{RoleConfig}: Role definitions used byfind_roles.
Optional keyword arguments
extra_columns::Vector{Symbol}: Additional corpus columns to preserve (default:Symbol[]).internal_domain::String: Domain suffix used to restrict edges to internal senders/recipients. Empty string disables filtering (default:"").bot_patterns::Vector{Regex}: Regex patterns identifying broadcast/bot senders (default:Regex[]).bot_domains::Vector{String}: Domains whose senders are treated as bots (default:String[]).bot_senders::Set{String}: Explicit sender addresses treated as bots (default: empty set).broadcast_discount::Function: Weight functionn -> Float64wherenis recipient count (default:n -> 1/log(n+2)).kernel_threshold::Float64: Fraction of baseline weeks a node must appear in to be a kernel member (default:2/3).kernel_jaccard_min::Float64: Minimum Jaccard similarity to match a community across snapshots (default:0.6).anomaly_zscore_threshold::Float64: Z-score threshold for volume spike detection (default:2.0).semantic_classifier::Function: Message classifier(df, cfg) -> df; default is a no-op stub.stopwords::Set{String}: Words excluded from TF-IDF vocabulary (default: built-in English stopword list).hotbutton_keywords::Vector{String}: Case-specific escalation terms supplied by the user; any match assigns Tier1 before standard keyword lists are checked. Disclosed explicitly in the Rule 26(f) memo (default:String[]).tier1_keywords::Vector{String}: Standard litigation/regulatory keywords (default:DEFAULT_TIER1_KEYWORDS).tier2_keywords::Vector{String}: Standard legal-advice keywords (default:DEFAULT_TIER2_KEYWORDS).tier3_keywords::Vector{String}: Standard transactional keywords (default:DEFAULT_TIER3_KEYWORDS).
Example
cfg = CorpusConfig(
sender = :sender,
recipients_to = :tos,
recipients_cc = :ccs,
timestamp = :date,
subject = :subj,
hash = :hash,
lastword = :lastword,
corpus_start = DateTime(2000, 1, 1),
corpus_end = DateTime(2002, 12, 31),
baseline_start = DateTime(2000, 7, 1),
baseline_end = DateTime(2000, 9, 30),
roles = [in_house_role, outside_role],
internal_domain = "corp.com",
)Corpus Loading and Validation
DiscoveryGraph.load_corpus — Function
load_corpus(df::DataFrame, cfg::CorpusConfig) -> DataFrameValidate that a corpus DataFrame satisfies the requirements of cfg and return it unchanged.
Checks that:
cfg.corpus_start < cfg.corpus_endandcfg.baseline_start < cfg.baseline_end.- All required columns (sender, recipientsto, recipientscc, timestamp, subject, hash, lastword) are present.
- The sender, hash, and timestamp columns contain no missing values.
Throws ArgumentError on any violation. If all checks pass, returns df unmodified so the call can be composed in a pipeline.
Arguments
df::DataFrame: Raw corpus to validate.cfg::CorpusConfig: Configuration describing expected column names and date bounds.
Returns
The input df unchanged if valid.
Example
cfg = enron_config()
corpus = load_corpus(raw_df, cfg)Enron Reference Configuration
DiscoveryGraph.enron_config — Function
enron_config() -> CorpusConfigReturn a CorpusConfig pre-configured for the Enron email corpus.
The configuration encodes:
- Column name mapping for the Enron Arrow schema (
:sender,:tos,:ccs,:date,:subj,:hash,:lastword). - Corpus window: 1999-01-01 to 2002-12-31.
- Baseline period: Q3 2000 (2000-07-01 to 2000-09-30).
- Internal domain:
"enron.com"(only @enron.com ↔ @enron.com edges are built). - Bot/broadcast sender patterns and explicit bot addresses derived from the Enron corpus.
- Two role definitions:
"in_house_counsel"(InHouse): 21 named Enron in-house attorneys by explicit address, including General Counsel James Derrick and attorneys surfaced byaudit_counsel_coverage."outside_counsel"(OutsideFirm): 13 firm domains including Vinson & Elkins, Bracewell & Patterson, Andrews Kurth, Sullivan & Cromwell, Weil Gotshal, Gibbs & Bruns, Jones Day, and others.
Returns
A fully populated CorpusConfig ready to pass to load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.
Example
cfg = enron_config()
corpus = load_corpus(raw_df, cfg)
edges = build_edges(corpus, cfg)DiscoveryGraph.enron_corpus — Function
enron_corpus() -> DataFrameLoad the Enron email corpus from the package artifact store.
Downloads and caches the corpus automatically on first call via Julia's Artifacts system. The artifact is hosted on Zenodo; internet access is required on first use.
Returns
A DataFrame with the Enron corpus in the schema expected by enron_config(): columns :sender, :tos, :ccs, :date, :subj, :hash, :lastword.
Example
cfg = enron_config()
corpus = load_corpus(enron_corpus(), cfg)
edges = build_edges(corpus, cfg)DiscoveryGraph.ENRON_HOTBUTTON_EXAMPLES — Constant
ENRON_HOTBUTTON_EXAMPLESIllustrative case-specific escalation terms for the Enron investigation. These are the names of trading schemes, special-purpose entities, and accounting mechanisms that were central to the FERC and SEC investigations.
Pass any subset to enron_config() or build_corpus_config() as hotbutton_keywords to promote matching messages to Tier 1 before standard keyword classification runs.
cfg = enron_config(hotbutton_keywords = ENRON_HOTBUTTON_EXAMPLES)DiscoveryGraph.ENRON_TIER1_EXAMPLES — Constant
ENRON_TIER1_EXAMPLESCorpus-specific Tier 1 regulatory keywords for the Enron investigation. ferc (Federal Energy Regulatory Commission) and sec (Securities and Exchange Commission) are the primary enforcement bodies in the Enron case and are not part of DEFAULT_TIER1_KEYWORDS, which contains only matter-independent terms.
enron_config() includes these automatically. For other matters substitute the relevant regulator abbreviations (e.g., ["occ", "fdic"] for a banking matter).
cfg = enron_config() # includes ferc + sec automatically
cfg = build_corpus_config(...,
tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, ["occ", "fdic"]))Generic Config Builder
DiscoveryGraph.build_corpus_config — Function
build_corpus_config(; internal_domain, corpus_start, corpus_end,
baseline_start, baseline_end,
in_house_attorneys, outside_firm_domains,
hotbutton_keywords, kwargs...) -> CorpusConfigConstruct a CorpusConfig from plain lists — no Julia struct knowledge required.
Uses the standard Enron column layout (:sender, :tos, :ccs, :date, :subj, :hash, :lastword) and sensible defaults for all network and classification parameters. A paralegal or technician can populate the four domain-specific lists; everything else is handled automatically.
Required arguments
internal_domain::String: Email domain that defines "internal" nodes (e.g.,"enron.com"). Only @domain ↔ @domain edges are built.corpus_start,corpus_end: Earliest and latest dates of the full corpus (DateorDateTime).baseline_start,baseline_end: Community-detection baseline window (DateorDateTime).
Optional arguments
in_house_attorneys::Vector{String}: Exact email addresses of in-house counsel (default:String[]).outside_firm_domains::Vector{String}: Email domains of outside counsel firms, e.g.["vinson-elkins.com", "bracepatt.com"](default:String[]).hotbutton_keywords::Vector{String}: Case-specific escalation terms that promote matching messages to Tier 1 ahead of standard keyword lists (default:String[]). SeeENRON_HOTBUTTON_EXAMPLESfor examples.- Any additional keyword argument accepted by
CorpusConfig(e.g.,tier1_keywords,anomaly_zscore_threshold).
Returns
A fully populated CorpusConfig ready for load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.
Example
cfg = build_corpus_config(
internal_domain = "enron.com",
corpus_start = Date(1999, 1, 1),
corpus_end = Date(2002, 12, 31),
baseline_start = Date(2000, 7, 1),
baseline_end = Date(2000, 9, 30),
in_house_attorneys = ["sara.shackleton@enron.com", "mark.haedicke@enron.com"],
outside_firm_domains = ["vinson-elkins.com", "bracepatt.com"],
hotbutton_keywords = ["raptors", "ljm", "mark-to-market"],
)
corpus = load_corpus(raw_df, cfg)XLSX Config Helper
DiscoveryGraph.write_config_template — Function
write_config_template(path::AbstractString) -> StringWrite a blank DiscoveryGraph configuration workbook to path.
Creates an .xlsx file with four sheets that a paralegal or technician can populate and return to the developer. Pass the completed file to config_from_xlsx to produce a CorpusConfig.
Sheets
| Sheet | Contents |
|---|---|
Metadata | Internal domain, corpus/baseline date bounds, schema version |
InHouseAttorneys | One in-house counsel email address per row |
OutsideFirmDomains | One outside-counsel email domain per row |
HotbuttonKeywords | One case-specific escalation keyword per row |
Returns
The absolute path written (same as path).
Example
write_config_template("matter_config_template.xlsx")
# Hand the file to a paralegal; receive it back completed.
cfg = config_from_xlsx("matter_config_completed.xlsx")DiscoveryGraph.config_from_xlsx — Function
config_from_xlsx(path::AbstractString) -> CorpusConfigLoad a CorpusConfig from a completed DiscoveryGraph configuration workbook.
Reads the four sheets produced by write_config_template and constructs a CorpusConfig via build_corpus_config. Rows beginning with # are treated as comments and ignored.
Sheet requirements
- Metadata:
Field/Valuecolumns; all six fields must be present. - InHouseAttorneys:
emailcolumn; one address per row. - OutsideFirmDomains:
domaincolumn; one domain per row. - HotbuttonKeywords:
keywordcolumn; one term per row (case-insensitive matching applied at runtime).
Returns
A fully populated CorpusConfig.
Example
cfg = config_from_xlsx("matter_config.xlsx")
corpus = load_corpus(raw_df, cfg)Default Keyword Lists
DiscoveryGraph.DEFAULT_TIER1_KEYWORDS — Constant
DEFAULT_TIER1_KEYWORDSMatter-independent Tier 1 keywords signalling litigation anticipation or active regulatory investigation. Any subject or body match promotes a counsel-involved message to Tier 1 (immediate human review).
These terms are deliberately generic — they apply across matter types without modification. Corpus-specific regulatory abbreviations (e.g. "ferc", "sec") should be added via CorpusConfig(tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, [...])) or the corpus-specific constant (see ENRON_TIER1_EXAMPLES).
DiscoveryGraph.DEFAULT_TIER2_KEYWORDS — Constant
DEFAULT_TIER2_KEYWORDSMatter-independent Tier 2 keywords signalling regulatory compliance or direct legal advice. Messages matching these terms (and no Tier 1 term) are placed in the secondary review queue.
DiscoveryGraph.DEFAULT_TIER3_KEYWORDS — Constant
DEFAULT_TIER3_KEYWORDSMatter-independent Tier 3 keywords signalling transactional legal work where privilege is likely waived in the transactional context. Deprioritised for review.