Schema Layer

The schema layer defines configuration types, corpus validation, and corpus-specific reference data. All pipeline functions accept a CorpusConfig that maps arbitrary corpus column names to canonical fields.

Types

DiscoveryGraph.CounselType — Type

CounselType

Enum classifying the type of legal counsel associated with a network node.

Variants

NotCounsel: The node is not identified as legal counsel.
InHouse: The node is identified as in-house legal counsel (employee of the organization).
OutsideFirm: The node is identified as outside legal counsel (external law firm).
RegulatoryAdvisor: The node is a non-attorney staff member who routinely handles regulatory or litigation-adjacent correspondence (e.g., government affairs, compliance). Sets is_counsel = true so messages enter the review queue, but the role label distinguishes them from attorneys in the methodology memo and community table. Messages involving only RegulatoryAdvisor parties are not presumptively privileged — they require separate legal analysis to determine privilege status.

source

DiscoveryGraph.RoleConfig — Type

RoleConfig

Configuration for identifying nodes that hold a particular legal or organizational role.

Each RoleConfig defines one role (e.g., "inhousecounsel") and the address-matching rules used to assign nodes to that role during privilege triage.

Fields

label::String: Human-readable role name (e.g., "in_house_counsel", "outside_counsel").
counsel_type::CounselType: Whether this role constitutes legal counsel (InHouse, OutsideFirm, or NotCounsel).
address_patterns::Vector{Regex}: Regex patterns matched against node addresses. Any match assigns the role.
domain_list::Vector{String}: Email domains whose addresses are assigned the role.
explicit_addresses::Set{String}: Exact email addresses that are unconditionally assigned the role.

Example

rc = RoleConfig(
    "outside_counsel",
    OutsideFirm,
    [r".*@lawfirm\.com"],
    ["lawfirm.com"],
    Set(["partner@lawfirm.com"]),
)

source

DiscoveryGraph.CorpusConfig — Type

CorpusConfig(; sender, recipients_to, recipients_cc, timestamp, subject, hash, lastword,
               corpus_start, corpus_end, baseline_start, baseline_end, roles,
               extra_columns, internal_domain, bot_patterns, bot_domains, bot_senders,
               broadcast_discount, kernel_threshold, kernel_jaccard_min,
               anomaly_zscore_threshold, semantic_classifier, stopwords)

Configuration struct that fully describes a corpus and its analysis parameters. All pipeline functions accept a CorpusConfig to remain corpus-agnostic.

Required keyword arguments

sender::Symbol: Column name for the message sender address.
recipients_to::Symbol: Column name for the To-recipients field (stored as a stringified list).
recipients_cc::Symbol: Column name for the CC-recipients field (stored as a stringified list).
timestamp::Symbol: Column name for the message timestamp (DateTime).
subject::Symbol: Column name for the subject line.
hash::Symbol: Column name for the unique message identifier.
lastword::Symbol: Column name for a corpus-specific auxiliary text field.
corpus_start::DateTime: Earliest date of the full corpus window.
corpus_end::DateTime: Latest date of the full corpus window.
baseline_start::DateTime: Start of the community-detection baseline period.
baseline_end::DateTime: End of the community-detection baseline period.
roles::Vector{RoleConfig}: Role definitions used by find_roles.

Optional keyword arguments

extra_columns::Vector{Symbol}: Additional corpus columns to preserve (default: Symbol[]).
internal_domain::String: Domain suffix used to restrict edges to internal senders/recipients. Empty string disables filtering (default: "").
bot_patterns::Vector{Regex}: Regex patterns identifying broadcast/bot senders (default: Regex[]).
bot_domains::Vector{String}: Domains whose senders are treated as bots (default: String[]).
bot_senders::Set{String}: Explicit sender addresses treated as bots (default: empty set).
broadcast_discount::Function: Weight function n -> Float64 where n is recipient count (default: n -> 1/log(n+2)).
kernel_threshold::Float64: Fraction of baseline weeks a node must appear in to be a kernel member (default: 2/3).
kernel_jaccard_min::Float64: Minimum Jaccard similarity to match a community across snapshots (default: 0.6).
anomaly_zscore_threshold::Float64: Z-score threshold for volume spike detection (default: 2.0).
semantic_classifier::Function: Message classifier (df, cfg) -> df; default is a no-op stub.
stopwords::Set{String}: Words excluded from TF-IDF vocabulary (default: built-in English stopword list).
hotbutton_keywords::Vector{String}: Case-specific escalation terms supplied by the user; any match assigns Tier1 before standard keyword lists are checked. Disclosed explicitly in the Rule 26(f) memo (default: String[]).
tier1_keywords::Vector{String}: Standard litigation/regulatory keywords (default: DEFAULT_TIER1_KEYWORDS).
tier2_keywords::Vector{String}: Standard legal-advice keywords (default: DEFAULT_TIER2_KEYWORDS).
tier3_keywords::Vector{String}: Standard transactional keywords (default: DEFAULT_TIER3_KEYWORDS).

Example

cfg = CorpusConfig(
    sender         = :sender,
    recipients_to  = :tos,
    recipients_cc  = :ccs,
    timestamp      = :date,
    subject        = :subj,
    hash           = :hash,
    lastword       = :lastword,
    corpus_start   = DateTime(2000, 1, 1),
    corpus_end     = DateTime(2002, 12, 31),
    baseline_start = DateTime(2000, 7, 1),
    baseline_end   = DateTime(2000, 9, 30),
    roles          = [in_house_role, outside_role],
    internal_domain = "corp.com",
)

source

Corpus Loading and Validation

DiscoveryGraph.load_corpus — Function

load_corpus(df::DataFrame, cfg::CorpusConfig) -> DataFrame

Validate that a corpus DataFrame satisfies the requirements of cfg and return it unchanged.

Checks that:

cfg.corpus_start < cfg.corpus_end and cfg.baseline_start < cfg.baseline_end.
All required columns (sender, recipientsto, recipientscc, timestamp, subject, hash, lastword) are present.
The sender, hash, and timestamp columns contain no missing values.

Throws ArgumentError on any violation. If all checks pass, returns df unmodified so the call can be composed in a pipeline.

Arguments

df::DataFrame: Raw corpus to validate.
cfg::CorpusConfig: Configuration describing expected column names and date bounds.

Returns

The input df unchanged if valid.

Example

cfg = enron_config()
corpus = load_corpus(raw_df, cfg)

source

Enron Reference Configuration

DiscoveryGraph.enron_config — Function

enron_config() -> CorpusConfig

Return a CorpusConfig pre-configured for the Enron email corpus.

The configuration encodes:

Column name mapping for the Enron Arrow schema (:sender, :tos, :ccs, :date, :subj, :hash, :lastword).
Corpus window: 1999-01-01 to 2002-12-31.
Baseline period: Q3 2000 (2000-07-01 to 2000-09-30).
Internal domain: "enron.com" (only @enron.com ↔ @enron.com edges are built).
Bot/broadcast sender patterns and explicit bot addresses derived from the Enron corpus.
Two role definitions:
- "in_house_counsel" (InHouse): 21 named Enron in-house attorneys by explicit address, including General Counsel James Derrick and attorneys surfaced by audit_counsel_coverage.
- "outside_counsel" (OutsideFirm): 13 firm domains including Vinson & Elkins, Bracewell & Patterson, Andrews Kurth, Sullivan & Cromwell, Weil Gotshal, Gibbs & Bruns, Jones Day, and others.

Returns

A fully populated CorpusConfig ready to pass to load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.

Example

cfg    = enron_config()
corpus = load_corpus(raw_df, cfg)
edges  = build_edges(corpus, cfg)

source

DiscoveryGraph.enron_corpus — Function

enron_corpus() -> DataFrame

Load the Enron email corpus from the package artifact store.

Downloads and caches the corpus automatically on first call via Julia's Artifacts system. The artifact is hosted on Zenodo; internet access is required on first use.

Returns

A DataFrame with the Enron corpus in the schema expected by enron_config(): columns :sender, :tos, :ccs, :date, :subj, :hash, :lastword.

Example

cfg    = enron_config()
corpus = load_corpus(enron_corpus(), cfg)
edges  = build_edges(corpus, cfg)

source

DiscoveryGraph.ENRON_HOTBUTTON_EXAMPLES — Constant

ENRON_HOTBUTTON_EXAMPLES

Illustrative case-specific escalation terms for the Enron investigation. These are the names of trading schemes, special-purpose entities, and accounting mechanisms that were central to the FERC and SEC investigations.

Pass any subset to enron_config() or build_corpus_config() as hotbutton_keywords to promote matching messages to Tier 1 before standard keyword classification runs.

cfg = enron_config(hotbutton_keywords = ENRON_HOTBUTTON_EXAMPLES)

source

DiscoveryGraph.ENRON_TIER1_EXAMPLES — Constant

ENRON_TIER1_EXAMPLES

Corpus-specific Tier 1 regulatory keywords for the Enron investigation. ferc (Federal Energy Regulatory Commission) and sec (Securities and Exchange Commission) are the primary enforcement bodies in the Enron case and are not part of DEFAULT_TIER1_KEYWORDS, which contains only matter-independent terms.

enron_config() includes these automatically. For other matters substitute the relevant regulator abbreviations (e.g., ["occ", "fdic"] for a banking matter).

cfg = enron_config()                        # includes ferc + sec automatically
cfg = build_corpus_config(...,
    tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, ["occ", "fdic"]))

source

Generic Config Builder

DiscoveryGraph.build_corpus_config — Function

build_corpus_config(; internal_domain, corpus_start, corpus_end,
                      baseline_start, baseline_end,
                      in_house_attorneys, outside_firm_domains,
                      hotbutton_keywords, kwargs...) -> CorpusConfig

Construct a CorpusConfig from plain lists — no Julia struct knowledge required.

Uses the standard Enron column layout (:sender, :tos, :ccs, :date, :subj, :hash, :lastword) and sensible defaults for all network and classification parameters. A paralegal or technician can populate the four domain-specific lists; everything else is handled automatically.

Required arguments

internal_domain::String: Email domain that defines "internal" nodes (e.g., "enron.com"). Only @domain ↔ @domain edges are built.
corpus_start, corpus_end: Earliest and latest dates of the full corpus (Date or DateTime).
baseline_start, baseline_end: Community-detection baseline window (Date or DateTime).

Optional arguments

in_house_attorneys::Vector{String}: Exact email addresses of in-house counsel (default: String[]).
outside_firm_domains::Vector{String}: Email domains of outside counsel firms, e.g. ["vinson-elkins.com", "bracepatt.com"] (default: String[]).
hotbutton_keywords::Vector{String}: Case-specific escalation terms that promote matching messages to Tier 1 ahead of standard keyword lists (default: String[]). See ENRON_HOTBUTTON_EXAMPLES for examples.
Any additional keyword argument accepted by CorpusConfig (e.g., tier1_keywords, anomaly_zscore_threshold).

Returns

A fully populated CorpusConfig ready for load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.

Example

cfg = build_corpus_config(
    internal_domain    = "enron.com",
    corpus_start       = Date(1999, 1, 1),
    corpus_end         = Date(2002, 12, 31),
    baseline_start     = Date(2000, 7, 1),
    baseline_end       = Date(2000, 9, 30),
    in_house_attorneys = ["sara.shackleton@enron.com", "mark.haedicke@enron.com"],
    outside_firm_domains = ["vinson-elkins.com", "bracepatt.com"],
    hotbutton_keywords = ["raptors", "ljm", "mark-to-market"],
)
corpus = load_corpus(raw_df, cfg)

source

XLSX Config Helper

DiscoveryGraph.write_config_template — Function

write_config_template(path::AbstractString) -> String

Write a blank DiscoveryGraph configuration workbook to path.

Creates an .xlsx file with four sheets that a paralegal or technician can populate and return to the developer. Pass the completed file to config_from_xlsx to produce a CorpusConfig.

Sheets

Sheet	Contents
`Metadata`	Internal domain, corpus/baseline date bounds, schema version
`InHouseAttorneys`	One in-house counsel email address per row
`OutsideFirmDomains`	One outside-counsel email domain per row
`HotbuttonKeywords`	One case-specific escalation keyword per row

Returns

The absolute path written (same as path).

Example

write_config_template("matter_config_template.xlsx")
# Hand the file to a paralegal; receive it back completed.
cfg = config_from_xlsx("matter_config_completed.xlsx")

source

DiscoveryGraph.config_from_xlsx — Function

config_from_xlsx(path::AbstractString) -> CorpusConfig

Load a CorpusConfig from a completed DiscoveryGraph configuration workbook.

Reads the four sheets produced by write_config_template and constructs a CorpusConfig via build_corpus_config. Rows beginning with # are treated as comments and ignored.

Sheet requirements

Metadata: Field/Value columns; all six fields must be present.
InHouseAttorneys: email column; one address per row.
OutsideFirmDomains: domain column; one domain per row.
HotbuttonKeywords: keyword column; one term per row (case-insensitive matching applied at runtime).

Returns

A fully populated CorpusConfig.

Example

cfg    = config_from_xlsx("matter_config.xlsx")
corpus = load_corpus(raw_df, cfg)

source

Default Keyword Lists

DiscoveryGraph.DEFAULT_TIER1_KEYWORDS — Constant

DEFAULT_TIER1_KEYWORDS

Matter-independent Tier 1 keywords signalling litigation anticipation or active regulatory investigation. Any subject or body match promotes a counsel-involved message to Tier 1 (immediate human review).

These terms are deliberately generic — they apply across matter types without modification. Corpus-specific regulatory abbreviations (e.g. "ferc", "sec") should be added via CorpusConfig(tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, [...])) or the corpus-specific constant (see ENRON_TIER1_EXAMPLES).

source

DiscoveryGraph.DEFAULT_TIER2_KEYWORDS — Constant

DEFAULT_TIER2_KEYWORDS

Matter-independent Tier 2 keywords signalling regulatory compliance or direct legal advice. Messages matching these terms (and no Tier 1 term) are placed in the secondary review queue.

source

DiscoveryGraph.DEFAULT_TIER3_KEYWORDS — Constant

DEFAULT_TIER3_KEYWORDS

Matter-independent Tier 3 keywords signalling transactional legal work where privilege is likely waived in the transactional context. Deprioritised for review.

source