Schema Layer

The schema layer defines configuration types, corpus validation, and corpus-specific reference data. All pipeline functions accept a CorpusConfig that maps arbitrary corpus column names to canonical fields.

Types

DiscoveryGraph.CounselTypeType
CounselType

Enum classifying the type of legal counsel associated with a network node.

Variants

  • NotCounsel: The node is not identified as legal counsel.
  • InHouse: The node is identified as in-house legal counsel (employee of the organization).
  • OutsideFirm: The node is identified as outside legal counsel (external law firm).
  • RegulatoryAdvisor: The node is a non-attorney staff member who routinely handles regulatory or litigation-adjacent correspondence (e.g., government affairs, compliance). Sets is_counsel = true so messages enter the review queue, but the role label distinguishes them from attorneys in the methodology memo and community table. Messages involving only RegulatoryAdvisor parties are not presumptively privileged — they require separate legal analysis to determine privilege status.
source
DiscoveryGraph.RoleConfigType
RoleConfig

Configuration for identifying nodes that hold a particular legal or organizational role.

Each RoleConfig defines one role (e.g., "inhousecounsel") and the address-matching rules used to assign nodes to that role during privilege triage.

Fields

  • label::String: Human-readable role name (e.g., "in_house_counsel", "outside_counsel").
  • counsel_type::CounselType: Whether this role constitutes legal counsel (InHouse, OutsideFirm, or NotCounsel).
  • address_patterns::Vector{Regex}: Regex patterns matched against node addresses. Any match assigns the role.
  • domain_list::Vector{String}: Email domains whose addresses are assigned the role.
  • explicit_addresses::Set{String}: Exact email addresses that are unconditionally assigned the role.

Example

rc = RoleConfig(
    "outside_counsel",
    OutsideFirm,
    [r".*@lawfirm\.com"],
    ["lawfirm.com"],
    Set(["partner@lawfirm.com"]),
)
source
DiscoveryGraph.CorpusConfigType
CorpusConfig(; sender, recipients_to, recipients_cc, timestamp, subject, hash, lastword,
               corpus_start, corpus_end, baseline_start, baseline_end, roles,
               extra_columns, internal_domain, bot_patterns, bot_domains, bot_senders,
               broadcast_discount, kernel_threshold, kernel_jaccard_min,
               anomaly_zscore_threshold, semantic_classifier, stopwords)

Configuration struct that fully describes a corpus and its analysis parameters. All pipeline functions accept a CorpusConfig to remain corpus-agnostic.

Required keyword arguments

  • sender::Symbol: Column name for the message sender address.
  • recipients_to::Symbol: Column name for the To-recipients field (stored as a stringified list).
  • recipients_cc::Symbol: Column name for the CC-recipients field (stored as a stringified list).
  • timestamp::Symbol: Column name for the message timestamp (DateTime).
  • subject::Symbol: Column name for the subject line.
  • hash::Symbol: Column name for the unique message identifier.
  • lastword::Symbol: Column name for a corpus-specific auxiliary text field.
  • corpus_start::DateTime: Earliest date of the full corpus window.
  • corpus_end::DateTime: Latest date of the full corpus window.
  • baseline_start::DateTime: Start of the community-detection baseline period.
  • baseline_end::DateTime: End of the community-detection baseline period.
  • roles::Vector{RoleConfig}: Role definitions used by find_roles.

Optional keyword arguments

  • extra_columns::Vector{Symbol}: Additional corpus columns to preserve (default: Symbol[]).
  • internal_domain::String: Domain suffix used to restrict edges to internal senders/recipients. Empty string disables filtering (default: "").
  • bot_patterns::Vector{Regex}: Regex patterns identifying broadcast/bot senders (default: Regex[]).
  • bot_domains::Vector{String}: Domains whose senders are treated as bots (default: String[]).
  • bot_senders::Set{String}: Explicit sender addresses treated as bots (default: empty set).
  • broadcast_discount::Function: Weight function n -> Float64 where n is recipient count (default: n -> 1/log(n+2)).
  • kernel_threshold::Float64: Fraction of baseline weeks a node must appear in to be a kernel member (default: 2/3).
  • kernel_jaccard_min::Float64: Minimum Jaccard similarity to match a community across snapshots (default: 0.6).
  • anomaly_zscore_threshold::Float64: Z-score threshold for volume spike detection (default: 2.0).
  • semantic_classifier::Function: Message classifier (df, cfg) -> df; default is a no-op stub.
  • stopwords::Set{String}: Words excluded from TF-IDF vocabulary (default: built-in English stopword list).
  • hotbutton_keywords::Vector{String}: Case-specific escalation terms supplied by the user; any match assigns Tier1 before standard keyword lists are checked. Disclosed explicitly in the Rule 26(f) memo (default: String[]).
  • tier1_keywords::Vector{String}: Standard litigation/regulatory keywords (default: DEFAULT_TIER1_KEYWORDS).
  • tier2_keywords::Vector{String}: Standard legal-advice keywords (default: DEFAULT_TIER2_KEYWORDS).
  • tier3_keywords::Vector{String}: Standard transactional keywords (default: DEFAULT_TIER3_KEYWORDS).

Example

cfg = CorpusConfig(
    sender         = :sender,
    recipients_to  = :tos,
    recipients_cc  = :ccs,
    timestamp      = :date,
    subject        = :subj,
    hash           = :hash,
    lastword       = :lastword,
    corpus_start   = DateTime(2000, 1, 1),
    corpus_end     = DateTime(2002, 12, 31),
    baseline_start = DateTime(2000, 7, 1),
    baseline_end   = DateTime(2000, 9, 30),
    roles          = [in_house_role, outside_role],
    internal_domain = "corp.com",
)
source

Corpus Loading and Validation

DiscoveryGraph.load_corpusFunction
load_corpus(df::DataFrame, cfg::CorpusConfig) -> DataFrame

Validate that a corpus DataFrame satisfies the requirements of cfg and return it unchanged.

Checks that:

  • cfg.corpus_start < cfg.corpus_end and cfg.baseline_start < cfg.baseline_end.
  • All required columns (sender, recipientsto, recipientscc, timestamp, subject, hash, lastword) are present.
  • The sender, hash, and timestamp columns contain no missing values.

Throws ArgumentError on any violation. If all checks pass, returns df unmodified so the call can be composed in a pipeline.

Arguments

  • df::DataFrame: Raw corpus to validate.
  • cfg::CorpusConfig: Configuration describing expected column names and date bounds.

Returns

The input df unchanged if valid.

Example

cfg = enron_config()
corpus = load_corpus(raw_df, cfg)
source

Enron Reference Configuration

DiscoveryGraph.enron_configFunction
enron_config() -> CorpusConfig

Return a CorpusConfig pre-configured for the Enron email corpus.

The configuration encodes:

  • Column name mapping for the Enron Arrow schema (:sender, :tos, :ccs, :date, :subj, :hash, :lastword).
  • Corpus window: 1999-01-01 to 2002-12-31.
  • Baseline period: Q3 2000 (2000-07-01 to 2000-09-30).
  • Internal domain: "enron.com" (only @enron.com ↔ @enron.com edges are built).
  • Bot/broadcast sender patterns and explicit bot addresses derived from the Enron corpus.
  • Two role definitions:
    • "in_house_counsel" (InHouse): 21 named Enron in-house attorneys by explicit address, including General Counsel James Derrick and attorneys surfaced by audit_counsel_coverage.
    • "outside_counsel" (OutsideFirm): 13 firm domains including Vinson & Elkins, Bracewell & Patterson, Andrews Kurth, Sullivan & Cromwell, Weil Gotshal, Gibbs & Bruns, Jones Day, and others.

Returns

A fully populated CorpusConfig ready to pass to load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.

Example

cfg    = enron_config()
corpus = load_corpus(raw_df, cfg)
edges  = build_edges(corpus, cfg)
source
DiscoveryGraph.enron_corpusFunction
enron_corpus() -> DataFrame

Load the Enron email corpus from the package artifact store.

Downloads and caches the corpus automatically on first call via Julia's Artifacts system. The artifact is hosted on Zenodo; internet access is required on first use.

Returns

A DataFrame with the Enron corpus in the schema expected by enron_config(): columns :sender, :tos, :ccs, :date, :subj, :hash, :lastword.

Example

cfg    = enron_config()
corpus = load_corpus(enron_corpus(), cfg)
edges  = build_edges(corpus, cfg)
source
DiscoveryGraph.ENRON_HOTBUTTON_EXAMPLESConstant
ENRON_HOTBUTTON_EXAMPLES

Illustrative case-specific escalation terms for the Enron investigation. These are the names of trading schemes, special-purpose entities, and accounting mechanisms that were central to the FERC and SEC investigations.

Pass any subset to enron_config() or build_corpus_config() as hotbutton_keywords to promote matching messages to Tier 1 before standard keyword classification runs.

cfg = enron_config(hotbutton_keywords = ENRON_HOTBUTTON_EXAMPLES)
source
DiscoveryGraph.ENRON_TIER1_EXAMPLESConstant
ENRON_TIER1_EXAMPLES

Corpus-specific Tier 1 regulatory keywords for the Enron investigation. ferc (Federal Energy Regulatory Commission) and sec (Securities and Exchange Commission) are the primary enforcement bodies in the Enron case and are not part of DEFAULT_TIER1_KEYWORDS, which contains only matter-independent terms.

enron_config() includes these automatically. For other matters substitute the relevant regulator abbreviations (e.g., ["occ", "fdic"] for a banking matter).

cfg = enron_config()                        # includes ferc + sec automatically
cfg = build_corpus_config(...,
    tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, ["occ", "fdic"]))
source

Generic Config Builder

DiscoveryGraph.build_corpus_configFunction
build_corpus_config(; internal_domain, corpus_start, corpus_end,
                      baseline_start, baseline_end,
                      in_house_attorneys, outside_firm_domains,
                      hotbutton_keywords, kwargs...) -> CorpusConfig

Construct a CorpusConfig from plain lists — no Julia struct knowledge required.

Uses the standard Enron column layout (:sender, :tos, :ccs, :date, :subj, :hash, :lastword) and sensible defaults for all network and classification parameters. A paralegal or technician can populate the four domain-specific lists; everything else is handled automatically.

Required arguments

  • internal_domain::String: Email domain that defines "internal" nodes (e.g., "enron.com"). Only @domain ↔ @domain edges are built.
  • corpus_start, corpus_end: Earliest and latest dates of the full corpus (Date or DateTime).
  • baseline_start, baseline_end: Community-detection baseline window (Date or DateTime).

Optional arguments

  • in_house_attorneys::Vector{String}: Exact email addresses of in-house counsel (default: String[]).
  • outside_firm_domains::Vector{String}: Email domains of outside counsel firms, e.g. ["vinson-elkins.com", "bracepatt.com"] (default: String[]).
  • hotbutton_keywords::Vector{String}: Case-specific escalation terms that promote matching messages to Tier 1 ahead of standard keyword lists (default: String[]). See ENRON_HOTBUTTON_EXAMPLES for examples.
  • Any additional keyword argument accepted by CorpusConfig (e.g., tier1_keywords, anomaly_zscore_threshold).

Returns

A fully populated CorpusConfig ready for load_corpus, build_edges, and the rest of the DiscoveryGraph pipeline.

Example

cfg = build_corpus_config(
    internal_domain    = "enron.com",
    corpus_start       = Date(1999, 1, 1),
    corpus_end         = Date(2002, 12, 31),
    baseline_start     = Date(2000, 7, 1),
    baseline_end       = Date(2000, 9, 30),
    in_house_attorneys = ["sara.shackleton@enron.com", "mark.haedicke@enron.com"],
    outside_firm_domains = ["vinson-elkins.com", "bracepatt.com"],
    hotbutton_keywords = ["raptors", "ljm", "mark-to-market"],
)
corpus = load_corpus(raw_df, cfg)
source

XLSX Config Helper

DiscoveryGraph.write_config_templateFunction
write_config_template(path::AbstractString) -> String

Write a blank DiscoveryGraph configuration workbook to path.

Creates an .xlsx file with four sheets that a paralegal or technician can populate and return to the developer. Pass the completed file to config_from_xlsx to produce a CorpusConfig.

Sheets

SheetContents
MetadataInternal domain, corpus/baseline date bounds, schema version
InHouseAttorneysOne in-house counsel email address per row
OutsideFirmDomainsOne outside-counsel email domain per row
HotbuttonKeywordsOne case-specific escalation keyword per row

Returns

The absolute path written (same as path).

Example

write_config_template("matter_config_template.xlsx")
# Hand the file to a paralegal; receive it back completed.
cfg = config_from_xlsx("matter_config_completed.xlsx")
source
DiscoveryGraph.config_from_xlsxFunction
config_from_xlsx(path::AbstractString) -> CorpusConfig

Load a CorpusConfig from a completed DiscoveryGraph configuration workbook.

Reads the four sheets produced by write_config_template and constructs a CorpusConfig via build_corpus_config. Rows beginning with # are treated as comments and ignored.

Sheet requirements

  • Metadata: Field/Value columns; all six fields must be present.
  • InHouseAttorneys: email column; one address per row.
  • OutsideFirmDomains: domain column; one domain per row.
  • HotbuttonKeywords: keyword column; one term per row (case-insensitive matching applied at runtime).

Returns

A fully populated CorpusConfig.

Example

cfg    = config_from_xlsx("matter_config.xlsx")
corpus = load_corpus(raw_df, cfg)
source

Default Keyword Lists

DiscoveryGraph.DEFAULT_TIER1_KEYWORDSConstant
DEFAULT_TIER1_KEYWORDS

Matter-independent Tier 1 keywords signalling litigation anticipation or active regulatory investigation. Any subject or body match promotes a counsel-involved message to Tier 1 (immediate human review).

These terms are deliberately generic — they apply across matter types without modification. Corpus-specific regulatory abbreviations (e.g. "ferc", "sec") should be added via CorpusConfig(tier1_keywords = vcat(DEFAULT_TIER1_KEYWORDS, [...])) or the corpus-specific constant (see ENRON_TIER1_EXAMPLES).

source
DiscoveryGraph.DEFAULT_TIER2_KEYWORDSConstant
DEFAULT_TIER2_KEYWORDS

Matter-independent Tier 2 keywords signalling regulatory compliance or direct legal advice. Messages matching these terms (and no Tier 1 term) are placed in the secondary review queue.

source
DiscoveryGraph.DEFAULT_TIER3_KEYWORDSConstant
DEFAULT_TIER3_KEYWORDS

Matter-independent Tier 3 keywords signalling transactional legal work where privilege is likely waived in the transactional context. Deprioritised for review.

source