We had to classify 30,800 legal orders. So we asked five different classifiers — and read every case they fought over.
SOR orders are written in templated language, but the templates drift, the older orders are scanned PDFs with OCR damage, and the boundary cases between "ordered to release" and "ordered to clarify" turn on a single sentence. A single classifier was never going to be reliable enough. So every order gets read five times, by five different systems, and the cases where they disagree are the cases a human reads next. Three of the five classifiers are independently-trained Chinese models, giving us real diversity from the Western-trained ensemble. The whole cross-validation run cost about $4.
01
30,800 SOR orders
Every public-records appeal order from 2014 to today.
02
5 classifiers vote
Regex pipeline + Gemini Flash-Lite + DeepSeek + Qwen3-235B + local qwen3 24B. Each reads every order independently.
03
Sort by agreement
All five agree → ship. Four agree against the regex → likely regex bug. Models split → real ambiguity. Triage by stakes — release calls read first.
04
A human reads
Every disagreement case opened by hand. The goal isn't to label this one case — it's to find the pattern that broke.
05
Patch the regex
Fix the structural bug. Re-run all five classifiers. Re-check the labeled holdout. Lock the new version.
Hand-validated holdout · 84 of 85 cases · every patch passes this before shipping
29,882
SOR orders classified · 2014 through 2026 ytd · 97% mapped to a canonical agency
~$4
Total cost of the five-classifier cross-validation pass across the whole corpus
The classifier is iterated, not finished. If you spot a misclassification on a specific case, tell us the case number and what we got wrong — that becomes the next round.
The reading model
Every SOR order has the same structure: setup, agency argument, optional petitioner challenge, the Supervisor's reasoning, and a disposition in the last one or two paragraphs. The classifier reads the disposition tail (last 40% of the document, anchored on the final "Sincerely,") and matches against a curated list of operative phrases.
Document-type detection runs first on the opening paragraph: is this a regular requester appeal, an agency-side fee or time petition, or a reconsideration? Each type gets its own outcome enum set, because the same word ("granted") means opposite things across types.
Source: regex_classifier_v3.py (the V4b lock as of 2026-05-03). 580 lines, ~80 patterns, runs the full 30,800-order corpus in under two minutes, costs zero.
Four bugs the humans found — that no model caught alone
The loop above is the boring part. The interesting part is what came out of it. Reading the disagreement queue by hand surfaced four structural bugs in the regex pipeline — patterns that any individual model could have mislabeled forever without noticing, because they're about how the text was being read, not what the words said. Each one became a new version of the pipeline.
Bug 01 · V3 → V4
The "ordered to provide" outcome was losing to "properly withheld" recitals.
When two outcome phrases appeared in the same order, a priority list decided which won. The V3 list ranked "upheld" above "ordered to provide." But in 74% of the supposedly-upheld cases, the actual disposition near the bottom of the order was "is ordered to provide" — the phrase up top was just the SOR quoting the agency's position.
Fix: read the LAST operative phrase in the disposition region, not the first. The SOR's actual ruling is always the last operative thing in the order.
Bug 02 · V3 → V4
A legal recital looked exactly like a denial.
The Supervisor's office opens nearly every privilege case with the same recital from Suffolk Construction Co. v. DCAM: "in assessing whether a records custodian has properly withheld records based on the claim of attorney-client privilege…" The phrase "properly withheld" inside that recital was triggering the regex as if it were the disposition. It appeared in 41% of true-upheld cases — and 80% of false-positive cases.
Fix: a separate library of canonical-recital phrases that get stripped from the disposition region before outcome regex runs. Variants of the recital that don't match the canonical phrase get flagged for human review — has the boilerplate drifted, or is the law changing?
Bug 03 · V4 → V4b
The signature detector was reading the wrong "Sincerely."
Some long orders quote email chains in the body — the requester's original email, the agency's response, all reproduced inside the SOR's analysis. Those quoted emails have their own "Sincerely," sign-offs. The signature detector was treating the first one it found as the SOR's signature and truncating the disposition there. So 227 orders had their actual ruling cut off entirely.
Fix: search for the LAST "Sincerely," in the document. The SOR's signature is reliably the final one, and reading backwards from the end of the file is unambiguous.
Bug 04 · V4b (open)
The holdout was over-sampling the easy cases.
The hand-validated holdout had hit 100% on V3 — but full-corpus scale had ~470 systematic errors. The holdout had drawn cases from common outcomes and under-sampled rare ones like RECORDS_DESTROYED. Errors on the rare classes never showed up in the holdout numbers.
Fix: expand the holdout with targeted rare-class examples. Always validate at full corpus scale, not just on the holdout. A holdout that's 100% accurate is not the same thing as a corpus that's 100% accurate.
Where the regex assigned a verdict different from the cloud LLM, the second independently-trained LLM sided with the cloud LLM 73% of the time and with the regex 3.5%. Two separately-trained models converging against the regex is a much stronger signal than either alone — that's the rule we use to decide which disagreements to escalate to the human queue.
Massachusetts State Police was four different strings. UMass was much harder.
Every distinct custodian-name string in the corpus is mapped to one canonical agency. The original string is preserved per-order so anyone can audit the merge. The canonical mapping is an additional column on every row — researchers who want to slice by the raw as-filed strings can still do that.
The finite-list trick
Massachusetts has a knowable, closed list of governments. 351 cities and towns. 14 counties. A finite roster of state agencies, quasi-publics, and statutorily-chartered bodies. We treated that list as a strong prior: fuzzy clustering of the corpus's custodian strings against it surfaces candidate clusters, but the cluster is only adopted after a rule maps it to a real, named entity. No string gets canonicalized to "anonymous government body #57" — it lands on a row of the YAML or it stays as-is.
A worked example: UMass
University of Massachusetts is the case that taught us how much of this needed a human in the loop. UMass is not one entity. It is six canonical entities, plus an ambiguous-bare-name bucket held in suspension pending case-level review:
umass_amherst · with sub-units (Facilities & Campus Services, the Police Department) rolled up
umass_boston
umass_dartmouth · with police department rolled up
umass_lowell
umass_chan_medical_school · renamed from "University of Massachusetts Medical School" in late 2021 following the $175M Chan / Morningside donation; pre-2022 filings aliased to the same legal entity
umass_system · central system administration / President's Office at 1 Beacon Street, Boston
umass_building_authority · separately chartered quasi-public corporation under G.L. c. 75 § 5 and c. 40I; issues bonds for UMass capital projects; not part of any campus or the system administration, and intentionally not rolled up
The bare string "University of Massachusetts" (10 appeals) is still held in suspension. It could be the system office, or it could be any of the five campuses filing under an abbreviated label. Resolution requires reading the addressee block in each underlying order PDF. That work is queued for human review and tracked in the YAML notes. Premature merging would have produced a leaderboard that overstated the system office's appeal volume by ten cases — not catastrophic, but the wrong answer.
The four moves that did the heavy lifting
Statutory rule reading. A city's police department, fire department, public records office are sub-offices of the same municipality. Separately-chartered bodies — Boston Public Health Commission, UMass Building Authority, MBTA Retirement Board — are kept independent because the statute treats them as independent entities. This is corporate-form analysis, not name similarity.
MA reorganization history. Tracked and dated: EOHED's 2023 split into EOED and EOHLC. The Department of Veterans Services' 2022–2023 promotion from EOHHS to its own Executive Office. The Division of Professional Licensure's 2021 rename to Division of Occupational Licensure. The UMass Medical School → UMass Chan renaming, 2021. The Framingham 2018 Town-to-City transition. Pre-rename filings are aliased to the post-rename canonical entity.
Order-body addressee reading. When the filed custodian string is ambiguous (bare "University of Massachusetts"; bare "Department of State Police"), the addressee block in the actual order PDF disambiguates. The system office's letterhead address is different from any individual campus's.
AI-assisted verification, then hand-check. A research model with web access reviews each candidate cluster and reports whether the strings describe the same agency, with sources. The hand-check is the load-bearing step — the model occasionally proposes a clean merge that a human reads against the statutory framework and rejects.
Every alias decision is recorded in a YAML file under git-style change tracking, so any decision can be re-examined or reversed. Notes fields record the reasoning for the non-obvious calls. The 13 explicit rejections in the canonical-entity layer are merge candidates that looked plausible to the clustering and were ruled apart on substantive grounds — those rejections are themselves a deliverable, because they document the lines that should not be crossed.
The SOR's own corpus has gaps. We disclose them.
The SOR's appellate corpus has its own categorical inconsistencies. We disclose them rather than paper over them.
Material — affects 2025 totals
Case-type drift in 2025.
Starting in 2025, a large block of orders is missing the case_type field — 1,973 of 3,851 in 2025 (51%). We do not know yet whether this is a coding-procedure change at SOR, a data-export gap, or something else. Annual totals for 2025 should be read with this caveat.
Status: open. Under investigation.
Minor — known, bounded
Date format inconsistency.
Some orders have dates in MM/DD/YYYY format, others in YYYY-MM-DD, others in handwritten variants. We parse all of them. A small fraction (well under 1%) have ambiguous or missing dates.
Out of scope
Reorganizations preceding 2014.
Several MA agencies were renamed or restructured before 2014. The corpus begins in 2014; we don't reach back. Where pre-2014 names appear in the order text, they're recorded as written.
A leaderboard isn't a moral ranking. Read the disclosures.
When we publish a custodian comparison, we try to:
Normalize for size where possible.
Compare against peer-cohort means, not the global mean (police vs. police, town clerks vs. town clerks, quasi-publics vs. quasi-publics).
Disclose any roll-ups (e.g., the Boston umbrella excludes BPHC, which is separately chartered and gets its own row).
Link to every cited SOR determination so a critical reader can verify.
How the homepage rankings are counted
Two of the three callouts are one-line SQL aggregations against the canonical-custodian join described above, refreshed daily. The third is hand-curated for the reason explained below.
Most appealed agency.SELECT custodian_id, COUNT(*) FROM orders GROUP BY 1 ORDER BY 2 DESC. All case types (Appeal, Fee Petition, Time Petition, Reconsideration, In Camera Review, plus a small uncoded set in 2025), rolled up to the canonical agency. The filter matches the labeled definition on the stats page's Top Appeal Targets table ("All appeal types · canonicalized custodians"). Auto-updates daily.City of Boston, 2,298.
Top fee-petition filer. Same shape, WHERE case_type IN ('Fee Petition', 'Time Petition'). A fee petition is when an agency asks the Supervisor to authorize a fee above the statutory cap; a time petition is the analogous extension-of-time filing. Both are agency-side filings against a specific requester. Fees an agency quotes informally to a requester (where the requester then files a regular Appeal) are not in this count; that undercount is disclosed on the stats page. Auto-updates daily.MBTA, 149.
Largest single fee disclosed (hand-curated). The structured fees column on each row of orders is incomplete: many Appeal-classified orders carry the dollar figure in the column too, but the biggest cases on the docket — Hinkle's $1.88M, Faulk's $1.34M, Lipton's $823K — sit in rows where fees = 0 because the dollar amount appears only in the order body, not in a structured field. A naive MAX(fees) against the column returns a 2025 Town of Danvers case ($499,950), missing every mega-fee. The hand-curated mega-fees table on the stats page was assembled by reading those dollar figures out of order bodies directly. Until classifier-rerun #32 re-extracts dollar amounts from Appeal-order bodies into a structured column, this callout updates only when a new mega-fee lands and someone reads it in. $1,877,775 — Hinkle (Gannett) for traffic-stop demographic data, Massachusetts State Police, DENIED, SPR20/1153.
What the auto-updating callouts don't count. The 811 orders (2.7% of the corpus) with no normalized custodian_id are excluded.
Why these three. They name specific institutions a reader can investigate further: the agency most often dragged to the Supervisor, the agency that most often asks the Supervisor to bless a high fee, and the highest dollar figure on the docket. None of the three normalize for agency size; "worst at X" is a count, not a moral judgment. The pages those callouts link to handle the size-normalized comparisons.
Known limits
These are issues we have found and not yet fixed. Listing them here is part of the contract: a working classifier with documented errors is more useful than a black-box claiming perfect accuracy.
Material — affects headline numbers
"Ordered" is one bucket. It should be several.
The ORDERED_TO_PROVIDE outcome triggers on phrases like "is ordered to provide a response." Reading the underlying orders, that catches at least four meaningfully different things:
Released records — agency must turn over the documents.
Respond properly — agency must either release or identify an exemption. Often satisfied by a better-justified denial.
Clarify scope — agency must clarify whether additional records exist, or what the request actually covers.
Provide a fee estimate — agency must quote a price the requester can decide to pay or not.
In a working journalist's read of the docket, true orders to release records are rare. The bulk of "ordered" outcomes are procedural-cure orders telling the agency to respond better. Today's classifier collapses all four into one bucket, and the sub-flag that was supposed to catch the procedural-cure subset (merits_avoided) currently fires on only 3% of ORDERED — far too low.
Estimated impact: the headline rate of disclosure orders is materially overstated. The "actually released records" rate is being studied; expect it to land well below the current 52–58% post-reform band. Sub-categories will be exposed as drill-down filters on the stats page when the next iteration cycle lands.
Status: open, iteration in progress. Surfaced 2026-05-04. Update 2026-05-05: first split landed — 1,022 anticipatory-close cases (where SOR closes the appeal because the agency promised a future response, then orders that promise enforced within ten business days) were promoted to a separate AGENCY_PROMISED_RESPONSE_ORDERED outcome. Two new finding-type sub-flags now track the doctrinal split between SOR holdings: remand_burden_not_met ("the agency has not met its burden") fires on 1,097 ORDERED cases (8.1%), and remand_unclear ("I find it is unclear whether/how/if") fires on 239 (1.8%). These two phrasings may be doctrinally distinct — SOR's "burden not met" can be either a specificity-failure remand (agency keeps records and re-justifies) or a substantive merits loss for the agency, while "unclear" tends to attach to factual existence questions. Tracked separately pending litigation-side feedback. Per-exemption verdict extraction (the schema work needed to compute "CORI win rate" or "Exemption (a) success rate" by custodian) is the next iteration block.
Resolved — LLM rescue pass landed
Regex-only residual was 10.1%. After LLM rescue: ~1%.
The regex pipeline alone left 3,106 of 30,800 orders (10.1%) in a NONE bucket — older, more variable disposition phrasings that eluded every pattern. The Gemini Flash-Lite cross-validation pass (see above) produces confident verdicts on the large majority of those. Applied as overrides, the residual unresolvable rate now lands near 1%. The 2018-cohort OCR damage (465 orders with a uniform corrupt fingerprint) is the next batch of fixes queued.
Only 5 orders out of 30,800 are classified PARTIAL (granted in part, denied in part). Hand-reading suggests the true rate is much higher — orders that uphold withholding under one exemption while preserving the agency's right to charge under another are functionally partial wins, but the classifier currently calls them DENIED or UPHELD. Specific anchor phrases ("however, this determination does not preclude," "to the extent") need a dedicated regex pass.
Status: open. Iteration cycle queued.
Outcomes — regular appeals
A regular appeal is filed by the requester after an agency response (or non-response). These are the bulk of the corpus — 27,362 of 30,217 portal-tagged cases. Each disposition resolves to one of the following outcomes.
Substantive — favors requester
ORDERED_TO_PROVIDE
The Supervisor orders the agency to provide a response. As noted in Known limits, this can mean either substantive disclosure or a procedural-cure order to re-justify withholding.
is hereby ordered to provide
is hereby ordered to make
is ordered to provide
is ordered to make said response
is ordered to make a response
shall provide … responsive records
is ordered to review … records … provide
is ordered to redact … (provide | disclose | release)
once the fees are paid … (custodian) … provide
failure to comply with this order may result in referral
has not met its burden … why … may not be redacted
In plain English: SOR is telling the agency to do something — release, redact-and-release, give a better written response, clarify scope, or quote a fee. The current classifier counts all of those equally as a "win." Sub-categories (release vs. respond-better vs. clarify vs. fee-quote) are coming in the next iteration.
Substantive — favors agency
UPHELD_WITHHOLDING
The Supervisor finds the agency's withholding was proper under one or more exemptions.
the appeal is denied
properly withheld
upheld … withholding
has properly denied
acted properly in withholding
whereas … properly (denied | invoked | withheld)
public interest in … does not outweigh … privacy interest
has met its burden (in | to) withhold
i find … has met its burden … withhold
SOR sided with the agency. The records stay sealed.
Procedural close
AGENCY_RESPONDED_DURING_APPEAL
The agency provided a further response while the appeal was pending and the Supervisor closed the matter without ordering substantive disclosure. Common — about a third of all orders. The same matter often comes back as a new appeal.
i will now consider this … appeal closed
will now consider this … appeal closed
this administrative appeal is now closed
this appeal is now closed
considered closed
intends on providing a (further | supplemental | written | forthcoming) response … is ordered to provide … within (10 | ten) business days
SOR closed without reaching the merits. Whether the requester actually got the records they wanted is between them and the agency now.
Substantive — split
PARTIAL
The Supervisor grants disclosure of some records or fields and upholds withholding of others. As noted in Known limits, currently severely under-counted.
partial … disclosure
granted in part … denied in part
A nuanced ruling — typical when one exemption is rejected but a second is preserved.
Other procedural
CLOSED_OTHER
Procedural close on grounds other than agency-responded: parallel litigation, jurisdictional non-subject, no duty to create records, insufficient specificity, unique-right-of-access (police records), and similar.
no duty to create … appeal closed
no authority to compel … create records
§ 6A(d)
unique right of access … (declin | appeal closed)
declin … review … unique right of access
32.08(2)(b) (parallel litigation regulation cite)
parallel … litigation
32.08(1)(f) (specificity regulation cite)
insufficient … specificity
lack of jurisdiction
not subject to the public records law
is dismissed
shares jurisdiction with the superior court
i decline to … intervene
A grab-bag of procedural exits. The matter is closed without a merits ruling.
Procedural — narrow
DECLINED_TO_OPINE
The Supervisor declines to render a determination. Different from CLOSED_OTHER in that no procedural exit was cited; the Supervisor simply found it unnecessary or improper to opine on the question.
unnecessary to opine
no need to opine
declines? to opine
decline to render a determination
declines? to provide a determination
Rare outcomes, each with a small set of trigger phrases. Combined fewer than 100 cases across the corpus.
records … have been destroyed
records were destroyed
the requestor … has withdrawn
the request … has been withdrawn
withdrew (his | her | the | their) request
in[- ]camera review … (is | are) … order
is ordered … submit … in[- ]camera
i respectfully decline … reconsider
decline … reconsideration
reconsideration is denied
requestor … is deceased
requester … has (passed | died)
Edge-case dispositions. The requester withdrew, the records don't exist anymore, the Supervisor needs to see the records privately before ruling, the reconsideration was denied on a separate matter, or the requester died before resolution.
Outcomes — agency-side petitions
Fee petitions and time petitions are filed by the agency, asking the Supervisor's permission to charge for redaction time or to extend the response deadline. They use a different statutory basis than regular appeals — G.L. c. 66 §§ 10(c) and 10(d)(iv) — and they are tracked separately from regular appeals throughout this site.
Favors agency
GRANTED
The Supervisor grants the agency's petition. On a fee petition, this means the requester pays for the redaction work. On a time petition, the agency gets more time.
established good cause … (extension | fee)
established good cause to permit an extension
the … has met its burden … redaction
i find the … has met its burden
may assess a fee for … (segregation | redaction | commercial)
is granted an extension
i (approve | grant) … petition
Note the inversion: GRANTED on a petition is bad for the requester. GRANTED on a regular appeal is good for the requester. Same word, opposite consequence.
Favors requester
DENIED
The Supervisor denies the agency's petition.
has not met its burden … petition
i (deny | decline to grant) … petition
the petition is denied
The agency cannot charge, or cannot extend, on the grounds it asked for.
What we strip before classifying
Some legal-standard recitations appear in nearly every order regardless of outcome. If we let those through, they cause false positives — the recital "may be properly withheld from disclosure" reads like UPHELD even when the disposition is the opposite. The classifier strips these phrases from the disposition region before running outcome regex.
Suffolk § 10A(a) recital
Standard framing language for attorney-client privilege analysis. Cites Suffolk Constr. Co. v. Div. of Capital Asset Mgmt. Appears in 41% of true UPHELD orders and 80% of false-positive cases.
in assessing whether a records custodian has properly withheld records based on the claim of attorney-client privilege
Case-law quote on Exemption (X)
Quoted prior case-law containing "may be properly withheld from disclosure under Exemption (X)." Narrative cite, not a disposition.
may be properly withheld from disclosure under exemption (X)
Suffolk three-prong burden recital
The standard "records custodian claiming the attorney-client privilege has the burden of not only proving…" recital. Appears in nearly every privilege case as mandatory framing.
records custodian claiming the attorney-client privilege has the burden of not only proving …
Each canonical phrase is paired with a fuzzy probe. When the fuzzy probe matches but the canonical phrase doesn't, we flag the case for human review — has the boilerplate drifted (a new way of writing the same thing), or is this a substantive change of law?
How accuracy is measured here
Each new pattern is tested three ways before being added: (1) does it not break the smoke set, (2) does it improve the holdout against hand-labels, and (3) does the corpus distribution shift in a defensible way. Pattern changes are version-controlled and the prior corpus distributions are kept on disk as diff baselines.
When a finding like the "ordered to respond" issue surfaces, it is added to Known limits immediately and queued as the next iteration cycle. Patches do not silently change the headline numbers; they ship with a note explaining what changed and why.
Audit invitation
If you read an order on the underlying SOR docket and our classification looks wrong, send us the case number and what we got wrong. We will check, fix the pattern if the fix is a real one, and credit you in the changelog.