AI-Enabled
Scoring
Full transparency on how artificial intelligence is used in this project — what it does, what it cannot do, why the approach was chosen, and where the risks are.
AI generates proposed assessments. Humans review and approve every single one before it enters the published dataset. Nothing reaches the public record without that review.
That is not a disclaimer buried at the bottom of a methodology document. It is the design. The rest of this page explains why the pipeline was built this way, what it actually does well, where it introduces risk, and what the review process looks like in practice.
The Cultiness Spectrum Dataset covers 370 organizations across every major category of American institutional life. Each organization requires assessment across ten criteria, with evidence-based body text, source citations, confidence ratings, trajectory assessment, and two independent metric scores. Producing that at scale — consistently, evenhandedly, and to a documented methodological standard — is not feasible for a single researcher working manually.
The alternative to AI-assisted scoring is not a better dataset. It is a much smaller one, produced more slowly, with less cross-batch consistency because the researcher's judgment naturally drifts over time. A dataset of 50 organizations scored manually over two years answers fewer questions and introduces its own consistency problems.
AI is used here not to replace analytical judgment but to make consistent application of a documented methodology tractable at a scale that would otherwise be impossible for a single researcher.
The specific model used is Anthropic's Claude Sonnet, accessed via the Anthropic API. Each organization is scored in a separate API call using a system prompt containing the full V4.0 methodology, calibration anchor references across the scoring spectrum, and explicit instructions for N/A discipline, evenhandedness, and evidence standards. The model produces a structured JSON output with scores, body texts, confidence ratings, and source citations.
Consistent application of the framework
Applied the same ten criteria to the 370th organization with the same care and attention as the first. Human researchers drift — fatigue, changing intuitions, evolving interpretations of edge cases. The AI applies the documented methodology consistently because it has no memory of previous sessions and no accumulated fatigue. Each assessment starts fresh against the same standard.
Cross-ideological evenhandedness
A human researcher assessing both MAGA and Antifa, both the Black Church and the Southern Baptist Convention, both the NAACP and the Heritage Foundation, will bring conscious and unconscious political judgments to each assessment regardless of effort. The AI applies the same criteria to all of them. It has no political commitments, no social circle whose reactions it is anticipating, and no career consequences attached to particular findings. The evenhandedness of the results is, in part, a product of the tool.
Rapid synthesis of public documentation
The AI has broad familiarity with court records, investigative journalism, academic scholarship, regulatory findings, and institutional history across a wide range of organizations. It can quickly identify which documented behaviors are relevant to each criterion and construct evidence-based body text with source citations. A human researcher would spend hours on each organization doing the same initial synthesis.
Structured, auditable output
Every AI-generated assessment produces a complete structured record: scores, body text, confidence ratings, sources, and calculated composite metrics. This makes the human review process tractable — the reviewer is checking a complete record against documented standards, not filling in gaps or reconstructing reasoning.
Calibration anchor consistency
The system prompt includes reference anchors across the full scoring spectrum — from 100% Cult tier to 5% Healthy Group. This gives the model a comparative reference frame that helps prevent score inflation or compression over large batches. The anchors are drawn directly from the database and updated when methodology changes.
These are documented honestly, not minimized.
The model does not know what it does not know
AI confidence does not reliably track evidence quality. The model may produce a well-structured, internally consistent assessment for an organization with thin public documentation — and that assessment will look identical in form to one backed by extensive court records and investigative journalism. The confidence ratings in each entry are designed to surface this, but they are themselves AI-generated and subject to the same limitation. This is why the human reviewer must independently assess source quality, not just accept the model's confidence rating.
Training data cutoff and recency
The model's knowledge has a training cutoff. Organizations that have changed significantly — through scandals, leadership transitions, policy reversals, or membership collapse — may be assessed based on outdated information. Trajectory assessments (Escalating / Stable / Declining / Defunct) are particularly vulnerable to this. The human review process includes checking for major post-training developments, but this is not systematic across all entries.
Pattern matching versus genuine understanding
The model applies the methodology through pattern recognition against its training data. For organizations that appear frequently in public discourse, this works well. For organizations that are underrepresented in English-language text — smaller regional movements, less-covered religious formations, non-English-origin organizations — the model may produce thinner, less reliable assessments. Low confidence ratings in these entries reflect this, but the limitation is worth stating directly.
Systematic bias in the training data
If the corpus on which the model was trained over- or under-represents certain types of organizations, certain political formations, or certain ideological perspectives, those biases may appear in the assessments. The evenhandedness of the methodology is designed to catch gross asymmetries, but subtle systematic biases in how the model characterizes different types of organizations are difficult to detect and may not surface through the review process. This is the most difficult risk to quantify.
The reviewer is also a person
The human review gate is only as good as the reviewer applying it. The reviewer brings their own knowledge gaps, time constraints, and potential blind spots to every assessment. The methodology documents what the review is supposed to check. Whether that check is consistently executed is a function of the reviewer's discipline and time — not a guarantee built into the system.
Source citation quality
The model cites sources but cannot verify that those sources say what it claims they say, that the sources still exist at the cited URLs, or that the sources have not been updated or retracted since training. Source citations in AI-generated assessments should be independently verified before being relied upon for consequential purposes. The human review process checks that cited sources are plausible and consistent with the body text, but does not systematically verify every citation.
Every AI-generated assessment passes through human review before entering the published dataset. The review is not a rubber stamp. It is a structured check against the methodology.
The reviewer verifies:
Each score is consistent with the body text — if the body text describes structural absence, the score must be N/A, not a floor number
N/A designations have documented structural rationale, not just low evidence
Cited sources are plausible and consistent with the claims made in the body text
The assessment reflects consistent application across the ideological and cultural spectrum — a comparable organization on the other side of the political spectrum would be scored the same way
Confidence ratings reflect actual evidence quality, not just structural completeness of the output
The composite score matches the formula exactly given the criterion scores
Young's Original Score was derived from independent application of the binary checklist, not mechanically from composite intensity
The trajectory assessment reflects documented current state, not just historical reputation
Assessments that fail the review are returned for revision or rejected entirely. Accepted assessments are logged in an immutable audit trail with timestamp, methodology version, and reviewer notation. Score changes after initial acceptance are also logged — the full history of every score in the dataset is preserved and traceable.
The AI-assisted methodology produces assessments that are more consistent and more comprehensive than a single researcher could produce manually. It also introduces risks that a purely manual process would not have in the same form.
The appropriate use of this dataset is as a research resource and analytical tool — a structured starting point for investigation, not a final determination. High composite scores document institutional architecture worth examining. They are not verdicts. Low scores do not certify institutional health. Every entry should be read with the understanding that it represents a human-reviewed AI assessment anchored to publicly available documentation, produced under a documented methodology, at a specific point in time.
Model and Version
Current scoring model: claude-sonnet-4-6 (Anthropic)
Methodology version applied: V4.0
All accepted scores are version-tagged in the audit log. Score history is immutable — no accepted score can be modified or deleted, only superseded by a new accepted score with documented rationale.