How We Test AI Against 2,000 Years of Theology

The problem with AI and theology is not that the models are stupid. They are not. The problem is that they are trained to be neutral. A language model will hedge on the divinity of Christ the same way it hedges on whether Python is better than JavaScript. It treats theological truth claims as opinions to be balanced, not convictions to be stated.

For a Christian who believes that Jesus is God incarnate, “many theologians believe Jesus was divine” is not an answer. It is a dodge. The Nicene Creed does not say “we think he might be of one substance with the Father.” It says he is. The Council of Nicaea in 325 AD did not convene to express a range of perspectives.

This is why we built FaithMark.

What FaithMark is

FaithMark is a theological accuracy benchmark for AI. It is not a filter or a content moderation layer. It is a structured evaluation system that tests whether an AI model can state Christian theology correctly, within the right tradition, without drifting toward known heresies.

It has three layers. Each one tests something different, and a model has to pass all three to be considered theologically sound.

Layer 1: Mere Christianity

The first layer is roughly 100 questions on the fundamentals that all Christians share. We borrowed the framing from C.S. Lewis, who used “mere Christianity” to describe the common ground across traditions. These are the things that make someone a Christian rather than something else.

The Trinity. The Incarnation. The bodily Resurrection. The Atonement. The authority of Scripture. The return of Christ. These are not negotiable across any tradition we serve. A Catholic, a Baptist, and a Greek Orthodox Christian all affirm the Nicene Creed. If the AI cannot clearly state that God is three persons in one essence, it fails everything else. There is no point testing tradition-specific nuance if the model cannot get the basics right.

We test for affirmation, not just accuracy. The AI has to state these truths as truths, not as one perspective among many. “Christians believe Jesus rose from the dead” gets a lower score than “Jesus rose from the dead on the third day.” The first is a sociological observation. The second is a confession of faith. We want the second.

This layer is pass/fail. If a model scores below our threshold on Mere Christianity, it does not move to Layer 2. We do not ship it.

Layer 2: Tradition-specific orthodoxy

This is where it gets interesting. Layer 2 has 40 to 60 questions per tradition, and the correct answer depends on which tradition is asking.

Ask about the Eucharist. A Catholic answer should affirm transubstantiation: the bread and wine become the body and blood of Christ in substance, while the appearances remain. The Council of Trent defined this in 1551. A Reformed answer should describe the spiritual presence of Christ in the Lord's Supper, following Calvin's teaching that believers truly receive Christ by faith through the Holy Spirit. A Lutheran answer should affirm the real presence in, with, and under the elements, per the Formula of Concord.

All three of these answers are correct within their tradition. A generic answer that splits the difference is wrong for everyone. FaithMark tests that the AI gives the right answer for the right tradition. When a Catholic user asks about the Eucharist, they get a Catholic answer. Not a survey of views. Not “Christians have different opinions on this.” The actual teaching.

We test tradition-specific questions across the major areas where traditions diverge: sacraments, salvation, ecclesiology, Mariology, the saints, eschatology, and church authority. Each tradition's question set was developed by consulting the relevant catechisms, confessions, and authoritative documents. The Catholic Catechism. The Westminster Confession. The Orthodox Study Bible.

Layer 3: Controversy detection

The third layer is the one that catches what the other two miss. It tests for proximity to known heresies using embedding-based similarity scoring. This sounds technical, so let me explain what it actually does.

We have a database of heretical positions. Real ones, from real church history. Arianism taught that Jesus was a created being, not eternal God. The Council of Nicaea condemned this in 325. Modalism taught that the Father, Son, and Holy Spirit are not distinct persons but different modes of one person. Sabellius taught this in the third century and the church rejected it. Pelagianism taught that humans can achieve salvation through their own effort without God's grace. Augustine fought this in the fifth century. Nestorianism divided Christ into two separate persons. The Council of Ephesus condemned it in 431.

These heresies are not ancient curiosities. They show up constantly in AI-generated theology, because language models do not know the difference between an orthodox statement and a heretical one. They just predict the next likely word. A model can easily produce something that sounds Christian but is actually Arian, because the statistical patterns of Arian language and orthodox language are close together.

Layer 3 takes every AI response, converts it to an embedding vector, and measures its similarity to each known heresy in our database. If a response about the Trinity scores above our proximity threshold to Modalism, it gets flagged. The response might look fine to a casual reader. Layer 3 catches the drift.

The five-step pipeline

FaithMark runs every AI response through a five-step evaluation before it reaches you.

First, claim extraction. The system identifies every theological claim in the response. “Jesus died for our sins” is a claim. “Prayer is good for you” is not. We only evaluate actual theological content, not general advice.

Second, verification. Each extracted claim is checked against the relevant doctrinal source for the user's tradition. A claim about Mary's perpetual virginity is verified against Catholic teaching for a Catholic user and against Reformed confessions for a Reformed user.

Third, framing detection. This catches hedging and relativism. If the AI says “many theologians believe Jesus was divine” instead of “Jesus is God incarnate,” that gets flagged. We test for conviction, not just accuracy. An AI that states true things in a wishy-washy way is still failing the user.

Fourth, heresy proximity scoring. This is Layer 3 running on the full response. Every claim is measured against our database of historical heresies. The scoring is continuous, not binary. A response can be mildly Pelagian or strongly Pelagian, and the system captures that distinction.

Fifth, human audit. Responses that score in the uncertain range on any step get queued for human review. We do not ship borderline theology. If the system is not confident, a person looks at it.

Why nobody else does this

Building FaithMark required knowing both AI engineering and historical theology. You need someone who understands embedding spaces and someone who understands why the filioque clause split the church in 1054. Those two people are rarely in the same room.

Most Christian AI products do not test their theology at all. They use a general-purpose model, maybe add a system prompt that says “answer from a Christian perspective,” and hope for the best. That is like using spell check as your editor. It catches some obvious errors and misses everything subtle.

We checked. No other Christian AI product has a theological benchmark. Most cannot tell you whether their AI can correctly state the Nicene Creed. That is a low bar, and they have not cleared it.

FaithMark is not perfect. Theology is deep and complex and no automated system will catch everything. But it catches a lot. And it catches things that matter: the difference between orthodoxy and heresy, the difference between your tradition's teaching and a generic summary, the difference between conviction and hedging. For an AI that is going to help people study Scripture and pray, that matters.

AI that knows the difference between orthodoxy and heresy.

Join the Talents waitlist. Theology tested, not guessed.

Request early access

How we test AI against 2,000 years of theology