Data poisoning defences for fintech AI

Data poisoning is the quiet threat in fintech AI: planted in grounding data, it fires long after you stop looking. How to defend what you control.

By Giovanni Salvador · 12 June 2026 · 6 min read

Data poisoning is the threat that hides in plain sight: planted now, triggered later, durable across thousands of queries, and almost invisible to the teams looking in the wrong place.

When a fintech CISO asks me about AI security risks, prompt injection usually comes up first. It should. But the conversation rarely gets to data poisoning, and that is a mistake. Prompt injection acts in the request that carries it. Data poisoning is different: it is planted in your grounding data or model artefacts before a query ever arrives, and it fires persistently on every retrieval that surfaces it, potentially for months. The attack is quiet by design.

For most fintech firms, the honest picture is narrower and more tractable than the academic poisoning literature suggests. You are almost certainly a model consumer, not a trainer. Your threat surface is concentrated in two places: the retrieval-augmented generation (RAG) corpus you build and maintain, and the model artefacts you load from third-party sources. Getting those two corners right is worth far more than chasing exotic training-time attacks against runs you do not perform.

The stake

The defining property of data poisoning is deferral. An injection attack is live: it acts in the request that carries it. A poisoned corpus document is dormant: it sits in your index, passes every smoke test, and activates under conditions ordinary evaluation does not cover.

In a fintech context, the consequences are precisely the ones regulators and boards worry about. A poisoned exceptions document in a credit assistant’s policy corpus could quietly authorise out-of-policy limit increases for applicants matching a specific narrative. A poisoned KYC guidance note could steer an onboarding agent to wave through a particular risk pattern. A poisoned dispute-handling article could bias chargeback outcomes systematically. Each of these is durable, consequential, and owned by your own write path.

The RAG corpus: your highest-probability exposure

Almost every fintech I work with has built a RAG corpus: indexed policies, case notes, product documentation, knowledge-base articles. Most firms do not run their own pre-training. The RAG corpus is where you own the risk and where you have the most leverage to control it.

RAG-corpus poisoning means planting attacker-controlled content in that grounding-data store so that, when it is retrieved into the context window, it persistently steers the model’s output. The delivery mechanism once the poisoned document is retrieved is indirect prompt injection, a structural property of how language models treat context. But the persistence and integrity angle is what makes it a poisoning threat rather than a one-shot injection.

Three properties define the exposure:

It is durable. A direct injection lives for one request. A poisoned corpus document lives in the index until someone removes it, affecting every retrieval that surfaces it.
It is a write-path problem. The exposure is set by who and what can write to the corpus. An ingestion pipeline that indexes inbound email, scraped web content, customer-supplied documents, or third-party feeds without provenance control is an open poisoning channel.
It widens with agent memory. Where an agent persists state across sessions, a poisoned entry written in one session can influence decisions many steps later. This is the same latency that defines a backdoor, expressed through memory rather than model weights.

Model artefacts: the inherited corner to prioritise

When you load an open-weight model, pull a fine-tuned checkpoint, or accept a vendor model artefact, you are running a binary you did not build, distributed through a channel you do not fully control. The threats here are generic software supply-chain failures applied to AI artefacts:

Distribution-channel risk. Models are pulled from public hubs and registries. Account takeover of a publisher, typosquatted model names, or a tampered artefact from a compromised mirror are the same threats you already manage in your software supply chain, applied to a dependency with opaque, unreadable internals.
Unsafe serialisation. Some model formats execute code on load. A malicious artefact can compromise the host, not just alter model behaviour. This is a host-compromise vector wearing the appearance of a model file.
Tampered fine-tunes. A fine-tuned checkpoint from a third party can carry behaviour the base model did not, including a deliberately planted backdoor. Fine-tunes are where a lot of fintech value is added, and therefore where a lot of consumed risk enters.

The uncomfortable fact about model weights is that they are opaque. Unlike source code, you cannot read a weight set to understand what it will do. Behavioural testing samples the input space; it cannot prove the absence of a triggered behaviour. Assurance has to come from chain of custody, not inspection.

Backdoors: real, but not your first priority

A backdoor is a deliberately trained-in behaviour: the model acts normally except when a specific trigger appears, at which point it produces an attacker-chosen output. Research has demonstrated that such backdoors can survive subsequent safety fine-tuning and can be trained to better hide the trigger under adversarial testing.

Bound the realism honestly. The base rate of a targeted backdoor planted specifically to attack your workflow is low. It should not crowd out the more probable corpus-poisoning and artefact-compromise threats. The defensible posture is not “test harder for triggers,” which is a losing game against an unknown condition. It is provenance: prefer models whose origin, integrity, and signing you can verify from publishers you have decided to trust. That is a procurement decision, not an evaluation one.

Training-data poisoning: mostly not your attack

If you are not running pre-training, training-data poisoning is largely not your attack to defend. Two qualifications keep it on the map rather than off it.

First, you inherit upstream poisoning through the model. If a foundation or open-weight model you consume was trained on a poisoned corpus, you carry whatever was planted. This collapses back into provenance: you cannot inspect the training set, so you must rely on the publisher’s trustworthiness and on artefact integrity verification.

Second, it re-enters scope the instant you fine-tune. The moment you do, the data you feed the fine-tune is a poisoning surface you own, governed by the same write-path question as your RAG corpus: who and what can write to it, with what provenance?

Detecting poisoning at runtime

Because poisoning is dormant by design, detection is harder than with injection. The most reliable signal is behavioural drift: systematic deviation from a known-good output baseline, measured over time. This is why runtime behavioural baselining matters alongside build-time evaluation.

Three signals are worth instrumenting:

Output drift per query class. Establish a baseline of expected output distributions for high-consequence query types (credit decisions, KYC recommendations, exception flags) and alert on systematic shifts. A single off-policy output might be noise; a shift in the distribution over 48 hours is a signal.
Retrieval anomalies. Log which documents are retrieved for which query types and flag documents that appear in retrievals they were not historically associated with. A newly indexed document surfacing at high frequency for a specific query pattern is worth investigating.
Corpus write-path auditing. Log every write to the corpus with source provenance, author identity, and content hash. An alert on an unusual source or an unexpected update to a high-consequence policy document is cheaper to instrument than a post-hoc investigation.

What to do this week

Inventory your RAG corpus write paths. List every channel that can write content into your grounding data store. Flag any that ingest third-party or untrusted content without provenance control. That list is your highest-priority poisoning exposure.
Treat every model artefact as a privileged dependency. For each open-weight model, vendor checkpoint, or fine-tuned model you run, ask: where did this come from, and can I verify it is unaltered? Make this a precondition for loading, not an afterthought.
Make publisher trust an explicit, recorded decision. Pulling a model from a public hub is choosing to inherit that publisher’s security posture. Record the decision, the verification steps taken, and the review date.
Add behavioural drift monitoring to your highest-consequence AI workflows. Start with credit decisions, KYC recommendations, and any workflow where output bias could cause a regulatory or financial harm. Drift is often the first visible signal that something in the corpus or artefact is not what it claims to be.
Do not over-invest in defending training runs you do not perform. Spend the marginal control effort on corpus write-path control and provenance verification, where your leverage is real.

If you're working on this right now — Book a discovery call