Gpt 5.5 test output - how to make LLMs not hallucinate

Short answer: LangChain can strongly reduce hallucinations, but it cannot make them go to zero. Forcing the model to ground every claim in retrieved sources and to output citations helps a lot, but you also have to: control what context the model sees, control how it is allowed to answer, verify its claims after generation, and sometimes reject answers entirely.

I'll break down how to do this in LangChain.


---

1. Core idea

Hallucination means: the model produces text that sounds plausible but is not supported by the available data.

LangChain is an orchestration framework. It lets you build multi-step call graphs ("chains") where you can force the model to:

1. Read only retrieved context.


2. Answer only using that context.


3. Show where each statement came from.


4. Abstain if data is missing.


This pattern is usually called RAG (Retrieval-Augmented Generation). RAG means: instead of just prompting the model from scratch, you first retrieve relevant documents from your own data store (vector DB, SQL, etc.), then provide those documents to the model and say "answer using only this".

That solves two problems:

The model doesn't have to guess.

You can audit its claims because you know the source chunks.


But RAG alone is not enough. You have to enforce discipline.


---

2. High leverage controls

2.1 Retrieval quality gate

If retrieval gives trash, the answer will hallucinate. This is the single highest-leverage point.

In LangChain terms:

Retriever = component that pulls documents relevant to a query from a store like FAISS, Chroma, Milvus, etc. These are vector stores, which are databases that store document embeddings (numerical representations of text meaning).

You wrap the retriever in logic: if it returns nothing or low-score results, you do not ask the LLM to answer. You instead return "No data found".


Pseudocode sketch:

docs = retriever.get_relevant_documents(user_query)

if relevance_score_too_low(docs):
    return "I don't have sources for that."
else:
    answer = llm(prompt_with(docs, user_query))

LangChain lets you do this with a RunnableLambda before the LLM call. You explicitly check metadata like similarity score.

If you don't gate like this, the model will make stuff up to be "helpful".

Conclusion: Hallucination is often retrieval failure, not generation failure.


---

2.2 Grounded answering prompt

You use a strict prompt template. A prompt template is a structured string with instructions and slots like {context} and {question}.

Example pattern that actually works in production:

Rule 1. Only answer using the "Context" section.

Rule 2. For every factual claim include a citation like [1], [2], etc.

Rule 3. If the answer is not in context say "Not in sources".


Example template body:

"You are a factual QA system. Use ONLY the Context. Do not use prior knowledge.
If the Context does not contain the answer say 'Not in sources'.
After each sentence add citation tags like [1] [2] that map to the sources below.
Context:
{context}
Question: {question}
Answer:"

In LangChain you pass this as a PromptTemplate to the LLM.

Why this works:

You are removing "be helpful" behavior and replacing it with "be literal".

You are training the model (in-context) to fear answering without proof.


This does not eliminate hallucination, but it drives it down.


---

2.3 Forced citation schema

You mentioned "forcing factual citations of data from a RAG". Yes. This is mandatory.

In LangChain you also add an OutputParser. An output parser is a post-processor that checks whether the LLM output matches a required format.

For example you can demand structured JSON like:

{
  "answer": "text",
  "citations": [
    {
      "claim": "string",
      "source_ids": ["doc_3", "doc_5"]
    }
  ],
  "can_answer": true
}

Then:

If parsing fails (LLM didn't give citations) you reject the answer.

If citations[*].source_ids includes an ID that wasn't actually retrieved you reject the answer.

Optionally you re-prompt with "Your previous answer was invalid. Fix format."


This is a deterministic filter you control, not model magic.

LangChain supports this using StructuredOutputParser or custom PydanticOutputParser. You define the schema with Pydantic and force the model to fill it.

This step is underused. It cuts hallucination because you are not trusting free-form prose.


---

2.4 Post-verification / grading chain

After the model answers, you run a second model call that acts as a verifier. The verifier gets:

The retrieved docs.

The proposed answer with its citations.


The verifier is prompted like: "For each claim in the answer, confirm that the claim is directly and unambiguously supported by at least one cited source span. Mark any unsupported claim."

LangChain sometimes calls this pattern "retrieval QA with answer validation" or "RAG-as-a-judge". It's just a second chain.

Behavior:

If verifier finds unsupported claims you either:

Strip them.

Replace them with "Not in sources".

Or block the answer.


This is similar to "constitutional checks" in safety research. It treats hallucination as a policy violation and rejects it.

This helps because the first model can still drift, but the second model is optimized only for "is this grounded". That is cognitively easier than "answer the question" and yields higher accuracy.


---

2.5 Refusal / abstention

You must allow the system to say "unknown".

Hallucination mostly appears when the model thinks "I must answer anyway".

So in every step:

Retrieval gate can stop the chain.

Prompt explicitly rewards saying "Not in sources".

Output schema has a boolean can_answer.

Verifier can flip can_answer to false.


If can_answer is false, final output is "No supported answer available from the provided sources." That is boring but safe.

In practice this is the single cleanest way to get "hallucination free" behavior at the cost of coverage.


---

3. Architecture in LangChain terms

Below is a clean, high-level LangChain pipeline. I will define each block.

Terms:

Runnable in LangChain is any step that can be executed in a chain (retriever call, LLM call, lambda).

LCEL is LangChain Expression Language. It lets you compose runnables with | like Unix pipes.


3.1 Retrieval step

retrieve = retriever | RunnableLambda(filter_low_score_docs)

filter_low_score_docs enforces similarity score thresholds.

Similarity score is a numeric measure of how close two embeddings are. For cosine similarity, 1.0 means identical direction in vector space and 0.0 means orthogonal (unrelated).

3.2 Answer draft step

generate = (
    {
        "context": retrieve,
        "question": RunnablePassthrough()
    }
    | prompt_template
    | llm
    | output_parser  # forces JSON with citations
)

RunnablePassthrough lets the original user question flow forward unchanged so you can use it alongside retrieved documents.

3.3 Verification step

verify = (
    {
        "retrieved_docs": retrieve,
        "draft_answer": generate
    }
    | verifier_prompt
    | verifier_llm
    | verifier_parser
)

The verifier LLM can be a cheaper model. It just has to check grounding.

3.4 Final selection step

final = RunnableLambda(select_verified_or_refuse)

Where select_verified_or_refuse is a pure Python function that:

Drops any claim marked unsupported.

If too much is unsupported, returns "Not in sources".


3.5 Full chain

full_chain = (
    {"question": RunnablePassthrough()}
    | {
        "answer": generate,
        "verification": verify
    }
    | final
)

You then expose full_chain.invoke(user_query).

Result:

Every claim is traceable.

Unsupported claims are pruned.

No "naked" statements without source IDs.

The model is not allowed to invent a source.


This is about as close as you get to "hallucination free" with current LLMs.


---

4. Important constraints

Constraint 1. You cannot actually prove truth. You only prove consistency with provided sources.

You are reducing hallucinations relative to your knowledge base. If the knowledge base is wrong or incomplete you will produce confident wrong answers that are still "grounded". The system has no external fact check against reality. It only checks against the retrieved context.

In other words you get "citation compliance", not "truth".

To improve this you can:

Add multiple retrievers: internal data + public web search + policy docs. Then merge and deduplicate results before generation.

Penalize answers that rely on a single source when multiple disagree.


This is ensemble retrieval.

Constraint 2. Chunking and context window matter

If you chunk documents badly you get retrieval misses.

Chunking means: how you split source documents into pieces before embedding. Example: 512 tokens per chunk with overlap of 50 tokens.

If chunks are too small you lose context and the model may overfit fragments. If chunks are too big you blow the prompt window and the model can't see all relevant evidence at once. That reintroduces hallucination because it tries to fill gaps between partial chunks.

Optimal chunking is empirical per domain. Regulatory text prefers larger chunks because definitions cross-reference heavily. FAQs prefer smaller chunks.

LangChain exposes text splitters like RecursiveCharacterTextSplitter. You tune this as part of quality control.

Constraint 3. Model will try to "smooth" language between sources

Even if every sentence has [1] [2], the model will often generate connecting glue text. Sometimes that glue text encodes inferences not actually in the docs.

Your verifier chain must explicitly catch "synthesis that goes beyond sources". The verifier prompt should say:

"Mark a claim unsupported if it is inferred from multiple sources but never explicitly stated."

This prevents creative synthesis that lawyers/regulators will call hallucination.


---

5. Direct answers to the implied questions

Q: Can LangChain make an LLM hallucination free just by forcing factual citations of data from RAG?

A: No. Mandatory citations reduce hallucination. They do not eliminate it. The model can still:

Cite a source number that does not really support the claim.

Fabricate relationships across sources.

Answer when retrieval failed.


To get close to zero hallucination in practice you must add:

1. Retrieval gating.


2. Strict grounded-answer prompt.


3. Structured output with per-claim citations and can_answer.


4. Automatic verification / grading against the retrieved sources.


5. A hard refusal path.


When you add all five you approach "LLM cannot assert anything that is not in your data". That is the real target for most regulated use cases (compliance, finance, medical triage, internal policy, etc.).


---

6. Summary

LangChain is useful because it lets you build multi-step guarded pipelines, not because it magically fixes hallucination.

Forcing citations is necessary but not sufficient.

The winning pattern is: retrieve → generate with rules → parse → verify → refuse or answer.

You must be willing to return "Not in sources" often. If the business side refuses that outcome, you will get hallucinations. This is not a model problem. It is a product owner problem.