Stop Copying and Pasting: A No-Nonsense Guide to Training a Real Estate Chatbot
Every real estate agency has an FAQ document. It’s usually a poorly formatted PDF, a neglected webpage, or a Word document that gets emailed to new hires. The default move is to feed this mess directly into a chatbot builder and expect magic. The result is almost always a glorified search bar that parrots back entire paragraphs, completely missing the user’s actual question. This isn’t intelligent automation. It’s a liability.
The core problem is a fundamental misunderstanding of the task. You are not building a document retriever. You are building a system that maps user intent to a specific piece of information. The raw FAQ text is just the source material, not the final dataset. We have to gut it, restructure it, and convert it into a format a machine can actually reason with. Forget the marketing hype. Let’s build something that works.
Prerequisite: Your Data Is Unusable. Fix It First.
Before writing a single line of code, you must accept that your existing FAQ is structured for humans, not machines. It’s full of conversational filler, compound questions, and implicit context. A language model needs clean, atomic pairs of questions and answers. Our first job is to perform a data extraction and sanitation process that turns narrative text into a structured dataset, preferably JSONL or CSV.
This means manually, or with a script, breaking down entries like “What are closing costs and how are they calculated for FHA loans?” into two distinct entries. One for defining closing costs, and another for the FHA calculation. You are creating a one-to-one mapping between a potential query and its direct answer. This is tedious, non-glamorous work, and it’s the single most important factor for success.
Your target format should look something like this for each entry:
{
"intent_id": "closing_costs_definition",
"question_variations": [
"What are closing costs?",
"Can you explain closing costs?",
"Define closing costs for me.",
"Tell me about the fees at closing."
],
"answer": "Closing costs are fees paid at the closing of a real estate transaction. These costs are incurred by either the buyer or the seller and typically include loan origination fees, appraisal fees, title insurance, and property taxes."
}
Notice the `question_variations` array. This is critical. You are front-loading the model with different ways a user might phrase the same core intent. You are doing the model’s homework for it, which drastically reduces the chances of a mismatched or empty response. Without this step, you’re just gambling on keyword proximity.

Step 1: Ingesting and Vectorizing the Sanitized Data
With a clean JSONL file, the real work begins. We need to convert this text data into numerical representations called embeddings. An embedding is a vector, a list of numbers, that captures the semantic meaning of a piece of text. The goal is to represent text in a multi-dimensional space where similar concepts are located close to each other. This is the foundation of modern semantic search.
You have two primary routes for this: a hosted API like OpenAI’s `text-embedding-ada-002` or a self-hosted open-source model from a library like Sentence-Transformers. The OpenAI route is fast to implement but becomes a recurring operational expense. Self-hosting models like `all-MiniLM-L6-v2` gives you full control and zero per-call cost, but demands more setup and compute resources. For this application, a small Sentence-Transformers model is more than sufficient and avoids API latency.
We’ll process our structured JSONL file. For each entry, we will embed the `answer` text. Some schools of thought say to embed the questions or an average of the questions and answers. Start by embedding just the answers. It’s a cleaner signal and forces the system to find the document that contains the information, not just the document that parrots the user’s question back.
The process looks like this: load the data, initialize the embedding model, iterate through each FAQ item, and generate a vector for its `answer` field. You then store this vector alongside its original text and metadata. This collection of vectors is what will populate our vector database. Feeding raw, unstructured text directly to a model for this task is like shoving a firehose through a needle. You need this structured, vectorized intermediate layer.
Step 2: Building and Populating a Vector Database
A vector database is a specialized system designed for one thing: extremely fast similarity searches on high-dimensional vectors. Given a query vector, it can find the “nearest neighbors” from millions or billions of stored vectors in milliseconds. Trying to do this with a traditional database and cosine similarity calculations is sluggish and will not scale past a proof-of-concept.
Your options range from managed services to local libraries. Pinecone is a popular managed service, but it’s a wallet-drainer for anything beyond a hobby project. For production, you might look at a self-hosted solution like Weaviate or Milvus. For development and smaller-scale applications, a library like ChromaDB or Facebook’s FAISS is perfect. We’ll use ChromaDB here because it runs locally with minimal setup.
The logic is simple. You create a “collection” in the database. Then you loop through your sanitized FAQ data. For each item, you pass the text of the `answer` to your embedding model to get the vector. You then inject this vector, along with the original text and its unique ID (`intent_id`), into the ChromaDB collection. This is a one-time setup process, though you’ll need to re-run it whenever the FAQ content changes.
Here’s a barebones Python script to illustrate the concept using `sentence-transformers` and `chromadb`.
import chromadb
import json
from sentence_transformers import SentenceTransformer
# 1. Initialize model and DB client
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.Client()
collection = client.create_collection("real_estate_faq")
# 2. Load sanitized data
faq_data = []
with open('faq_sanitized.jsonl', 'r') as f:
for line in f:
faq_data.append(json.loads(line))
# 3. Iterate, embed, and inject
for item in faq_data:
# We embed the answer, as it contains the context.
answer_embedding = model.encode(item['answer']).tolist()
collection.add(
embeddings=[answer_embedding],
documents=[item['answer']],
metadatas=[{"intent_id": item['intent_id']}],
ids=[item['intent_id']]
)
print(f"Injection complete. Collection contains {collection.count()} items.")
This script rips through your clean data file, converts each answer into a numerical vector, and loads it into a local database named `real_estate_faq`. Now you have a queryable knowledge base.
Step 3: Engineering the Retrieval and Generation Flow
The chatbot’s brain is not just the Large Language Model (LLM). It’s the entire process, known as Retrieval-Augmented Generation (RAG). This process prevents the LLM from making things up (hallucinating) by forcing it to answer based only on the information you provide it.
The flow is executed on every user query:
- User Input: The user asks a question, e.g., “how much are closing fees?”
- Query Embedding: The same embedding model you used for your documents now converts the user’s question into a query vector.
- Vector Search: This query vector is used to search your vector database. The database returns the top ‘k’ most similar documents (we’ll start with k=3). These are your context documents.
- Prompt Injection: You construct a new prompt for a generative LLM (like GPT-3.5 or an open-source alternative). This prompt is carefully engineered. It contains the original user question and the context documents you just retrieved.
- LLM Generation: The LLM receives the prompt and generates a response. Because the prompt instructs it to answer *based on the provided context*, it synthesizes an answer from your verified FAQ data instead of its own general knowledge.

The quality of your prompt is everything. A weak prompt will be ignored by the model. A strong prompt forces it to comply.
Weak Prompt: “Here is some context: [CONTEXT]. Answer this question: [QUESTION]”
Strong Prompt:
“You are an AI assistant for a real estate agency. Answer the user’s question based *only* on the provided context documents below. If the context does not contain the answer, state that you do not have enough information. Do not use any external knowledge. Be concise.
—CONTEXT—
[Retrieved Document 1]
[Retrieved Document 2]
—QUESTION—
[User’s Original Question]”
This structured, forceful prompt dramatically reduces hallucinations and keeps the bot on topic. It’s the guardrail that makes the system reliable.
Step 4: Closing the Loop with Validation and Monitoring
Deploying the bot is not the end. It’s the beginning of a continuous feedback loop. Your initial FAQ will have gaps, and users will ask questions you never anticipated. You need a mechanism to identify these failures and use them to improve the system.
The easiest method is to log every query and the similarity score of the top retrieved document. When a user asks a question and the highest score is below a certain threshold (e.g., 0.6 cosine similarity), it’s a strong signal that your knowledge base does not contain a relevant answer. These low-confidence queries should be flagged for human review.
A reviewer can then take one of two actions:
- Content Gap: The question is valid, but no FAQ entry exists to answer it. This informs the business that they need to create a new FAQ entry.
- Phrasing Mismatch: An answer exists, but the user’s phrasing was too different for the embedding model to make a connection. The user’s question should be added to the `question_variations` array for the relevant intent in your source data.

After updating your source JSONL file, you simply re-run the ingestion script to update the vector database. This iterative process of monitoring, identifying failures, and patching the knowledge base is what separates a demo-quality toy from a production-ready tool. Without it, your chatbot’s accuracy will degrade over time as business policies and user questions evolve.