How RAG and Agentic AI
get built in the real world.
From a ten-person business querying its policy documents to a government ministry running a sovereign AI system on air-gapped infrastructure — the architecture, the tools, and the decisions that separate a demo from a production deployment.
Two technologies. One architecture.
RAG RETRIEVAL-AUGMENTED GENERATION
Language models are trained on general data. They do not know your internal policies, your contracts from last year, or your ministry's 2024 circulars. Ask them — they hallucinate or refuse.
RAG fixes this by splitting the job into two steps. Before the model answers, it searches your actual documents using vector search — finding meaning, not just matching keywords. The model then reasons over the retrieved content and answers with a citation. Every claim links back to its exact source.
The result: an AI that knows your documents and can prove where every answer came from. That is what makes it deployable in regulated environments.
Agentic AI REASONING AND ACTION
RAG answers questions. An agent decides what to do about the answer — and then does it.
An agent can plan a sequence of steps, use RAG as one of many tools, call external APIs, query databases, draft documents, route approvals, and log its actions — all in response to a single instruction. It knows the boundary of its authority. Everything beyond that boundary goes to a human with full context already assembled.
This is the difference between an AI that is a very good research assistant and an AI that completes a meaningful workflow and hands a human exactly what they need to act.
Seven steps from document to cited answer.
Ingest
Documents pulled into the pipeline. PDFs, Word, Excel, scanned files. OCR applied to scanned content.
Chunk
Each document split into meaningful passages. Metadata attached: source, page, section, date, version.
Embed
Each chunk converted to a vector — a mathematical representation of meaning. Stored in a vector database.
Retrieve
Query converted to a vector. The database returns the most semantically similar chunks across all documents.
Generate
LLM reasons over retrieved chunks and produces a cited answer. Only uses provided context — no hallucination.
Audit trail
Every query, every retrieved chunk, every response logged. Fully reconstructable for regulators and auditors.
Access control
Document-level permissions enforced at retrieval. Users only get answers from documents their role permits.
Agentic layer
When RAG is a tool inside an agent: retrieval becomes one step in a multi-action workflow with planning and decision logic.
The full stack — by deployment tier.
Small Business — Cloud-first, cost-managed, fast to deploy.
Limited IT overhead, tolerance for cloud data processing, priority on time-to-value. Entirely managed services — no infrastructure to operate. Pay-as-you-go pricing means costs scale with usage, not with headcount.
Prerequisites
- A body of documents that contain the answers people are searching for — HR policies, product specs, procedures, FAQs. Quality matters more than quantity.
- A defined scope: "questions about our procurement process" not "everything." RAG performs best when scoped to a category of questions.
- A content audit: outdated, contradictory, or poorly written documents produce poor retrieval. Remove or update before ingesting.
- A simple feedback mechanism decided in advance — thumbs up/down on responses drives continuous improvement.
Medium Enterprise — Managed cloud with control and compliance.
Internal IT team, mix of cloud and on-premise, role-based access requirements, needs reliability SLAs. Higher retrieval quality through semantic chunking and reranking. Data processed within compliance boundaries via Azure or AWS.
Why reranking is non-optional at medium enterprise scale
- Vector search retrieves by approximate similarity — it is fast and good but not perfect. It returns the top-K chunks that are semantically close to the query, but not necessarily the most relevant.
- A reranker is a second, more precise model that reads the query and each retrieved chunk together and scores them for true relevance. It runs after retrieval, not instead of it.
- The improvement is measurable: retrieval accuracy typically improves 15–30% when a reranker is added to a naive vector retrieval pipeline.
- At medium enterprise scale, the document volumes and query complexity justify the additional latency (typically 200–500ms) and cost.
Corporate — Private cloud, full governance, in-region data residency.
Dedicated infrastructure and security teams, strict data governance, multi-department deployment. All components deployed in-region — UAE North (Azure) or me-south-1 (AWS Bahrain) for UAE data residency. Custom chunking strategy per document type. Long-running agent workflows with persistent state.
HyDE — Hypothetical Document Embedding
- Standard RAG embeds the user's query and searches for similar document chunks. This works well when the query language matches the document language.
- HyDE inverts this: the LLM first generates a hypothetical answer to the query — what the ideal answer might look like. That hypothetical is then embedded and used for retrieval.
- Why this works: a well-formed answer sounds more like the source document than a short question does. Retrieval quality improves significantly for complex queries where the question phrasing is far from the document phrasing.
- The risk: if the LLM's hypothetical is hallucinated in a misleading direction, retrieval goes wrong. HyDE works best combined with a reranker that catches poor retrievals before they reach the generation step.
Government — Air-gapped, sovereign, Arabic-first.
Data sovereignty absolute. No document leaves the ministry's network at any stage. Arabic language is a first-class requirement, not an afterthought. Full compliance with NESA, UAE ISR, and NCA ECC frameworks. Every component selected, configured, and audited against information security requirements.
Minimum viable hardware for a government deployment
- 2x NVIDIA A100 80GB GPUs — LLM inference (Jais-30B or Llama 3.1 70B)
- 1x NVIDIA A10G or RTX 4090 — embedding model (BGE-M3)
- 64–128GB RAM — application and orchestration server
- High-speed NVMe storage — vector database (minimum 2TB, RAID for redundancy)
- Air-gapped network segment — physically isolated from public internet
- Backup power and cooling infrastructure appropriate for GPU workloads
Why Jais-30B for UAE government
- Developed at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) — UAE national AI institution
- Arabic-first architecture: trained on Arabic at the model level, not fine-tuned after the fact
- Outperforms general models on Arabic language tasks by a significant margin
- Runs on-premise with no dependency on international model providers
The same architecture. Different execution.
| Layer | Small Business | Medium Enterprise | Corporate | Government |
|---|---|---|---|---|
| Ingestion | LlamaParse cloud | Azure Doc Intelligence | Self-hosted Unstructured | On-prem + Arabic OCR |
| Chunking | Character-based | Semantic chunking | Custom per doc type | Custom + Arabic RTL logic |
| Embedding | OpenAI API | Cohere API | Self-hosted BGE-M3 | Jais / BGE-M3 on-prem |
| Vector DB | Pinecone managed | Qdrant / Weaviate cloud | Qdrant on Kubernetes | Qdrant air-gapped |
| LLM | GPT-4o mini / Groq | Azure OpenAI / Bedrock | Private endpoint / self-hosted | Jais-30B / Llama 70B on-prem |
| Orchestration | LangChain | LangGraph | LangGraph + persistent state | LangGraph, no telemetry |
| Reranking | None | Cohere Rerank | Cross-encoder + HyDE | BGE-reranker self-hosted |
| Guardrails | None | Basic | Guardrails AI / NeMo | Custom rule-based |
| Observability | LangSmith | Langfuse self-hosted | Langfuse + Grafana | On-prem SIEM + Grafana |
| Auth | Auth0 / Clerk | Azure AD / Okta | SAML + ABAC | UAE PASS + ministry AD |
| Data location | Cloud (shared) | Cloud (compliance boundary) | Private cloud, in-region | Air-gapped gov data centre |
| Monthly cost | AED 300 – 1,500 | AED 5,000 – 25,000 | AED 50,000 – 200,000+ | AED 40,000 – 120,000 infra |
Know your tier. Ready to build?
Every engagement starts with a discovery call. We scope the right architecture for your data, your infrastructure, and your compliance requirements — and tell you honestly what you need before we propose anything.