Technical Reference — Agentic AI and RAG

How RAG and Agentic AI
get built in the real world.

From a ten-person business querying its policy documents to a government ministry running a sovereign AI system on air-gapped infrastructure — the architecture, the tools, and the decisions that separate a demo from a production deployment.

4 Deployment tiers

10 Stack layers per tier

12+ Years building production systems

The foundation

Two technologies. One architecture.

RAG RETRIEVAL-AUGMENTED GENERATION

Language models are trained on general data. They do not know your internal policies, your contracts from last year, or your ministry's 2024 circulars. Ask them — they hallucinate or refuse.

RAG fixes this by splitting the job into two steps. Before the model answers, it searches your actual documents using vector search — finding meaning, not just matching keywords. The model then reasons over the retrieved content and answers with a citation. Every claim links back to its exact source.

The result: an AI that knows your documents and can prove where every answer came from. That is what makes it deployable in regulated environments.

Agentic AI REASONING AND ACTION

RAG answers questions. An agent decides what to do about the answer — and then does it.

An agent can plan a sequence of steps, use RAG as one of many tools, call external APIs, query databases, draft documents, route approvals, and log its actions — all in response to a single instruction. It knows the boundary of its authority. Everything beyond that boundary goes to a human with full context already assembled.

This is the difference between an AI that is a very good research assistant and an AI that completes a meaningful workflow and hands a human exactly what they need to act.

How RAG works

Seven steps from document to cited answer.

Ingest

Documents pulled into the pipeline. PDFs, Word, Excel, scanned files. OCR applied to scanned content.

›

Chunk

Each document split into meaningful passages. Metadata attached: source, page, section, date, version.

›

Embed

Each chunk converted to a vector — a mathematical representation of meaning. Stored in a vector database.

›

Retrieve

Query converted to a vector. The database returns the most semantically similar chunks across all documents.

›

Generate

LLM reasons over retrieved chunks and produces a cited answer. Only uses provided context — no hallucination.

Audit trail

Every query, every retrieved chunk, every response logged. Fully reconstructable for regulators and auditors.

Access control

Document-level permissions enforced at retrieval. Users only get answers from documents their role permits.

Agentic layer

When RAG is a tool inside an agent: retrieval becomes one step in a multi-action workflow with planning and decision logic.

Implementation guide

The full stack — by deployment tier.

Small Business — Cloud-first, cost-managed, fast to deploy.

Limited IT overhead, tolerance for cloud data processing, priority on time-to-value. Entirely managed services — no infrastructure to operate. Pay-as-you-go pricing means costs scale with usage, not with headcount.

Monthly estimate

AED 300 – 1,500

IngestionLlamaParse / Unstructured.ioHosted service. Handles PDF, Word, Excel. No self-hosting required. Unstructured.io for richer document types.CLOUD

ChunkingLangChain Text SplittersRecursive character splitting with overlap. Simple, effective, no custom engineering. Sufficient for most small business document types.CLOUD

EmbeddingOpenAI text-embedding-3-smallAED 0.02 per million tokens. Total embedding cost for a small business knowledge base is negligible. High quality, fast, no infrastructure.CLOUD

Vector DBPinecone (managed) or Chroma (local)Pinecone: no infrastructure management, free tier sufficient. Chroma: runs locally in Python for zero external dependency. Start with Pinecone, migrate if needed.CLOUD

LLMGPT-4o mini or Groq + Llama 3.3 70BGPT-4o mini for cost and capability balance. Groq for speed — Llama 3.3 70B at 500+ tokens/sec is dramatically faster and cheaper. Both via API.CLOUD

OrchestrationLangChain or LlamaIndexLangChain for agent workflows. LlamaIndex for pure RAG pipelines. Both have extensive documentation and fast implementation. LlamaIndex preferred for RAG-heavy workloads.CLOUD

Tool integrationsGoogle Drive, Gmail, Notion, Slack APIsAll the SaaS tools a small business already runs on. Zapier as a no-code bridge for simpler integrations. Custom tool wrappers take 1–2 days to build per integration.CLOUD

ApplicationStreamlit or Next.jsStreamlit for internal tools — fast to build, no frontend engineering required. Next.js if client-facing or needs production UX. Vercel for hosting.CLOUD

ObservabilityLangSmithTraces every LLM call, retrieval step, and agent action. Free tier available. Essential for debugging retrieval quality and understanding costs.CLOUD

AuthAuth0 or ClerkManaged authentication, minimal setup. Clerk preferred for simplicity. Add role-based access control to gate document access by user type.CLOUD

Prerequisites

A body of documents that contain the answers people are searching for — HR policies, product specs, procedures, FAQs. Quality matters more than quantity.
A defined scope: "questions about our procurement process" not "everything." RAG performs best when scoped to a category of questions.
A content audit: outdated, contradictory, or poorly written documents produce poor retrieval. Remove or update before ingesting.
A simple feedback mechanism decided in advance — thumbs up/down on responses drives continuous improvement.

Medium Enterprise — Managed cloud with control and compliance.

Internal IT team, mix of cloud and on-premise, role-based access requirements, needs reliability SLAs. Higher retrieval quality through semantic chunking and reranking. Data processed within compliance boundaries via Azure or AWS.

Monthly estimate

AED 5,000 – 25,000

IngestionUnstructured.io self-hosted or Azure Document IntelligenceHandles complex layouts, tables, forms, scanned documents with higher accuracy. Azure Document Intelligence for Arabic document support — critical for UAE/GCC deployments.HYBRID

ChunkingSemantic chunking — LlamaIndexUses an embedding model to detect natural topic boundaries within documents rather than splitting by character count. Produces better retrieval quality for complex documents.HYBRID

EmbeddingOpenAI text-embedding-3-large or Cohere Embed v3Higher dimensional embeddings, better retrieval precision. Cohere Embed v3 strong for multilingual including Arabic. Choose based on language distribution of your document set.CLOUD

Vector DBQdrant self-hosted or Weaviate CloudQdrant: fast, memory-efficient, straightforward to operate on a single VM. Self-hosted gives control over where data lives. Weaviate Cloud if managed is preferred and budget allows.HYBRID

LLMAzure OpenAI Service or Anthropic Claude via AWS BedrockSame models as direct API but data processed within Azure or AWS compliance boundary — not shared for training. Azure OpenAI preferred for Microsoft-aligned enterprises. Bedrock for AWS-aligned.CLOUD

OrchestrationLangGraphMore sophisticated than LangChain for complex agent workflows with conditional branching, parallel tool calls, and human-in-the-loop checkpoints. Designed explicitly for multi-step agents.HYBRID

RerankingCohere Rerank or cross-encoder modelAfter initial vector retrieval, a reranker scores retrieved chunks for relevance more precisely. Significantly improves answer quality. Critical at this tier — adds measurable improvement to retrieval accuracy.CLOUD

Tool integrationsMicrosoft 365, Salesforce, SAP, REST APIsSharePoint, Teams, Outlook via Graph API. Salesforce and SAP connectors. Custom tool wrappers for proprietary internal systems. Each integration is a defined agent tool.HYBRID

ApplicationReact + Next.js frontend, FastAPI backendProper engineering — not Streamlit. Responsive, accessible, production UX. Deployed on Azure Container Apps or AWS ECS. Custom design to match enterprise brand standards.HYBRID

ObservabilityLangfuse self-hostedOpen source LLM observability. Traces, latency metrics, retrieval quality scores, cost per query. Self-hostable — preferred for enterprises that cannot send trace data to external services.HYBRID

AuthAzure Active Directory or OktaSingle sign-on, role-based access, document-level permissions enforced at retrieval time. The vector database query filters by user role before returning results — not just at the application layer.HYBRID

Why reranking is non-optional at medium enterprise scale

Vector search retrieves by approximate similarity — it is fast and good but not perfect. It returns the top-K chunks that are semantically close to the query, but not necessarily the most relevant.
A reranker is a second, more precise model that reads the query and each retrieved chunk together and scores them for true relevance. It runs after retrieval, not instead of it.
The improvement is measurable: retrieval accuracy typically improves 15–30% when a reranker is added to a naive vector retrieval pipeline.
At medium enterprise scale, the document volumes and query complexity justify the additional latency (typically 200–500ms) and cost.

Corporate — Private cloud, full governance, in-region data residency.

Dedicated infrastructure and security teams, strict data governance, multi-department deployment. All components deployed in-region — UAE North (Azure) or me-south-1 (AWS Bahrain) for UAE data residency. Custom chunking strategy per document type. Long-running agent workflows with persistent state.

Monthly estimate

AED 50,000 – 200,000+

IngestionUnstructured.io enterprise — self-hosted pipelineRuns inside corporate infrastructure. No document leaves the network during processing. Supports 25+ file types including legacy formats. Custom preprocessing for domain-specific document structures.ON-PREM

ChunkingCustom chunking strategy per document typeLegal documents chunked by clause. Technical manuals by section and subsection. Financial reports by disclosure item. Not one-size-fits-all — developed and tuned during project delivery.ON-PREM

EmbeddingSelf-hosted — BGE-M3 or E5-Mistral-7BBGE-M3: strong multilingual including Arabic, runs on 1x A100 or 2x A10G GPUs. E5-Mistral-7B: high accuracy for English-dominant sets. No embeddings sent to external APIs.ON-PREM

Vector DBQdrant or Weaviate on KubernetesHigh availability deployment, multiple replicas, backup and restore procedures. Multi-tenancy: different departments have isolated vector spaces within the same cluster. Operator pattern for lifecycle management.ON-PREM

LLMAzure OpenAI private endpoint / AWS Bedrock VPC / self-hosted LlamaAzure OpenAI with private endpoint and no data exfiltration for standard data. Anthropic Claude via Bedrock in private VPC. Self-hosted quantised Llama 3.1 70B on GPU cluster for highest sensitivity data.HYBRID

OrchestrationLangGraph with persistent stateAgent workflows that can pause, wait for human approval, resume. Supports long-running processes across hours or days. Redis or PostgreSQL for state persistence. Explicit checkpoints at authority boundaries.ON-PREM

RerankingCross-encoder + HyDEHyDE (Hypothetical Document Embedding): generates a hypothetical answer to the query, embeds that, retrieves against it. Significantly better recall for complex queries. Cross-encoder reranker for final scoring.ON-PREM

GuardrailsGuardrails AI or NeMo GuardrailsValidates model outputs before they reach the user. Checks for hallucination, off-topic responses, policy violations, sensitive data leakage. Non-negotiable at corporate tier.ON-PREM

Tool integrationsSAP, Oracle, Salesforce, ServiceNow, Workday, custom MCP serversFull enterprise integration surface. Custom MCP (Model Context Protocol) servers for proprietary internal systems — standardised interface between the agent and any tool it needs to call.ON-PREM

ObservabilityLangfuse self-hosted + Grafana dashboardsReal-time visibility into query volumes, latency percentiles, retrieval quality, LLM costs, error rates. Alerting on anomalies. Full trace retention for audit. No trace data leaves the network.ON-PREM

AuthSAML 2.0 / OIDC + attribute-based access controlIntegrated with corporate identity provider. Attribute-based access control — not just role, but department, clearance level, data classification. Enforced at the vector database query layer, not just the application.ON-PREM

HostingUAE North (Azure) or me-south-1 (AWS Bahrain)In-region data residency. Application on managed container services. Vector database on dedicated VMs. LLM on GPU cluster or via private endpoint. All within a single private network.ON-PREM

HyDE — Hypothetical Document Embedding

Standard RAG embeds the user's query and searches for similar document chunks. This works well when the query language matches the document language.
HyDE inverts this: the LLM first generates a hypothetical answer to the query — what the ideal answer might look like. That hypothetical is then embedded and used for retrieval.
Why this works: a well-formed answer sounds more like the source document than a short question does. Retrieval quality improves significantly for complex queries where the question phrasing is far from the document phrasing.
The risk: if the LLM's hypothetical is hallucinated in a misleading direction, retrieval goes wrong. HyDE works best combined with a reranker that catches poor retrievals before they reach the generation step.

Government — Air-gapped, sovereign, Arabic-first.

Data sovereignty absolute. No document leaves the ministry's network at any stage. Arabic language is a first-class requirement, not an afterthought. Full compliance with NESA, UAE ISR, and NCA ECC frameworks. Every component selected, configured, and audited against information security requirements.

Infrastructure estimate

AED 40,000 – 120,000 / mo

IngestionFully on-premise pipeline — Unstructured.io enterpriseServers inside the government data centre or approved government cloud (UAE GCCP, Huawei Government Cloud). No commercial cloud. Arabic OCR validated first — quality of Arabic OCR is the first decision before any other component is chosen.AIR-GAPPED

Arabic OCRTesseract with Arabic language pack or commercial Arabic OCRStandard OCR fails on Arabic. Tesseract with the Arabic language pack covers most cases. For high-volume, high-accuracy requirements: a commercial Arabic OCR solution validated against the ministry's actual document formats.AIR-GAPPED

ChunkingCustom Arabic chunking logicArabic text requires different chunking than English. Right-to-left, different punctuation conventions, formal versus dialectal register differences in government documents. Custom engineering — not a library default.AIR-GAPPED

EmbeddingBGE-M3 self-hosted or CAMeL-BERT / AraBARTBGE-M3: best open-source multilingual model, strong Arabic, runs on 1x A100 or 2x A10G GPUs. CAMeL-BERT or AraBART for Arabic-dominant document sets. No external API calls — embedding runs on-premise.AIR-GAPPED

Vector DBQdrant — air-gapped deploymentWritten in Rust, minimal dependencies, no JVM, easy to secure and audit. No telemetry, no external connections. Preferred for air-gapped: simpler operational footprint than Weaviate. FIPS-compliant configuration available.AIR-GAPPED

LLMJais-30B (primary) or Llama 3.1 70B (fallback)Jais-30B: Arabic-first LLM developed at MBZUAI in the UAE. Strongest Arabic language capability, purpose-built for this context. Llama 3.1 70B at 4-bit quantisation as fallback — requires 2x A100 80GB or 4x A40 GPUs. All self-hosted.AIR-GAPPED

Inference servingvLLM (production) or Ollama (single-GPU)vLLM: production LLM serving framework, handles batching, GPU memory management, high throughput. Runs on-premise. Ollama for simpler single-GPU proof-of-concept deployments before production rollout.AIR-GAPPED

OrchestrationLangGraph — no external telemetryAll state stored in on-premise PostgreSQL. Agent workflows include mandatory human-in-the-loop checkpoints at defined authority boundaries. Every tool call logged before execution. No data leaves the network.AIR-GAPPED

RerankingBGE-reranker-large — self-hostedRuns on CPU — no additional GPU required. Self-hosted cross-encoder model. No external reranking API calls. Performance comparable to cloud reranking services at a fraction of the operational complexity.AIR-GAPPED

GuardrailsCustom rule-based guardrails — defined by ministry IS teamNot a third-party service. An explicit allowlist of response types validated before surfacing to users. Defined by the ministry's information security team. Reviewed and approved as part of security assessment.AIR-GAPPED

Tool integrationsUAE PASS, ministry systems, Emirates ID API, SharePoint on-premGovernment-specific integrations only. UAE PASS for identity. Internal ministry systems via approved APIs. Emirates ID verification. Government document management systems. No commercial SaaS integrations.AIR-GAPPED

ObservabilityGrafana + Prometheus + Langfuse self-hosted + on-prem SIEMNo trace data leaves the network. Audit logs immutable, retained per ministry data retention policy. On-premise SIEM integration for security event correlation. Grafana dashboards for operational monitoring.AIR-GAPPED

AuthMinistry Active Directory + UAE PASS + document clearance levelsSECRET documents never retrieved for users without appropriate clearance — enforced at the database layer, not the application layer. SAML integration with ministry identity provider. Attribute-based access per data classification.AIR-GAPPED

ComplianceNESA / UAE ISR / NCA ECC (KSA)Every component mapped to NESA controls. UAE ISR alignment documented. Data classification applied at ingestion. Penetration testing before go-live. Security assessment by approved third party. Ongoing vulnerability management.AIR-GAPPED

Minimum viable hardware for a government deployment

2x NVIDIA A100 80GB GPUs — LLM inference (Jais-30B or Llama 3.1 70B)
1x NVIDIA A10G or RTX 4090 — embedding model (BGE-M3)
64–128GB RAM — application and orchestration server
High-speed NVMe storage — vector database (minimum 2TB, RAID for redundancy)
Air-gapped network segment — physically isolated from public internet
Backup power and cooling infrastructure appropriate for GPU workloads

Why Jais-30B for UAE government

Developed at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) — UAE national AI institution
Arabic-first architecture: trained on Arabic at the model level, not fine-tuned after the fact
Outperforms general models on Arabic language tasks by a significant margin
Runs on-premise with no dependency on international model providers

All tiers compared

The same architecture. Different execution.

Layer	Small Business	Medium Enterprise	Corporate	Government
Ingestion	LlamaParse cloud	Azure Doc Intelligence	Self-hosted Unstructured	On-prem + Arabic OCR
Chunking	Character-based	Semantic chunking	Custom per doc type	Custom + Arabic RTL logic
Embedding	OpenAI API	Cohere API	Self-hosted BGE-M3	Jais / BGE-M3 on-prem
Vector DB	Pinecone managed	Qdrant / Weaviate cloud	Qdrant on Kubernetes	Qdrant air-gapped
LLM	GPT-4o mini / Groq	Azure OpenAI / Bedrock	Private endpoint / self-hosted	Jais-30B / Llama 70B on-prem
Orchestration	LangChain	LangGraph	LangGraph + persistent state	LangGraph, no telemetry
Reranking	None	Cohere Rerank	Cross-encoder + HyDE	BGE-reranker self-hosted
Guardrails	None	Basic	Guardrails AI / NeMo	Custom rule-based
Observability	LangSmith	Langfuse self-hosted	Langfuse + Grafana	On-prem SIEM + Grafana
Auth	Auth0 / Clerk	Azure AD / Okta	SAML + ABAC	UAE PASS + ministry AD
Data location	Cloud (shared)	Cloud (compliance boundary)	Private cloud, in-region	Air-gapped gov data centre
Monthly cost	AED 300 – 1,500	AED 5,000 – 25,000	AED 50,000 – 200,000+	AED 40,000 – 120,000 infra

Know your tier. Ready to build?

Every engagement starts with a discovery call. We scope the right architecture for your data, your infrastructure, and your compliance requirements — and tell you honestly what you need before we propose anything.

Start a conversation → Back to Agentic AI

How RAG and Agentic AIget built in the real world.

Two technologies. One architecture.

RAG RETRIEVAL-AUGMENTED GENERATION

Agentic AI REASONING AND ACTION

Seven steps from document to cited answer.

Ingest

Chunk

Embed

Retrieve

Generate

Audit trail

Access control

Agentic layer

The full stack — by deployment tier.

Small Business — Cloud-first, cost-managed, fast to deploy.

Prerequisites

Medium Enterprise — Managed cloud with control and compliance.

Why reranking is non-optional at medium enterprise scale

Corporate — Private cloud, full governance, in-region data residency.

HyDE — Hypothetical Document Embedding

Government — Air-gapped, sovereign, Arabic-first.

Minimum viable hardware for a government deployment

Why Jais-30B for UAE government

The same architecture. Different execution.

Know your tier. Ready to build?

How RAG and Agentic AI
get built in the real world.