Technical Reference — Agentic AI and RAG

How RAG and Agentic AI
get built in the real world.

From a ten-person business querying its policy documents to a government ministry running a sovereign AI system on air-gapped infrastructure — the architecture, the tools, and the decisions that separate a demo from a production deployment.

4 Deployment tiers
10 Stack layers per tier
12+ Years building production systems

Two technologies. One architecture.

RAG RETRIEVAL-AUGMENTED GENERATION

Language models are trained on general data. They do not know your internal policies, your contracts from last year, or your ministry's 2024 circulars. Ask them — they hallucinate or refuse.

RAG fixes this by splitting the job into two steps. Before the model answers, it searches your actual documents using vector search — finding meaning, not just matching keywords. The model then reasons over the retrieved content and answers with a citation. Every claim links back to its exact source.

The result: an AI that knows your documents and can prove where every answer came from. That is what makes it deployable in regulated environments.

Agentic AI REASONING AND ACTION

RAG answers questions. An agent decides what to do about the answer — and then does it.

An agent can plan a sequence of steps, use RAG as one of many tools, call external APIs, query databases, draft documents, route approvals, and log its actions — all in response to a single instruction. It knows the boundary of its authority. Everything beyond that boundary goes to a human with full context already assembled.

This is the difference between an AI that is a very good research assistant and an AI that completes a meaningful workflow and hands a human exactly what they need to act.

Seven steps from document to cited answer.

01

Ingest

Documents pulled into the pipeline. PDFs, Word, Excel, scanned files. OCR applied to scanned content.

02

Chunk

Each document split into meaningful passages. Metadata attached: source, page, section, date, version.

03

Embed

Each chunk converted to a vector — a mathematical representation of meaning. Stored in a vector database.

04

Retrieve

Query converted to a vector. The database returns the most semantically similar chunks across all documents.

05

Generate

LLM reasons over retrieved chunks and produces a cited answer. Only uses provided context — no hallucination.

06

Audit trail

Every query, every retrieved chunk, every response logged. Fully reconstructable for regulators and auditors.

07

Access control

Document-level permissions enforced at retrieval. Users only get answers from documents their role permits.

+A

Agentic layer

When RAG is a tool inside an agent: retrieval becomes one step in a multi-action workflow with planning and decision logic.

The full stack — by deployment tier.

Small Business — Cloud-first, cost-managed, fast to deploy.

Limited IT overhead, tolerance for cloud data processing, priority on time-to-value. Entirely managed services — no infrastructure to operate. Pay-as-you-go pricing means costs scale with usage, not with headcount.

Monthly estimate
AED 300 – 1,500
IngestionLlamaParse / Unstructured.ioHosted service. Handles PDF, Word, Excel. No self-hosting required. Unstructured.io for richer document types.CLOUD
ChunkingLangChain Text SplittersRecursive character splitting with overlap. Simple, effective, no custom engineering. Sufficient for most small business document types.CLOUD
EmbeddingOpenAI text-embedding-3-smallAED 0.02 per million tokens. Total embedding cost for a small business knowledge base is negligible. High quality, fast, no infrastructure.CLOUD
Vector DBPinecone (managed) or Chroma (local)Pinecone: no infrastructure management, free tier sufficient. Chroma: runs locally in Python for zero external dependency. Start with Pinecone, migrate if needed.CLOUD
LLMGPT-4o mini or Groq + Llama 3.3 70BGPT-4o mini for cost and capability balance. Groq for speed — Llama 3.3 70B at 500+ tokens/sec is dramatically faster and cheaper. Both via API.CLOUD
OrchestrationLangChain or LlamaIndexLangChain for agent workflows. LlamaIndex for pure RAG pipelines. Both have extensive documentation and fast implementation. LlamaIndex preferred for RAG-heavy workloads.CLOUD
Tool integrationsGoogle Drive, Gmail, Notion, Slack APIsAll the SaaS tools a small business already runs on. Zapier as a no-code bridge for simpler integrations. Custom tool wrappers take 1–2 days to build per integration.CLOUD
ApplicationStreamlit or Next.jsStreamlit for internal tools — fast to build, no frontend engineering required. Next.js if client-facing or needs production UX. Vercel for hosting.CLOUD
ObservabilityLangSmithTraces every LLM call, retrieval step, and agent action. Free tier available. Essential for debugging retrieval quality and understanding costs.CLOUD
AuthAuth0 or ClerkManaged authentication, minimal setup. Clerk preferred for simplicity. Add role-based access control to gate document access by user type.CLOUD

Prerequisites

  • A body of documents that contain the answers people are searching for — HR policies, product specs, procedures, FAQs. Quality matters more than quantity.
  • A defined scope: "questions about our procurement process" not "everything." RAG performs best when scoped to a category of questions.
  • A content audit: outdated, contradictory, or poorly written documents produce poor retrieval. Remove or update before ingesting.
  • A simple feedback mechanism decided in advance — thumbs up/down on responses drives continuous improvement.

Medium Enterprise — Managed cloud with control and compliance.

Internal IT team, mix of cloud and on-premise, role-based access requirements, needs reliability SLAs. Higher retrieval quality through semantic chunking and reranking. Data processed within compliance boundaries via Azure or AWS.

Monthly estimate
AED 5,000 – 25,000
IngestionUnstructured.io self-hosted or Azure Document IntelligenceHandles complex layouts, tables, forms, scanned documents with higher accuracy. Azure Document Intelligence for Arabic document support — critical for UAE/GCC deployments.HYBRID
ChunkingSemantic chunking — LlamaIndexUses an embedding model to detect natural topic boundaries within documents rather than splitting by character count. Produces better retrieval quality for complex documents.HYBRID
EmbeddingOpenAI text-embedding-3-large or Cohere Embed v3Higher dimensional embeddings, better retrieval precision. Cohere Embed v3 strong for multilingual including Arabic. Choose based on language distribution of your document set.CLOUD
Vector DBQdrant self-hosted or Weaviate CloudQdrant: fast, memory-efficient, straightforward to operate on a single VM. Self-hosted gives control over where data lives. Weaviate Cloud if managed is preferred and budget allows.HYBRID
LLMAzure OpenAI Service or Anthropic Claude via AWS BedrockSame models as direct API but data processed within Azure or AWS compliance boundary — not shared for training. Azure OpenAI preferred for Microsoft-aligned enterprises. Bedrock for AWS-aligned.CLOUD
OrchestrationLangGraphMore sophisticated than LangChain for complex agent workflows with conditional branching, parallel tool calls, and human-in-the-loop checkpoints. Designed explicitly for multi-step agents.HYBRID
RerankingCohere Rerank or cross-encoder modelAfter initial vector retrieval, a reranker scores retrieved chunks for relevance more precisely. Significantly improves answer quality. Critical at this tier — adds measurable improvement to retrieval accuracy.CLOUD
Tool integrationsMicrosoft 365, Salesforce, SAP, REST APIsSharePoint, Teams, Outlook via Graph API. Salesforce and SAP connectors. Custom tool wrappers for proprietary internal systems. Each integration is a defined agent tool.HYBRID
ApplicationReact + Next.js frontend, FastAPI backendProper engineering — not Streamlit. Responsive, accessible, production UX. Deployed on Azure Container Apps or AWS ECS. Custom design to match enterprise brand standards.HYBRID
ObservabilityLangfuse self-hostedOpen source LLM observability. Traces, latency metrics, retrieval quality scores, cost per query. Self-hostable — preferred for enterprises that cannot send trace data to external services.HYBRID
AuthAzure Active Directory or OktaSingle sign-on, role-based access, document-level permissions enforced at retrieval time. The vector database query filters by user role before returning results — not just at the application layer.HYBRID

Why reranking is non-optional at medium enterprise scale

  • Vector search retrieves by approximate similarity — it is fast and good but not perfect. It returns the top-K chunks that are semantically close to the query, but not necessarily the most relevant.
  • A reranker is a second, more precise model that reads the query and each retrieved chunk together and scores them for true relevance. It runs after retrieval, not instead of it.
  • The improvement is measurable: retrieval accuracy typically improves 15–30% when a reranker is added to a naive vector retrieval pipeline.
  • At medium enterprise scale, the document volumes and query complexity justify the additional latency (typically 200–500ms) and cost.

Corporate — Private cloud, full governance, in-region data residency.

Dedicated infrastructure and security teams, strict data governance, multi-department deployment. All components deployed in-region — UAE North (Azure) or me-south-1 (AWS Bahrain) for UAE data residency. Custom chunking strategy per document type. Long-running agent workflows with persistent state.

Monthly estimate
AED 50,000 – 200,000+
IngestionUnstructured.io enterprise — self-hosted pipelineRuns inside corporate infrastructure. No document leaves the network during processing. Supports 25+ file types including legacy formats. Custom preprocessing for domain-specific document structures.ON-PREM
ChunkingCustom chunking strategy per document typeLegal documents chunked by clause. Technical manuals by section and subsection. Financial reports by disclosure item. Not one-size-fits-all — developed and tuned during project delivery.ON-PREM
EmbeddingSelf-hosted — BGE-M3 or E5-Mistral-7BBGE-M3: strong multilingual including Arabic, runs on 1x A100 or 2x A10G GPUs. E5-Mistral-7B: high accuracy for English-dominant sets. No embeddings sent to external APIs.ON-PREM
Vector DBQdrant or Weaviate on KubernetesHigh availability deployment, multiple replicas, backup and restore procedures. Multi-tenancy: different departments have isolated vector spaces within the same cluster. Operator pattern for lifecycle management.ON-PREM
LLMAzure OpenAI private endpoint / AWS Bedrock VPC / self-hosted LlamaAzure OpenAI with private endpoint and no data exfiltration for standard data. Anthropic Claude via Bedrock in private VPC. Self-hosted quantised Llama 3.1 70B on GPU cluster for highest sensitivity data.HYBRID
OrchestrationLangGraph with persistent stateAgent workflows that can pause, wait for human approval, resume. Supports long-running processes across hours or days. Redis or PostgreSQL for state persistence. Explicit checkpoints at authority boundaries.ON-PREM
RerankingCross-encoder + HyDEHyDE (Hypothetical Document Embedding): generates a hypothetical answer to the query, embeds that, retrieves against it. Significantly better recall for complex queries. Cross-encoder reranker for final scoring.ON-PREM
GuardrailsGuardrails AI or NeMo GuardrailsValidates model outputs before they reach the user. Checks for hallucination, off-topic responses, policy violations, sensitive data leakage. Non-negotiable at corporate tier.ON-PREM
Tool integrationsSAP, Oracle, Salesforce, ServiceNow, Workday, custom MCP serversFull enterprise integration surface. Custom MCP (Model Context Protocol) servers for proprietary internal systems — standardised interface between the agent and any tool it needs to call.ON-PREM
ObservabilityLangfuse self-hosted + Grafana dashboardsReal-time visibility into query volumes, latency percentiles, retrieval quality, LLM costs, error rates. Alerting on anomalies. Full trace retention for audit. No trace data leaves the network.ON-PREM
AuthSAML 2.0 / OIDC + attribute-based access controlIntegrated with corporate identity provider. Attribute-based access control — not just role, but department, clearance level, data classification. Enforced at the vector database query layer, not just the application.ON-PREM
HostingUAE North (Azure) or me-south-1 (AWS Bahrain)In-region data residency. Application on managed container services. Vector database on dedicated VMs. LLM on GPU cluster or via private endpoint. All within a single private network.ON-PREM

HyDE — Hypothetical Document Embedding

  • Standard RAG embeds the user's query and searches for similar document chunks. This works well when the query language matches the document language.
  • HyDE inverts this: the LLM first generates a hypothetical answer to the query — what the ideal answer might look like. That hypothetical is then embedded and used for retrieval.
  • Why this works: a well-formed answer sounds more like the source document than a short question does. Retrieval quality improves significantly for complex queries where the question phrasing is far from the document phrasing.
  • The risk: if the LLM's hypothetical is hallucinated in a misleading direction, retrieval goes wrong. HyDE works best combined with a reranker that catches poor retrievals before they reach the generation step.

Government — Air-gapped, sovereign, Arabic-first.

Data sovereignty absolute. No document leaves the ministry's network at any stage. Arabic language is a first-class requirement, not an afterthought. Full compliance with NESA, UAE ISR, and NCA ECC frameworks. Every component selected, configured, and audited against information security requirements.

Infrastructure estimate
AED 40,000 – 120,000 / mo
IngestionFully on-premise pipeline — Unstructured.io enterpriseServers inside the government data centre or approved government cloud (UAE GCCP, Huawei Government Cloud). No commercial cloud. Arabic OCR validated first — quality of Arabic OCR is the first decision before any other component is chosen.AIR-GAPPED
Arabic OCRTesseract with Arabic language pack or commercial Arabic OCRStandard OCR fails on Arabic. Tesseract with the Arabic language pack covers most cases. For high-volume, high-accuracy requirements: a commercial Arabic OCR solution validated against the ministry's actual document formats.AIR-GAPPED
ChunkingCustom Arabic chunking logicArabic text requires different chunking than English. Right-to-left, different punctuation conventions, formal versus dialectal register differences in government documents. Custom engineering — not a library default.AIR-GAPPED
EmbeddingBGE-M3 self-hosted or CAMeL-BERT / AraBARTBGE-M3: best open-source multilingual model, strong Arabic, runs on 1x A100 or 2x A10G GPUs. CAMeL-BERT or AraBART for Arabic-dominant document sets. No external API calls — embedding runs on-premise.AIR-GAPPED
Vector DBQdrant — air-gapped deploymentWritten in Rust, minimal dependencies, no JVM, easy to secure and audit. No telemetry, no external connections. Preferred for air-gapped: simpler operational footprint than Weaviate. FIPS-compliant configuration available.AIR-GAPPED
LLMJais-30B (primary) or Llama 3.1 70B (fallback)Jais-30B: Arabic-first LLM developed at MBZUAI in the UAE. Strongest Arabic language capability, purpose-built for this context. Llama 3.1 70B at 4-bit quantisation as fallback — requires 2x A100 80GB or 4x A40 GPUs. All self-hosted.AIR-GAPPED
Inference servingvLLM (production) or Ollama (single-GPU)vLLM: production LLM serving framework, handles batching, GPU memory management, high throughput. Runs on-premise. Ollama for simpler single-GPU proof-of-concept deployments before production rollout.AIR-GAPPED
OrchestrationLangGraph — no external telemetryAll state stored in on-premise PostgreSQL. Agent workflows include mandatory human-in-the-loop checkpoints at defined authority boundaries. Every tool call logged before execution. No data leaves the network.AIR-GAPPED
RerankingBGE-reranker-large — self-hostedRuns on CPU — no additional GPU required. Self-hosted cross-encoder model. No external reranking API calls. Performance comparable to cloud reranking services at a fraction of the operational complexity.AIR-GAPPED
GuardrailsCustom rule-based guardrails — defined by ministry IS teamNot a third-party service. An explicit allowlist of response types validated before surfacing to users. Defined by the ministry's information security team. Reviewed and approved as part of security assessment.AIR-GAPPED
Tool integrationsUAE PASS, ministry systems, Emirates ID API, SharePoint on-premGovernment-specific integrations only. UAE PASS for identity. Internal ministry systems via approved APIs. Emirates ID verification. Government document management systems. No commercial SaaS integrations.AIR-GAPPED
ObservabilityGrafana + Prometheus + Langfuse self-hosted + on-prem SIEMNo trace data leaves the network. Audit logs immutable, retained per ministry data retention policy. On-premise SIEM integration for security event correlation. Grafana dashboards for operational monitoring.AIR-GAPPED
AuthMinistry Active Directory + UAE PASS + document clearance levelsSECRET documents never retrieved for users without appropriate clearance — enforced at the database layer, not the application layer. SAML integration with ministry identity provider. Attribute-based access per data classification.AIR-GAPPED
ComplianceNESA / UAE ISR / NCA ECC (KSA)Every component mapped to NESA controls. UAE ISR alignment documented. Data classification applied at ingestion. Penetration testing before go-live. Security assessment by approved third party. Ongoing vulnerability management.AIR-GAPPED

Minimum viable hardware for a government deployment

  • 2x NVIDIA A100 80GB GPUs — LLM inference (Jais-30B or Llama 3.1 70B)
  • 1x NVIDIA A10G or RTX 4090 — embedding model (BGE-M3)
  • 64–128GB RAM — application and orchestration server
  • High-speed NVMe storage — vector database (minimum 2TB, RAID for redundancy)
  • Air-gapped network segment — physically isolated from public internet
  • Backup power and cooling infrastructure appropriate for GPU workloads
Why Jais-30B for UAE government
  • Developed at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) — UAE national AI institution
  • Arabic-first architecture: trained on Arabic at the model level, not fine-tuned after the fact
  • Outperforms general models on Arabic language tasks by a significant margin
  • Runs on-premise with no dependency on international model providers

The same architecture. Different execution.

Layer Small Business Medium Enterprise Corporate Government
IngestionLlamaParse cloudAzure Doc IntelligenceSelf-hosted UnstructuredOn-prem + Arabic OCR
ChunkingCharacter-basedSemantic chunkingCustom per doc typeCustom + Arabic RTL logic
EmbeddingOpenAI APICohere APISelf-hosted BGE-M3Jais / BGE-M3 on-prem
Vector DBPinecone managedQdrant / Weaviate cloudQdrant on KubernetesQdrant air-gapped
LLMGPT-4o mini / GroqAzure OpenAI / BedrockPrivate endpoint / self-hostedJais-30B / Llama 70B on-prem
OrchestrationLangChainLangGraphLangGraph + persistent stateLangGraph, no telemetry
RerankingNoneCohere RerankCross-encoder + HyDEBGE-reranker self-hosted
GuardrailsNoneBasicGuardrails AI / NeMoCustom rule-based
ObservabilityLangSmithLangfuse self-hostedLangfuse + GrafanaOn-prem SIEM + Grafana
AuthAuth0 / ClerkAzure AD / OktaSAML + ABACUAE PASS + ministry AD
Data locationCloud (shared)Cloud (compliance boundary)Private cloud, in-regionAir-gapped gov data centre
Monthly costAED 300 – 1,500AED 5,000 – 25,000AED 50,000 – 200,000+AED 40,000 – 120,000 infra

Know your tier. Ready to build?

Every engagement starts with a discovery call. We scope the right architecture for your data, your infrastructure, and your compliance requirements — and tell you honestly what you need before we propose anything.

S
Scarlett Hashinclude AI
Online
S
Hey! I'm Scarlett from the Hashinclude team. I'm here to understand what you're working on and connect you with the right people.

Tell me about what you're building or trying to solve.
Just now
S
Enter to send · Shift+Enter for new line
anas.exe
> _
$
Powered by Hashinclude AI
WhatsApp