Beyond Basic RAG: Architecting a Fault-Tolerant, Agentic AI Platform

High-level System Architecture of Cloud-Agnostic AI SaaS

The first generation of AI SaaS applications had a fundamental flaw: they were glorified wrappers. You typed a prompt, it went to an LLM, and it returned a generic, stateless answer.

When I set out to architect the backend for a personalized AI platform designed to actively track user goals and habits, I knew standard RAG (Retrieval-Augmented Generation) wouldn't be enough. The system needed to deeply understand the user, remember their past, analyze their media, and survive the harsh realities of mobile network instability all while scaling gracefully to support over 25,000 concurrent users.

Here is an architectural breakdown of how I engineered an Agentic RAG pipeline, avoided cloud vendor lock-in, and built a fault-tolerant infrastructure capable of delivering highly relevant, hyper-personalized AI guidance.

Phase 1: The Cloud-Agnostic Foundation

Before writing a single line of AI logic, the infrastructure had to be bulletproof. A common trap for startups is deep-coupling their architecture to managed cloud services (like AWS S3 or DynamoDB), leading to massive vendor lock-in and uncontrollable costs at scale.

To ensure absolute system resiliency and sovereignty over our data, I designed a completely cloud-agnostic backend.

Compute: The core API was broken down into modular FastAPI microservices, fully containerized using Docker. This allowed us to deploy the exact same image on an AWS EC2 instance, a DigitalOcean droplet, or a bare-metal server without changing the codebase.
Storage: Instead of relying on proprietary cloud object storage, I deployed a self-hosted MinIO cluster. MinIO provides massive, scalable, S3-compatible object storage. By keeping this self-hosted, we maintained complete sovereignty over user media and drastically reduced bandwidth egress costs.

Phase 2: The Brains: Agentic RAG and MCP

The biggest challenge with conversational AI is "generic output syndrome." If the AI doesn't know the user's specific context, engagement plummets. To solve this, I moved away from linear RAG and pioneered an Agentic RAG pipeline using LangGraph and Qdrant (our vector database).

This wasn't just pulling text chunks; it was a multi-step reasoning engine.

1. Query Reprompting (LLM Pre-Processing)

Users rarely ask perfect questions. If a user types, "Why did I fail yesterday?", a standard RAG system will search the database for the word "fail" and return useless results. To fix this, I implemented an LLM Query Rewriter. Before touching the database, a fast, lightweight LLM intercepts the user's message and rewrites it using recent chat history. "Why did I fail yesterday?" is autonomously expanded into: "Retrieve the user's habit tracking data and journal entries for [Date], specifically looking for reasons they did not complete their daily running goal." This dramatically increased the accuracy of our Qdrant vector searches.

2. The Model Context Protocol (MCP) Integration

Text is only half the story. Users upload images of their meals, screenshots of their workouts, and log daily habits. To feed this into the AI, I implemented the Model Context Protocol (MCP). MCP acted as a standardized bridge, allowing the LangGraph agents to dynamically query external APIs fetching the user's habit streaks from the PostgreSQL database or pulling image metadata from MinIO and injecting it directly into the LLM's context window.

The Result: The AI stopped sounding like a robot and started acting like a personalized coach. Goal-tracking engagement spiked by 35%.

Phase 3: The Memory Engine: Storing User Facts

If you stuff an entire month of chat history into an LLM prompt, you will hit context limits and rack up astronomical API bills. Yet, the AI must remember that the user is allergic to peanuts or is training for a marathon.

To achieve "infinite memory," I decoupled short-term chat from long-term facts.

Short-Term Context: Only the last 10 messages are sent to the LLM directly.
Long-Term Fact Extraction: I engineered an asynchronous background worker. Every night, it ingests the user's daily conversations and uses an LLM to extract concrete "facts." (e.g., "User expressed frustration with knee pain," "User prefers vegetarian meals").
Fact Injection: These facts are embedded and stored in Qdrant. When the user asks a question, the Agentic pipeline queries these summarized facts and injects only the highly relevant ones into the system prompt. The AI remembers the user perfectly without the immense token overhead.

Phase 4: Real-World Network Resiliency

Architecting for the real world means acknowledging that mobile networks are terrible. Users walk into elevators, switch from WiFi to 4G, and drive through tunnels.

Initially, the platform used bidirectional WebSockets for real-time chat. However, WebSockets are highly fragile on unstable mobile connections. When the connection dropped, payloads were lost, resulting in silent failures and frustrated users.

The Fix: I completely ripped out the WebSockets and replaced them with a highly resilient HTTP POST + Client-Side Polling architecture.

When a user sends a message, it is an HTTP POST request. If the network drops, the mobile client simply retries the request seamlessly.
The client then polls the server for the AI's response stream. Because HTTP is stateless, network drops no longer broke the application logic. The Impact: Payload delivery failures dropped by an astonishing 98%.

Optimizing Media Ingestion: Alongside chat resiliency, handling user media uploads (photos of meals/workouts) was consuming massive storage. I integrated sharp directly into the Node.js backend ingestion pipeline. Before an image ever touched the MinIO cluster, it was dynamically compressed and converted to WebP. This optimization refined our overall storage costs by 75% without noticeable quality loss.

Phase 5: Observability and Continuous Delivery

You cannot scale a system you cannot see. Operating microservices blindly is a recipe for disaster.

To guarantee reliability, I deployed a custom PGL Stack (Prometheus, Grafana, Loki).

Prometheus scraped real-time metrics from our FastAPI containers and MinIO nodes.
Loki centralized all our distributed logs, allowing us to trace a single request's journey across the entire microservice ecosystem.
Grafana visualized this telemetry, setting off automated Slack alerts if vector search latency spiked or API error rates climbed.

By tracking granular application telemetry, we could literally see where users were experiencing UX drop-offs (e.g., realizing an agentic tool call was taking 3 seconds too long).

Coupled with a rigorous, automated CI/CD pipeline, this observability allowed us to iteratively refine the application with extreme confidence. We slashed our deployment cycles by 80%, shipping smaller, safer updates daily, ultimately achieving a sustained 99.9% uptime.

The Evolution of an Engineer

Building this platform reinforced a core engineering philosophy: the best architecture isn't about using the flashiest new AI model. It is about how gracefully you connect that model to the real world.

From managing state across unstable mobile networks to engineering memory systems that bypass LLM token limits, the challenge of building AI SaaS is deeply rooted in traditional, highly scalable distributed system design.

Want to dive deeper into Agentic RAG or cloud-agnostic architecture? Let's connect! I am Ankit Jaiswal, a Senior Full Stack AI Engineer specializing in conceptualizing and delivering highly resilient, personalized AI platforms and scalable SaaS infrastructure.

Beyond Basic RAG: Architecting a Fault-Tolerant, Agentic AI Platform

Phase 1: The Cloud-Agnostic Foundation

Phase 2: The Brains: Agentic RAG and MCP

1. Query Reprompting (LLM Pre-Processing)

2. The Model Context Protocol (MCP) Integration

Phase 3: The Memory Engine: Storing User Facts

Phase 4: Real-World Network Resiliency

Phase 5: Observability and Continuous Delivery

The Evolution of an Engineer

Read more

Architecting a Fault-Tolerant AI Agent: From Brittle Scripts to Self-Healing SQL Pipelines

Architecting Resilient Ingestion: Decoupling High-Payload Data from Real-Time Streams

Eradicating Operational Drag: Architecting a Resilient Data Ingestion Pipeline

Get in Touch