Unbreakable AI Chat: Streaming Responses with Convex + Vercel AI SDK

How I built a persistent, reliable chat system with Convex and the AI SDK

Published on: 2025-04-30

Streaming AI responses is easy until something goes wrong. Maybe your user loses internet, or your server hiccups. Suddenly, your beautiful real-time chat breaks. Lost messages and broken UX with no recovery. I built a solution using Convex and the Vercel AI SDK that solves this with minimal complexity and no database spam.

After evaluating several options, I decided to build on top of the Vercel AI SDK. While it's not perfect out of the box, it's clean interface for LLM integration and structured generation made it the best foundation to build upon. The challenge would be extending it to handle the persistence and reliability requirements of a production chat system.

Why I Picked the Vercel AI SDK (Despite Its Limits)

The Vercel AI SDK is great for swapping LLMs, structured generation, and a clean interface. But useChat() is limiting for production chat UIs that need custom streaming and persistence.

The F5 Problem

Most AI chat UIs fail the basic resilience test. Refresh the page mid-stream or lose internet? You've either lost the message or wait until its fully generated and then refresh the page to see it.
Here's how some of today's top platforms stack up:

Platform	Result	Details
chatgpt.com	⚠️ Partial Pass	Requires another refresh to see the complete response
claude.ai	❌ Fail	Response is lost entirely when the page is refreshed
T3.chat	❌ Fail	Response is stuck at the point of interruption
chat.vercel.ai	❌ Fail	Response is lost entirely when refreshing the page
chatprd.ai	❌ Fail	Shows error alert for in-progress response, then loses it
Perplexity	⚠️ Partial Pass	Requires another refresh to see the complete response
sdk.vercel.ai/playground	❌ Fail	Chat is lost entirely on refresh

As of 2025-04-28. Tested on Chrome.

To get both real-time streaming and persistence, you need to decouple the LLM stream from the HTTP request and persist every chunk. Most frameworks don't help you do this. The Vercel AI SDK assumes Node and Next.js API routes. If you use Convex or another backend, you end up streaming across systems, adding latency and risk.

How does Convex help?

Convex lets you put LLM logic right next to your database. This means we have minimal network hops, and fast data operations.

Saving Every Token?

Saving every token chunk to Convex is too many writes. A job queue could help, but then you lose streaming. My solution: buffer tokens in memory and post a messagechunk to Convex every 200ms.

The Solution

Here's a breif overview of the solution I came up with. I'll go into more detail in the next section.

Define your message schema
Start the chat with startChatMessagePair
Save the user's message and initiate the LLM response
Generate the response in internal.llm.generateAssistantMessage
Use useQuery to stream the response

We're going to store the message chunks in the database, but we're going to stream the tokens to the client as soon as they're generated.

1. Define your message schema

This is the schema I used for the chat. We're going to need a message thread, a message, and a message chunk. The message thread is just a title. The message is the user's message or the AI's response. The message chunk is a chunk of the AI's response.

const messageThreads = defineTable({
  title: v.string(),
}).index("by_title", ["title"]);

const messageChunks = defineTable({
  content: v.string(),
  messageId: v.id("messages"),
}).index("by_messageId", ["messageId""]);

const messages = defineTable({
  isComplete: v.optional(v.boolean()),
  role: v.union(v.literal("user"), v.literal("assistant")),
  threadId: v.id("messageThreads"),
}).index("by_threadId", ["threadId"]);

const userSettings = defineTable({
  userId: v.id("users"),
  role: v.union(v.literal("free"), v.literal("pro"), v.literal("admin")),
}).index("by_userId", ["userId"]);

2. Start the chat with `startChatMessagePair`.

This Convex action:

Stores the user's message
Creates a blank assistant message
Immediately schedules a job to generate the reply

// In your client component
const startChat = useAction(api.chat.startChatMessagePair);

const sendMessage = async () => {
  if (!input.trim()) return;

  setIsSending(true);

  await startChat({
    threadId,
    content: input,
  });

  setInput('');
  setIsSending(false);
};

const handleSubmit = async (e: React.FormEvent) => {
  e.preventDefault();
  await sendMessage();
  formRef.current?.reset();
};

2. Save the user's message and initiate the LLM response

This Convex action:

Stores the user's message
Creates a blank assistant message
Schedules the LLM job to generate the reply immediately

export const startChatMessagePair = action({
  args: {
    threadId: v.id('messageThreads'),
    content: v.string(),
    saveId: v.id('saves'),
  },
  returns: v.object({
    assistantMessageId: v.id("messages"),
  }),
  handler: async (ctx, { threadId, content, saveId }) => {

    await ctx.runMutation(api.messages.createMessage, {
      threadId,
      role: 'user',
      content,
      isComplete: true,
    });

    const assistantMessageResult = await ctx.runMutation(api.messages.createMessage, {
      threadId,
      role: 'assistant',
      content: '',
      isComplete: false,
    });

    const assistantMessageId: Id<"messages"> = assistantMessageResult;

    // 3. Schedule the LLM job immediately
    await ctx.scheduler.runAfter(0, internal.llm.generateAssistantMessage, {
      threadId,
      content,
      assistantMessageId,
      saveId,
    });

    return { assistantMessageId };
  },
});

export const createMessage = mutation({
  args: {
    role: v.union(v.literal("user"), v.literal("assistant")),
    threadId: v.id("messageThreads"),
    isComplete: v.boolean(),
    content: v.optional(v.string()),
  },
  handler: async (ctx, args) => {
    const newMessageId = await ctx.db.insert("messages", {
      threadId: args.threadId as Id<"messageThreads">,
      role: args.role,
      isComplete: args.isComplete
    });

    if (args.content) {
      await ctx.db.insert("messageChunks", {
        messageId: newMessageId,
        content: args.content,
      });
    }

    return newMessageId;
  },
});

3. Generate the response in `internal.llm.generateAssistantMessage`

This internal Convex action:

Uses streamText() from the Vercel AI SDK
Streams tokens with for await...of result.textStream
Keeps them in memory
Saves to Convex every 200ms

// Minimum chunk size to reduce database writes
const MIN_CHUNK_SIZE = 20;
const FLUSH_INTERVAL = 200; // ms

export const generateAssistantMessage = internalAction({
  args: {
    threadId: v.id("messageThreads"),
    content: v.string(),
    assistantMessageId: v.id("messages"),
    saveId: v.id("saves"),
  },

  handler: async (ctx, args) => {
    try {
      const messages = await ctx.runQuery(api.messages.getMessages, {
        threadId: args.threadId,
      });

      const fullPrompt = [
        ...messages.map((m) => ({ 
          role: m.role, 
          content: m.messageChunks.map(chunk => chunk.content).join('') 
        })),
        { role: "user", content: args.content },
      ];

      const result = streamText({
        model: openai("gpt-4.1-mini"),
        system: `You are helpful assistant.`,
        messages: fullPrompt as CoreMessage[],
      });

      let buffer = "";
      let lastFlushTime = Date.now();
      let flushTimeout: NodeJS.Timeout | null = null;

      const flush = async (force = false) => {
        if (!force && (buffer.length < MIN_CHUNK_SIZE || Date.now() - lastFlushTime < FLUSH_INTERVAL)) {
          return;
        }

        if (buffer.length === 0) return;

        const contentToFlush = buffer;
        buffer = "";
        flushTimeout = null;
        lastFlushTime = Date.now();

        try {
          await ctx.runMutation(api.messages.createMessageChunk, {
            messageId: args.assistantMessageId,
            content: contentToFlush,
          });
        } catch (error) {
          console.error("Failed to save message chunk:", error);
          // In case of error, add content back to buffer
          buffer = contentToFlush + buffer;
          // Retry after a short delay
          await new Promise(resolve => setTimeout(resolve, 1000));
          await flush(true);
        }
      };

      for await (const chunk of result.textStream) {
        if (chunk) {
          buffer += chunk;
          
          // Schedule a flush if not already scheduled
          if (!flushTimeout) {
            flushTimeout = setTimeout(() => flush(), FLUSH_INTERVAL);
          }
          
          // Force flush if buffer gets too large
          if (buffer.length >= MIN_CHUNK_SIZE * 2) {
            if (flushTimeout) {
              clearTimeout(flushTimeout);
              flushTimeout = null;
            }
            await flush(true);
          }
        }
      }

      // Final cleanup
      if (flushTimeout) {
        clearTimeout(flushTimeout);
      }
      // Force flush any remaining content
      await flush(true);

      // Mark message as complete
      await ctx.runMutation(api.messages.updateMessage, {
        messageId: args.assistantMessageId,
        isComplete: true,
      });

    } catch (error) {
      console.error("Error in generateAssistantMessage:", error);
      
      // Mark message as complete but with error state
      await ctx.runMutation(api.messages.updateMessage, {
        messageId: args.assistantMessageId,
        isComplete: true,
      });
      
      // Add error message as final chunk
      await ctx.runMutation(api.messages.createMessageChunk, {
        messageId: args.assistantMessageId,
        content: "\n\nI apologize, but I encountered an error while generating the response. Please try again.",
      });
      
      throw error; // Re-throw to trigger Convex's error handling
    }
  },
});

4. Use `useQuery` to stream the response

Convex's useQuery is reactive, so it'll update as soon as the message is saved.

// Convex query:
export const getMessages = query({
  args: { 
    threadId: v.id("messageThreads"),
    limit: v.optional(v.number()),
  },
  handler: async (ctx, args) => {
    // Get the most recent messages first, limited if specified
    const query = ctx.db
      .query("messages")
      .withIndex("by_threadId", (q) => 
        q.eq("threadId", args.threadId)
      )
      .order("desc");

    const messages = await (args.limit ? query.take(args.limit) : query.collect());
    messages.reverse(); // Put them back in chronological order

    // Fetch chunks for each message
    const messagesWithChunks = await Promise.all(
      messages.map(async (message) => {
        const chunks = await ctx.db
          .query("messageChunks")
          .withIndex("by_messageId", (q) => 
            q.eq("messageId", message._id)
          )
          .order("asc")
          .collect();
        
        return {
          ...message,
          messageChunks: chunks
        };
      })
    );

    return messagesWithChunks;
  },
});

// Client component:
// Memoized Message component to optimize re-renders
const Message = memo(({ role, content, isComplete = true }: { 
  role: 'user' | 'assistant', 
  content: string, 
  isComplete?: boolean 
}) => (
  <div className={cn(
    "mb-4 flex w-full",
    role === "assistant" ? "justify-start" : "justify-end"
  )}>
    <div className={cn(
      "rounded-lg px-4 py-3",
      role === "assistant" 
        ? "w-full" 
        : "max-w-[90%] bg-primary text-primary-foreground"
    )}>
      <Markdown>{content}</Markdown>
      {!isComplete && (
        <span className="inline-block h-4 w-1 animate-pulse bg-current" />
      )}
    </div>
  </div>
));

//...

export default function Chat() {

  const messages = useQuery(api.messages.getMessages, { 
    threadId,
    limit: MESSAGE_LIMIT
  });

  /* Rest of the chat code. */
  return (
    <div>
    { /* ... */ }
      {messages?.map(message => (
        <Message
          key={message._id}
          role={message.role}
          content={message.messageChunks.map(chunk => chunk.content).join('')}
          isComplete={message.isComplete}
        />
      ))}
    </div>
  );
}

Results

After implementing this solution, I was able to achieve a simple implementation that provides a streaming feel even after disconnects while maintaining controlled database load. The system maintains message history, handles disconnects gracefully, and provides a smooth streaming experience without overwhelming the database since we capped the update rate at 200ms.

Example of the streaming chat in action. The grey flashes are me refreshing the page.

Final Thoughts

If you're building real AI products, you can't afford to ignore resilience. The risk of losing a message increases the longer your response takes. This system works well for me, but it's not a silver bullet. If your use case demands token-level persistence (like hallucination tracking or audit logs), you'll need to tweak the flush cadence, or add extra logic.

If you're building something real with AI don't just trust the magic SDKs. Understand what's happening under the hood. Own your infra.

What This is For: Veilborn

This system powers Veilborn, an infinite, AI-enhanced, text-based RPG for the web. It combines:

The story depth of The Witcher 3
The character building of Soulslikes
Turn-based combat with deep strategy
A persistent world with AI-driven storytelling

It's ambitious, but I've never been more excited about a project.

Also, if you're into TTRPGs, check out my side project: MonsterLabs.app, where you can create AI-driven monsters and items for D&D. I also wrote about it here.

Arham Humayun