Faster Retrieval Augmented Generation on Frontend stack

Pushkar Yadav

June 8, 2024


It's not possible that each LLM has all the information in the world, so we need to have a way to retrieve information from data store and then generate the output. This is where RAG comes into picture.

I've developed a project AI Speed Which uses Open AI text-embeeding-3-small to retrieve information from Pinecone Vector Store and uses Groq powered Llama 3 to generate the output this makes it context aware and superfast as groq powered llama-3-8b is fastst available right now (~1250 tokens per second o/p).

Trained on almost ~5M+ tokens of publically available data this is able to generate pretty accurate and context aware output. This is also able to generate code snippets and other structured output which can be used in real world applications.

What's Next? Increasing traing set and making it more context aware and faster. Also making some checks to ensure that output is not hallucinated and is correct.

Live here: ai-speed.vercel.app

Avalable here: pushkarydv/ai-speed also attaching first release tweet below.

馃憟 Labs