x-technology

Building a RAG System in Node.js: Vector Databases, Embeddings & Chunking

Large Language Models (LLMs) are powerful, but they often lack real-time knowledge. Retrieval-Augmented Generation (RAG) bridges this gap by fetching relevant information from external sources before generating responses. In this workshop, we’ll explore how to build an efficient RAG pipeline in Node.js using RSS feeds as a data source. We’ll compare different vector databases (FAISS, pgvector, Elasticsearch), embedding methods, and testing strategies. We’ll also cover the crucial role of chunking—splitting and structuring data effectively for better retrieval performance.

Prerequisites

Good understanding of JavaScript or TypeScript
Experience with Node.js and API development
Basic knowledge of databases and LLMs is helpful but not required

Code

RAG Workshop

Introduction

Explore RAG’s scope, architecture and components
Practice various RAG aspects with chosen technologies
Feedback & evaluate

Alex Korzhikov

alex korzhikov photo

Software Engineer, Netherlands

My primary interest is self development and craftsmanship. I enjoy exploring technologies, coding open source and enterprise projects, teaching, speaking and writing about programming - JavaScript, Node.js, TypeScript, Go, Java, Docker, Kubernetes, JSON Schema, DevOps, Web Components, Algorithms 🎧 ⚽️ 💻 👋 ☕️ 🌊 🎾

Pavlik Kiselev

Pavlik

Software Engineer, Netherlands

JavaScript developer with full-stack experience and frontend passion. He happily works at ING in a Fraud Prevention department, where helps to protect the finances of the ING customers.

About Everything

Generative Artificial Intelligence (GenAI)

Use of Artificial Intelligence to produce human consumable content in text, image, audio, video format.

Language Model (LM)

Language Models are Machine Learning models trained on natural language resources and aim predict next word based on given context.

LM relevant tasks - summarization, Q&A, classification, and more.

Large Language Model (LLM)

Artificial Intelligence is powered by Large Language Models - models trained on tons of sources and materials, having billions of parameters.

llm as blackbox

Retrieval Augmented Generation (RAG)

RAG is a method that combines information retrieval with language model generation.

alt text

Retrieval = Search

Generation = LLM

Use Cases:

Chat with user
Analyze and summarise documents

async function rag(q) {
  const results = await search(q)
  const prompt = makePrompt(q, results)
  const answer = await llm(prompt)
  return answer
}

Why do we need RAG?

- Additional, specific knowledge
- Reduce hallucinations
- Control system costs

Why can't we just put all context in our LLM and ask it about?

- Cost
- Slow
- Size
- Noize

What do we need to build a RAG?

- Application (backend)
- Search system (vector databases)
- LLM (blackbox)

Langchain 🦜️🔗

LangChain is a Python and JavaScript framework that brings flexible abstractions and AI-first toolkit for developers to build with GenAI and integrate your applications with LLMs. It includes components for abstracting and chaining LLM prompts, configure and use vector databases (for semantic search), document loaders and splitters (to analyze documents and learn from them), output parsers, and more.

Setup

Node.js
Ollama / OpenAI
npm i langchain

Practice #1 - Hello World

ollama.chat({
  messages: [
    { role: "user", content: "What is retrieval-augmented generation?" },
  ],
});

Use Case - Exploring Node.js News

We aim to understand what happened in the Node.js community over the past year. To achieve this, we:

Collect and process news, popular blog posts, and articles
Store the documents in a database
Analyze and query the documents by asking targeted questions

How do we find sources?

Own collection
Ask ChatGPT What are the best online resource to follow on Node.js news? Output urls and short description

How do we scrap articles?

Direct fetch requests
Run browser and navigate to urls
Use external services

Some scrapping and text quality problems

Paywalls
Anti-bot protections
- Cat & mouse game
Noise (marketing texts, tags, js, css)
- Clean & parse
- Noise reduction algorithms

Articles

What dataset do you have?
Which format do you need to parse?
What are the most often queries?

Chunking

To chunk or not to chunk?

Chunking involves breaking down texts into smaller, manageable pieces called “chunks”. Each chunk becomes a unit of information that is vectorized and stored in a database, fundamentally shaping the efficiency and effectiveness of natural language processing tasks.

Normalize documents
Reduce context (latency, LLM usage & cost)
Enhance retrieval precision

How do we split documents?

Chunk Size

The number of characters we would like in our chunks. 50, 100, 100,000, etc.
Chunk Overlap

The amount we would like our sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.
Length-based (CharacterTextSplitter) - Easy & Simple, don’t take into account the text’s structure
Character/Token splitters

const { CharacterTextSplitter } = require("@langchain/textsplitters");
const textSplitter = new CharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});
const texts = await textSplitter.splitText(document)

Text-structured based (RecursiveCharacterTextSplitter)
What split characters do you think are included in Langchain by default?
Document-structured based
Semantic meaning based - Extract Propositions
Agentic Chunking
- Summarize text, propositions, images
- Generate hypothetical questions
- Multiple embeddings

Practice #2 - Chunking

const { RecursiveCharacterTextSplitter } = require("@langchain/textsplitters");
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 0,
});

return await splitter.splitText(text);

Store & Retrieve

what are embedding models

Semantic VS Keyword Search

To embed or not?

An embedding is a vector representation of data in embedding space (projecting the high-dimensional space of initial data vectors into a lower-dimensional space).

embeddings vs indexing

Vectors are stored in a database, which compare them as a way to search for data that is similar in meaning (by using dot product or cosine measurement).

Ollama supported embedding models

Can we exchange embedding models with equal vector dimensions?

- Nope, also embedding models evolve over time
obvious reaction

Practice #3 - Store & Retrieve

await Chroma.fromTexts(
  texts,
  { id: Array.from({ length: texts.length }, (_, i) => `chunk-${i}`) },
  embeddings,
  { collectionName: "rag_node_workshop_articles" },
);

Reranking

reranking process

Rerankers analyze the initial search output and adjust the ranking to better match user preferences and application requirements

Improve quality (other models and more context involved)
Cost considerations
Vector VS LLM based

Evaluation

RAG Triad

Evaluate RAG the retriever and generator of a RAG pipeline separately

Retriever

Hyperparameters:

Database
Embedding model
Chunk size
Top-K documents
LLM temperature
Prompt template
Reranking model
etc.

Does your reranker model ranks the retrieved documents in the “correct” order?

Metrics:

Precision - whether documents in the retrieved context that are relevant to the given input are ranked higher than irrelevant ones
Recall - how well the retrieved context aligns with the expected output, or if all relevant documents are retrieved

Generation