← All posts
TechnologyJun 10, 2026· 10 min

RAG vs Fine-tuning: What Your Personal AI Actually Uses

Fine-tuning was the answer when models were small and data was static. For a device that has to remember your lease, your prescriptions, and yesterday's invoice, RAG is the only architecture that holds up.

By Digitec Team · yeongsil.digitecsolution.com
On this page+

Ask ten engineers how a personal AI device "remembers" your documents and you will get two camps. One says fine-tuning — train the model on your data so the knowledge is baked in. The other says retrieval-augmented generation — keep the model generic and pull the right chunks at query time. The camps argue past each other because they are answering different questions. This post separates the two, then explains why YeongSil is built on RAG and what that choice means for your privacy.

What is fine-tuning

Fine-tuning takes a pre-trained large language model and continues training it on a smaller, focused dataset. The weights of the model shift to encode the new information. After fine-tuning, the new knowledge is part of the model itself; you do not need to supply the source documents at inference time.

The classic use case is style and format adaptation. If you want a model to always answer in legal English, or to mimic the voice of a particular author, fine-tuning is the right tool. You feed it thousands of examples and the output distribution shifts.

The cost is real. A single fine-tuning run on a frontier-class open model is measured in hundreds to thousands of GPU-hours. Hugging Face's documentation on LoRA and QLoRA — the cheapest mainstream fine-tuning methods — still assumes you have at least a single high-end GPU and a curated training set in the tens of thousands of examples. Per-user fine-tuning at consumer scale is not economically possible today and probably will not be for the rest of the decade.

There is a second cost, often missed in product discussions. Fine-tuned knowledge is static at the moment of training. The day after you fine-tune, your model knows what the dataset knew on the day you ran the job. New information requires another training run. For a device that needs to ingest a new prescription, a new invoice, or a new email every day, this is the wrong tool.

What is RAG

Retrieval-augmented generation, introduced in the 2020 Lewis et al. paper at Facebook AI Research, takes the opposite approach. The model stays generic. The user's data lives in a separate vector database — an index of text chunks converted into numerical embeddings. At query time, the system finds the chunks most semantically relevant to the user's question, stitches them into the prompt as context, and asks the LLM to answer using that context.

The pipeline has four parts:

  • Ingestion. Documents are split into chunks (typically 200–800 tokens), each chunk is passed through an embedding model, and the resulting vectors are stored alongside the original text.
  • Retrieval. The user's query is embedded the same way. The vector database returns the top-k chunks whose embeddings are closest to the query embedding.
  • Augmentation. Retrieved chunks are inserted into the prompt as "here is the relevant context" before the user's actual question.
  • Generation. The LLM produces the answer grounded in the supplied context, ideally citing which chunks it used.

The model never memorises the documents. The documents are looked up. This is the same architectural pattern as a database with an index — the only difference is the lookup uses semantic similarity instead of exact keyword match.

Why RAG wins for personal memory

For a personal AI device, four properties matter more than raw model capability.

Updates are instant. When you hand your device a new document, it is searchable within seconds — no retraining, no downtime. With fine-tuning, the document would have to wait for the next training batch, which in practice never happens.

Per-user data stays per-user. Every user's vector database is its own isolated index. Nothing about your documents touches another user's session, and nothing about your documents shapes the model's weights for anyone else. Compare this to fine-tuning, where the only way to keep one user's data out of another user's outputs is to fine-tune a separate model per user — operationally impossible at scale.

Hallucinations are catchable. Because the model is asked to answer "using the following context," you can show the user which chunks were retrieved, with citations back to the source document. If the answer drifts from what the chunks actually say, the gap is visible. Fine-tuned knowledge has no source pointer — you cannot tell whether the model is remembering accurately or making it up.

The model can be swapped without losing memory. When a better open-source model ships next year, you can upgrade the LLM without re-indexing your library. Your memories are decoupled from the model that reads them. With fine-tuning, the memory is the model — upgrading means losing everything.

The trade-off RAG makes is on stylistic adaptation. A RAG system will answer in the default voice of the underlying model. If you want it to write like Hemingway, you still want fine-tuning. For a personal AI whose job is to recall and reason over your documents, the trade-off is one-sided.

How YeongSil uses RAG

YeongSil ships with an embedded vector store running on the device — no cloud round-trip required for retrieval. The pipeline runs locally on the Raspberry Pi 5's CPU; the only network call happens when the system needs to invoke the language model for generation (and that endpoint is OpenAI-compatible, so you can point it at a self-hosted model when those become small enough to run on-device in 2027–2028).

The chunking strategy is tuned for the kinds of documents personal users actually own: leases (long, dense, full of cross-references), medical letters (short, factual), invoices (semi-structured), and conversational notes (short, contextual). Each document type gets its own chunk size and overlap, set empirically rather than by a single one-size-fits-all rule.

Retrieval is hybrid — semantic similarity from the embedding model combined with BM25 keyword scoring, then re-ranked. This matters for personal data, because users often ask questions using the exact words on the document ("what was the deposit on the Gulberg apartment lease") that pure vector search can under-weight in favour of fuzzy semantic matches.

The system always returns citations alongside the spoken answer. When you ask "when does my prescription expire," YeongSil tells you the date and offers to read the line from the original letter. That is not a UX nicety; it is the core safety mechanism that comes free with the RAG architecture.

What this means for privacy

The privacy story for RAG is structurally better than for fine-tuning, but only if it is implemented carefully. Three commitments matter.

Your embeddings are yours. The numerical vectors that represent your documents are stored in your account's encrypted partition. They are never pooled with other users' embeddings, never used to train a shared model, and never sold. AES-256 encrypts the index at rest; the key is derived per-account.

Your documents never train any model. This is the bright line that distinguishes a RAG product from a "we'll fine-tune on your data for better personalisation" product. The former needs your data only at retrieval time. The latter needs your data at training time, which means it is in a training pipeline, which means it is one bug or one policy change away from leaking into another user's output.

Export and delete are one click. Because your memory is a separate database from the model, you can export it (or delete it) without affecting the rest of the system. The same is not true if your knowledge was fine-tuned in — there is no "delete" operation for weights once trained.

This is the same architectural commitment the EU's AI Act will likely formalise for high-risk personal AI products: separation of model and memory, with the user controlling the latter. RAG is not just the better engineering choice; it is the architecture that will survive the regulatory environment of 2027 and beyond.

If you want to see how this looks in practice, our post on [what makes a personal AI actually personal](/blog/what-makes-ai-personal) walks through the three properties — memory, presence, action — that turn a chatbot into a companion. And if you want to live with one, [join the waitlist](#waitlist).

Sources & further reading

  1. 01Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksLewis et al., arXiv
  2. 02LoRA: Low-Rank Adaptation of Large Language ModelsHu et al., arXiv
  3. 03QLoRA: Efficient Finetuning of Quantized LLMsDettmers et al., arXiv
  4. 04Hugging Face PEFT documentationHugging Face
  5. 05EU AI Act — official textEuropean Commission

Be first to live with it.

Join 2,400+ people on the waitlist. Early members get 30% off launch price and priority shipping.