Ask HN: What is the best way to provide continuous context to models?

With research done till date, what according to you is the best way to provide context to a model. Are there any articles that go into depth of how Cursor does it?

How do context collation companies work?

71 points

・

nemath

・

a day ago

41 comments

vivekraja ・ a day ago

I think the emerging best way is to do "agentic search" over files. If you think about it, Claude Code is quite good at navigating large codebases and finding the required context for a problem.

Further, instead of polluting the context of your main agent, you can run a subagent to do search and retrieve the important bits of information and report back to your main agent. This is what Claude Code does if you use the keyword "explore". It starts a subagent with Haiku which reads ten of thousands of tokens in seconds.

From my experience the only shortcoming of this approach right now is that it's slow, and sometimes haiku misses some details in what it reads. These will get better very soon (in one or two generations, we will likely see opus 4.5 level intelligence at haiku speeds/price). For now, if not missing a detail is important for your usecase, you can give the output from the first subagent to a second one and ask the second one to find important details the first one missed. I've found this additional step to catch most things the first search missed. You can try this for yourself with Claude Code: ask it to create a plan for your spec, and then pass the plan to a second Claude Code session and ask it to find gaps and missing files from the plan.

nl ・ a day ago

Gemini 3 Flash is very good at the search task (it benchmarks quite close to 3 Pro in coding tasks but is much faster). I believe Amp switch to Gemini Flash for their search agent because it is better.
- esperent ・ a day ago
  
  I very much doubt this. I've been using gemini whenever I get hit by codex limits over the past week. 3 pro is very good - a bit behind codex but still very useful. I've tried 3 flash several times and each time what I got back was complete garbage. After the third or fourth attempt I stopped trying.
- theshrike79 ・ a day ago
  
  Gemini the model is good, Gemini the framework around the model is a steaming pile of crap.
  
  woggy ・ a day ago
  ・ 3 more
  
  Yup. Gemini CLI needs a lot of work.
  
  esperent ・ a day ago
  
  I actually defended Gemini CLI when someone said this a few days ago. Then Murphy's law hit me with bug after bug, all of which have been reported many times already going back over year or more. I keep having to totally clear out the ~/.gemini folder and then it works again for a while.
  
  theshrike79 ・ a day ago
  
  I'm pretty sure Anthropic is the only company that actually dogfoods their CLI tool.
  Gemini seems like an afterthought an intern did, Codex has cool features but none of them align with real-world need.
theshrike79 ・ a day ago

You can also create per-project agents with specific "expertise" in different parts of the code.
Basically they're just few kilobytes of text that's given as extra context to "explore" agents when looking at specific parts of the code.

Agent_Builder ・ 4 hours ago

We tried “continuous context” early on and it mostly created entropy. Context kept growing, signal degraded, and the model got confidently wrong in subtle ways.

What worked better for us while building GTWY was doing the opposite. Context is disposable. Each step rebuilds only what it actually needs, with explicit inputs and outputs.

Long-lived context feels efficient, but in production it tends to hide bugs and amplify hallucinations. Dropping context aggressively between steps made failures obvious instead of mysterious.

bluegatty ・ a day ago

Every time you send a request to a model you're already providing all of the context history along with it. To edit the context, just send a different context history. You can send whatever you want as history, it's entirely up to you and entirely arbitrary.

We only think in conversational turns because that's what we've expected a conversation to 'look like'. But that's just a very deeply ingrained convention.

Forget that there is such a thing as 'turns' in a LLM convo for now, imagine that it's all 'one-shot'.

So you ask A, it responds A1.

But when you and B, and expect B1 - which depends on A and A1 already being in the convo history - consider that you are actually sending that again anyhow.

Behind the scenes when you think you're sending just 'B' (next prompt) you're actually sending A + A1 + B aka including the history.

A and A1 are usually 'cached' but that's not the simplest way to do it, the caching is an optimization.

Without caching the model would just process all of A + A1 + B and B1 in return just the same.

And then A + A1 + B + B1 + C and expect C1 in return.

It just so happens it will cache the state of the convo at your previous turn, and so it's optimized but the key insight is that you can send whatever context you want at any time.

If after you send A + A1 + B + B1 + C and get C1, if you want to then send A + B + C + D and expect D1 ... (basically sending the prompts with no responses) - you can totally do that. It will have to re-process all of that aka no cached state, but it will definitely do it for you.

Heck you can send Z + A + X, or A + A1 + X + Y - or whatever you want.

So in that sense - what you are really sending (if you're using the simplest form API), is sending 'a bunch of content' and 'expecting a response'. That's it. Everything is actually 'one shot' (prefill => response) and that's it. It feels conversational but structural and operational convention.

So the very simple answer to your question is: send whatever context you want. That's it.

Fazebooking ・ a day ago

Bigger context makes responses slower.
Context is limited.
You do not want the cloud provider running a context compaction if you can control it a lot better.
There are even tips on when to ask the question like "send first the content then ask the question" vs. "ask the question then send the content"
- bluegatty ・ 10 minutes ago
  
  When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state.
  So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.
  Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.
  Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.
  'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.
  So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.
_bobm ・ a day ago

This is how I view it as well.
And... and...
This results in a _very_ deep implication, which big companies may not be eager to let you see:
they are context processors
Take it for what it is.
- derrida ・ a day ago
  
  What you are trying to say is they are plagiarists and training on the input?
  We know that already I don’t know why have to be quiet or hint at it, in fact they have been quite explicit about it.
  Or is there some other context to your statement? Anyway that’s my “take that for what you will”.

swid ・ a day ago

If you know you will be pruning or otherwise reusing the context across multiple threads, the best place for context that will be retained is at the beginning due to prompt caching - it will reduce the cost and improve the speed.

If not, inserting new context any place other than at the end will cause cache misses and therefore slow down the response and increase cost.

Models also have some bias for tokens at start and end of the context window, so potentially there is a reason to put important instructions in one of those places.

catlifeonmars ・ a day ago

I wonder how far you can take that. Basically can you jam a bunch of garbage in the middle and still get useful results

HarHarVeryFunny ・ a day ago

One thing Cursor does different to some other agents such a Claude Code in managing context, is to use a vector database of code chunks so that it can selectively load relevant code chunks into context rather than entire source files.

Another way to control context size (not specific to Cursor), is to use subagents with their own context for specific tasks so that the subagent context can be discarded when done rather that just adding to the agent's main context.

If context gets too full (performance may degrade well before you hit LLM max context length), then the main remedy is to compact - summarize the old context and discard. One way to prevent this from being too disruptive is to have the agent maintain a TODO list tracking progress and what it is doing, so that it can better remain on track after compaction.

kristianp ・ 16 hours ago

A vector database of code chunks sounds like it would have advantages over agentic search. Less reimplemented code that's already in your codebase, things get missed by grep. It should be faster too, I get impatient watching CC doing the same set of searches every time it's launched.

d4rkp4ttern ・ a day ago

Specifically for coding agents, one issue is how to continue work when almost fill the context window.

Compaction always loses information, so I use an alternative approach that works extremely well, based on this almost silly idea — your original session file itself is the golden source of truth with all details, so why not directly leverage it?

So I built the aichat feature in my Claude-code-tools repo with exactly this sort of thought; the aichat rollover option puts you in a fresh session, with the original session path injected, and you use sub agents to recover any arbitrary detail at any time. Now I keep auto-compact turned off and don’t compact ever.

https://github.com/pchalasani/claude-code-tools?tab=readme-o...

It’s a relatively simple idea; no elaborate “memory” artifacts, no discipline or system to follow, work until 95%+ context usage.

The tool (with the related plugins) makes it seamless: first type “>resume” in your session (this copies session id to clipboard), then quit and run

    aichat resume <pasted session id>

And this launches a TUI offering a few ways to resume your work, one of which is “rollover”; this puts you in a new session with the original session jsonl path injected. And in the new session say something like,

“There is a chat session log file path shown to you; Use subagents strategically to extract details of the task we were working on at the end of it”, or use the /recover-context slash command. If it doesn’t quite get all of it, prompt it again for specific details.

There’s also an aichat search command for rust/tantivy based fast full text search to search across sessions, with a TUI for humans and a CLI/JSON mode for agents/subagents. The latter ( and the corresponding skill and sub agent) can be used to recover arbitrary detailed context about past work.

ako ・ a day ago

What works best for me using Claude Code is to let the CC engineer its own context. You need to provide it with tools that it can use to engineer its context. CC comes with a lot of tools already (grep, sed, curl, etc), but for specific domain you may want to add more, e.g., access to a database, a cms, a parser for a bespoke language, etc.

With these i'll mostly just give it questions: what are some approaches to implement x, what are the pros and cons, what libraries are available to handle x? What data would you need to create x screen, or y report? And then let it google it, or run queries on your data.

I'll have it create markdown documents or skills to persist the insights it comes back with that will be useful in the future.

LLMs are pretty good at plan/do/check/act: create a plan (maybe to run a query to see what tables you have in your database), run the query, understand the output, and then determine the next step.

Your main goal should be to enable the PDCA loop of the LLM through tools you provide.

bob1029 ・ a day ago

Tool calling + recursion seems to be the answer. Two tools are for manipulating the logical call stack - call/return. The trick is to not permit use of any meaningful tools at the root of recursion, but to always make their descriptions available. For instance, the root can't QueryWidgets or ExecuteShell, but any descendant of it can.

These constraints result in token-hungry activity being confined to child scopes that are fully isolated from their parents. The only way to communicate between stack frames is by way of the arguments to call() and return(). Theoretically, recursive dispatch gives us exponential scaling of effective context size as we descend into the call graph. It also helps to isolate bad trips and potentially learn from them.

energy123 ・ a day ago

I open 4 chat windows with Gemini 3.0 Pro. I paste in all file contents to each window. I ask them "which files would an AI need to do $TASK effectively?"

Each of the 4 responses will disagree, despite some overlap. I take the union of the 4 responses as the canonical set of files that an implementer would need to see.

This reduces the risk of missing key files, while increasing the risk of including marginally important files. An easy trade-off.

Then I paste the subset of files into GPT 5.2 Pro, and give it $TASK.

You could replace the upstream process with N codex sessions instead of N gemini chat windows. It doesn't matter.

This process can be automated with structured json outputs, but I haven't bothered yet.

It uses much inference compute. But it's better than missing key inputs and wasting time with hallucinated output.

DrSiemer ・ a day ago

That sounds cumbersome and even more wasteful than my own method of simply dumping a fixed selection of project code in Gemini for each set of requests. Is there any benefit to pruning?
- energy123 ・ a day ago
  
  > Is there any benefit to pruning?
  1- Better quality output due to pruning noise, while reducing the chances of missing key context.
  2- Saving time/effort by not using my brain to decide which files to include.
  3- ChatGPT 5.2 Pro only allows 60k tokens, so I have no choice sometimes.
  It comes with costs as you identified. It's a trade-off that I am willing to pay.

jaychia ・ 16 hours ago

It's fairly surprising to me how naive/early we are still in the techniques that we use here.

Anthropic's post on the Claude Agent SDK (formerly Claude Code SDK) talks about how the agent "gathers context", and is fairly accurate as to how people do it today.

1. Agentic Search (give the agent tools and let it run its own search trajectory): specifically, the industry seems to have made really strong advances towards giving the agents POSIX filesystems and UNIX utilities (grep/sed/awk/jq/head etc) for navigating data. MCP for data retrieval also falls into this category, since the agent can choose to invoke tools to hit MCP servers for required data. But because coding agents know filesystems really well, it seems like that is outperforming everything else today ("bash is all you need").

2. Semantic Search (essentially chunking + embedding, a la RAG in 2022/2023): I've definitely noticed a growing trend amongst leading AI companies to move away from this. Especially if your data is easily represented as a filesystem, (1) seems to be the winning approach.

Interestingly though this approach has a pretty glaring flaw: all the approaches today really only provide the agents with raw unprocessed data. There's a ton of recomputation on raw data! Agents that have sifted through the raw data once (maybe it reads v1, v2 and v_final of a design document or something) will have to do the same thing again in the next session.

I have a strong thesis that this will change in 2026 (Knowledge Curation, not search, is the next data problem for AI) https://www.daft.ai/blog/knowledge-curation-not-search-is-th... and we're building towards this future as well. Related ideas here that have anecdotal evidence of providing benefits, but haven't really stuck yet in practice include: agentic memory, processing agent trajectory logs, continuous learning, persistent note-taking etc.

Agent_Builder ・ a day ago

We ran into this while building GTWY.ai. What worked for us wasn’t trying to keep a single model “continuously informed”, but breaking work into smaller steps with explicit context passed between them. Long-lived context drifted fast. Short-lived, well-scoped context stayed predictable.

kennyu ・ a day ago

This blog explains context folding ie tool calling and recursion - https://www.primeintellect.ai/blog/rlm

zarathustra333 ・ a day ago

I've been building https://www.usesatori.sh/ to give persistent context to agents

Would be happy to onboard you personally.

dboon ・ a day ago

I'm not OP, but send me an email. My address is in my HN profile. You and I are building the same thing, and I would love to have a chat.
bnt123 ・ 21 hours ago

i see you are only using vector search; many solutions for memory involve some combination of vector search and/or graphs (mem0, zep, cognee, etc). have you compared against these?
beeeenorm ・ a day ago

[dead]

undefined ・ a day ago

[deleted]

akshay326 ・ a day ago

What would you cover not-continuous?

Best methods I’ve observed -progressive loading (claude skills) & symbolic search (serena mcp)

_boffin_ ・ a day ago

> what according to you is the best way to provide context to a model.

Are you talking about manually or in an automated fashion?

nemath ・ a day ago

Automated fashion would be what I'm curious on.

DonHopkins ・ a day ago

Check out "cursor-mirror", this extended Anthropic Skill I've developed as a part of MOOLLM, which will tell you all about how cursor assembles its context:

cursor-mirror skill: https://github.com/SimHacker/moollm/tree/main/skills/cursor-...

cursor-mirror

See yourself think. Introspection tools for Cursor IDE — 47 read-only commands to inspect conversations, tool calls, context assembly, and agent reasoning from Cursor's internal SQLite databases.

By Don Hopkins, Leela AI — Part of MOOLLM

The Problem

LLM agents are black boxes. You prompt, they respond, you have no idea what happened inside. Context assembly? Opaque. Tool selection? Hidden. Reasoning? Buried in thinking blocks you can't access.

Cursor stores everything in SQLite. This tool opens those databases.

The Science

"You can't think about thinking without thinking about thinking about something." — Seymour Papert, Mindstorms: Children, Computers, and Powerful Ideas (Basic Books, 1980), p. 137

Papert's insight: metacognition requires concrete artifacts. Abstract introspection is empty. You need something to inspect.

This connects to three traditions:

Constructionism (Papert, 1980) — Learning happens through building inspectable artifacts. The Logo turtle wasn't about drawing; it was about making geometry visible so children could debug their mental models. cursor-mirror makes agent behavior visible so you can debug your mental model of how Cursor works.

Society of Mind (Minsky, 1986) — Intelligence emerges from interacting agents. Minsky's "K-lines" are activation patterns that recall mental states. cursor-mirror lets you see these patterns: which tools activated, what context was assembled, how the agent reasoned.

Schema Mechanism (Drescher, 1991) — Made-Up Minds describes how agents learn causal models through Context → Action → Result schemas. cursor-mirror provides the data for schema refinement: what context was assembled, what action was taken, what result occurred.

What You Can Inspect:

Conversation Structure

Context Assembly

Tool Execution

Server Configuration

MCP Servers

Image Archaeology

Python Sister Script CLI Tool: cursor_mirror.py

cursor_mirror.py: https://github.com/SimHacker/moollm/blob/main/skills/cursor-...

Here is the design and exploration and hacking session in which I iteratively designed and developed it, using MOOLLM's Constructionist "PLAY-LEARN-LIFT" methodology:

cursor-chat-reflection.md: https://github.com/SimHacker/moollm/blob/main/examples/adven...

Look at the "Scene 19 — Context Assembly Deep Dive" section and messageRequestContext schema, and "Scene 23 — Orchestration Deep Dive" section!

PR-CURSOR-MIRROR-GENESIS.md: https://github.com/SimHacker/moollm/blob/main/designs/PR-CUR...

play-learn-lift skill: https://github.com/SimHacker/moollm/tree/main/skills/play-le...

MOOLLM Anthropic compatible extended meta skill skill: https://github.com/SimHacker/moollm/tree/main/skills/skill

Specifically you can check out ORCHESTRATION.yml and other "YAML Jazz" metadata in the directory:

ORCHESTRATION.yml: https://github.com/SimHacker/moollm/blob/main/skills/cursor-...

Currently only supports Cursor running on Mac, but I'd be happy to accept PRs for Linux and Windows support. Look at the cursor-chat-relection.md document to see how I had Cursor analyze its own directories, files, and sqlite databases and JSON schemas. Also looking for help developing mirrors and MOOLMM kernel drivers for other orchestrators like Claud Code, etc.

DATA-SCHEMAS.yml: https://github.com/SimHacker/moollm/blob/main/skills/cursor-...

OutOfHere ・ a day ago

It's called continuous learning. You can't do it with an LLM service but you can if in training mode with bigger hardware.

Cursor and AI coding doesn't do it. It uses agentic subtasks.

dtagames ・ a day ago

There is no such thing as continuous context. There is only context that you start and stop, which is the same as typing those words in the prompt. To make anything carry over to a second thread, it must be included in the second thread's context.

Rules are just context, too, and all elaborate AI control systems boil down to these contexts and tool calls.

In other words, you can rig it up anyway you like. Only the context in the actual thread (or "continuation," as it used to be called) is sent to the model, which has no memory or context outside that prompt.

tcdent ・ a day ago

Furthermore, all of the major LLM APIs reward you for re-sending the same context with only appended data in the form of lower token costs (caching).
There may be a day when we retroactively edit context, but the system in it's current state is not very supportive of that.
- vanviegen ・ a day ago
  
  > Furthermore, all of the major LLM APIs reward you for re-sending the same context with only appended data in the form of lower token costs (caching).
  There's a little more flexibility than that. You can strip of some trailing context before appending some new context. This allows you to keep the 'long-term context' minimal, while still making good use of the cache.

journal ・ a day ago

i dont understand why these questions are so common? is it not obvious how one should use these capabilities? i compose my context in md file and send it through API. i wrote a simple lms.exe to send context and append response to the same file. why doesn't everyone else do that? i never believed in agents that compose their own context like Cursor. and i always pass the lowest reasoning value parameter to the API I can. why doesn't anyone else do this? you become dependent on a tool, you're already dependent some of you on fancy IDEs and agents. we're already dependent on top 3 vendors and openai is the only one that no one complains about from API key configuration side. you're gonna become dependent not only on LLMs but on the tooling as well? no thanks. anyone with a different opinion regarding this down to exact work flow, you are walking down the wrong path. you have to become efficient at converting electricity to text. admit it, some are just better than others while some will never get it at all. you know you won't because you know people in your life that never change their opinions about something, or always get into car accidents because they're a bad driver. you cant change these people and you might be one of them.