As long as there's no solution to the long-term memory problem, we will have a "country of geniuses in a data center" that are all suffering from anterograde amnesia (movie: Memento), which requires human hand-holding.
I have experimented with a lot of hacks, like hierarchies of indexed md files, semantic DBs, embeddings, dynamic context retrieval, but none of this is really a comprehensive solution to get something that feels as intelligent as what these systems are able to do within their context windows.
I am als a touch skeptical that adjusting weights to learn context will do the trick without a transformer-like innovation in reinforcement learning.
Anyway, I‘ll keep tinkering…
I've used open claw (just for learning, I agree with the author it's not reliable enough to do anything useful) but also have a similar daily summary routine which is a basic gemini api call to a personal mcp server that has access to my email, calendar etc. The latter is so much more reliable. Open claw flows sometimes nail it, and then the next day fails miserably. It seems like we need a way to 'bank' the correct behaviours - like 'do it like you did it on Monday'. I feel that for any high percentage reliability, we will end up moving towards using LLMs as glue with as much of the actual work as possible being handed off to MCP or persisted routine code. The best use case for LLMs currently is writing code, because once it's written, tested and committed, it's useful for the long term. If we had to generate the same code on the fly for every run, there's no way it would ever work reliably. If we extrapolate that idea, I think it helps to see what we can and can't expect from AI.
This is interesting. I haven't used OpenClaw but I set up my own autonomous agent using Codex + ChatGPT Plus + systemd + normal UNIX email and user account infrastructure. And it's been working great! I'm very happy with it. It's been doing all kinds of tasks for me, effectively as an employee of my company.
I haven't seen any issues with memory so far. Using one long rolling context window, a diary and a markdown wiki folder seems sufficient to have it do stuff well. It's early days still and I might still encounter issues as I demand more, but I might just create a second or third bot and treat them as 'specialists' as I would with employees.
I did (using Claude Code) something that sounds very similar to this. It’s a bunch of bootstrapped Unix tools, systemd units, and some markdown files. Two comments:
- I suspect that in this moment, cobbling together your own simple version of a “claw-alike” is far more likely to be productive than a “real” claw. These are still pretty complex systems! And if you don’t have good mental models of what they’re doing under the hood and why, they’re very likely to fail in surprising, infuriating, or downright dangerous ways.
For example, I have implemented my own “sleep” context compaction process and while I’m certain there are objectively better implementations of it than mine… My one is legible to me and therefore I can predict with some accuracy how my productivity tamagotchi will behave day-to-day in a way that I could not if I wasn’t involved in creating it.
(Nb I expect this is a temporary state of affairs while the quality gap between homemade and “professional” just isn’t that big)
- I do use mine as a personal assistant, and I think there is a lot of potential value in this category for people like me with ADD-style brains. For whatever reason, explaining in some detail how a task should be done is often much easier for me than just doing the task (even if, objectively, there’s equal or higher effort required for the former). It therefore doesn’t do anything I _couldn’t_ do myself. But it does do stuff I _wouldn’t_ do on my own.
Right - I think email is a much better UI than Slack or WhatsApp or Discord for that reason. It forces you to write properly and explain what you want, instead of firing off a quick chat. Writing things down helps you think. And because coding harnesses like Codex are very good at interacting with their UNIX environments but are also kinda slow, email's higher latency expectations are a better fit for the underlying technology.
Any chance you might put this on GH? Sounds really interesting.
Maybe but it's so simple I'm not sure it's worth it. You can easily make your own!
What sort of tasks do you have it do for you?
Two categories: actual useful work for the company, and improving the bot's own infrastructure.
Useful work includes: bug triage, matching up external user bug reports on GitHub to the internal YouTrack, fixing easy looking bugs, working on a redesign of the website. I also want to extend it to handling the quarterly accounting, which is already largely automated with AI but I still need to run the scripts myself, preparing answers to support queries, and more work on bug fixing+features. It has access to the bug tracker, internal git and CI system as if it were an employee and uses all of those quite successfully.
Meta-work has so far included: making a console so I can watch what it's doing when it wakes up, regularly organizing its own notes and home directory, improving the wakeup rhythm, and packaging up its infrastructure to a repeatable install script so I can create more of them. I work with a charity in the UK whose owner has expressed interest in an OpenClaw but I warned him off because of all the horror stories. If this experiment continues to work out I might create some more agents for people like him.
I'm not sure it's super useful for individuals. I haven't felt any great need to treat it as a personal assistant yet. ChatGPT web UI works fine for most day to day stuff in my personal life. It's very much acting like an extra employee would at a software company, not a personal secretary or anything like that.
It sounds like our experience differs because you wanted something more controlled with access to your own personal information like email, etc, whereas I gave "Axiom" (it chose its own name) its own accounts and keep it strictly separated from mine. Also, so far I haven't given it many regular repeating tasks beyond a nightly wakeup to maintain its own home directory. I can imagine that for e.g. the accounting work we'd need to do some meta-work first on a calendar integration so it doesn't forget.
I’m doing this exact same thing in my solo saas company, except with Cursor’s Cloud Agents. I can kick them off from web, slack, linear, or on a scheduled basis, so I’m doing a lot of the same things as you. It’s just prompts on a cron, with access to some tools and skills, but super useful.
That unreliability was why I gave up on OpenClaw. I tried hard to give it very simple tasks but it had a high degree of failure. Heartbeats and RAG are lightyears away from where they need to be. I'm not sure if this can be overcome using an application layer right now, but I trust that many people are trying, and I'm eager to see what emerges in the next year. In the mean time I know that they're working very hard on continuous learning - real-time updates to weights and parametric knowledge. It could be that in a year or so, we can all have customised models.
That would be great if that comes to fruition. Investing in a model with weights updates would be like investing in employee training, rather than just giving the same unreliable employee more and more specific instructions.
You're right to be skeptical. Without a way to actually implement how the human brain processes experiences into a consolidated memory, we won't be able to solve the long term memory problem at all. Not with the current technology.
An LLM context is a pretty well extended short term memory, and the trained network is a very nice comprehensive long term memory, but due to the way we currently train these networks, an LLM is just fundamentally not able to "move" these experiences to long term, like a human brain does (through sleep, among others).
Once we can teach a machine to experience something once, and remember it (preferably on a local model, because you wouldn't want a global memory to remember your information), we just cannot solve this problem.
I think this is probably the most interesting field of research right now. Actually understanding in depth how the brain learns, and figuring out a way to build a model that implements this. Because right now, with backtracking and weight adjustments, I just can't see us getting there.
I think if we want to build on what we have, instead of compaction at the end of the context window, the LLM would have to 'sleep', i.e. adjust its weights, then wake up with the last bits of the old context window in the new one, and have a 'feel' for what it did before through the change in weights. I just sense it's not that simple to get there, because simply updating the weights based on a single context sample risks degrading the weights of the whole network.
I like the idea of using small local model (or several) for tackling this problem, like low rank adaptation, but with current tech, I still have to piece this together or the small local models will forget old memories.
Sleep would probably be a part of the equation for consolidating , but there's still the question of how exactly does the brain process the information during sleep in a way that it permanently consolidates the information.
It's not how an llm can work right now, it needs too much iterations & a much bigger dataset than what we can work with. A single time experiencing something and we can remember it. That's orders of magnitude more efficient than an LLM right now can achieve.
Couldn't fitting solve the problem? That's what companies do: take a model as a base and train it on the specific data long enough so that it prefers the new data. Overfitting may be a thing but for personal use, I may want to have it work as I expected, every time.
> I think this is probably the most interesting field of research right now. Actually understanding in depth how the brain learns, and figuring out a way to build a model that implements this.
This field of research has been around for decades, so who's to say when there'll be a breakthrough.
In fact, LLMs are great despite our very limited understanding, and not because we had some breakthrough about the human brain.
Exactly. It's been around so long and we still don't know how to mimic it.
The way an llm learns is a very interesting way of doing it, but it sure isn't what the brain is doing.
But it's indisputable.. We can get enormous results with this technique. It's just probably not the way forward for faster learning to remediate the issue of context loss.
Why does a language model have to be monolithic? I think retraining a model is expensive (relatively speaking). Is there some way to bolt on specialization?
That's exactly the issue. Retraining is too expensive & needs too much iteration to work efficiently I think.
How well do LoRAs work for this using something like Thinking Machine's Tinker?
It's kind of fascinating that everyone is trying to build a Chinese Room agent with stateless models, since we don't know how to produce a stateful model with continuous, incremental training.
It's like spontaneous implemention of thought experiments from yesteryear. I wonder if all this product-focused experimentation will accidentally impact philosophy of mind after all...
I agree. A key to human intelligence is our ability to adjust our weights in real-time. All knowledge becomes parametric knowledge - the knowledge stored inside the model. RAG is a messy workaround which requires making assumptions about what is needed to load from external sources before it is clear what is needed. Agentic loops can go some way to overcome this, but they are resource intensive, slow, prone to mistakes and deviations, and far less accurate. The secret sauce of an LLM is the vectorised weights. RAG is like putting a 1990s Honda Civic engine into a Ferrari. You can do it, but the result is quite terrible.
I think we will eventually end up with models which can be individually trained and customised on regular schedules. After that, real-time.
[dead]
[dead]
[dead]