Throwing this out there, I have a command line driver for LLMs. Lots of little tricks in there to adapt the CLI to make it amiable for LLMs. Like interrupting a long running process periodically and asking the LLM if it wants to kill it or continue waiting. Also allowing the LLM to use and understand apps that use the alternate screen buffer (to some degree).
Overall I try to keep it as thin a wrapper as I can. The better the model, the less wrapper is needed. It's a good way to measure model competence. The code is here https://github.com/swax/NAISYS and context logs here for examples - https://test.naisys.org/logs/
I have agents built with it that do research on the web for content, run python scripts, update the database, maintain a website, etc.. all running through the CLI, if it calls APIs then it does it with curl. Example agent instructions here: https://github.com/swax/NAISYS/tree/main/agents/scdb/subagen...
The one thing that I always wonder is how varied are those interactions with an agent. My workflow is is enough of a routine that I just write scripts and create functions and aliases to improve ergonomics. Anything that have to do with interacting with the computer can be automated.
Yea a lot of this is experimental, I basically have plain text instructions per agent all talking to each other, coordinating and running an entire pipeline to do what would typically be hard coded. There’s definite pros and cons, a lot of unpredictability of course, but also resilience and flexibility in the ways they can work around unexpected errors.
> It's a good way to measure model competence.
Can you elaborate?
Sure, so you tell the model, here's a command prompt, what do you type next? Ideally, it types commands, but a lesser model may just type what it's thinking which is invalid. You can give it an out with a 'comment' command, but some models will forget about that. The next biggest problem is fake output; it types not just 'cat file.txt' but the following command prompt and fake output for the file.
The biggest mark of intelligence is can it continue a project long-term over multiple contexts and sessions. Like, 'build me a whole website to do x', many AIs are very good at one-shotting something, but not continually working on the same thing, maintenance, and improvement. Basically, after that good one shot, the AI starts regressing, the project gets continually broken by changes, and the architecture becomes convoluted.
My plan is not to change NAISYS that much; I'm not going to continually add crutches and scaffolding to handhold the AI; it's the AI that needs to get better, and the AI has improved significantly since I mostly finished the project last year.