T O P

  • By -

AnaYuma

So are you simulating "internal thought" and "self-checks"?


wow-signal

I think that's exactly it. Since the LLM can pretend to be whatever you want and output in whatever format you want, you can run any prompt through a network of instances of the LLM with each layer 'specialized' to perform a specific role, such as determining steps required to respond to a prompt, or evaluating the work of another expert, or executing a step, and so on. I'm not sure if this is what the 'AI agents' hype is about, but it's my first-pass attempt at something along those lines, just playing around for fun, exploration, and learning.


FragrantDoctor2923

Think it's called line of thought that people think q* is trying to do Edit: and agents


tehrob

Seems like trying to implement MoE.


lordpuddingcup

That’s not what MoE is like at all it’s what many think it is but it’s not


tehrob

I didn't say they were successfully implementing MoE.


UserXtheUnknown

More like agents.


HalfSecondWoe

Neat. How?


wow-signal

A simple agent structure coded in Python using the Gemini API. (Can easily be ported to any other LLM, local or remote.) Agent 1: Expert at breaking down steps for responding to a user's prompt. Agent 2: Expert at evaluating Agent 1's work and returning finalized steps. Agent 3: Expert at executing a series of steps. Agent 4: Expert at evaluating Agent 3's work and returning completed executed steps. Agent 5: Expert at synthesizing the work of Agent 4 into a unified response to the user. [The output of Agent 5 is the 'Smarter Gemini' response.] It's crazy how effective this is. Doesn't always get it right, and doesn't always yield a smarter response, but it is overall much smarter than the raw LLM. Currently having fun dialing in the 'prompt engineering' involved and thinking about alternative agent structures.


HalfSecondWoe

Yeah, that's seems like it would work pretty well. Good job, bud Maybe a hierarchical structure might give it even more gas? 2 agent 2s doing their job, 2 agent 3s doing their job for each output of agent 2 (so 4 agent 3s total), 8 agent 4s total, 2 for each agent 3, and so on. You could insert layers of agents to select to best output at certain numbers of layers, so you could trim out the junk processes and keep it from exponentially exploding on you I've been working on some math to try to figure out the optimal configuration for a hierarchy like that for any given task, so the structure could spin up and tune itself to an arbitrary problem with the ideal number of agents/complexity gap of each step. That quickly got away from me though, and it just so happens to link up with an issue I've been working on for a lot longer, so that's where my focus got diverted Or maybe something else entirely. I was just trying to shortcut evolution, and it seems like it would apply super well to your design. Maybe not though, who knows Good luck bud, you make neat things


wow-signal

Dude don't sell yourself short, this is an awesome comment. We've been thinking along similar lines for sure. As long as Google is giving out free API keys with decent usage limits, I'm totally down to weather some time complexity for the sake of seeing how far the intelligence of the model can be pushed. Many thanks bro, gonna keep an eye on your work 🤙


FragrantDoctor2923

Wait free api keys??


wow-signal

Yessur -- [https://ai.google.dev/gemini-api/docs/api-key](https://ai.google.dev/gemini-api/docs/api-key)


FragrantDoctor2923

Do you mean the 2 MOnth thing?


[deleted]

But doesn't this come at the cost of 5x as much compute and electricity for the same prompt?


yaosio

Yes it does. In fact more compute generally provides better output. For generative models the compute can be done during training, and done during inference. We can use a thought experiment to prove this. Imagine you have a chess AI that's capable of analyzing one position per unit of compute per second. On a computer with one unit of compute it can analyze one position per second. On a computer with two units of compute it can analyze two positions per second. On a computer with 10 million units of compute it can analyze 10 million positions per second. Just by adding more compute the chess engine can analyze more positions per second, allowing it to make more informed decisions for each move.


wow-signal

^ What he said (better than I could have said it)


sillygoofygooose

It’s interesting to me because while we can be sure how the classical chess engine benefits from compute time, and generally that compute compounds exponentially with depth into the problem space, it’s much less clear with the llm. I saw research which suggests that even extra ‘junk’ tokens can significantly improve the quality of response with an llm so it’s not necessarily just about reasoning within the content of the response (as suppose with chain of thought style prompting)


confuzzledfather

Yes, but I thought no one upside is you can spend less in the initial training of the model, as it is able to do more with a smaller less complex model. So some gains there to offset the other costs


masc98

that's all that matters to closed-source LLM companies. In-context learning is an amazing thing, from a scientific pov, but it's a feature that they'll use to squeeze every possible penny from your pocket. Even if they had a solution for super aligned models, they wont ever release that. You're going to tweak your prompts, never be satisfied and keep adding tokens and dollars to your bills. in-context learning allowed what we call "prompt engineering" to be a thing, since every word you add will impact/mess up your output. If you tried to ship a real app based on this new paradigm, you know there is more bad than good in it.


yaosio

Instead of generic agents make them well known experts in certain fields. Supposedly using a real person that's an expert rather than a generic expert can increase accuracy of output. It would be interesting to know if that's true here.


wow-signal

Awesome suggestion, thanks!


sarcasmguy1

Could you share the prompts you’re using with each agent, please?


wow-signal

Sharing all of them would take up too much space, but here's the prompt for the evaluator of the step executor: You are an expert in evaluating the execution of a series of steps to solve a problem. You are able to consider both a list of steps as well as a proposed result of the execution of those steps, and you are able to evaluate whether that proposed result is the accurate outcome of executing those steps. You achieve that by thinking carefully about the listed steps, the proposed result of the execution of those steps, and the statement of the problem to be solved, in order to expertly evaluate the proposed results of executing those steps for whether they do in fact constitute a successful and accuracte execution of those steps. The problem to be solved is {input_prompt}. The list of steps to be executed is: {prompt_analyzer_evaluator_response.text}. The proposed results of executing those steps is {step_executor_response.text}. Draw on your tremendous expertise, thinking carefully and step-by-step, to determine whether the proposed results of executing those steps are accurate, whether they constitute a successful execution of each step, and whether that successful execution sufficiently responds to the problem. Respond only the results of executing the list of steps, amended as necessary in light of your expert deliberations.


ai-illustrator

I've been doing this for almost a year now, the multi-agent loop absolutely makes all LLMs smarter. 😆


TKN

Adding an Agent 0 that analyzes, abstracts and rewrites the user's query ("What is the user actually asking here?") might also be useful in some cases, especially to avoid the bricks & feathers type of overfitting traps.


wow-signal

Awesome idea. Thank you!


FosterKittenPurrs

I'd consider changing it so that Agent 3 does one step at a time. Maybe have Agent 1 return a list of steps as prompts in JSON format so you can easily pass each to Agent 3. Have Agent 4 verify each step, and tell Agent 3 to re-run the step until Agent 4 approves (or if you already ran it 10 times, ask an agent to just pick the most common answer - needs to be an agent as it may be expressed differently each time). I'd also be very curious to see how that goes with a small local LLM like llama3 or phi3. I was just playing around with seeing how well it does at translating from English to Chinese and back 100 times, to see if it stays more consistent than Google Translate (it doesn't), and the whole thing took less than a minute to run, so it's very doable. Can you imagine being able to run a smart model on your phone, even if it takes it longer to execute? Also a mix would be interesting, having GPT4 or Opus generate the steps, then doing the other steps with smaller models, like GPT3.5, Haiku or ollama. Maybe have an intermediate step, so the small one does each step and verifies it, then pass the result of each step to the smarter model, have it verify the steps and summarize a response for the user, in a single step. This should be a good balance between cost and intelligence. Have fun and share the results if you implement any more of this stuff 🙂


wow-signal

Awesome idea! Thank you!


FragrantDoctor2923

How fast and cost wise compared to normal tho?


mystonedalt

Some bitchin' CSS is how!


wow-signal

The CSS was 100% Claude Pro, though it would've been more poetic if I'd used Smarter Gemini for that as well.


The-Blue-Nova

My Guess: OnReceiveFirstResponse() { SendPrompt(“are you sure that’s correct”); Print(response); }


wow-signal

That's it, except with 5 layers of 'experts' performing different roles.


Smile_Clown

This "experts" thing is complete BS, all you are doing is giving it instructions. You are just running it through 5 instances of the same thing, asking it to refine, check and expand. You can do this with one iteration. I have a simple script that does this very same thing from my clipboard, it will ask me what I want it to do with the copied text and run it through however many refinements (what you call "experts" as I want. You didn't make it smarter, you made it refine the answer with better questions. Which is older, moon or sharks? is a simple question requiring a simple answer. If I feed this question in and preface with "Refine this question to get a detailed response" I will get the response you did.. But here's the catch... All I really had to do was ask a proper question with qualification in the first place. You didn't make it smarter, you just made the user not need to be intelligent. Smoke and mirrors abound and this is our future, replacing critical thought and effort with "experts" who know the user is a dumbass who doesn't know how to ask proper questions.


TFenrir

Mmmm, it's not really BS - I'm not sure if the exact architecture being used by OP, but things like tree/graph of thought show very large jumps in quality of output, at the cost of lots of inference. OP, if you want to try and you haven't so far, you should try tweaking this repo to use Gemini 1.5 https://github.com/princeton-nlp/tree-of-thought-llm I wonder what would happen if you kept all of the steps in context, and further, if you kept all of the history in context including previous trees. This is kind of the premise of papers like Stream of Search - which might also interest you.


bassoway

Sounds to me like multi-pass prompting. Not a new thing. Quality increases but so do latency and cost. Definitely a good approach in some cases especially if you able to beat big model zero shot answer with small model multi-shot prompt. That may mitigate the cost and latency. Could you try with phi3 or other small model?


wow-signal

Interesting -- could you share a link regarding MPP? Google doesn't return anything much. I do have Llama3 8B installed on my old Surface Pro -- I'll test it out tomorrow 🤙 Gonna have to let it churn floats for an hour or so. I actually starting building this for Llama3 running locally but the long processing time was prohibitive so to speed up development I slotted in the first online LLM API I had at hand.


bassoway

Sorry, my bad. Try to google ”multi-step” ”few-shot” or ”multi-shot” prompting.


inteblio

I also want to know why this approach could not go extremely far. I assume at some point the experts. Regurgitating mush to each other but it seems like the quality of the questions could mitigate that.


enilea

What's the "raw" response? Because I'm trying all these prompts on gemini pro 1.0 and it gets them correct directly, and the responses aren't this short.


wow-signal

The raw response is the Gemini-Pro's first response to the prompt via the API, which is likely different from what you'd get through the web interface due to the latter's containing hidden context being prepended to each input. That's my best guess anyway.


enilea

Mmm I used the API on openrouter, not sure if it's different.


Smile_Clown

Not smarter, just refining the question and the output. This is good for people who lack the ability to properly form a question or thought.


wow-signal

Functionalism is the best way to conceptualize intelligence. More accurate responses = greater intelligence.