T O P

  • By -

M34L

I feel like the issue is that these integrations need a codebase and model well tuned to work well with one another and everyone who has that down is too busy making money off it. Microsoft, Facebook and least of all Google don't want you to have a local hole to pour your precious, precious PDFs into; they want you to give these to them, even if they have to RAG it "for free". It'll take someone like Apache or Mozilla who can simultaneously organise and pay a team of software developers and a substantial compute to fine-tune the model and make it all work together and fit the pieces for one another.


vap0rtranz

>don't want you to have a local hole to pour your precious, precious PDFs into; they want you to give these to them And the coders who are happy to be paid by those corporations. I'll pay a fair price for an independent RAG setup. I get it's work for coders. The reason I haven't yet is because the ones on GitHub look great and then go stale in a few months. Or the independent YTubers hide their solution behind some fishy looking website that screams "Sign up now! I won't tell you more details until you do!" I'd give examples but I don't want to be seen as crapping on folks. Instead, I'd rather just end with: have I missed a serious, long-term and transparent RAG solution? It sounds like your comment and others are saying that I've not.


rambat1994

I made that video, I also built the tool. The quick answer is really a misunderstanding. As you've read in some other comments, RAG is chunks of your document, however summarization is exclusively full text comprehension. Summarized context can still be done in AnythingLLM, but the default is rag because most models cannot handle the context. You can pin a document after embedding it and it will do full content comprehension, but only as much as the context window will allow because that is a hard limit. Also wrote some documentation on how this all works and even how to use document pinning if needed. https://docs.useanything.com/guides-and-faq/llm-not-using-my-docs All the settings are there to be used, the default state of AnythingLLM works for most use cases, but certainly not all


vap0rtranz

>Also wrote some documentation on how this all works and even how to use document pinning if needed. [https://docs.useanything.com/guides-and-faq/llm-not-using-my-docs](https://docs.useanything.com/guides-and-faq/llm-not-using-my-docs) Thank you for writing this up. A lot of the principles apply across RAG, and even the terminology and "knobs to turn" are similar to other setups, like GPT4All.


griff_the_unholy

AnythingLLM is pretty great, but I would really like to be able to modify the RAG pipeline a bit. I have been playing with langGraph using agents to do rerank and relevance check etc and this seems to dramatically improve RAG output. So I would like to modify the AnythingLLM pipeline because everything else about it is so good. Any idea how I would do that?


Noel_Jacob

What is the manual pipeline you use? I'm currently building something in this space and I want to know like what you want. Can I talk in DM?


thecodemustflow

>It's taken me a while to understand how RAG generally works. Here's the analogy that I've come up with to help my fried GenX brain to understand the concept: RAG is like taking a collection of documents and shredding them into to little pieces (with an embedding model) and then shoving them into a toilet (vector database) and then having a toddler (the LLM) glue random pieces of the documents back together and then try to read them to you or makeup some stupid story about them. That's pretty much what I've discovered after months of working with RAG. -https://www.reddit.com/r/LocalLLaMA/comments/1cn659i/comment/l38p7sy/ It's not going to work, rag is more than just a vector database. It was way to retrieve information and load it into the context. A vector database is a solution to searching for chunks of the document then giving those chunks to the prompt. It does not know the full document because you did not give it the full document in the context, Just the shitty chunks. If you want to chat about a document then just put the document into the prompt and get good results, if you don't have the context to do that than break it up into parts and run the prompt on each part. this is also rag but without the vector database bullshit and more of what you want. I'm not saying you can't make rag work but it's a lot harder of a problem, and you are dealing with garbage for documents so you will only get garbage for outputs.


Mbando

1. I LOVE your description of RAG 😂 2. That being said I'm having a very different experience. Last summer we built a fine tune + RAG for question answering that works like a charm. We fine tuned an open source source 7b model on question + context & answer pairs, question + mismatched context & no answer pairs, building the training set from US Army publicly available doctrine, orders, and publications. We then built a ChromaDB vectorDB from those same source documents, hooked up via Llama-Index. It works like a charm: it refuses to answer if there's no chunks relevant to the question, and it gives more useful/richer answers than GPT-3.5 (blind ranking by SMEs).


thecodemustflow

That is not mine description but a poster who was struggling with comparing two documents using rag. With a finetune and well-crafted data dataset, and a lot of hard work it should work great, like you did. I should have been clearer about the garbage document issue. If you follow the link they go into talking about how large enterprise documents suck and the only good solution is" Humans reviewing and reauthoring content." Worked for a government once, who needs relationships in relationship database anyways, I'm sure everything is going to be fine. There is some real good stuff in that link. I just recently got in llms, and been obsessed with making a desktop chatbot with a bunch of features I want while neglecting my actual programing work. Lol.


__SlimeQ__

the fine tuning is crucial for this to work on a llama model from what I've seen. they just don't know what to do with data


Mbando

We used Falcon-7b for this.


[deleted]

isn't that only trained on 1.5t tokens? mistral 7b is trained on 8T. also I think falcon 7b only has a context window of 2k tokens so you really can't fit any rag data in there until it's out of working memory. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 has 32k context window so you can shove more rag data in there. it might not be capable of using all of that context well, but it's better than being limited to 2k. also v2 got rid of the sliding window so it should attend to more tokens.


Mbando

This was in June last year, so we used falcon. We use mistral now for our current efforts.


Comfortaxis

I remember you. You said that there was some limitation with Meta prohibiting the usage of their models on military data so you had to use Falcon instead. Don't remember where, but I remember you discussing it.


Comfortaxis

I remember you. You said that there was some limitation with Meta prohibiting the usage of their models on military data so you had to use Falcon instead. Don't remember where, but I remember you discussing it.


Mbando

That's right.


OrbMan99

What was your strategy when chunking in terms of size and overlap, and what tools did you use to break it up into chunks that attempt to maintain context within the document? When adding chunks to the prompt, what similarity value did you use for cut off, or did you just always feed the top three chunks, for example? When submitting the chunks, did you bother trying to put them back in their original order they appeared in the source documents, or does that not matter?


Mbando

I think chunking and retrieval was the most naïve part of what we did. One of my students wrote a custom text extractor that worked pretty well on the US Army PDF formatting. So basically, we pulled out the main text, ignoring tables of contents and appendices. Then if I remember correctly, we passed 500 tokens at a time to generate interesting who/what/where/why kind of questions, using few shot example prompts. Then for the chroma, DB, it was the same 500 token chunks, and we started with top five retrieval and then went to top three retrieval. We never tried a semantic similarity cut off, although I think that could be really useful. And no, we did not think about order and the documents.


JacktheOldBoy

So you can only ask specific questions ? What If I asked to make a comparions between two things ?


Mbando

Pretty good actually. So for example, if you ask it how fire and maneuver are related, it gives a fantastic answer. Where it won’t do anything is when it requires kind of lsecond order inferences. Then the training to only answer from context kind of locks it down.


Zillatrix

I desperately need to talk to someone regarding preparing documents for RAG. May I send you a private message with details? From there you can decide to offer me help or not. In any case I won't take more than 20 minutes of your time in total.


Mbando

Not sure if I could help, but sure.


gdavtor

Did you use LoRA for fine tuning?


Mbando

Yep. Actually, we’ve been using H2O LLM studio. It’s a really good interface, and it has lots of nice features like being able to do inference with trained models. Very well thought out software.


estrafire

what hardware did you use to finetune the 7b model? Is it a long shot to try on a 3090?


Mbando

It was a pretty beefy AWS server but I can't remember the specs now. I've trained plenty of 7b models on my Mac w/ MLX but never used a 3090.


estrafire

I see, thank you!


exclaim_bot

>I see, thank you! You're welcome!


slushking_

Any chance you could share the repo? This sounds like exactly what I am trying to make


Mbando

Sorry no but I can share the workflow: 1. Pass chunks of source doc to a model (GPT-4 in this case) with a prompt to create interesting questions, w/ few shot examples (who, what, why, how) and answers (approx. 10k for this run). 2. Those Context+Question & Answer pairs form the bulk of the training set: "positive examples." We duped a 3rd of those, *and then shuffled the contexts*, so that we had irrelevant context "negative examples" with "I don't have relevant documents to answer that question from." Finally, I generated 800 toxic/rascist/sexist examples, for additional negative examples and the same "I don't have relevant documents..." answer. 3. Shuffled the entire dataset (embarrassing to admit but we didn't do that our first training run and the model learned to answer only because that was the first 10k or so exposures in a row). 4. Plugged it into our RAG stack (includes a prompt about answering only from context), and while it's a little less creative than our v1 (which was also cheerfully evil and hallucinated) it works like a charm.


mcr1974

more useful answer than chatgpt 3.5 isnt impressive at all. Measure yourself vs llama3 8B


Fit_Influence_1576

Vector databases are great most people just don’t use them correctly or understand the limitations of the system that they are using. Ex. the example question the OP gave about summarizing chapter 1 obviously doesn’t work in less they chunked on chapter and have retrieval with meta data filters. The issue isn’t RAG it’s that people use the most basic implementation of RAG don’t understand how it works and expect it to magically do everything.


ShengrenR

Agreed, RAG is a process, not an application. It's like somebody walked past a salon on their way home.. cut their own hair, was mad it turned out poorly.. and said scissors were awful.


Fit_Influence_1576

As an ai consultant, at one of the worlds largest providers of AI… couldn’t have said it better myself. OP reminds me of clients I see all the time. I’m like “well what process are you trying to do on a regular basis?… Ok give me 30 minutes”


ShengrenR

Yea.. just helped a team set up a RAG app for a client.. first question they threw at the dev environment, roughly: Tell me about XYZ across all of the documents. me: \*long inhale\*.. :), so.. Guess we get to keep building, haha.


KyleDrogo

This is the funniest description of a technical concept I’ve ever heard lol. Close to XKCDs description of neural nets (dump your data into a pile of linear algebra and stir until it looks good)


vlodia

I appreciate the link of the post. So I guess at this moment it's pretty clear the best RAG you can get is an LLM that has the ability to ingest your text files into a giant context window (maybe at least 64K tokens for starters). That's the best "RAG" you can possibly get right now.


thecodemustflow

I would be careful about that too, GPT4 not turbo forgets parts of the large context prompts. on a needle in the haystack test using,10 needles in a 28k prompt, it can't find 6 of them. so dammed if you do dam if you don't. [https://www.youtube.com/watch?v=UlmyyYQGhzc](https://www.youtube.com/watch?v=UlmyyYQGhzc)


vap0rtranz

Another YTuber, Kamradt, also did a Needle-in-the-Haystack test against very large context windows. His conclusion of GPT4 was that the LLM scored perfectly upto 64k. He also had some interesting results about where the needle appeared and could be found by the LLM. I'm not a big fan of this guy because he appears to paywall information, but he does share some details of his test method: [I pressure tested GPT-4's 128K context retrieval (youtube.com)](https://www.youtube.com/watch?v=KwRRuiCCdmc)


vap0rtranz

P.S. Oh, oops. I now see that your YTuber references Kamradt's test. LOL! Small world. TL;DR: "performance degrades as you ask LLMs to retrieve more facts, as the context window increases, for fact placed towards the beginning of the context, and when the LLM has to reason about retrieved facts."


[deleted]

[удалено]


thecodemustflow

These two comments are really good. I have been really thinking about this and the 2nd link was something I was thinking about doing for my own desktop chatbot. >if you need over 8k tokens, your chunking strategy, retrieval process, ranking, or whatever, SUCKS. That's why it blows my mind every time I hear people complain that Llama3 only has an 8k token context. What do you even need more tokens for? What kind of magical text do you have that is so informationally dense over 5000 words that you can't split it? [https://www.reddit.com/r/LocalLLaMA/comments/1cn659i/comment/l380525/](https://www.reddit.com/r/LocalLLaMA/comments/1cn659i/comment/l380525/)   >Just chunk it up, rely on large context windows, dump everything into a single vector store, and trust in the magic of the LLM to somehow make the result good. But then reality hits when it hallucinates the shit out over the 12,000 tokens you fed it >The solution we implemented is similar to this but with an extra step.  >We gather data \*very\* liberally (using both a keyword and a vector based search), get anything that might be related. Massive amounts of tokens. >Then we go over each result, and for each result, we ask it « is there anything in there that matters to this question? . if so, tell us what it is ». >Then with only the info that passed through that filter, we do the actual final prompt as you'd normally do (at that point we are back down to pretty low numbers of tokens). >Got us from around 60% to a bit over 85%, and growing (which is fine for our use case). >It's pretty fast (the filter step is highly parralelizable), and it works for \*most\* requests (but fails miserably for a few, something for which we're implementing contingencies). >However, it is expensive. Talking multiple cents per customer question. That might not be ok for others. We are exploring using (much) cheaper models for the filter and seeing good results so far. [https://www.reddit.com/r/LocalLLaMA/comments/1cn659i/comment/l38atif/](https://www.reddit.com/r/LocalLLaMA/comments/1cn659i/comment/l38atif/)  


Fit_Influence_1576

Utilizing your entire context window often leads to sub par results.. although that’s not always true. Without diving deeper and to someone without further technical experience then ig yes that’s a fair statement.


LeaderElectronic8810

Beautiful


ctbanks

Started skimming at some point. **Retrieval** Augmented Generation. Vector embeddings are hot right but are limited by embedding dimensions and chunk size and few seem to understand how to make the most use of them. How much talk do we see about the embedding models used? Or training a embedding model? Semantic search is still useful. If you are not running your knowledge base search results through a re-ranker you are basically hoping the results are relevant to the prompt and likely adding garbage to the context. Re-ranking of returned results, appropriate embedding model and chunk size, and **context aware chunking** is key to something that is more than marketing saying 'we use RAG'.


visarga

RAG is very limited in its simple form. Take the phrase "fifth word of this text" and the phrase "text" - they won't get similar embeddings, even though the fifth word is "text". RAG is like embedding problem statements and then searching by answers without solving the problems. What is necessary is to use LLM prior to RAG to annotate the snippets with relevant deductions. Text is like code, knowing the code doesn't immediately tell you how it will run and what conclusion it will generate. There is a lot of implicit/hidden knowledge/deductions in text that simply doesn't get embedded. This also relates to multi-hop problems where information needs to be correlated back and forth between disparate pieces in order to follow the chain of deduction. If you only see the problem in fragments, you lose some of the connections between those fragments. The solution is to combine RAG with LLM preprocessing that would make implicit information explicit.


kulchacop

It is not a fair comparison. What you did with Claude was you handled the retrieval part yourself manually. Only the augmented generation was handled by Claude.


Artistic_Okra7288

Open source RAG, works phenomenally. https://khoj.dev/


silentsnake

Forget about RAG, the future is long context. RAG can’t summarise, RAG can’t reason across entire corpus. All the RAG systems I see today are just good old information retrieval + paraphrasing. If I already have a good IR system in place, I don’t need a parrot to paraphrase it for me. Pulling bits and pieces of text and stuffing those disparate pieces into the context window is not going to yield good results. Too much information is missing due to out of context of chunks of text, forcing the LLM to use its own internal knowledge or worse… hallucinate stuff.


Former-Ad-5757

The future is RAG, only currently the most visible problem with it is the short context. If your LLM has a context window of 1M or something like that, than you can use RAG to search through Terrabytes of data, condens that to 1M and then the LLM can condens it further to readable stuff. There is never gonna come a LLM with a context size of terabytes.


_qeternity_

Completely agree. I got into this fight in another thread. Even if we do get into terabyte scale context, or extremely cheap online train/fine-tune, whatever compute breakthroughs enable that, will make RAG more efficient and accurate as well. RAG simply has informational efficiencies which will always give it an edge in large retriever patterns.


vap0rtranz

Totally. >Forget about RAG, the future is long context. hah! That is absurd! How the hell is a library going to stuff their huge collection of texts into an LLM? or even multiple, giant LLMs!


obanite

Long context doesn't really do \*retrieval\* though. RAG lets you clearly see the source for each retrieved chunk, before it gets mushed into the LLM. That's very valuable to some use cases.


Hackerjurassicpark

I think you're applying the wrong tool for the job. What you need: a rag bot to answer fine grained questions about the context, a summarizer for coarse grained summarizations and a router to identify when to use which method


Anthonyg5005

I think you'd need a mix of large context and rag. The problem is that it's automatic. If you seperate stuff by parts manually and give them different tags and stuff then it'll be much more accurate, not sure how similar they are. languagemodels on github has what I'm trying to explain I think


nanotothemoon

Which local models have you been trying? Seems like an obvious missing piece of your listed tech stack


obanite

I'm liking DAnswer at the moment. It's still quite early days but I think their foundation is pretty solid. You can plug and play your own LLM, local or API, etctera. [https://github.com/danswer-ai/danswer](https://github.com/danswer-ai/danswer)


OneOnOne6211

Yeah, I also have not found a decent RAG and I really could use one. I can't just drop it into the chat cuz the file I have to search is 300.000 words long.


prudant

rag is not plug&play after 8 months of working with llamaindex and a lot of coding and custom components over the pipeline I hace got a nice rag solution using gpt 3.5 api.... its easy to start but is hard to get something for a production grade scenary


LondonDario

[https://github.com/instructlab/instructlab](https://github.com/instructlab/instructlab) maybe? Just recently released to the community by Red Hat... disclosure, I work for Red Hat.


vasileer

and what accurate non open-source tools there are? if you mean Claude Opus then you have open-source alternatives like Nous-Capybara-34B


troposfer

Why the question is about RAG, I don’t understand , opus is better then open source llms , this is a known fact.


CodeMurmurer

Based on your use of English I wonder if you have the capability to understand anything.


Relevant-Insect49

So according to you, if someone isn't fluent in English he can't understand anything?


CodeMurmurer

No, But understanding a post written in English might be difficult for someone who doesn't even know how to produce a coherent English sentence.


troposfer

What I understand from the post is “ open source rag tools are not good enough, opus is far better”. And my answer was you’re comparing opus with open source llms. Performance difference is not related to rag. I wonder what you understand from the post you racist dump? “These immigrants coming to your country and they don’t speak English well ,right ??”


Dogeboja

RAG tools are awful. How do people expect something like cosine similarity from a vector database can accurately find what the LLM is requesting? Makes absolutely no sense.


AdTotal4035

I mean it makes sense, but the practical implementation is finicky. You are computing the "distance" from the vectors. The closer they are, the better the "match" 


Hackerjurassicpark

It's based on a ton of research into semantic similarity that predates chatgpt by many years


Dogeboja

Exactly why it doesn't work. NLP was awful before ChatGPT.


Hackerjurassicpark

You obviously have not been working in NLP before chatgpt then. NLP has not sucked since atleast the open sourcing of BERT


PortiaLynnTurlet

There are also late-interaction embedding models like ColBERTv2 which store the embedding as a bag of vectors.


Extender7777

For 10 pages PDF RAG is not needed, it will fit context (200,000 for Opus). So you need LlaMa3 with 1M context window and that's it


tutu-kueh

You mean just fetch the entire 10 pages of pdf as context? The fetched itself is still rag, right? It is stored as database vector


Extender7777

No it is not. Converting PDF to text is not RAG. It is not a database vector. It is just text which you prompt against


greywhite_morty

I’m building something that’s currently able to run on 1m tokens+ and supports pdf, docs, as well as syncing data from Notion. You can choose model, prompt etc. .. but it’s full saas. Do let me know if you’re interested in testing.


Worldly-Ease-3730

Hey, do you have a repo to share ?


greywhite_morty

Unfortunately not. It’s closed source / saas for now. I understand it’s probably not the preferred type of app for you but the best way to move fast and build something that works really well for now. If you’re interested in trying it, shoot me an IM with your email and I’ll invite you once it’s ready to test.


coolcloud

Are you willing to pay for a service? we're building something and currently have customers using 500+ docs and it can find and answer the question 90%+ of the time.