Hacker News new | past | comments | ask | show | jobs | submit login
Run llama3 locally with 1M token context (ollama.com)
204 points by mritchie712 20 days ago | hide | past | favorite | 72 comments



I've been using Gemini 1.5 with 1M tokens and it's a totally different game. You can essentially "train" the model on whatever hyper specific stuff you have on hand.

Don't know how to work with my test system? No problem, here is an 800 page reference manual. Now you know.


I'm curious what your setup is? I mean what platform, how much RAM and which particular version of the model?


They are talking about Google's Gemini, not running locally.


Oops I confused Gemini and Gemma.


I agree, definitely a game changer. Finally made me a convert.

re latency: yes, latency can get quite high, for large contexts. minutes, 10s of minutes when you try to max out the context.


Could you clarify whether you’re context stuffing the whole document on every query, or something else?


Yeah looks like they misused the word "train". And instead meant context stuffing.



What kind of latency do you observe when using Gemini with such large contexts?


Around 250k tokens its around 15-20 seconds. When you really pack it full (which I have only done just to test it) it can take a minute before it outputs anything. It also sometimes glitches out with massive contexts, and just stops generation mid sentence. Asking it to repeat the response clears it up though.


Inference takes 30+ seconds before it starts outputting even the first token.


How's recall over 1M tokens?


"run locally" is meaningless if you don't state the hardware requirements unless you don't mind wasting the readers time making him or her figure it out only find out it takes a $20,000 computer. Or is this just assumed?


$20,000 to run a close to state-of-the-art model is very reasonable. Well within the grasp of most startups and well off individual "gentleman scientists".


I aspire to be a well off individual gentleman scientist


Noob here. How would I figure out whether my machine would handle a particular model?


Check the RAM requirements for the model (can be hard to find, though) and compare to the available RAM you want to run it in (VRAM if you are running on GPU, system RAM if running on CPU.)

E.g., this extended-context Llama 3 70B requires 64GB at 256K context and over 100GB at 1M.


man, exactly. it gets on my nerves.


I asked Claude Opus about it and got an estimate. It's going to take over 100GB (A 128 GB RAM stick can cost around $1,500) to run locally and probably mulitple Nvidia A100's ($10,000 per) to run locally, so if you're rich or work is supplying the hardware or cloud it's possible.

What are the hardware requirements to run the model I gave you?

Based on the provided document, the hardware requirements for running the llama3-gradient model depend on the desired context window size:

To use a 256k context window, at least 64GB of memory is required. To use a 1M+ context window, significantly more memory (100GB+) is needed.

The exact requirements are not specified, but it's clear that running this model with very large context windows requires a machine with a substantial amount of RAM, likely a high-end workstation or server rather than a typical consumer PC. The computational demands also likely necessitate a powerful CPU and/or GPU. CopyMWwhat type of CPU and what type of GPU's and how many would it take?The document does not provide specific details on the CPU or GPU requirements for running the llama3-gradient model. However, based on the memory requirements mentioned and the nature of the model, we can make some educated guesses:

CPU:

For optimal performance, a high-end, multi-core server-grade CPU would likely be necessary, such as an Intel Xeon or AMD EPYC processor with a high core count (e.g., 16+ cores).

GPU:

Running large language models often benefits greatly from the use of one or more powerful GPUs. High-end consumer GPUs like the Nvidia GeForce RTX 3090 (24GB VRAM) or data center GPUs like the Nvidia A100 (40GB-80GB VRAM) would likely be suitable.

The number of GPUs required would depend on the specific implementation and the desired performance. A single high-end GPU might be sufficient, but using multiple GPUs could speed up processing.

Keep in mind that these are estimates based on the provided information. The actual hardware requirements may vary depending on the specific implementation, optimizations, and performance requirements. Running such a model with very large context windows would likely require a powerful, expensive system with ample RAM, fast storage, and robust cooling. Detailed benchmarks or recommendations from the model's developers would provide more definitive guidance on the exact hardware needed.


Great. Thanks!

I wish there was something like "can you run it" but for LLMs.


I second that! The lack of a tool for this is nuts.


Tested this on Apple Silicon at 128k context, very poor results with random character hallucinations.


If you got random tokens on Apple Silicon Macs, you should check your stderr for Metal’s error code, like 5 for out of metal memory. Metal lets you use half of your unified memory as metal memory.


Thanks for testing it. I wonder if there’s a reasonable line? 64k? How much RAM did the 128k context require? (Maybe future models might be better at that level)


Pardon my ignorance, but I am a dilettante. I was just wondering about this earlier today.

How much unlocking of a more active intelligence from LLMs, will far larger context length provide?

Would virtually unlimited context move us much closer to an LLM where training is continuous? Is that a bit of a holy grail? I assume not, but would love to know why.


> How much unlocking of a more active intelligence from LLMs, will far larger context length provide?

Remarkably, foundation models can learn new tasks via a few examples (called few-shot learning). LLM answers also significantly improve when given relevant supplemental information. Boosting context length: grows its "working memory"; provides richer knowledge to inform its reasoning; and expands its capacity for new tasks, given germane examples.

> Would virtually unlimited context move us much closer to an LLM where training is continuous?

No. You can already continually train a context-limited LLM. Virtually unlimited context window schemes also exist. Training is separate concept from context length. In pre-training, we work backward from a model's incorrect answer, tweaking its parameters to more likely say the correct thing next time. Fine-tuning is the same, but focusing on specific tasks important to the user. After training, when running the model (called inference), you can change the context length to suit your needs and tradeoffs.


Thank you. If you don't mind...

> tweaking its parameters to more likely say the correct thing next time.

Is this entirely, or just partially done via human feedback on models like GPT-4 and LLama-3, for example?


Training is done by partially hiding (masking) text and making a (decoder-only) model predict what's missing. Training text which humans prefer as model responses is called RLHF. It's an expensive and small text dataset.


Look into agent training.


The other thing to consider is that, at least today, more context also means more chances at hallucinations or otherwise imperfect recall that goes into giving an answer. This has certainly been my experience with larger context windows.


Thanks, I was wondering about this. Has this been your experience across many models, universally? Or, are some worse than others at what I just learned is called In-Context Learning?


This has been my experience across all of them, yes. Especially when I ask it to select a decently-sized subset of the text I pass in, as opposed to just doing a needle-in-the-haystack type thing.


Recall ability varies quite a bit. GPT-4-Turbo's recall becomes quite spotty as you reach 30-50k tokens, whereas Claude-3 has really good recall over the entire context window.


context length doesn't unlock intelligence, just more information. e.g. adding info that wasn't part of training (or wasn't heavily weighted in training).


> context length doesn't unlock intelligence, just more information

This isn't correct (for most definitions of "unlock intelligence"). In-Context Learning (ICL) can do everything off-line training can do.

> We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases and can learn high-dimensional functions with numerical inputs.

https://arxiv.org/abs/2404.11018


Right, I should have been more clear than the words "active intelligence." But as one use of this... would unlimited, say 1 to 10 billion tokens of context, used as a system prompt, with "just" 32k left for the user, allow a model to be updated every day, in between actual training? (This is in the future, where training only takes a month, or much less.)

I guess part of what I really don't understand is how context tokens compare to training weights, as far as value to the final response. Would a giant context window muddle the value of weights?

(Maybe what I am missing is the human-feedback on the training weights? If the giant system prompt I am imagining is garbage, then that would be bad.)


In-context learning (ICL) is a thing. People aren't entirely sure how it works[1].

LLMs are very effective at few-shot learning via so yes, for all practical purposes yes, large context windows do allow for continuous learning.

Note that the context needs to be loaded and processed on every request to the LLM though - so all that additional information has to be "retaught" each time.

[1] https://openreview.net/pdf?id=992eLydH8G "These results indicate that the equivalence between ICL (ed: In-context-learning) and GD (ed: Gradient Descent) is an open hypothesis, requires nuanced considerations, and calls for further studies.


Thank you so much for your response. It's amazing what typing a little acronym like "ICL" can do as far as sharing knowledge. This is so cool!

Also, your link appears to exactly address my question. It's late here, but I am very excited to do my best at understanding that paper in the morning.


I mean, working memory is an aspect of intelligence.


Our results are varied on long contexts. True — one can put a lot of stuff in there and the thing groks a bunch of it. Being able to synthesize specific answers from “long tail” facts in the context can be difficult. YMMV.


> Note: using a 256k context window requires at least 64GB of memory. Using a 1M+ context window requires significantly more (100s of GBs).

The ram requirements at 200k seem to be using an assumed 16 bits? With a 4 bit quantization it's more like ~12GB of kv-cache for for 200k sequence length. Unless I'm missing something?


Are people quantizing the activations? I was under the impression that people generally quantize the parameters but leave the activations unchanged.


You can quantize KV caches


How much does this do to performance though?


Is this better than the large context window versions of dolphin-llama3?

e.g. https://ollama.com/library/dolphin-llama3:8b-256k-v2.9-q6_K


Can we expect that the 1M version has the same level of intelligence as the vanilla version? Across the whole 1M tokens without degradation? What are the trade offs?


Well, according to LocalLLaMa, it's surprisingly good: https://www.reddit.com/r/LocalLLaMA/s/ypFd2Xlnvf


"Note: using a 256k context window requires at least 64GB of memory. Using a 1M+ context window requires significantly more (100GB+)."


It’s just so clear the math is wrong on these things.

You’ve got an apparent contradiction: SGD (AdamW at 1e-6 give or take) works. So we’ve got extremely abundant local maxima up to epsilon_0, but, it always lands in the same place, so there are abundant “well-studied” minima, likewise symmetrical up to epsilon_1, both of which are roughly: “start debugging if above we can tell”.

The maxima have meaningful curvature tensors at or adjacent to them: AdamW works.

But joker in the deck: control vectors work. So you’re in a quasi-Euclidean region.

In fact all the useful regions are about exactly the same, the weights are actually complex valued, everyone knows this part….

The conserved quantity up to let’s call it phi is compression ratio.

Maybe in a year or two when Altman is in jail and Mme. Su gives George cards that work, well crunch numbers more interesting than how much a googol FMA units cost and Emmy Nother gets some damned credit for knowing this a century ago.


In an attempt to have some basic idea of what you wrote I turned to ChatGPT but ended up having a deeply dystopian conversation. Admittedly triggered by me.

https://chat.openai.com/share/d228e04e-ae36-4468-ac45-fdb035...


I was fascinated enough to read it to the end.


I think that's expected. SGD in such a high-dimensional space is exceedingly likely to find such a minimum.


I’m aware of both theoretical and empirical arguments that I find quite persuasive that you’re exactly right.

I think it’s extremely thought provoking why they would be symmetrically, locally Euclidean in such abundance.


I have no idea what you’re saying. It sounds important and interesting but I’d like some more details.


Roughly, all of the parameters of any NN (and many other models as well” can be thought of as spaces that are flatter or smoother in one region or another or under some pretty fancy “zooming” (Atiyah-Singer indexing give or take).

The way we train them involves finding steepness and chasing it, and it almost always works for a bit, often for quite a while. But the flat places it ends up are both really flat, and zillions of them are nearly identical.

Those two sets of nearly identically “places”, and in particular their difference in being useful via selection bias, are called together or separately a “gauge symmetry”, which basically means things remain true as you vary things a lot. The things that remain true are usually “conserved quantities”, and in the case of OpeenAI 100% compressing the New York Times, the conserved quantity is compression ratio up to some parameter or lossiness.


I am probably way off base here, but what I think you are saying is that these 'flat regions' come close to lossless compression, and thus copyright infringement is occurring.?


Not quite, the abuse of the commons in trivial violation of the spirit of our system of government is suggested (I’d contend demonstrated) by necessary properties of the latent manifolds.

The uniformity (gauge symmetry up to a bound) of such regions is a way of thinking about the apparent contradiction between the properties of a billion dimensional space before and after a scalar loss pushing a gradient around in it.


Okay, yeah obviously there is a loss of entropy.


Entropy is a tricky word: legend has it that von Neumann persuaded Shannon to use it for the logarithmic information measure because “no one knows what it means anyways”.

These days we have KL-divergence and information gain and countless other ways to be rigorous, but you still have to be kind of careful with “macro” vs “micro” states, it’s just a slippery concept.

Whether or not some 7B parameter NN that was like, Xavier-Xe initialized or whatever the Fortress of Solitude people are doing these days is more or less unique than after you push an exabyte of Harry Potter fan fiction through it?

I think that’s an interesting question even if I (we) haven’t yet posed it in a rigorous way.


Rare rule violation on the downvote thing.

Make a dissenting case or leave it the fuck alone. I’ve been pretty laid back about the OpenAI bot thing but I’m over it.


You are cloaking what seems to be an opinion of some kind (unclear what? Something about Sam Altman or maybe about copyright or maybe anti NYT lawsuit?) in obtuse math.

The conclusions you seem to draw are by no means conclusive and at best seem only vaguely related to the unclear moral, ethical or legal stance you seem to believe in.


Ben, honestly some of your comments are extremely abstruse to the point where people can't tell if you are serious or not. That's my take anyway (I don't have downvote power)


I think the contentious tone within the comment combined with the mathematical abstrusity makes it doubly difficult to determine whether the criticism stems from correcting proper high-level-math-other-people-don't-understand or just a personal axe to grind with... someone (openai? sam altman? gradient.ai? meta?).

To those who don't understand the math (or its implications/connections to the shared link here, especially since the link is sparse on details unless you know where to look), I could see how a reader would lean toward the latter and downvote without comment.


A sibling asked about the math, I gave some hopefully useful analogies. I’m happy to elaborate further on what a colossal waste of money the current mega autoencoders are and why this follows from some geometry and topology that’s table stakes in AI now.

“Jail Altman” is basically my sig now. And no one wonders why someone would be passionate about that. Not in good faith.


The mathematics is necessarily an argument sketch.

The single-minded, singular goal of seeing Sam Altman answer to a criminal jury for his crimes?

Serious as a guy happy to talk to journalists. I won’t sleep a full night until he hears a verdict carried by a bailiff.

Any remaining confusion?


@dang there are the same number of points on me finally accusing big Azure IPv4 blocks of manipulating the site as there are downvotes on an argument south of preprint but north of what passes for AI math here.

This is by no means the worst example this month. You run the best moderation team on the Internet, but no one at OpenAI (including Fidji) will flat deny they’re doing it, and it’s just obvious.

I know you’re doing yeoman’s work like always. Have someone let @sama that at least one person is going to start making charts. Not here.


You are being downvoted because you are annoying.


Please elaborate?


I’ve heard feedback in this thread that I’m being mathematically abstruse, and that I’m being controversial or inflammatory or something.

I’m paying attention, but we are talking about a giant neural network trained by my friends and former colleagues at FAIR based mostly out of FBNY where I used to go every day, so, I’ll contend there’s some math involved: this is a topic for people who make a serious priority out of it these days.

The controversial piece no one is coming right out and saying, I think it’s my “fuck @sama” refrain.

Though how something that’s a meme on YouTube channels about typescript is a bigger topic than finally giving Emmy Nother her props (if she’d been a man she’d be far more famous than e.g. Heisenberg) eludes me.

I’m saying that an iconic mathematician and physicist deprived of her rightful place in history had it right, and once crooks like “Fired for Fraud Repeatedly” Altman and Madame Su are out of the picture, we might re-learn what she taught us.

On reflection? Fuck you, you’re annoying, ignorant, and a shill if your comments are anything to go by.


Yeah like I said you're really annoying


You could have learned something.

Instead we all wasted memory remembering that twice.

I plan to forget your username. I hope I never have cause to remember it.

Ronin, masterless. There’s no one to call me to heel if I take a dislike.


Ironic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: