How can I create a local large language model?

I’m currently using two AI tools—called Rosebud and Goblin Tools. I use Rosebud for writing in my journal and for any questions related to my entries, while I use Goblin Tools to help make my emails sound more professional (haha).

However, I’m a bit worried about how private these tools really are. Rosebud says it keeps my data private and doesn’t use my journal entries to learn or improve, but I’m not entirely sure. Especially since last year, they showed something similar to Spotify Wrapped, where they talked about my entries and topics.

I’m interested in creating my own large AI model that would have similar features—like journaling and helping me write in a certain style. I’m not sure if I should do this on a Mac or if I’d need a dedicated Linux computer. I also don’t know the safest and most private way to do this, or if it’s even possible. I saw a video where someone talked about building a large language model (LLM), but I didn’t quite understand all the details.

I’m not a super hacker or expert coder or tech genius, but I’m also not a total beginner. I’m good with getting my hands a bit dirty to figure things out.

Could anyone help me figure out where to start?

To some extent the short answer is, you don’t.
The LLM’s people use are generally too big to train on consumer hardware.

You can implement an existing model locally, assuming you have hardware that can run it effectively and that the model is can be downloaded.
On higher end GPU’s it might also be possible to “tune” an existing model.

I’m not that interested in LLM’s, so this is about the extent of my knowledge. But apparently Mistral is a popular option for local implementations.

I don’t think you mentioned what hardware you currently have?
The setup is different depending on hardware and OS.
Most of the stuff I’ve seen or heard of seem to use Ollama to run the model so any instructions they have might be worth a look.
NVidia cards should just work, but I don’t actually know.
AMD cards generally seem to require ROCm, a sort of AI and compute specific driver, which I believe is only avaialble on a few major Linux distros.
I don’t know about AI implemntations on Intel cards or MacOS. I’d expect an Intel GPU on Windows to work if you can find a DirectX implementation.

There are videos of people running some of the smaller LLM’s on Raspberry Pi, although I’m not sure if the AI HAT (sort of an expansion card) is necessary.
It would be a lower cost option at least, but I’m not actually sure how well models would perform on it with normal usage (as opposed to something like a YouTuber trying something out for a day or so and scrapping the project later)

Mostly, most of the time, nvidia GPUs use CUDA to run LLMs. It’s insanely well supported on Linux (considering the general state of nvidia support on Linux) and insanely non-supported on Windows (tensoflow just refuses to use CUDA on Windows)

As for ROCM, it just doesn’t work. Almost nobody cares about ROCM, so most tools don’t support it and most OSes can’t run it

If you have a relatively recent GPU (particularly Nvidia) or a Macbook Pro M1/2/3/4, you can run a decent LLM locally with software like llama.cpp, a mozilla llamafile, ollama, lm-studio (non-foss) or jan[.]ai.

If you don’t, you are in the same boat as me, the models you can run locally are limited, to the smaller more basic models at slower speesd on CPU/ram only, but it is still possible.

The r/localllama community on reddit is helpful for learning what is possibly and practical. Huggingface is a good community to be aware of also.

1 Like

I’m not a super hacker or expert coder or tech genius, but I’m also not a total beginner. I’m good with getting my hands a bit dirty to figure things out.

I have done a few things regarding LLMs on in my Job (RAG, fine-tuning, vectorizing, chunking and implementation)

And to say it: You and even I can’t. No single individual has the capabilities of training an AI model. You just need too much data and too much power. Organizations, countries or companies have them, but not you or me.

What we can do is using existing model offline and locally, creating RAGs or fine–tune them. But for this you need¹ a deep tech understanding and good understanding of machine– and deep-learning as well as about LLMs in general.
If you want to learn you can and if you want I can send you some links to start.

Now what you normally do is installing Ollama on a system that has a good GPU or NPU.
I would go with an RTX 4070 and higher. If you have such GPU you can start with local AI.

If you want, you also can build your own server. But please don’t follow the guides on YouTube. There are many people that use very overkill and shitty setups.
Two Nvidia Tesla 4 (GPU for server) are a good start, and you can find a T4 used for around 800€.

1: I’m talking about fine-tuning or RAGs not about using them locally.

If you need to choose a model it depends on what you want to do. I would go with: Deepseek-R1 (12B), mixtral-nemo or LLama3.2.

I’ll look into this options, as I do have a more recent mac with the M chips. But I am still curious what sort of benchmarks are required with a tower pc or other laptop, like a thinkpad with linux or something. But I imagine those benchmarks are somewhere in that community.

Thank you for helping me understand. I think what I may have been thinking about is using an existing model. Eventually I wouldn’t mind building my own server, that seems cool, although I am not exactly sure how as I never really have before.

Do these models run on linux as well? Pardon the newbie questions as I am….quite the newbie in this area haha

What matters most for inference is memory bandwidth and capacity (for prompt processing compute matters too), but the main factor will be memory bandwidth. This is what gives Apple M series chips have an edge (not the base model but the pro/max/ultra) because of the high memory bandwidth of unified memory and why GPUs with fast vram perform even better. Beyond that you can run things, but it’ll be slower, particularly if you have an older system with ddr4. Figure out the memory bandwith you have and that’ll help inform what you can realistically run at acceptable speeds.

Yes Linux is great for running models. For most hobbyists and working in the field it’s fairly standard to use linux.

Do these models run on linux as well? Pardon the newbie questions as I am….quite the newbie in this area haha

It does. As it is for most server applications (AI stuff are most of the time application that are designed to run on server).
Most AI stuff works on MacOS and Windows too, but the best support and performance is on linux.

2 Likes