So, LabLab.ai are running a hackathon for using GPT5 to create AI software. GPT5 isn’t our bag. We’re not a fan of closed source AI. So we’re not entering that hackathon.
What we are doing is using an abliterated Llama 3.2 model with 3 billion parameters to show that A) Bigger isn’t always better. B) Open source is very cool. and 3) That whilst offering us $20 of credits for GPT5 is nice and all, we can grab a model from Huggingface, integrate it into a free Colab GPU instance/run it on 3 year old hardware, and get stuff done. And when we say small, we’re using a quantized version of Llama 3.2, so it’s less than around a 2GB download.
Why are we putting all these restrictions on ourselves? Why not grab $20 of free credits from LabLab.ai and do something amazing with GPT5? Well, not everyone has $20 for an API. But maybe they have a machine with a crappy GPU. Or can wait for CPU processing. Or can use the free GPU time from Google Colab to run it.
These blog posts are to document our process. The thing is, Llama 3.2 wasn’t our first choice. Or even our second. We first tried Qwen 2.5 3B, which didn’t work that well. Then we tried Gemma 3 4B, which was unusable.
We use Llama everyday. It’s a retrained version, which we found on Huggingface and is excellent. The version we’re using for this project isn’t the same one, but it’s uncensored, small, and modern, which is what we need.
The first thing we did did was put it into the software we use to chat with our AIs every day. We use Koboldcpp, because we can either use the interface or run it as a server. Very handy. Then we added our modifications to the model, and added the personality and configuration. This is for a bigger model, and it doesn’t always translate well (See Qwen 2.5 and Gemma 3) but our 2.02GB Llama 3.2 3B model seemed to handle it reasonably well.
Tomorrow we’ll do more work on building something around it and tweaking some of the settings to make sure we get every bit of performance out of the model.