The Quest for themachine ⚔️
I've been fascinated by the potential of large language models (LLMs) to revolutionize the way we interact with technology. However, running these models requires serious computational horsepower.
That's why I set out to build a custom machine learning server, dubbed themachine, capable of running multiple LLMs concurrently and handling large context sizes. In this post, I'll share the story of how I built themachine, its specs, and some of the exciting projects and experiments I've run on it.
From Humble Beginnings
Building themachine wasn't easy. I started with a modest setup featuring a pair of NVIDIA GeForce RTX 3090 GPUs, a desktop motherboard, and 96GB of RAM. But... those GPUs wouldn't fit quite right, so I had to get creative. I designed and 3d-printed a custom PCIe riser bracket to orient them vertically.
Even with the addition of a 4060Ti, I quickly realized that this configuration was holding me back. It was time for an upgrade. I swapped out my hardware for some serious server-grade gear, including an Epyc CPU and the ASRock ROMED8-2T motherboard, which boasts seven PCIe slots. This allowed me to add even more GPUs to the mix.
A Custom Case for a Custom Machine
With all this new hardware, I needed a case that could keep up. Unfortunately, there weren't many options available that could fit everything, so I designed and built a custom case using 2020 extrusion. The end result was a modular design that could grow over time.
When taking apart the original machine to migrate parts over, I noticed the GPUs were sagging... The PCIE mount (3d-printed PLA) had deformed from the heat!
The Never-Ending Upgrade
For a while, everything was good and the universe was at peace.
But it didn't last. A month or so later, it becase clear that four GPUs just weren't enough... we needed... more.
The release of Llama3.1 405b was on the horizon, and I wanted to be able to run it at a reasonable speed. This also meant I could run Llama3.1 70b with full context as well as multiple large LLMs concurrently. Little did I know, this would require some serious power - four 1600w power supplies, to be exact. We had definitely exceeded the capabilities of a standard US household electrical circuit!
Fortunately, I had designed the case with modularity in mind, allowing me to expand it one tier at a time.
The modular design worked well and provided a compact housing for the whole system.
Overcoming the Final Hurdles
Getting to this point took some elbow grease, but the real test was yet to come. The cheap PCIe risers I had been using were causing PCIe bus errors, and the system wouldn't stay stable. I sourced higher-quality risers from c-payne.com, which have worked like a charm. To keep the risers and cards aligned and mounted, I designed and built custom mounting plates.
Finally, after all the blood, sweat, and tears, we had all 12 GPUs working in harmony! Well, almost. The top tier of cards still can't reliably run at PCIe 4.0 speeds and had to be stepped down to 3.0. I suspect I need to add PCIe redrivers to get them running flawlessly, but for now, the impact is negligible since my workloads are mostly inference-oriented.
Whispers of silicon Pulsing thoughts in darkened sea
Wisdom's warm heartbeat
--kalle
Technical Specifications
- ASRock ROMED8-2T motherboard
- AMD EPYC 7502p w/ 32 CPU Cores @ 2.5Ghz
- 512GB DDR4-3200 RAM @ 204.8GB/s
- 16TB NVMe total disk
- 288GB VRAM (12x RTX3090 @ 8x PCIe 4.0)
- Max 4000 Watt power consumption
- UPS power backup
- Custom 2020 extrusion based modular case design
The specs for themachine was heavily inspired by this build.
Inference Speeds
(All GPUs limited to 300w)
Llama-3.1-405b-Instruct
4.5bpw exl2
- ~3.4 t/s
- ~6.5 t/s with Llama3.1-8b draft model
Mistral-large-Instruct-2407
8.0bpw exl2
- ~8 t/s
- ~19 t/s with Mistral-7b-Instruct-v0.3 draft model
Llama-3.1-70b-Instruct
8.0bpw exl2
- ~12 t/s
- ~23 t/s with Llama3.1-8b draft model
Llama-3.1-8b-Instruct
8.0bpw exl2
- ~80 t/s
Parts List
Item | Qty | |
---|---|---|
PSU | EVGA Supernova 1600 G+ | 4 |
Add2PSU Multiple Power Supply Adapter | 3 | |
GPU | EVGA GeForce RTX 3090 24GB FTW3 Ultra | 12 |
NVIDIA GeForce RTX NVLink HB Bridge | 4 | |
CPayne PCIe SlimSAS Host Adapter | 6 | |
SlimSAS SFF-8654 8i cable (45cm) | 4 | |
SlimSAS SFF-8654 8i cable (75cm) | 8 | |
CPayne SlimSAS PCIe gen4 Device Adapter | 12 | |
Compute | ASRock ROMED8-2T Motherboard | 1 |
AMD EPYC 7502P CPU | 1 | |
Noctua NH-U9 TR4-SP3 CPU Cooler | 1 | |
Hynix 64GB Server RAM | 8 | |
Storage | ASUS Hyper M.2 x16 Gen5 Card | 1 |
SK hynix Platinum P41 2TB SSD | 1 | |
Team Group MP44 4TB SSD | 4 | |
UPS | CyberPower Smart App PR1000LCD | 1 |
CyberPower Smart App PR1500LCD | 3 | |
Case | ATX Open Chassis Case Rack | 1 |
1000mm T Slot Aluminum Extrusion | ~40 | |
30pcs Black 2020 T-Slot Corner Bracket | 32 | |
2020 T-Slot L-Shape Corner Connector | 80 | |
2020 Joint Plate Connector | 12 | |
Tools | VANPO Digital Torque Screwdriver | 1 |
Capabilities and Applications
themachine is capable of running Llama3.1 405b at a 4.5bpw quant as well as multiple LLMs concurrently like in the 70-120B param size like Llama3.1-70b @6.0bpw with Mistral-Large-Instruct-2407 @8.0bpw and at least than 32k contexts for each. This provides decent headroom for various tasks.
Some potential uses for themachine include:
- Running multiple LLMs for various tasks, such as language translation, text summarization, and sentiment analysis
- Developing and training custom LLMs and other machine learning models for specific applications
- Experimenting with applications of LLMs for various tasks
Projects and Experiments
I've run several exciting projects and experiments on themachine, including:
- A CLI assistant, kalle, which is powered by models running on themachine (it can also use third party LLM APIs)
- Generating synthetic conversations for an experiment with intent classification
- A cli color palette tool that generates color palettes from a source image using a number of methods including: kmeans, medcut, and dbscan
- An experiment in LLM driven software maintenance where the logs and code are analyzed, the code updated, and the LLM updated code is deployed without a human making any code edits
What's Next?
themachine is a powerful machine learning server that is capable of running multiple LLMs concurrently and handling large context sizes. Its custom design and technical specifications make it an ideal setup for natural language processing, text generation, and other applications.
I'm excited to continue exploring the possibilities of themachine and sharing my experiences with the community. If you're interested in building your own machine learning server or learning more about themachine, I'd be happy to share more details and answer any questions you may have.