themachine


A Powerful Personal Machine Learning Server

2024

The Quest for themachine ⚔️

I've been fascinated by the potential of large language models (LLMs) to revolutionize the way we interact with technology. However, running these models requires serious computational horsepower.

That's why I set out to build a custom machine learning server, dubbed themachine, capable of running multiple LLMs concurrently and handling large context sizes. In this post, I'll share the story of how I built themachine, its specs, and some of the exciting projects and experiments I've run on it.


From Humble Beginnings

Building themachine wasn't easy. I started with a modest setup featuring a pair of NVIDIA GeForce RTX 3090 GPUs, a desktop motherboard, and 96GB of RAM. But... those GPUs wouldn't fit quite right, so I had to get creative. I designed and 3d-printed a custom PCIe riser bracket to orient them vertically.

Even with the addition of a 4060Ti, I quickly realized that this configuration was holding me back. It was time for an upgrade. I swapped out my hardware for some serious server-grade gear, including an Epyc CPU and the ASRock ROMED8-2T motherboard, which boasts seven PCIe slots. This allowed me to add even more GPUs to the mix.


A Custom Case for a Custom Machine

With all this new hardware, I needed a case that could keep up. Unfortunately, there weren't many options available that could fit everything, so I designed and built a custom case using 2020 extrusion. The end result was a modular design that could grow over time.

When taking apart the original machine to migrate parts over, I noticed the GPUs were sagging... The PCIE mount (3d-printed PLA) had deformed from the heat!


The Never-Ending Upgrade

For a while, everything was good and the universe was at peace.

But it didn't last. A month or so later, it becase clear that four GPUs just weren't enough... we needed... more.

The release of Llama3.1 405b was on the horizon, and I wanted to be able to run it at a reasonable speed. This also meant I could run Llama3.1 70b with full context as well as multiple large LLMs concurrently. Little did I know, this would require some serious power - four 1600w power supplies, to be exact. We had definitely exceeded the capabilities of a standard US household electrical circuit!

Fortunately, I had designed the case with modularity in mind, allowing me to expand it one tier at a time.

The modular design worked well and provided a compact housing for the whole system.


Overcoming the Final Hurdles

Getting to this point took some elbow grease, but the real test was yet to come. The cheap PCIe risers I had been using were causing PCIe bus errors, and the system wouldn't stay stable. I sourced higher-quality risers from c-payne.com, which have worked like a charm. To keep the risers and cards aligned and mounted, I designed and built custom mounting plates.

Finally, after all the blood, sweat, and tears, we had all 12 GPUs working in harmony! Well, almost. The top tier of cards still can't reliably run at PCIe 4.0 speeds and had to be stepped down to 3.0. I suspect I need to add PCIe redrivers to get them running flawlessly, but for now, the impact is negligible since my workloads are mostly inference-oriented.


Whispers of silicon
Pulsing thoughts in darkened sea
Wisdom's warm heartbeat
--kalle

Technical Specifications

  • ASRock ROMED8-2T motherboard
  • AMD EPYC 7502p w/ 32 CPU Cores @ 2.5Ghz
  • 512GB DDR4-3200 RAM @ 204.8GB/s
  • 16TB NVMe total disk
  • 288GB VRAM (12x RTX3090 @ 8x PCIe 4.0)
  • Max 4000 Watt power consumption
  • UPS power backup
  • Custom 2020 extrusion based modular case design

The specs for themachine was heavily inspired by this build.



Inference Speeds

(All GPUs limited to 300w)

Llama-3.1-405b-Instruct

4.5bpw exl2

  • ~3.4 t/s
  • ~6.5 t/s with Llama3.1-8b draft model

Mistral-large-Instruct-2407

8.0bpw exl2

  • ~8 t/s
  • ~19 t/s with Mistral-7b-Instruct-v0.3 draft model

Llama-3.1-70b-Instruct

8.0bpw exl2

  • ~12 t/s
  • ~23 t/s with Llama3.1-8b draft model

Llama-3.1-8b-Instruct

8.0bpw exl2

  • ~80 t/s

Parts List

Item Qty
PSU EVGA Supernova 1600 G+ 4
Add2PSU Multiple Power Supply Adapter 3
GPU EVGA GeForce RTX 3090 24GB FTW3 Ultra 12
NVIDIA GeForce RTX NVLink HB Bridge 4
CPayne PCIe SlimSAS Host Adapter 6
SlimSAS SFF-8654 8i cable (45cm) 4
SlimSAS SFF-8654 8i cable (75cm) 8
CPayne SlimSAS PCIe gen4 Device Adapter 12
Compute ASRock ROMED8-2T Motherboard 1
AMD EPYC 7502P CPU 1
Noctua NH-U9 TR4-SP3 CPU Cooler 1
Hynix 64GB Server RAM 8
Storage ASUS Hyper M.2 x16 Gen5 Card 1
SK hynix Platinum P41 2TB SSD 1
Team Group MP44 4TB SSD 4
UPS CyberPower Smart App PR1000LCD 1
CyberPower Smart App PR1500LCD 3
Case ATX Open Chassis Case Rack 1
1000mm T Slot Aluminum Extrusion ~40
30pcs Black 2020 T-Slot Corner Bracket 32
2020 T-Slot L-Shape Corner Connector 80
2020 Joint Plate Connector 12
Tools VANPO Digital Torque Screwdriver 1


Capabilities and Applications

themachine is capable of running Llama3.1 405b at a 4.5bpw quant as well as multiple LLMs concurrently like in the 70-120B param size like Llama3.1-70b @6.0bpw with Mistral-Large-Instruct-2407 @8.0bpw and at least than 32k contexts for each. This provides decent headroom for various tasks.

Some potential uses for themachine include:

  • Running multiple LLMs for various tasks, such as language translation, text summarization, and sentiment analysis
  • Developing and training custom LLMs and other machine learning models for specific applications
  • Experimenting with applications of LLMs for various tasks

Projects and Experiments

I've run several exciting projects and experiments on themachine, including:

  • A CLI assistant, kalle, which is powered by models running on themachine (it can also use third party LLM APIs)
  • Generating synthetic conversations for an experiment with intent classification
  • A cli color palette tool that generates color palettes from a source image using a number of methods including: kmeans, medcut, and dbscan
  • An experiment in LLM driven software maintenance where the logs and code are analyzed, the code updated, and the LLM updated code is deployed without a human making any code edits

What's Next?

themachine is a powerful machine learning server that is capable of running multiple LLMs concurrently and handling large context sizes. Its custom design and technical specifications make it an ideal setup for natural language processing, text generation, and other applications.

I'm excited to continue exploring the possibilities of themachine and sharing my experiences with the community. If you're interested in building your own machine learning server or learning more about themachine, I'd be happy to share more details and answer any questions you may have.