TernaryPhysics-7B: Our Quantized LLM
TernaryPhysics-7B is the brain behind our agents' conversational capabilities. It's a 4-bit quantized model that runs entirely on CPU, without requiring a GPU. This post explains what it is, how we built it, and why we made the choices we did.
Model Specifications
| Model Size | 7 billion parameters (quantized) |
| Disk Space | ~4-5 GB |
| Context Window | Large context support |
| Inference Speed | Real-time conversational (CPU) |
| RAM Required | 8 GB minimum (16 GB recommended) |
| GPU Required | No |
Choosing the Right Model
We evaluated dozens of models for infrastructure investigation tasks. Our criteria:
- Instruction following. The model needs to understand complex multi-step queries about infrastructure.
- Technical knowledge. It needs to understand Kubernetes, databases, networking, Linux internals.
- Reasoning ability. Root cause analysis requires multi-hop reasoning.
- Efficiency. Must run well on CPU without excessive resource usage.
- Permissive license. Must be deployable commercially without restrictions.
TernaryPhysics-7B is the result of extensive evaluation and optimization. It provides strong infrastructure reasoning capabilities while running efficiently on standard hardware.
What is Quantization?
Neural networks typically use 32-bit or 16-bit floating-point numbers to represent weights. Quantization reduces this precision to enable efficient CPU inference.
Why Quantize?
TernaryPhysics-7B uses advanced quantization techniques that balance size reduction with quality preservation, ensuring accurate infrastructure reasoning.
Optimized Inference
TernaryPhysics-7B uses an optimized inference engine designed for CPU execution. This enables real-time conversational responses without GPU acceleration.
Fast CPU Inference
Real-time conversational responses on modern hardware.
Memory Efficient
Optimized to run on systems with 8GB+ RAM.
Cross-Platform
Linux, macOS, Windows. x86_64, ARM64. Works everywhere.
No Dependencies
No GPU, no special drivers. Just standard hardware.
How It Fits the Architecture
TernaryPhysics-7B is the "Tier 2" brain in our two-tier architecture. It works alongside the TNN™ (Tier 1):
Normal Operation ──────────────── TNN™ runs continuously → minimal resource usage TernaryPhysics-7B sleeps → 0 CPU usage Anomaly Detected / Human Query ────────────────────────────── TNN™ detects anomaly → wakes TernaryPhysics-7B TernaryPhysics-7B analyzes logs/metrics Returns findings → goes back to sleep
This pattern minimizes resource usage. During normal operation, only the tiny TNN consumes resources. The heavyweight LLM only activates when needed.
Hardware Requirements
TernaryPhysics-7B is designed to run on commodity hardware:
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB |
| Disk | 6 GB | 10 GB |
| CPU | Any x86_64 or ARM64 | Modern multi-core |
| GPU | Not required | Not required |
On modern hardware, you'll get real-time conversational responses. Older hardware still works, just with slightly longer response times.
Future Improvements
We're actively working on:
- TNN™ integration. Using the TNN™ to accelerate LLM inference.
- Infrastructure fine-tuning. Training on infrastructure-specific data for better technical understanding.
- Smaller models. Exploring smaller models for resource-constrained environments.
- Efficiency improvements. Continuous optimization for faster, leaner inference.
For more details on how the model fits into the broader architecture, see our Architecture documentation.