Back to Blog
Technical

TernaryPhysics-7B: Our Quantized LLM

10 min readApril 2026

TernaryPhysics-7B is the brain behind our agents' conversational capabilities. It's a 4-bit quantized model that runs entirely on CPU, without requiring a GPU. This post explains what it is, how we built it, and why we made the choices we did.

Model Specifications

Model Size7 billion parameters (quantized)
Disk Space~4-5 GB
Context WindowLarge context support
Inference SpeedReal-time conversational (CPU)
RAM Required8 GB minimum (16 GB recommended)
GPU RequiredNo

Choosing the Right Model

We evaluated dozens of models for infrastructure investigation tasks. Our criteria:

  • Instruction following. The model needs to understand complex multi-step queries about infrastructure.
  • Technical knowledge. It needs to understand Kubernetes, databases, networking, Linux internals.
  • Reasoning ability. Root cause analysis requires multi-hop reasoning.
  • Efficiency. Must run well on CPU without excessive resource usage.
  • Permissive license. Must be deployable commercially without restrictions.

TernaryPhysics-7B is the result of extensive evaluation and optimization. It provides strong infrastructure reasoning capabilities while running efficiently on standard hardware.

What is Quantization?

Neural networks typically use 32-bit or 16-bit floating-point numbers to represent weights. Quantization reduces this precision to enable efficient CPU inference.

Why Quantize?

Smaller size: Reduces model size dramatically, from tens of GB to just a few GB.
Faster inference: Smaller models load faster and process more efficiently.
CPU-friendly: Enables real-time inference without GPU acceleration.

TernaryPhysics-7B uses advanced quantization techniques that balance size reduction with quality preservation, ensuring accurate infrastructure reasoning.

Optimized Inference

TernaryPhysics-7B uses an optimized inference engine designed for CPU execution. This enables real-time conversational responses without GPU acceleration.

Fast CPU Inference

Real-time conversational responses on modern hardware.

Memory Efficient

Optimized to run on systems with 8GB+ RAM.

Cross-Platform

Linux, macOS, Windows. x86_64, ARM64. Works everywhere.

No Dependencies

No GPU, no special drivers. Just standard hardware.

How It Fits the Architecture

TernaryPhysics-7B is the "Tier 2" brain in our two-tier architecture. It works alongside the TNN™ (Tier 1):

Normal Operation
────────────────
TNN™ runs continuously → minimal resource usage
TernaryPhysics-7B sleeps → 0 CPU usage

Anomaly Detected / Human Query
──────────────────────────────
TNN™ detects anomaly → wakes TernaryPhysics-7B
TernaryPhysics-7B analyzes logs/metrics
Returns findings → goes back to sleep

This pattern minimizes resource usage. During normal operation, only the tiny TNN consumes resources. The heavyweight LLM only activates when needed.

Hardware Requirements

TernaryPhysics-7B is designed to run on commodity hardware:

ComponentMinimumRecommended
RAM8 GB16 GB
Disk6 GB10 GB
CPUAny x86_64 or ARM64Modern multi-core
GPUNot requiredNot required

On modern hardware, you'll get real-time conversational responses. Older hardware still works, just with slightly longer response times.

Future Improvements

We're actively working on:

  • TNN™ integration. Using the TNN™ to accelerate LLM inference.
  • Infrastructure fine-tuning. Training on infrastructure-specific data for better technical understanding.
  • Smaller models. Exploring smaller models for resource-constrained environments.
  • Efficiency improvements. Continuous optimization for faster, leaner inference.

For more details on how the model fits into the broader architecture, see our Architecture documentation.