Deep Learning

What is LLM Quantization - Condensing Models to Manageable Sizes

May 8, 2025

7 min read

EXX-Blog-Quantization-LLMs-Condensing-Models-to-Manageable-Sizes.jpg

The Cost of Running Complex LLMs

Large Language Models (LLMs) are powered by vast neural networks made of billions of parameters, trained on extensive datasets, and fine-tuned for peak accuracy and versatility. These complex AI models require extreme computing resources to train and power. As the industry continues to introduce new LLMs and generative AI models, the number of parameters also continues to rise. Open AI GPT-4 model has over 1.8 trillion parameters that dwarfs other models’ parameter size.

More parameters within an LLM usually result in better performance. But as parameters increase, these LLMs require significantly more computational power for training and inference. This sparked an interest in procuring lower parameter models with similar performance to these larger foundational models.

Businesses often run LLMs on cloud computing with specialized hardware such as GPU-accelerated AI servers. However, training and deploying complex generative AI and LLM models will get costly since they require high computational resources.

High costs can make it challenging for small business deployments to train or implement AI in their workflow. There are two ways we can reduce the cost. By owning an on-premises solution, or by reducing the model size. Here is where quantization comes in handy.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

What is Quantization?

Quantization in LLMs is a technique that reduces model size and computational requirements by converting high-precision floating-point numbers to lower-precision formats. This process makes LLMs more efficient and deployable on devices with limited resources while maintaining most of their functionality, similar to compressing a high-quality image to a smaller file size.

In the context of neural networks, deep learning AI models, and large language models, numerical calculations and values are represented as floating-point 32-bit or 16-bit. Each number is represented using that many binary bits.

Quantization converts these higher floating-point precisions into lower-precision representations, like INT8 or FP8-bit integers. A full FP32 model quantized to FP8 is a quarter the size. The assumption is that removing some insignificant decimal digits will not significantly impact performance. As a result, the training and inference of the model require less storage, consume less memory, and can be executed more quickly on hardware that supports lower-precision computations.

For example, ChatGPT is trained and is deployed on hundreds to thousands of NVIDIA DGX systems, millions of dollars worth of hardware and infrastructure cost. Quantization can enable good proof-of-concept, or even fully-fledged deployments with less computational (but still high-performance) hardware.

Next, we'll explore quantization methods and their role in making resource-heavy LLMs more accessible for everyday applications and smaller deployments.

Types of Quantization

Quantization can be applied at various stages in the lifecycle of a model's development and deployment. Each method has its distinct advantages and trade-offs and is selected based on the specific requirements and constraints of the use case.

Static Quantization

Static quantization is applied during the training phase of an AI model. The weights and activations are quantized to a lower bit precision and applied to all layers and remain fixed throughout. Static quantization is great for known memory requirements of the system you will be deploying the model to.

Pros of Static Quantization
- Simplifies deployment planning as the quantization parameters are fixed.
- Reduces model size, making it more suitable for edge devices and real-time applications.
Cons of Static Quantization
- Performance drops are predictable: certain quantized parts may suffer more with a broad static approach.
- Limited adaptability for static quantization for varying input patterns and less robust update to weights.

Dynamic Quantization

Dynamic Quantization involves quantizing weights statically, but activations are quantized on the fly during model inference. The weights are quantized ahead of time, while the activations are quantized dynamically as data passes through the network. Certain parts of the model are quantized with different precisions versus a fixed quantization.

Pros of Dynamic Quantization
- Balances model compression and runtime efficiency without significant drop in accuracy.
- Useful for models where activation precision is more critical than weight precision.
Cons of Dynamic Quantization
- Performance improvements aren’t predictable compared to static methods (but this isn’t necessarily a bad thing).
- Dynamic calculation means more computational overhead and longer training and inference times than the other methods, while still being lighter weight than without quantization

Post-Training Quantization

Post-training quantization analyzes the distribution of weights and activations in a trained model, then maps these values to a lower precision. This technique is particularly useful for deploying models on resource-constrained devices like mobile phones and edge devices.

Pros of PTQ (Post Training Quantization)
- Can be applied directly to a pre-trained model without the need for retraining.
- Reduces the model size and decreases memory requirements.
- Improved inference speeds enable faster computations during and after deployment.
Cons of PTQ (Post Training Quantization)
- Potential loss in model accuracy because of weight approximation.
- Requires careful calibration and fine tuning to mitigate quantization. errors.
- It may not be optimal for all models, particularly those sensitive to weight precision.

Quantization Aware Training (QAT)

During training, the model accounts for the quantization operations that will occur during inference, allowing the parameters to adjust accordingly. This enables the model to learn how to handle quantization-induced errors.

Pros of QAT (Quantization Aware Training)
- Tends to preserve model accuracy compared to PTQ since the model training accounts for quantization errors during training.
- It is more robust for models that are sensitive to precision and is better for inference even on lower precisions.
Cons of QAT (Quantization Aware Training)
- Requires retraining the model, resulting in longer training times.
- QAT is more computationally intensive since it incorporates quantization error checking.

Binary/Ternary Quantization

Binary and ternary quantization are the most extreme forms of weight reduction with the most aggressive trade-off between model size and precision. Binary quantization limits weights to just two values (+1 or -1), while ternary allows three values (+1, 0, or -1). This dramatically reduces memory usage but still allows some flexibility in model behavior.

Pros of Binary/Ternary Quantization
- Maximizes model compression and inference speed and has minimal memory requirements.
- Fast inferencing and quantization calculations enable usefulness on underpowered hardware.
Cons of Binary Ternary Quantization
- High compression and reduced precision result in a significant drop in accuracy.
- Not suitable for all types of tasks or datasets, and struggles with complex tasks.

The Benefits & Challenges of Quantization

The quantization of Large Language Models brings forth multiple benefits for the training process and deployment. It achieves a significant reduction in the memory requirements of these models. With higher efficiency, these models can run on systems with less memory. Lower computational requirements lead to faster inference speeds and response times, improving the user experience.

However, quantization comes with a trade-off: since it approximates numbers to use less memory, it can reduce model accuracy. The key challenge is finding the right balance between reduced resource usage and maintaining model performance. It is important to test the model thoroughly before and after quantization measuring both accuracy and processing speed to achieve this balance.

By optimizing the balance between performance and resource consumption, quantization not only broadens the accessibility of LLMs but also contributes to more sustainable computing practices.

We're Here to Deliver the Tools to Power Your Research

With access to the highest performing hardware, at Exxact, we can offer the platform optimized for your deployment, budget, and desired performance so you can make an impact with your research!

Talk to an Engineer Today

Topics

Have any questions?

Deep Learning