Quantization Techniques for Large-Scale Data Processing
跳到主要内容
领英
马上加入
登录
热门内容
Productivity
Performance Optimization Techniques
Quantization Techniques for Large-Scale Data Processing
浏览来自职场专家的热门领英内容。
摘要
Quantization techniques for large-scale data processing are methods that reduce the size and complexity of data by converting high-precision values into lower-precision representations, making storage and computation more efficient without sacrificing much accuracy. These approaches are especially important for handling massive datasets and deploying machine learning models in environments with limited resources.
Choose smart compression: Selecting the right quantization method, such as product or rotational quantization, can deliver significant storage savings and speed up search operations while keeping accuracy high.
Match precision to needs: Adjusting the bit-width of data representations lets you balance resource use and performance, making it easier to deploy models across different hardware setups.
Streamline deployment: Using flexible quantization strategies, like group-wise or nested models, allows you to maintain strong search and retrieval quality even when working with strict memory or computing constraints.
由 AI 根据领英会员动态总结
Philip Vollet
VP Developer Relations & Growth @ Weaviate - Building memory
135,471 位关注者
6 个月
举报此动态
关闭菜单
4× compression. 42% faster queries. BETTER accuracy.
Sounds impossible right? Compression is supposed to be a tradeoff BUT 8-bit Rotational Quantization breaks the rules.
What makes this absolutely wild:
𝗧𝗵𝗲 𝘁𝘆𝗽𝗶𝗰𝗮𝗹 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘀𝘁𝗼𝗿𝘆: You compress vectors to save memory, you lose some quality, maybe you gain some speed. Pick your poison.
𝗧𝗵𝗲 𝗥𝗤 𝘀𝘁𝗼𝗿𝘆: 4x compression + 2.3x faster distance computation + 98-99% recall (basically perfect) + 15-50% higher throughput MIND BLOWN
How does this even work? Two steps:
𝗙𝗮𝘀𝘁 𝗽𝘀𝗲𝘂𝗱𝗼𝗿𝗮𝗻𝗱𝗼𝗺 𝗿𝗼𝘁𝗮𝘁𝗶𝗼𝗻 using the Walsh-Hadamard Transform that redistributes your vector's information evenly across all dimensions (this is the clever part that makes everything else possible)
𝗦𝗰𝗮𝗹𝗮𝗿 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 to 8-bit integers with the quantization interval defined by each vector's min/max values
The rotation is doing something beautiful here - it's making EVERY vector universally well-suited for quantization regardless of structure or position in your dataset. No training needed, no cluster centers, just pure mathematical elegance.
We went DEEP on the technical implementation in the full blog post. The engineering to make this production-ready is genuinely impressive 💙
This is now available in Weaviate 1.32+ and we see it as a better default than uncompressed vectors for most use cases.
Blog: https://lnkd.in/e7aBbaeG
…展开
无上一项内容
无下一项内容
Philip Vollet
VP Developer Relations & Growth @ Weaviate - Building memory
4× compression. 42% faster queries. BETTER accuracy.
Sounds impossible right? Compression is supposed to be a tradeoff BUT 8-bit Rotational Quantization breaks the rules.
What makes this absolutely wild:
𝗧𝗵𝗲 𝘁𝘆𝗽𝗶𝗰𝗮𝗹 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘀𝘁𝗼𝗿𝘆: You compress vectors to save memory, you lose some quality, maybe you gain some speed. Pick your poison.
𝗧𝗵𝗲 𝗥𝗤 𝘀𝘁𝗼𝗿𝘆: 4x compression + 2.3x faster distance computation + 98-99% recall (basically perfect) + 15-50% higher throughput MIND BLOWN
How does this even work? Two steps:
𝗙𝗮𝘀𝘁 𝗽𝘀𝗲𝘂𝗱𝗼𝗿𝗮𝗻𝗱𝗼𝗺 𝗿𝗼𝘁𝗮𝘁𝗶𝗼𝗻 using the Walsh-Hadamard Transform that redistributes your vector's information evenly across all dimensions (this is the clever part that makes everything else possible)
𝗦𝗰𝗮𝗹𝗮𝗿 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 to 8-bit integers with the quantization interval defined by each vector's min/max values
The rotation is doing something beautiful here - it's making EVERY vector universally well-suited for quantization regardless of structure or position in your dataset. No training needed, no cluster centers, just pure mathematical elegance.
We went DEEP on the technical implementation in the full blog post. The engineering to make this production-ready is genuinely impressive 💙
This is now available in Weaviate 1.32+ and we see it as a better default than uncompressed vectors for most use cases.
Blog: https://lnkd.in/e7aBbaeG
…展开
686
12 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
686
12 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Krishna Konar
Data Engineer @ Meta Reality Labs | ML Data Infrastructure and Distributed Systems
5,966 位关注者
8 个月
举报此动态
关闭菜单
How to compress 2.88 TB of vector embeddings down to 97 GB
I just wrote a detailed article on Product Quantization (PQ), the technique powering efficient similarity search in modern vector databases.
The challenge is...
Storing 1 billion 768-dimensional embeddings requires 2.88 TB of memory. That's expensive and impractical for most applications.
1,000,000,000 vectors × 768 dimensions × 4 bytes = 3,072,000,000,000 bytes
The Solution: Product Quantization achieves 30x compression while maintaining good search accuracy depends on sub-vector selection (Recall@10).
How it works:
- Split vectors into sub-vectors (e.g., 96 sub-vectors of 8 dimensions each)
- Learn a codebook of centroids for each sub-vector using K-means
- Replace each sub-vector with a 1-byte code pointing to its nearest centroid
- Search using fast table lookups instead of full distance calculations
Real-world impact: With 96 sub-vectors and 256 centroids:
1 billion vectors → just ~97 GB
This article walks through a complete step-by-step example with actual numbers, showing exactly how the encoding and search process works.
…展开
Understanding Product Quantization: A Step-by-Step Guide
Krishna Konar,发布于领英
23
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Rahul Agarwal
Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz
45,948 位关注者
3 个月
举报此动态
关闭菜单
A colleague asked me to explain Product Quantization last week. Twenty minutes later, we were both completely lost.
I knew IVF-PQ worked - I had used FAISS dozens of times. But explaining it from scratch? That is where things got messy.
So I went back and actually figured it out. My latest post is the explanation I wish I had.
Here is what we cover:
- Why Brute Force search falls apart at 100M+ vectors
- How IVF partitions space so you are not searching everything
- How Product Quantization (PQ) compresses vectors by 64x
- Why IVF-PQ combined is what most production search engines use
- Metadata filtering for real-world constraints
- FAISS vs Milvus vs Pinecone - which stack to pick and when
- How to benchmark recall, QPS, latency, and memory
Part 5 of my RecSys for MLEs series - a deep dive into the vector infra that makes retrieval work at scale.
https://lnkd.in/ehA_WRFk
Have you ever had to explain IVF-PQ to someone from scratch? How did it go?
#MachineLearning #RecommenderSystems #MLEngineering #DataScience
…展开
Vector Search at Scale: The Production Engineer's Guide
mlwhiz.com
69
2 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Zain Hasan
I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰
20,106 位关注者
1 年
举报此动态
关闭菜单
The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient.
The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8.
Here are the major innovations:
1. Single Model, Multiple Precisions
>> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2)
>> You can extract lower precision models by simply slicing the most significant bits
>> No need to maintain separate models for different deployment scenarios
2. Improved Low-Precision Performance
>> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization
>> This is a huge breakthrough since int2 quantization typically severely degrades model quality
>> The researchers achieved this through co-training and co-distillation across precision levels
3. Flexible Deployment
>> MatQuant enables "Mix'n'Match" - using different precisions for different layers
>> You can interpolate to intermediate bit-widths like int3 and int6
>> This allows fine-grained control over the accuracy vs. efficiency trade-off
The results are impressive. When applied to the FFN parameters of Gemma-2 9B:
>> Int8 and int4 models perform on par with individually trained baselines
>> Int2 models show significant improvements (8%+ better on downstream tasks)
>> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B
This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications.
Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency!
Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach.
https://lnkd.in/g6mdmVjx
…展开
无上一项内容
无下一项内容
Zain Hasan
I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰
The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient.
The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8.
Here are the major innovations:
1. Single Model, Multiple Precisions
>> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2)
>> You can extract lower precision models by simply slicing the most significant bits
>> No need to maintain separate models for different deployment scenarios
2. Improved Low-Precision Performance
>> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization
>> This is a huge breakthrough since int2 quantization typically severely degrades model quality
>> The researchers achieved this through co-training and co-distillation across precision levels
3. Flexible Deployment
>> MatQuant enables "Mix'n'Match" - using different precisions for different layers
>> You can interpolate to intermediate bit-widths like int3 and int6
>> This allows fine-grained control over the accuracy vs. efficiency trade-off
The results are impressive. When applied to the FFN parameters of Gemma-2 9B:
>> Int8 and int4 models perform on par with individually trained baselines
>> Int2 models show significant improvements (8%+ better on downstream tasks)
>> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B
This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications.
Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency!
Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach.
https://lnkd.in/g6mdmVjx
…展开
57
6 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
57
6 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Kuldeep Singh Sidhu
Senior Data Scientist @ Walmart | BITS Pilani
16,689 位关注者
1 年
举报此动态
关闭菜单
Groundbreaking Research Alert: 4-bit Quantization for RAG Systems
A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings.
>> Technical Deep Dive:
The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources.
Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that:
- Reduces memory footprint by up to 87.5%
- Maintains search accuracy within 4% of original performance
- Implements group-wise quantization for enhanced precision
- Outperforms HNSW algorithm in accuracy with group sizes ≤ 128
>> Under the Hood:
The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets.
>> Impact:
This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.
…展开
无上一项内容
无下一项内容
Kuldeep Singh Sidhu
Senior Data Scientist @ Walmart | BITS Pilani
Groundbreaking Research Alert: 4-bit Quantization for RAG Systems
A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings.
>> Technical Deep Dive:
The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources.
Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that:
- Reduces memory footprint by up to 87.5%
- Maintains search accuracy within 4% of original performance
- Implements group-wise quantization for enhanced precision
- Outperforms HNSW algorithm in accuracy with group sizes ≤ 128
>> Under the Hood:
The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets.
>> Impact:
This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.
…展开
91
3 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
91
3 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Kasper Groes Albin Ludvigsen
Leader | Board Chair @ DDSC | Advisor @ Mimiry | Open Source AI | Digital sovereignty
5,217 位关注者
2 个月
举报此动态
关闭菜单
Google just published something that's a gift to self-hosted LLM inference 🚀
TurboQuant is a new KV cache quantization method from Google Research.
It compresses KV cache to 3.5 bits with no accuracy loss, no calibration data.
The benchmark results are hard to argue with. Tested on several benchmarks using open-weights models, 3.5-bit TurboQuant is statistically indistinguishable from BF16 baseline.
On H100s it also achieves up to 8x speedup computing attention logits at 4-bit, compared to 32-bit unquantized keys.
For those self-hosting models with vLLM or llama.cpp, the practical implications are significant.
A 128K-context 32B session drops from roughly 30 GB of KV cache to around 6 GB. That compression ratio doesn't just save memory; it changes what's feasible on a given server.
A few scenarios where this becomes especially useful:
👉 Agentic coding pipelines running on-prem, where long context windows are the norm and multiple agents may be running concurrently. The freed memory can support far more parallel sessions on the same hardware.
👉 RAG-heavy enterprise deployments where large documents are included in context. The bottleneck shifts from "can we fit this in memory" to "what do we actually want to retrieve."
👉 Long-context document analysis and summarization workflows, where 32B or 70B models at 64K+ context have historically been memory-prohibitive on single-node setups.
On the implementation side, it's moving quickly. Implementation is underway both in vLLM and llama.cpp.
The gap between "interesting research" and "running in production" looks unusually short here. Perhaps because of agentic coding tools?
…展开
无上一项内容
无下一项内容
Kasper Groes Albin Ludvigsen
Leader | Board Chair @ DDSC | Advisor @ Mimiry | Open Source AI | Digital sovereignty
Google just published something that's a gift to self-hosted LLM inference 🚀
TurboQuant is a new KV cache quantization method from Google Research.
It compresses KV cache to 3.5 bits with no accuracy loss, no calibration data.
The benchmark results are hard to argue with. Tested on several benchmarks using open-weights models, 3.5-bit TurboQuant is statistically indistinguishable from BF16 baseline.
On H100s it also achieves up to 8x speedup computing attention logits at 4-bit, compared to 32-bit unquantized keys.
For those self-hosting models with vLLM or llama.cpp, the practical implications are significant.
A 128K-context 32B session drops from roughly 30 GB of KV cache to around 6 GB. That compression ratio doesn't just save memory; it changes what's feasible on a given server.
A few scenarios where this becomes especially useful:
👉 Agentic coding pipelines running on-prem, where long context windows are the norm and multiple agents may be running concurrently. The freed memory can support far more parallel sessions on the same hardware.
👉 RAG-heavy enterprise deployments where large documents are included in context. The bottleneck shifts from "can we fit this in memory" to "what do we actually want to retrieve."
👉 Long-context document analysis and summarization workflows, where 32B or 70B models at 64K+ context have historically been memory-prohibitive on single-node setups.
On the implementation side, it's moving quickly. Implementation is underway both in vLLM and llama.cpp.
The gap between "interesting research" and "running in production" looks unusually short here. Perhaps because of agentic coding tools?
…展开
182
8 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
182
8 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Sohrab Rahimi
Director, AI/ML Lead @ Google
23,988 位关注者
2 个月
举报此动态
关闭菜单
Google solved one of the most important constraints of language models by reducing how much memory they need to run, which directly translates into freed compute capacity and lower serving cost.
The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” from Google Research focuses on a simple observation. LLMs do not rely on exact vector values during inference. They rely on inner products between vectors. If those similarities are preserved, the model behaves the same.
The design follows from that.
First, vectors are transformed to make compression tractable. A random rotation makes each dimension behave almost independently. That removes the need for complex, data-specific quantization schemes and allows each coordinate to be compressed separately with minimal loss.
Second, the method fixes what compression breaks. Standard quantization distorts inner products, which directly affects attention. TurboQuant isolates that distortion and encodes it using a 1-bit correction signal based on a Quantized Johnson–Lindenstrauss transform. This restores unbiased similarity calculations with negligible overhead.
The key architectural move is separation. One stage compresses efficiently. The other guarantees correctness of interactions between vectors. That is why the system can push compression without degrading performance.
The empirical result is what matters. KV cache 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗯𝘆 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱𝘅 while maintaining the same downstream accuracy on long-context tasks. This brings compression close to the theoretical limit for this problem.
Now the practical impact.
Take a RAG system with 10M documents embedded at 1536 dimensions. Stored in float16, that is roughly 30 GB of vector data. With 5x to 8x compression, that drops to around 4–6 GB. The entire index fits in GPU memory. Retrieval becomes local, faster, and cheaper.
On the generation side, the KV cache shrinks by the same factor. A system capped at 32k context due to memory can push toward 150k on the same hardware, or run several times more concurrent requests per GPU.
In practice, a deployment that required 8 GPUs to serve long-context RAG queries can drop to 2–3 GPUs for the same throughput, or keep the hardware and scale traffic significantly.
No retraining. No change to the model. Just a different way of encoding state during inference.
This is a significant and imoactful breakthrough. Memory is the bottleneck in modern LLM systems. If you compress it without breaking similarity, you unlock longer context, higher throughput, and materially lower cost at the same time.
Blog: https://lnkd.in/ei3Nb5Vv
Paper: https://lnkd.in/eyJ4Hf9U
…展开
无上一项内容
无下一项内容
Sohrab Rahimi
Director, AI/ML Lead @ Google
Google solved one of the most important constraints of language models by reducing how much memory they need to run, which directly translates into freed compute capacity and lower serving cost.
The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” from Google Research focuses on a simple observation. LLMs do not rely on exact vector values during inference. They rely on inner products between vectors. If those similarities are preserved, the model behaves the same.
The design follows from that.
First, vectors are transformed to make compression tractable. A random rotation makes each dimension behave almost independently. That removes the need for complex, data-specific quantization schemes and allows each coordinate to be compressed separately with minimal loss.
Second, the method fixes what compression breaks. Standard quantization distorts inner products, which directly affects attention. TurboQuant isolates that distortion and encodes it using a 1-bit correction signal based on a Quantized Johnson–Lindenstrauss transform. This restores unbiased similarity calculations with negligible overhead.
The key architectural move is separation. One stage compresses efficiently. The other guarantees correctness of interactions between vectors. That is why the system can push compression without degrading performance.
The empirical result is what matters. KV cache 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗯𝘆 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱𝘅 while maintaining the same downstream accuracy on long-context tasks. This brings compression close to the theoretical limit for this problem.
Now the practical impact.
Take a RAG system with 10M documents embedded at 1536 dimensions. Stored in float16, that is roughly 30 GB of vector data. With 5x to 8x compression, that drops to around 4–6 GB. The entire index fits in GPU memory. Retrieval becomes local, faster, and cheaper.
On the generation side, the KV cache shrinks by the same factor. A system capped at 32k context due to memory can push toward 150k on the same hardware, or run several times more concurrent requests per GPU.
In practice, a deployment that required 8 GPUs to serve long-context RAG queries can drop to 2–3 GPUs for the same throughput, or keep the hardware and scale traffic significantly.
No retraining. No change to the model. Just a different way of encoding state during inference.
This is a significant and imoactful breakthrough. Memory is the bottleneck in modern LLM systems. If you compress it without breaking similarity, you unlock longer context, higher throughput, and materially lower cost at the same time.
Blog: https://lnkd.in/ei3Nb5Vv
Paper: https://lnkd.in/eyJ4Hf9U
…展开
51
1 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
51
1 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Kartik Mathur
AI @ Vectara, Ex-Microsoft
3,600 位关注者
2 个月
举报此动态
关闭菜单
This week, Google Research achieved nearly lossless compression of the KV Cache in frontier models.
The paper is called TurboQuant, and it's being presented at ICLR 2026. Here's what it does and why it matters.
Large language models keep a "cheat sheet" while they're thinking. It's called the KV Cache, short for Key-Value Cache. Instead of re-reading everything from scratch, the model stores recently used information there so it can retrieve it fast.
The catch: this cheat sheet is enormous. It eats up memory, slows things down, and becomes a real bottleneck at scale.
The obvious fix is compression. But traditional compression methods introduce their own overhead, adding 1 to 2 extra bits per number and partially canceling out the benefit.
TurboQuant compresses the KV Cache down to just 3 bits per number, with no retraining required and no measurable loss in model accuracy.
Two algorithms power this approach:
• PolarQuant handles the heavy lifting. Instead of storing a vector using standard X/Y/Z coordinates, it converts it into polar coordinates (think: an angle plus a distance).
• QJL (Quantized Johnson-Lindenstrauss) mops up the residual error using just 1 bit. It acts as a mathematical error-checker that removes bias from the compressed approximation, keeping attention scores accurate.
The KV Cache bottleneck is one of the main reasons long-context inference is expensive. But vector quantization is also the foundation of large-scale semantic search, the technology that lets search engines find meaning rather than just keywords.
This is the kind of foundational infrastructure work that quietly makes everything else faster and cheaper.
Blog: https://lnkd.in/g5NcAFQ5
#KVCache #Quantization #LLMInference #AIResearch #GoogleResearch #MachineLearning #ICLR2026.
…展开
无上一项内容
无下一项内容
Kartik Mathur
AI @ Vectara, Ex-Microsoft
This week, Google Research achieved nearly lossless compression of the KV Cache in frontier models.
The paper is called TurboQuant, and it's being presented at ICLR 2026. Here's what it does and why it matters.
Large language models keep a "cheat sheet" while they're thinking. It's called the KV Cache, short for Key-Value Cache. Instead of re-reading everything from scratch, the model stores recently used information there so it can retrieve it fast.
The catch: this cheat sheet is enormous. It eats up memory, slows things down, and becomes a real bottleneck at scale.
The obvious fix is compression. But traditional compression methods introduce their own overhead, adding 1 to 2 extra bits per number and partially canceling out the benefit.
TurboQuant compresses the KV Cache down to just 3 bits per number, with no retraining required and no measurable loss in model accuracy.
Two algorithms power this approach:
• PolarQuant handles the heavy lifting. Instead of storing a vector using standard X/Y/Z coordinates, it converts it into polar coordinates (think: an angle plus a distance).
• QJL (Quantized Johnson-Lindenstrauss) mops up the residual error using just 1 bit. It acts as a mathematical error-checker that removes bias from the compressed approximation, keeping attention scores accurate.
The KV Cache bottleneck is one of the main reasons long-context inference is expensive. But vector quantization is also the foundation of large-scale semantic search, the technology that lets search engines find meaning rather than just keywords.
This is the kind of foundational infrastructure work that quietly makes everything else faster and cheaper.
Blog: https://lnkd.in/g5NcAFQ5
#KVCache #Quantization #LLMInference #AIResearch #GoogleResearch #MachineLearning #ICLR2026.
…展开
60
5 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
60
5 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Taras Tsugrii
GenAI builder
5,517 位关注者
2 个月
举报此动态
关闭菜单
Most people hear “3-bit quantization” and imagine each value now costs 3 bits.
In practice, that is often false.
The real cost is:
effective bits/value = payload bits + metadata overhead
At very low bitwidths, that hidden overhead can dominate.
That is why the most interesting part of Google’s TurboQuant (https://lnkd.in/g5FNc7Gv) writeup is not just “more compression for LLMs.”
It is the deeper systems idea:
below a certain bitwidth, side information stops being a detail and starts becoming the format.
Traditional low-bit vector quantization usually carries extra baggage: scales, norms, block constants, codebooks, normalization metadata.
Those bits are amortized, but they are still real. If the payload is 3 bits/value and the side information costs another 1-2 bits/value, then the representation is not really 3-bit in the only sense hardware cares about: bytes stored and bytes moved.
The geometric part is what makes this elegant.
For attention and vector search, the goal is not perfect reconstruction of each coordinate. It is preserving the downstream computation, especially inner products.
That is a subtle but important shift.
If you randomly rotate a vector before quantizing it, you often spread its energy more evenly across coordinates. In geometric terms, the data becomes less spiky in the chosen basis and more isotropic. Quantization gets easier because no single coordinate carries too much structure.
Intuitively:
a bad coordinate system makes the cloud look stretched and fragile
a better one makes it look rounder and easier to compress
That is why the PolarQuant angle is interesting.
Cartesian quantization asks: which box did this point land in?
A polar-style view asks: how far from the origin is it, and in what direction is it pointing?
For similarity-heavy workloads, that second question is often closer to the computation that actually matters.
Then comes the part I especially like:
don’t spend every extra bit on making the first approximation uniformly better.
Spend most bits on a strong coarse representation, then use a tiny residual budget to remove systematic error.
This generalizes far beyond AI. In any compressed system, the representation is never just the data. It is the data plus everything required to interpret it.
Once the payload gets small enough, the interpretation overhead becomes the main event.
That is the real lesson I took from TurboQuant:
extreme compression only becomes real when you compress the hidden structure around the values, not just the values themselves.
…展开
查看 C2PA 信息
无上一项内容
查看 C2PA 信息
无下一项内容
Taras Tsugrii
GenAI builder
Most people hear “3-bit quantization” and imagine each value now costs 3 bits.
In practice, that is often false.
The real cost is:
effective bits/value = payload bits + metadata overhead
At very low bitwidths, that hidden overhead can dominate.
That is why the most interesting part of Google’s TurboQuant (https://lnkd.in/g5FNc7Gv) writeup is not just “more compression for LLMs.”
It is the deeper systems idea:
below a certain bitwidth, side information stops being a detail and starts becoming the format.
Traditional low-bit vector quantization usually carries extra baggage: scales, norms, block constants, codebooks, normalization metadata.
Those bits are amortized, but they are still real. If the payload is 3 bits/value and the side information costs another 1-2 bits/value, then the representation is not really 3-bit in the only sense hardware cares about: bytes stored and bytes moved.
The geometric part is what makes this elegant.
For attention and vector search, the goal is not perfect reconstruction of each coordinate. It is preserving the downstream computation, especially inner products.
That is a subtle but important shift.
If you randomly rotate a vector before quantizing it, you often spread its energy more evenly across coordinates. In geometric terms, the data becomes less spiky in the chosen basis and more isotropic. Quantization gets easier because no single coordinate carries too much structure.
Intuitively:
a bad coordinate system makes the cloud look stretched and fragile
a better one makes it look rounder and easier to compress
That is why the PolarQuant angle is interesting.
Cartesian quantization asks: which box did this point land in?
A polar-style view asks: how far from the origin is it, and in what direction is it pointing?
For similarity-heavy workloads, that second question is often closer to the computation that actually matters.
Then comes the part I especially like:
don’t spend every extra bit on making the first approximation uniformly better.
Spend most bits on a strong coarse representation, then use a tiny residual budget to remove systematic error.
This generalizes far beyond AI. In any compressed system, the representation is never just the data. It is the data plus everything required to interpret it.
Once the payload gets small enough, the interpretation overhead becomes the main event.
That is the real lesson I took from TurboQuant:
extreme compression only becomes real when you compress the hidden structure around the values, not just the values themselves.
…展开
31
2 条评论
赞
评论
复制
LinkedIn
Facebook
X
关闭菜单
分享
31
2 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Uday Kamath, Ph.D.
Building Industry-First AI in Regulated Industries | 8x Author AI Books(LLMs, RL, XAI) | Keynote Speaker |
8,363 位关注者
2 个月
举报此动态
关闭菜单
Google Research's latest paper at ICLR 2026 (TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate) tackles one of AI's most expensive infrastructure problems.
At 128K context, the KV cache alone costs 33 GB of GPU memory per user.
TurboQuant compresses it to 3 bits: 6× less memory, 8× faster attention on H100, zero accuracy loss, no retraining.
The trick: one random rotation makes every key vector follow the same predictable distribution. No per-vector statistics. No overhead. Just math.
I break down the paper, the math, and my own implementation on Google Colab.
https://lnkd.in/etuwmnAE
#LLM #MachineLearning #AIInfrastructure
…展开
TurboQuant: Google Just Solved the KV Cache Bottleneck
substack.com
33
1 条评论
赞
评论
分享
复制
LinkedIn
Facebook
X
关闭菜单
Performance Optimization Techniques的更多内容
A/b Testing Strategies for Better Results
Advanced LLM Parameter Tuning Techniques
AI-Based Load Planning Systems
Amazon A10 Ranking Optimization Strategies
Amazon Dsp Performance Improvement Strategies
Amazon Engineering Strategies for Fast-Paced Execution
API Performance Optimization Techniques
Applying an Engineering Mindset to Performance Optimization
Benefits of Caching Techniques
Best Strategies for Effective Memory Management
Best Techniques for High-Performance Computing
Boosting LLM Performance Using Local Data Layers
Boosting LLM Performance Using P2L Methods
Capacity Allocation Strategies for Optimal Resource Management
Cargo Weight Distribution Strategies
Commercial Solar Performance Analysis Techniques
Common Pytorch Memory Management Strategies
CRO Testing Methods to Accelerate Results in 2025
CX and EX Strategies for High Performance
Data-Driven Load Optimization
Deploying Local LLMs for Reliable Performance
Diffusion Models for Robotics Performance Optimization
Dynamic Load Scheduling Algorithms
Embedded Solutions for Improved Performance
Error Budget Strategies for Performance Management
Error Mitigation Strategies in Quantum Computing
Holistic System Analysis for Optimizing Energy Output
How Data Structures Affect Programming Performance
How Indexing Improves Query Performance
How IOWN Technology Improves Data Center Performance
How Llms Boost Performance
How to Achieve Fast Data Transmission
How to Address Human Needs for Optimal Performance
How to Address Performance Drops
How to Analyze Database Performance
How to Apply Optimization Techniques in Practice
How to Boost Pipeline Performance
How to Boost Web App Performance
How to Deploy Llms for Optimal Performance
How to Embrace REST for Improved Performance
How to Ensure App Performance
How to Improve AI Performance With New Techniques
How to Improve Code Performance
How to Improve NOSQL Database Performance
How to Improve Page Load Speed
How to Improve Telecom Cabinet Performance
How to Improve Well Performance
How to Maintain IT System Performance
How to Maximize GPU Utilization
How to Optimize Application Performance
How to Optimize Cloud Database Performance
How to Optimize Cloud Resource Provisioning
How to Optimize Data Serialization
How to Optimize Data Streaming Performance
How to Optimize Digital Shelf Performance
How to Optimize Embedded System Performance
How to Optimize Images for Website Speed
How to Optimize Performance Using Cuda
How to Optimize Postgresql Database Performance
How to Optimize Pyspark Job Performance
How to Optimize Pytorch Performance
How to Optimize Query Strategies
How to Optimize Search Using Embeddings
How to Optimize SQL Server Performance
Importance of Process Optimization in Data Centers
Improve LCP, INP, and CLS for Web Performance 2025
Improving Data Center Performance Beyond Marketing Claims
Improving Data Center Profitability and Network Performance
Improving Energy System Performance with Near-Optimal Solutions
Improving LLM Performance Using Open-Source Approaches
Improving Quantum Subsystem Performance for Faster Results
Improving Solar Panel Performance for Small Systems
Improving UAS Mission Performance in Multiple Sectors
Integrated Load Management Approaches
Key Drivers of Solar PLF Performance
Key Performance Testing Strategies
Key Strategies for Service Optimization
Key Techniques for Achieving High Throughput
LLM Fine-Tuning Strategies for Multi-Domain Applications
LLM Memory Profiling Strategies for Design Space Exploration
LLM Strategies for Human-Level Performance
Load Balancing Techniques for Optimal Performance
Load Capacity Utilization Strategies
Load Consolidation for Cost Savings
Load Flexibility Enhancement Techniques
Load Prioritization Frameworks
Load Testing Strategies That Deliver Results
Maintenance Strategies for Optimal Performance
Memory Optimization Strategies
Mental Techniques to Improve Performance
Methods to Compare Solar String Performance
Multi-GPU Parallelism Techniques
Multi-Model Strategies for LLM Performance
Optimizing LLM Output Using APO Techniques
Optimizing Quantum Model Performance for Professionals
Optimizing Robotics Performance with Smaller Components
Optimizing Test Systems for Better Performance
Overcoming Scaling Issues in Quantum Numerical Methods
Performance Improvement Strategies
Proactive Load Adjustment Strategies
Production Optimization Methods for Field Operators
Resource-Efficient Load Management
Resource Optimization Strategies
Rest Strategies for High Performers in 2025
Run Time Optimization in Solar Site Operations
Signal Stacking Strategies for Better Results
Simple ERP Optimization Techniques
Smart Load Allocation Algorithms
Solar Farm Network Performance Strategies
Stanford Method for Improving Open LLM Performance
Stochastic Optimization Methods
Strategies for Improving Fusion Reactor Performance
Strategies for Improving Midstream Oil & Gas Performance
Strategies for Optimizing Analytical Methods
Strategies for Optimizing Models
Strategies for Quantum Circuit Execution in Noisy Environments
Strategies for Results-Driven Energy Management
Strategies to Address EV Performance Challenges
Strategies to Address Operational Inefficiencies
Strategies to Boost BAL 2025 Performance
Strategies to Improve Delivery Performance
Strategies to Improve Inverter Performance
Strategies to Improve IT Infrastructure Performance
Strategies to Improve Physical Performance Consistency
Strategies to Improve String Handling in Algorithms
Strategies to Optimize Feed-to-Weight Conversion Ratio
Strategies to Prevent Network Bandwidth Bottlenecks in 2025
Streamlining Engineering While Maintaining Performance
Sustainable Load Management Practices
Techniques for Solar Plant Performance Assessment
Techniques to Boost XR Performance and Realism
Techniques to Streamline Large Language Model Performance
Testing Methods for Scaling LLM Performance
Tips for Cloud Optimization Strategies
Tips for Database Performance Optimization
Tips for Optimizing Apache Spark Performance
Tips for Optimizing App Performance Testing
Tips for Optimizing Images to Improve Load Times
Tips for Optimizing LLM Performance
Tips for Performance Optimization in C++
Tips for Real-Time Performance Tracking
Tips to Improve Performance in .Net
Tips to Improve Spark Job Execution Speed
Using I-V Curve Tracing for Solar PV Optimization
Using Models for Energy Performance Analysis
Wind Load Performance Analysis
展开
收起
浏览分类
Hospitality & Tourism
Finance
Soft Skills & Emotional Intelligence
Project Management
Education
Technology
Leadership
Ecommerce
User Experience
Recruitment & HR
Customer Experience
Real Estate
Marketing
Sales
Retail & Merchandising
Science
Supply Chain Management
Future Of Work
Consulting
Writing
Economics
Artificial Intelligence
Employee Experience
Healthcare
Workplace Trends
Fundraising
Networking
Corporate Social Responsibility
Negotiation
Communication
Engineering
Career
Business Strategy
Change Management
Organizational Culture
Design
Innovation
Event Planning
Training & Development
展开
收起
领英
© 2026
关于
无障碍模式
用户协议
隐私政策
Cookie 政策
版权政策
品牌政策
访客设置
社区准则
العربية (阿拉伯语)
বাংলা (孟加拉语)
Čeština (捷克语)
Dansk (丹麦语)
Deutsch (德语)
Ελληνικά (希腊语)
English (英语)
Español (西班牙语)
فارسی (波斯语)
Suomi (芬兰语)
Français (法语)
हिंदी (印地语)
Magyar (匈牙利语)
Bahasa Indonesia (印尼语)
Italiano (意大利语)
עברית (希伯来语)
日本語 (日语)
한국어 (韩语)
मराठी (马拉地语)
Bahasa Malaysia (马来语)
Nederlands (荷兰语)
Norsk (挪威语)
ਪੰਜਾਬੀ (旁遮普语)
Polski (波兰语)
Português (葡萄牙语)
Română (罗马尼亚语)
Русский (俄语)
Svenska (瑞典语)
తెలుగు (泰卢固语)
ภาษาไทย (泰语)
Tagalog (他加禄语)
Türkçe (土耳其语)
Українська (乌克兰语)
Tiếng Việt (越南语)
简体中文 (简体中文)
正體中文 (繁体中文)
关闭菜单
语言