- Character.AI designs its model architecture as a full-stack AI company
- They natively train their models in int8 precision, eliminating the risk of traning/serving mismatch while also significantly improving training efficiency
- They serve more than 20,000 inference queries per second
- This is roughly 20% of the request volume served by Google Search (105,000 queries/s)
Memory-efficient Architecture Design
- The key bottleneck of LLM inference throughput is the size of the cache of attention keys and values (KV)
- It determines the maximum batch size that can fit on a GPU
- It also dominates the I/O cost on attention layers
- They use the following techniques to reduce KV cache size by more then 20X without regressing quality.
Multi-Query Attention
