CodeSage is a family of source code embedding models with a Transformer encoder architecture that support a wide range of source code understanding tasks, and is available in three sizes: 130M (CodeSage-Small), 356M (CodeSage-Base), 1.3B (CodeSage-Large).
CodeSage is trained on the Stack dataset in two stages. In stage-1, we perform masked language modeling (MLM) with a mix of standard masking and identifier deobfuscation. In stage-2, we use contrastive learning by constructing text-code pairs. Please check CodeSage-v1 for details.
Flexible Embedding Dimensions. CodeSage-V2 supports flexible embedding sizes, thanks to Matryoshka Representation Learning.
Improved Retrieval Performance. CodeSage-V1 primarily relies on hand-crafted heuristics to filter the (summary, code) pairs constructed from GitHub data,
such as extracting the first sentence of a docstring as the summary and filtering out data with excessively short summaries or code snippets
(Husain et al., 2019; Zhang et al., 2024). For this V2 model, we enhanced semantic search performance by
improving the quality of the contrastive learning data through consistency filtering.
Starting from the pretrained checkpoints (trained with both Masked Language Modeling (MLM) and Deobfuscation see Section 3.1)
from our V1 model family,
we applied contrastive learning with the filtered data. Unlike the V1 model training, we extracted the initial set of (summary, code) pairs—specifically,
summaries and function/class bodies—from The Stack V2 data instead of
The Stack.
We employed simple rule-based filtering same as that used for training the V1 models.
We then applied consistency filtering to further refine the data. While using The Stack V2 resulted in minor performance boosts on downstream tasks,
the majority of the performance improvements came from the consistency filtering.
To perform consistency filtering, we train CodeSage-Base for one epoch using the full contrastive learning dataset obtained via rule-based filtering. Following this, we refine the data by removing any positive pairs where the similarity score is not within the top three relative to the anchor's similarity scores with 100,000 sampled examples. This process allows us to remove 40% of the contrastive learning data while yielding over a 10% absolute improvement in Code2Code search and over a 3% improvement in NL2Code search, compared to the baselines trained without consistency filtering. These improvements are mainly attributed to the enhanced quality of the contrastive learning data, which also allow us to use a larger learning rate to elicit better performance.
We benchmark CodeSage-V2 with various open-sourced or closed sourced models for code-related retrieval tasks: Below we show the evaluation results on:
from transformers import AutoModel, AutoTokenizer
checkpoint = "codesage/codesage-small-v2" # "codesage/codesage-base-v2", "codesage/codesage-large-v2"
device = "cuda" # "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
embedding = model(inputs)[0]
from sentence_transformers import SentenceTransformer
model_name = "codesage/codesage-small-v2" # "codesage/codesage-base-v2", "codesage/codesage-large-v2"
model = SentenceTransformer(model_name, trust_remote_code=True)
@inproceedings{
zhang2024code,
title={{CODE} {REPRESENTATION} {LEARNING} {AT} {SCALE}},
author={Dejiao Zhang and Wasi Uddin Ahmad and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=vfzRRjumpX}
}