CodeSage is a family of source code embedding models with a Transformer encoder architecture that support a wide range of source code understanding tasks, and is available in three sizes: 130M (CodeSage-Small), 356M (CodeSage-Base), 1.3B (CodeSage-Large).
CodeSage is trained on the Stack dataset in two stages. In stage-1, we perform masked language modeling (MLM) with a mix of standard masking and identifier deobfuscation. In stage-2, we use contrastive learning by constructing text-code pairs.
Our largest model, CodeSage-Large, outperforms OpenAI text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large by 41%, 144%, and 34% (relative) respectively, on code-to-code search tasks. On text-to-code search tasks, CodeSage-Large outperforms text-embedding-ada-002, text-embedding-3-small, and on par with text-embedding-3-large.
We benchmark CodeSage with public encoder models for code: CodeBERT, GraphCodeBERT, StarEncoder, UnixCoder, and OpenAI text-embedding-ada-002, OpenAI-text-embedding-3-small, OpenAI-text-embedding-3-large. Below we show the evaluation results on:
On the 80-10-10 corruption convention. Given an input sequence, the conventional strategy for text, first randomly samples a subset of its tokens, of which 80% are replaced by a special token [MASK], 10% are left unchanged, and the other 10% are replaced by random tokens from the vocabulary. We find this 80-10-10 strategy is suboptimal for code, and it is more effective to simply replace all sampled tokens with [MASK].
Random Masking & DOBF Complement Each Other. DOBF promotes the model to better understand the structure of code as well as yields better shared representations between NL and PL. Simultaneously, random masking promotes the model to learn beyond identifiers, e.g., only 30% of the PL tokens in python are associated with identifiers. We explored two alternatives to leverage DOBF (D) and random masking (R) to complement each other. (1) Sequential (S): training the model with random masking first, then DOBF. (2) Parallel (P): randomly picking either DOBF or random masking for a training example.
from transformers import AutoModel, AutoTokenizer
checkpoint = "codesage/codesage-small" # "codesage/codesage-base", "codesage/codesage-large"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
embedding = model(inputs)[0]
print(f'Dimension of the embedding: {embedding[0].size()}')
# Dimension of the embedding: torch.Size([13, 1024])
@inproceedings{
zhang2024codesage,
title={CodeSage: Code Representation Learning At Scale},
author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang},
booktitle={The Twelfth International Conference on Learning Representations },
year={2024},
url={https://openreview.net/forum?id=vfzRRjumpX}
}