CMU Researchers Open Source ‘PolyCoder’: A Machine Learning-Based Code Generator with 2.7 Billion Parameters


Language models (LMs) are commonly used in the natural language literature to assign probabilities to sequences of tokens. LMs have recently demonstrated outstanding performance in modeling source code written in programming languages. These patterns are particularly effective for tasks such as code completion and generating code from natural language descriptions. For the support of AI-based programming, today’s large state-of-the-art language models for code have demonstrated tremendous improvement. One of the largest of these patterns, Codex, has been implemented as an IDE-integrated development helper that automatically writes code based on user context in the real-world production tool GitHub. copilot.

Despite the huge success of massive code language models, the most powerful models are not publicly available. This limits research in this sector to low-resource companies and prevents the use of these models outside of well-resourced companies. The codex, for example, allows non-free access to model output through black box API calls, but not to model weights or training data. Researchers are therefore unable to refine and adopt this approach for areas and activities other than code completion. The lack of access to the internals of the model also prevents researchers from examining other important features of these models, such as interpretability, model distillation for more efficient deployment, and the introduction of additional components like recovery.

GPTNeo, GPT-J, and GPT-NeoX are three publicly available pre-trained language models that range in size from medium to large. Although they have been trained on a wide variety of content, including news articles, internet forums, and a small number of software repositories (GitHub), these language models are capable of producing source code at lightning speed. decent. There are also a few open source language models that are purely trained on source code. CodeParrot, for example, was trained on 180 GB of Python code.


The influence of various modeling and training design decisions is unclear due to the variety of model sizes and training strategies used in these models and the lack of comparisons between them. The actual dataset on which Codex and other proprietary models were trained, for example, is unknown. Some public models have been trained on a combination of natural language and code in various programming languages, while others (eg, CodeParrot) have been trained only on code in one programming language. Since different programming languages ​​share comparable keywords and features, multilingual models can allow for superior generalization, as demonstrated by the effectiveness of multilingual models for real-world language and code. This could indicate that multilingual LMs can generalize across languages, outperform monolingual models, and be efficient in modeling low-resource programming languages, although this has not yet been empirically proven.

Researchers at Carnegie Mellon University recently published an article that compares existing code patterns – Codex, GPT-J, GPT-Neo, GPT-NeoX, and CodeParrot – between programming languages. By comparing and contrasting various models, they want to offer more light on the landscape of code modeling design decisions, as well as fill a major gap: no great open source language model has been trained solely on the code of several programming languages. Under the generic name “PolyCoder”, the team offers three such models with parameters ranging from 160 M to 2.7 B.

First, the team compares and contrasts PolyCoder, open source and Codex models in terms of training and evaluation parameters. Second, the team is studying how models of different sizes and stages of formation evolve, as well as how varying temperatures affect the quality of generation, using the HumanEval benchmark. Finally, because HumanEval only evaluates natural language against Python synthesis, they create an unknown evaluation dataset in each of the 12 languages ​​to evaluate the perplexity of the different models.

The researchers found that, despite its apparent specialization in Python, Codex performs admirably in other programming languages, outperforming GPT-J and GPT-NeoX, which were trained on the Stack. Nevertheless, the PolyCoder model achieves lower perplexity in the C programming language than all these models, including Codex.

In the C programming language, PolyCoder outperforms Codex and all other models. PolyCoder outperforms the similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala, and TypeScript when comparing open source models only. All other open source models, including Polycoder, are much poorer (greater perplexity) than Codex in the 11 non-C languages. PolyCoder is trained on an unbalanced mix of languages, with C++ and C being related and the two most prevalent in the global training corpus, according to the researchers. Due to the larger total volume (due to longer files), PolyCoder considers C to be the “preferred” language. Due to the complexity of the C++ language and the significantly larger popup window size of Codex (4096 versus 2048 for PolyCoder), or because Codex is likely trained on more C++ training data, PolyCoder does not outperform Codex in C++.


The researchers conduct a comprehensive review of massive language models for the code in this work. Larger models and more training time benefit overall performance. They also claim that the performance of GPT-superior Neo in some languages ​​suggests that training on natural language text and code can help with code modeling. PolyCoder, a huge open-source language model for code that was trained solely on code in 12 distinct programming languages, has been released to facilitate future studies in the field. PolyCoder produces less perplexity in the C programming language than all other models, including Codex.



Sherry J. Basler