GitHub: Automating the Tagging of Repositories Under Industry Codes

As a Data Consultant, my team and I pulled public repository data with GitHub's REST API, cleaned and merged the data, and explored various machine learning methods to classify unseen repositories under a NAICS code. I focused on prompting OpenAI's GPT-3.5 to classify codes and found that it tends to overclassify repositories under the "Information" and "Professional, Scientific, and Technical Services" categories or occasionally refusing to give an output at all. Other tools my group members explored include word2vec, KeyBERT, TF-IDF, linear regression, PCA, KNN, UMAP, and neural networks. I also wrote a script to extract images from repositories, but we did not find images to be particularly helpful for classification purposes.