Job Description Summary
AI tools such as Large Language Models (LLMs) can be applied to mathematical text to help create a Knowledge Graph (KG) for Mathematics. This would be useful for many tasks such as helping search the literature (both for researchers and students) and finding semantic concepts and results. Thus this KG should avoid reinventing the wheel while helping uncover new relationships between mathematics subjects and conjecturing new connections. There is plenty of mathematical text on the web but there needs to be more mathematical text that is processed and annotated. Tools for processing text abound nowadays (see e.g. but mathematical corpora are still very rare. There are mostly datasets of problems and solutions (e.g. MATH GSM8K). However we want parsed mathematical text annotated with linguistic entities to get to the math semantics. This project wants to build a corpus of undergraduate mathematics that should help with the creation of the proposed knowledge graph. How We intend to create a corpus of undergraduate mathematics composed from the open source textbooks approved and recommended by the AIM (American Institute of Mathematics) at .
Responsibilities
After collecting the books the students should make a GitHub repo and then use spacy ( (or a similar offtheshelf tool) to process it producing stats (the script for the stats is from . its called ) like was done for the nLab in and for an updated nLab in . We want statistics of the corpus done for one or several books.
Required Qualifications
Some experience in data science Python and LaTeX Enthusiastic about helping to build tools that can improve the teaching of undergraduate mathematics