Senior AI Research Scientist training large language and vision models at production scale, with research roots in scientific machine learning (Harvard).
One question runs under most of my work: how to represent data on the boundary between the continuous and the discrete. FFT frequency bins, language tokens, molecular-dynamics trajectories modeled with mixture-density heads, OCR pixels fused with text, and now SMILES strings for chemistry. The same question each time; only the substrate changes.
Selected work
- GutenOCR: open-weights vision-language model family (3B and 7B) for grounded document OCR, with open training code and the 1.5M-page PubMed-OCR dataset. Apache-2.0. (Built at Roots.)
- Page Stream Segmentation with LLMs: COLING 2025, industry track.
- Deconstructing Recurrence, Attention, and Gating: architecture transferability for forecasting chaotic dynamical systems (Harvard).
- academic-tools-mcp: an MCP server giving agents identifier-routed tools across seven academic providers.
Now
Foundation-model methods for the physical sciences. My active research direction is chemical language models: tokenization, pretraining, and scaling for chemistry. Pinned repositories below span scientific computing, generative modeling, and research tooling.





