I worked as a software engineer intern at Annarabic, a Moroccan AI startup developing speech-recognition systems for spoken Arabic dialects and low-resource African languages. Annarabic trains its models from scratch using native-speaker data, addressing critical gaps left by mainstream ASR systems trained primarily on Modern Standard Arabic.
Software Engineering & Data Infrastructure
My core responsibility was building a scalable infrastructure to support low-resource NLP model training, focusing on Swahili and dialectal Arabic.
- I designed and optimized a Python web scraper using Selenium and MongoDB to collect and structure large-scale YouTube metadata and transcripts. From the initial iteration, I improved processing latency by ~4.5× and scaled the system to handle 10,000+ hours of video data.
- I also conducted bottleneck analysis across scraping, parsing, and storage layers, and scoped targeted optimizations to reduce system overhead and improve reliability.
- To plan for future work, I wrote a product specification document outlining architectural improvements, including the use of cloud services for distributed training and multiprocessing.
Throughout this process, I was challenged to balance speed, cost, and quality under the constraints of a small startup, given the need to avoid paid APIs.
Research, Product, & Partnerships
In addition to engineering, I worked closely with Annarabic’s founders on product development and developing external partnerships.
- Showcased Annarabic’s “Transcribing WhatsApp Arabic Voice Messages” system at the Columbia Data Science Institute Undergraduate Research Fair 2024, demonstrating real-world applications of speech recognition disaster response.
- Supported partnership discussions with Columbia DESDR research group to expand Annarabic’s operations into three additional countries (pending USAID support).
Takeaways
This internship provided hands-on experience with a very early-stage startup environment, from building cost-efficient data pipelines to translating technical capabilities into real impact. Working on mission-driven AI at a small company gave me ownership over both technical decisions and their downstream societal effects, particularly in advancing language equity and accessibility in underrepresented regions.