| By Nour Sibai, Marie-Assine Ghantous & Charbel Hajj Moussa |
The Youth4Governance Summer Internship Program at Siren has been nothing short of eye-opening and transformative. We began with Coursera courses, led by the renowned Andrew Ng, which were instrumental in shaping our understanding in many critical areas of AI. The wealth of knowledge gained was invaluable and evidenced in our ambitious final project initiatives, “Document Layout Analysis for Image Recognition” (Nour Sibai and Marie-Assine Ghantous) and “Arabic Speech Recognition with AI” (Charbel Hajj Moussa).
A Pivotal Meeting and Workshop on AI Transformation in Lebanon
One of the highlights during the internship was a meeting at the Office of the Minister of State for Administrative Reform (OMSAR). In this meeting, we had the opportunity to illustrate how AI could reshape operations at Lebanon’s Vehicle Registration Office. The atmosphere was charged with energy, and the event was further enriched by the presence of Minister of State Najla Riachi. This gathering was a significant milestone in Lebanon’s campaign toward a technologically advanced and efficient future.
We also had the opportunity to collaborate with the top-notch AI team at Siren Analytics. Together, we presented an “Introduction to AI” workshop at OMSAR. The event was again attended by Minister of State Riachi, whose dedication to digital transformation serves as a true inspiration. Our workshop aimed to offer a comprehensive understanding of Artificial Intelligence, diving deep into its real-world applications and discussing its potential impact on public administration and governance.
Document Layout Analysis for Image Recognition
This project, mentored by Siren’s AI Team Lead, Fady Baly, aimed to revolutionize the realm of image recognition, specifically focusing on the digitization of Lebanese government documents. The project aspired to create a state-of-the-art Document Layout Analysis (DLA) system that can interpret text across diverse document layouts, both in English and Arabic, as there is an urgent need for digital transformation in Lebanon and other Arabic-speaking nations.
By identifying and analyzing document layouts, our solution aims to streamline administrative tasks, minimize errors, and boost operational efficiency. The potential for integrating this DLA model into existing optical character recognition (OCR) systems is a testament to the project’s versatility and far-reaching impact.
The project utilized the robust features of the YOLO-NAS architecture, an object detection model that detects document layouts while drawing bounding boxes, combined with the capabilities of LayoutXLM integration, which understands documents in multiple languages. This innovative approach strived to redefine how we understand and interpret document layouts, breaking through traditional limitations in document analysis.
Our strategy was twofold – we leveraged the groundbreaking YOLO-NAS architecture for object detection, fine-tuned the entire detection process, and enriched YOLO-NAS with labeled data, particularly Arabic documents.
The seamless integration of YOLO-NAS and LayoutXLM was our core achievement. This synthesis decoded the complex interplay between various elements within the document, linking bounding boxes with content garnered through OCR. The result was a comprehensive document layout analysis tool that moves beyond mere technical capability to achieve genuine document comprehension.
Arabic Speech Recognition with AI
Our team’s mission for this project was to transform Arabic speech recognition using the Whisper model, a pre-trained language model with remarkable natural language processing abilities.
Whisper’s journey began with pre-training on a vast amount of text data. To tailor it for Arabic speech recognition, we embarked on a challenging task—fine-tuning. This process involved loading the pre-trained Whisper model and further training it using datasets comprising Arabic speech audio paired with text transcripts.
Fine-tuning Whisper was a meticulous process, with the goal of optimizing its ability to transcribe Arabic speech accurately based on the provided text examples. Techniques like data augmentation were employed to enhance data diversity, ensuring robust performance across various accents and speaking styles.
Whisper’s innate versatility, allowing it to understand and process conversational language, made it the ideal candidate for this transformation. Through fine-tuning, we honed its skills to excel specifically in Arabic speech recognition while preserving its inherent language abilities.
The results were astounding. Whisper, now specialized for Arabic speech recognition, accurately transcribed new Arabic speech input into text, bridging the gap between spoken words and written text. Our project had far-reaching implications, from aiding accessibility for Arabic speakers to advancing transcription services across the Arab world.
In retrospect, our internships have been pivotal milestones in our journeys, providing invaluable wisdom and hands-on experiences. They were filled with diverse experiences, combining immersive education with a team-oriented work setting, hands-on tasks, and networking possibilities.
The nurturing environment at Siren empowered us to venture beyond our comfort zones and face challenges head-on. We’re profoundly thankful for the priceless insight and guidance that has set a robust groundwork for our future pursuits. Moving forward, we are eager to apply what we have learned and nurture the highly beneficial relationships we’ve cultivated during this successful internship as we set our sights on making meaningful contributions to the tech industry in Lebanon and beyond.