Project Portfolio & Progress Log
This section documents every practical step in building the Somali language corpus and dictionary.
From early experiments to long-term milestones, each entry represents a building block of the Galool project.
Current Active Projects
๐ 1. Somali Text Corpus (Phase 1)
- Collecting public Somali text
- Manual cleaning tests
- Designing file structure
- Preparing for tokenization and frequency analysis
Status: In progress
๐ 2. Somali Dictionary Framework
- Choosing initial word list
- Drafting definition format
- Designing data model (JSON + human-readable format)
- Creating example entries
Status: Planning / Testing
๐ 3. Digitizing Old Somali Materials
- Reviewing scanned Somali government textbooks
- Preparing OCR test samples
- Deciding which books to rewrite manually
- Testing correction workflows
Status: Initial experiments
๐ 4. Web Infrastructure & Documentation
- Setting up Galool.net
- Creating subdomains for internal tools
- Building documentation style
- Structuring long-term archives
Status: Ongoing
Mini-Projects
These are small experiments used to learn skills and test ideas.
๐น Mini-Project: OCR Accuracy Test
Testing Tesseract and other OCR engines on Somali diacritics.
Goal: Evaluate accuracy and decide when manual rewriting is better.
๐น Mini-Project: Text Cleaning Pipeline
Building a simple Python script to remove noise: HTML, punctuation issues, duplicates.
๐น Mini-Project: Small Demo Corpus
Creating a 50kโ100k word mini-corpus to practice tokenization, sorting, and frequency lists.
๐น Mini-Project: Community Feedback Form
Creating a page where Somali speakers can suggest words, corrections, or sample sentences.
Long-Term Vision
โ Full Somali Corpus
Millions of words of cleaned and verified Somali text.
โ Full SomaliโEnglish Dictionary
Modern, comprehensive, open to everyone.
โ Tools & APIs
Word lookup API, POS tagger (future), search engine, teaching resources.
โ Education & Preservation
Digitized archives, reading materials, and content that future generations can depend on.
Progress Timeline
I will update this weekly.
Month / Year
- What was completed
- What was learned
- What challenges appeared
- Whatโs next
Galool Somali Corpus โ Progress Report
Night of Dec 07, 2025
Project Goal
Build a high-quality Somali text corpus to support dictionary development, language research, and NLP applications.
Tonightโs Achievements
- OCR Tools Setup
- Installed Tesseract-OCR 5.5.0 (Windows) โ
- Installed OCRmyPDF 16.12.0 via Python โ
- Installed Poppler 25.12.0 for PDF โ text conversion (
pdftotext) โ
- Test OCR Workflow
- Extracted pages 6โ10 from the first Somali literature book using Chrome Print โ Save as PDF
- Ran OCRmyPDF with
-l engfor Somali Latin script - Result: Searchable PDF preserves original formatting, layout, and background โ
- Mixed Script Handling
- Tested Arabic headings with
-l eng+ara - Verified headings recognized correctly alongside Somali text โ
- Tested Arabic headings with
- Extract Editable Text
- Converted OCR PDF to UTF-8 text using
pdftotext - Text is fully editable โ forming the base corpus for proofreading and cleaning โ
- Converted OCR PDF to UTF-8 text using
Next Steps
- Apply the workflow to the full 70+ page book and remaining literature books
- Begin manual proofreading to correct OCR errors
- Organize files into a structured corpus folder
- Plan for future OCR improvements, including Somali-specific Tesseract training
Outcome
A fully functional digitization pipeline is now established:
Original PDFs โ Searchable OCR PDFs โ Editable UTF-8 text
This is the foundation of the Galool Somali Corpus, paving the way for future dictionary projects, NLP research, and community contributions.
Date: 2025-12-07
Work Summary:
Today, we continued developing the Somali literature corpus project. Key accomplishments:
- Text Cleaning and Organization
- We successfully processed the first book (Book1) containing Somali poems (Gabay, Geeraar), short stories (Sheeko), and wisdom sayings (Curis).
- Cleaned the OCR text by removing noise such as page numbers, leftover symbols, and extraneous numbers, while keeping all meaningful text intact.
- Data Structuring
- Segmented the book into meaningful sections: poems, stories, and wisdom sayings.
- Added metadata such as title and author for each segment.
- Automated Corpus Generation
- Converted the cleaned and segmented text into a structured JSONL format suitable for corpus building and future NLP tasks.
- Prepared the file for easy search, filtering, and analysis of Somali literary works.
- Next Steps
- Fix minor author field inconsistencies.
- Optionally trim each segment to short previews for quick referencing.
- Continue processing remaining books in the same automated workflow.
Outcome:
The first book is now fully cleaned, segmented, and structured. This forms the foundation of a high-quality Somali literature corpus for research, NLP, and cultural preservation.