TAPS – Taya Arabic Processing Suite

TAPS (Taya Arabic Processing Suite) is the core of Taya’s digital content platform, covering a multitude of tasks including:  

  • Arabic morphology
  • Language detection
  • Named entity extraction
  • Stemmer
  • Stop word removal
  • Normalizer
  • Keyboard layout detection
  • Various data banks
  • Dictionaries
  • And much more… 

 

A multi-purpose Arabic processing port which facilitates Arabic language analysis through advanced tools and techniques, TAPS is designed to adapt and adjust to all search engines. TAPS has 14 key features: 

 

1. Tokenization & Segmentation

Tokenization of raw text is a standard pre-processing procedure for many NLP tasks.

For English, tokenization usually involves punctuation splitting and separation of some affixes like possessive ‘s’. Arabic language, however, requires somewhat broader token pre- processing; that is usually called ‘Segmentation’.

While tokenization can be described as the task involving separation of words (morphemes) from running text, it also involves dividing a word into clusters of consecutive morphemes, which typically correspond to the word stem and usually include inflectional morphemes.

 

2. Arabic Normalizer

Orthographic normalization is an activity that researchers involved in Arabic NLP apply with the purpose of reducing “noise” in the data. This is normally the case, especially if it involves preparing parallel text for machine translation, documents for information retrieval or text for language modelling. Normalization can be Tatweel removal (removing the Tatweel symbol), diacritic removal or letter normalization (converting variant forms to one form).

 

3. Morphology Analyzer

Morphology is the study of how words are structured. Every language has its own morphology that largely differs from other languages. Originally, morphology was the study of the way words are built up from smaller units called morphemes.

The merit of a morphological analyzer is twofold; the first stems from a theoretical (linguistic) point of view and involves being a useful tool for linguistic modelling and testing certain analysis. Secondly, it supports a variety of applications such as information retrieval, search engines and machine translation.

Our exclusive TAPS extracts the stem of any Arabic word entered into a search engine, taking into consideration the lack of vowels in the written script. It narrows a word’s multiple meanings into a single and accurate interpretation.

 

4. Transliteration

Transliteration involves phonetically converting a text from one language into another. It’s a process different from translation. Our transliteration module converts Romanized Arabic words, which are used in unofficial communication, into their Arabic equivalent.

 

5. Named Entities Recognition

Named entities recognition (NER) involves the identification of proper names in a specific text and then classifies them into sets of pre-defined categories of interest such as: Names of People, Names of Organisations (such as companies, government organisations and committees), Names of Location (such as cities and countries) and Miscellaneous Names (such as date, time, number, percentage, monetary expressions, number expressions and measurement expressions). TAPS’ NER module detects named entities according to types like Location, Organization and People.

 

6. Keyboard Layout Detection / Correction

This software feature improves the quality of spelling suggestions, as well as recognizing a keyboard-mapping mismatch between a computer and a given remote user, based on a defined dictionary. It is therefore capable of automatically detecting written Arabic words without converting the typing language, and then converting them into Arabic letters.

 

7. Translation of English words into Arabic

Machine translation (MT) is the process of using computer software to translate a text from one natural language to another. Machine translation is based on dictionary entries, which means that the words are translated as they are in a dictionary. This module is responsible for translating English words into Arabic based on the most reliable and trustworthy dictionaries.

 

8. Spell Correction

Spell Correction is a module devised to pinpoint Arabic/English misspelled words, alert the user to the correction, and provide them with a set of multiple suggestions for every misspelled word.

 

9. Information Extraction

Our Information Extraction library is a toolkit that detects and extracts metadata and structured text content from various documents. It extracts:

  • Text and metadata from PDF, Office Word Documents, Text formats, Audio formats and Video formats
  • Article text from HTML pages without headers, footers or ads
  • All metadata and open graph metadata
  • The best image to represent an article

Other benefits include enabling automatic character set detection and conversion, not based on page metadata (i.e. Windows-1256, UTF-8, etc).

 

10. Text Summarizer

Text Summarizer is a module that automatically reduces the word count of a text document in order to create a summary retaining the most important points of the original document. It’s used to summarize input text using AI algorithms and Natural Language Processing; based on TAPS. Its principal task is picking the most important phrases that represent the input text. This tool was originally designed to summarize texts of news stories and articles.

Currently, Text Summarizer supports texts written in Arabic, English and French, and will soon cover more languages.

 

11. Language Identification

This process involves determining a set content language. It detects up to 40 languages with high accuracy. It can also detect more than one language within a single document and define which one is the most used in the document. 

 

12. Predefined Topical Grouping

This module can be applied to any document to classify it into a set of predefined categories (i.e. sports, family, lifestyle, etc). Such grouping can be customised for a variety of topics (based on customer need) in Arabic as well as many other languages.

 

13. Dynamic Document Clustering

Dynamic Document Clustering is unsupervised clustering that gathers similar documents into separate groups, giving them automatically generated labels that represent the newly grouped set of documents.

 

14. Recommendation System

Recommendation System is a tool that seeks to predict the rating or preference that a user would give to any given item (such as music, books, or movies) or social element (such as people or groups) they have not yet experienced. It uses a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches). This module is embedded using content-based approaches which attempt to understand the Arabic text and recommends items (movies, books, news, etc) it expects the user to like.


Seeking to close the gap separating the Arab world and ME region on one hand and the West on the other, Taya IT’s research and development team built a powerful set of technologies for Natural Language Processing that led on to a whole set of new tools and subsequent technologies. Such technologies form the foundation on which we build and make products and services available to both businesses and consumers.

 

 

 

 

About Taya IT

Founded in 2006 by Wafik Shamma, TAYA IT started life as a research and development lab.

Get in touch

Taya IT
One Lake Plaza Tower, 2106, JLT.

Dubai, UAE

+971 4 429 06 66

info@tayait.com

Connect with us

Sign up to our newsletter:

JoomShaper