+98 21 8609-3065 h.veisi[at]ut.ac.ir
Deep Learning for Persian Natural Language Processing

Deep Learning for Persian Natural Language Processing

With the growth of unstructured text data over the Internet, which is mainly the result of the human interaction in web2.0 and social networks, finding a way to automatically process and extract knowledge from this data seems indispensable. Despite unstructured format this data contain valuable knowledge which can be extracted using knowledge discovery and machine learning techniques. There has been great progress in natural language processing task such as Sentiment analysis, Opinion mining, Topic identification, Automatic machine translation, Name entity recognition, Part of speech tagging, Parsing, Information extraction, Question answering, Paraphrase detection, etc. In most of NLP tasks we first develop an algorithm and then convert our data to be prepared to feed into that algorithm. This is called feature engineering which is very time consuming. Mainly, words are considered as features in text data. But there are two shortcomings in this method: First word order may be lost and second is the sparsity of feature vector which affect training time. The aim of this project is to find a way to automatically do feature extraction from text data in Persian. We found deep learning as a way to deal with this problem. Neural network with more than one layer is called deep network. In this method each word is described with a numerical vector, which is called distributed representation or word vector. This representation contains semantic and syntactic information about words. Word concatenation represent sentences. If we can describe words with such vectors the sentence could be too. The range of this combination include simple mathematic operator like vector addition or multiplication, to recurrent neural network and recursive...
Persian Medical Question Answering System

Persian Medical Question Answering System

Home > In this project, a Persian Question Answering (QA) system is created to ease the access to information resources for doctors, health providers and users. To this aim, a set of Persian documents related to drugs and diseases are collected. The processing of the structured documents improves the performance of the QA system that’s why all documents were converted into semi-structured documents. The developed system consists of three main units: question processing document retrieval answer extraction The question processing unit, as in the most important module, consists of four components that sequentially extract keywords/queries. These components use a dictionary of drugs/diseases names and keywords/queries. This process is shown in the following figure. If a module fails to extract keywords from the question, based on the condition of the question, another component would make the extraction process instead. The first part of the QA system is question processing module. The main component of the question processing module includes Question Classifier, N-gram Tokenizer, Patterns Matching and Advanced Tokenizer.   In this architecture, the question asked by the user, is normalized and then the drug name or disease related to the question is extracted through Name Entity (NE) Dictionary. If this specified name is extracted from the question, the question would be sent to Question Classifier component for the extraction of the phrases that indicate the meaning of the question. Finally, by using the concept of the dictionary, the keywords would be extracted and the phrases would be mapped to the dictionary keywords. On the other hand, if the Question Classifier fails to extract any keywords, the question would be sent...
Text Mining

Text Mining

Home > Most of valuable data around us is in unstructured format. Discovering worthy knowledge from text which is kind of unstructured data is an important task. Text mining (or text analytics), refers to the process of extracting information from text using machine learning algorithms. Research on text mining n DSPLab covers text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization and document similarity focusing on Persian language.                  ...