NLP Python

There are the works that completed primarily during one of my internships.

RAKE

RAKE goes by Rapid Automatic Keyword Extraction and is a corpus free keyword extraction algorithm. In the world of non Deep Learning, one of the more popular way of extracting algorithm is called Term Frequency-Inverse Document Frequency (TF-IDF), where it compares the frequency of one word in the entire corpuse against that of in this particular document. An abnormality suggests a possibility for keyword.

The algorithm of RAKE bases its foundation on the difference between a word and a phrase and, in layman’s term, uses mutual words between phrases to help vote against each phrase and find the best phrase that captures the essence of the article.

This would eliminate the need for a corpuse as long as we are aware of the basic deliminators for words that separate nouns. The original paper can be found in here

Cryptum169/Rake_For_Chinese

textCNN

textCNN is a method for classifying articles by encoding words into vectors and while words of the same sentence stacks together, they become images and CNN have a fairly easy method for classifying such images.

It is an implementation of textCNN.

While directly feeding an entire text into the network proves to hard to train, I pretrained the network with word2vec. Training time and model size has been significantly reduced for textCNN while accuracy differs not by more than 1\%, however the word2vec model for the texts made up for the difference in size and time.

The code can be found here: Cryptum169/textCNN-for-Chinese