Build Custom Keyword Dictionary Framework Using SpaCy

Hurry To Hire The Best SpaCy Developer from Mellivora team!
September 30, 2021 by
Administrator

Milestone 1: “Framework Setup”


The Result


A Python script for converting text data from PDF files to txt files. All text documents were split into paragraphs using regular expressions. Newly created text files were saved into the Google Drive folder for further processing.


The Challenge


The client approached Mellivora experts to create a new Code with PDF download & text extraction code split from the model. Improvement to extracted file information (e.g. title, any other metadata) and extracted Text (e.g. paragraph number, no removing of spaces or replacing by special characters or other such issues).


The Solution


To reach the set goal, we used the “requests” package in Python to download PDF files from links, which were stored in the csv file provided by the client. After that, using the “textract” Python package, text from PDF files was converted into txt format. Using Regular Expressions the data was cleaned from bytes symbols, special characters and was splitted by paragraphs (each paragraph starts from the new line). The new txt files were stored in the Google Drive folder.


Technology Stack 


Technology stack was based on Python:


  • Storing data: Google DRIVE – storing data (input data, files)
  • PDF files downloading: “requests” package in Python
  • Text extractor from PDF: “textract” Python package
  • Data cleansing and splitting – ‘re’ Python package


Milestone 2: “New Base Model”


The Result


Created a Python script for keyword extraction on 5 categories from documents in txt format. Document with POS and dependency relation rules for each keyword category was created. Extracted keywords were saved into csv files on Google Drive.


The Challenge


The client approached Mellivora experts to create a Python Script for the Part-of-Speech tagging model that can identify and extract keywords for 5 categories: “profiles”, “categories”, “goals”, “measures”, “actions”. The grammar-based rules from the previous model should be improved upon for better accuracy.


The Solution


To reach the set goal, we used the “NLTK” Python package for the tokenization and sentence splitting tasks. For the extraction rules creation we used the SpaCy dependency parser, SpaCy POS tagger and SpaCy Matcher. The rules were created for 4 keyword categories: “profiles”, “goals”, “measures”, “actions”. The document with the rules description is stored on Google Drive. We also used SpaCy NER for extracting keywords which can be accepted as “profiles” and “measures” keywords. For “categories” keyword extraction we used Latent Dirichlet Allocation (LDA) for Topic Modeling applied to each paragraph. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in Python’s Gensim package. The result of topic modeling is saved in the csv files on Google Drive for each document separately.


Technology Stack 


Technology stack was based on Python:


  • Storing data: Google DRIVE – storing data (input data, files) CSV, txt
  • Tokenization: “NLTK” package in Python
  • Sentence splitting: “NLTK” package in Python
  • NER for keyword extraction: “SpaCy” package in Python
  • Dependency parser for keyword extraction: “SpaCy” package in Python
  • POS tagger for keyword extraction: – “SpaCy” package in Python
  • Rules generation:  “SpaCy Matcher” package in Python
  • Latent Dirichlet Allocation(LDA) Topic modeling for keyword extraction: Gensim package in Python


Milestone 3: “Base Relationship Graphs”


The Result


Created a Python script for calculating the co-occurrence and average distance score between various keywords based on paragraph distance. In this way we have created relationship pairs between keywords which appear together. Created pairs were saved in csv files on Google Drive.


The Challenge


The client approached Mellivora experts to create a Python Script for the relationship creation among different types of extracted keywords using information from the document (e.g. distance, co-occurrence).


The Solution


To reach the set goal, we used the “collections” Python package and class “Counter” for counting the paragraph distance between keywords in order to get the co-occurrence score for each pair. We made pairs only between those keywords which distance in less than 10 paragraphs. For the average distance calculation we used a created list of paragraph distance scores. Algorithm for  the average distance calculation was to sum up all the scores in the paragraph distance list for each keyword pair and divide it by the length of this list. Achieved co-occurrence and average distance scores for pairs were saved in “Pandas” dataframe and to csv files on Google Drive.


Technology Stack 


Technology stack was based on Python:


  • Storing data: Google DRIVE – storing data (input data, files) CSV, txt
  • Extracted keywords downloading: “Numpy” package in Python
  • Co-occurrence calculation: “collections” package in Python class Counter
  • Dataframe generation: “Pandas” package in Python


Milestone 4: “AI/ML Models for Aggregate Relationships”


The Result


Created a Python script for calculating similarity between extracted keywords in order to create keyword pairs. We calculated 3 different similarity metrics and got the average similarity score based on these 3 metrics results. Created pairs based on similarity scores were saved in csv files on Google Drive. Created a Python script for clusters prediction for each keyword category.


The Challenge


The client approached Mellivora experts to train the AI models using the base data from many documents to predict Clusters (dimensionality reduction), similarity/synonyms (dictionary and model).


The Solution


To reach the set goal, we used 3 different similarity metrics. We used the “SpaCy” python package and its similarity method in order to get a similarity score based on  syntactic similarity of keywords. We used FastText and Word2Vec frameworks in order to get a vector representation of each document. Using FastText we calculated word vectors for the document and applied the cosine similarity metric in the “scipy” Python package to calculate the semantic similarity score. We imported Word2Vec from the “Gensim” Python package and calculated the semantic similarity score between keywords using a similarity function based on vector representation. For the clustering task we used the KMeans algorithm in the “sklearn” Python package. For the keywords’ importance identification we used TF-IDF scores.


Technology Stack 


Technology stack was based on Python:


  • Storing data: Google DRIVE – storing data (input data, files) CSV, txt
  • Extracted keywords downloading: “Numpy” package in Python
  • Syntactic similarity calculation: “SpaCy” package in Python
  • Semantic similarity calculation: FastText, SciPy, Gensim, Word2Vec
  • Clustering: Sklearn package in Python class KMeans
  • Importance scores: sklearn.feature_extraction.text  package in Python, class TfidfVectorizer
  • Dataframe generation: “Pandas” package in Python


P.S. SUMMARY SUBSTITUTE


The cooperation with the client is ongoing and we are happy to provide value with our NLP expertise! Need a hand with your NLP/Machine Learning endeavours? We are happy to give you a hand!