3/27/2023 0 Comments Text extractor tutorial![]() ![]() Obviously, we aren’t going to come up with some academically rigorous definition of what a keyword is. Let’s think a little bit more about what a good keyword really is. We have a sensible number of words to work with that could be keywords, so all is left is finding the best keyword out of the bunch. Using scikit-learn’s count vectorizer, we can specify the n-gram range parameter, then obtain the entire list of n-grams that fall within the specified range. Rarely do we see long keywords: after all, long, complicated keywords are self-defeating since the very purpose of a keyword is to be impressionable, short, and concise. Normally, keywords are either single words or two words. For example, a 2-gram or bi-gram span all sets of two consecutive word pairs. Recall that n-grams are simply consecutive words of text. The important question, then, is how we can select keywords from the body of text. As stated earlier, those candidates come from the provided text itself. The first step to keyword extraction is producing a set of plausible keyword candidates. For the purposes of this demonstration, we take the simpler extractive approach. Generating new words that somehow nicely summarize the provided passage requires a generative, potentially auto-regressive model, with tested and proven NLU and NLG capabilities. Note that this is not a generative method in other words, the keyword extractor will never be able to return words that are not present in the provided text. Hopefully, we can build a simple keyword extraction pipeline that is able to identify and return salient keywords from the original text. In a 'reasonable' way (see inductive bias). ![]() Generalize from the training data to unseen situations To correctly determine the class labels for unseen An optimal scenario will allow for the algorithm Inferred function, which can be used for mapping newĮxamples. A supervised learningĪlgorithm analyzes the training data and produces an (typically a vector) and a desired output value (alsoĬalled the supervisory signal). In supervised learning, eachĮxample is a pair consisting of an input object It infers a functionįrom labeled training data consisting of a set of Learning a function that maps an input to an output based Supervised learning is the machine learning task of I’m writing this tutorial on Google Colab, so let’s go ahead and install the packages that Colab does not ship with my default: spaCy and HuggingFace transformers. This way, we can overcome the shortcomings of the supervised learning approach with BERT fine-tuning discussed earlier. Finally, once we have those keywords, the idea is that each of these keywords could potentially be used as a tag for a blog post. We might specify as a parameter how many keywords we want to extract from the given text. Given a block of text, we want to have a function or model that is able to extract important keywords. Without futher ado, let’s jump right in! Introductionīefore we get down into the engineering details, here’s a bird’s eye view of what we want to achieve. I highly recommend that you check out both his post as well as the library on GitHub. The method introduced in this post heavily borrows the methodology introduced in this Medium article by Maarten Grootendorst, author of the KeyBERT. In today’s post, I hope to explore the latter in more detail by introducing an easy way of extracting keywords from a block of text using transformers and contextual embeddings. While there might be many ways to go about this problem, I’ve come to two realistic, engineerable solutions: zero-shot classification and keyword extraction as a means of new label suggestion. ![]() Retraining and fine-tuning the model again would be a costly, resource-intensive operation. After all, the classification head of the model was fixed, so unless a new classifier was trained from scratch using new data, the model would never learn to predict new labels. ![]() The supervised leanring approach I took with fine-tuning also meant that the model could not learn to classify new labels it had not seen before. The fact that the dataset had been manually labeled by me, who tagged articles back then without much thought, certainly did not help. This was in large part due to my naïve design of the model and the unavoidable limitations of multi-label classification: the more labels there are, the worse the model performs. For instance, the model was only trained on a total of the eight most frequently occuring labels. The BERT fine-tuning approach came with a number of different drawbacks. Nonetheless, I knew that more could be done. Automatically generates a markdown file that not only includes all the contents of the Jupyter notebook, but it also includes automatically generated tags that the fine-tuned BERT model inferred based on text processing. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |