Scraping and classifying articles to analyze Corporate Rebranding
Authors: Cheng Ren, Grace Zhao, Abhishek Roy, Khoa Nguyen, Wish He, Naman Patel
I. Introduction
The Empirical Examination Of Corporate Rebranding And Trademarks Project aims to improve our understanding about analyze trademark rebranding activities of public companies in the U.S. and answer the following questions:
- What kind of companies are engaged in name rebranding?
- What factors motivated these activities?
With those questions in mind, this semester, our team explored multiple databases of 10-K reports and statements as well as news articles and conducted broad google search regarding rebranding using web scraping techniques. We performed EDAs and text analysis on the news reports or articles that might be relevant to rebranding, and built some models to further determine their relevance to the subject of interest. This mission poses some challenges as well as introduces opportunities for growth for our team that will be discuss the challenges and opportunities of this processed in details below.
II. Data Collecting
1. Dataset exploration
We first decided to use two databases: Nexis Uni and Business Source Complete among others (e.g. Factiva) to download our main dataset for future analysis, and we first downloaded hundreds of articles to have an idea of what constitutes a related article (flagged as 1).
1.1. Keywords selection
The data collection team experimented with various sets and combinations of keywords to test and determine what is the best criteria for searching. Some keywords used: “Rebranding”, “Rebrand”, “Trademark”, “Trademarks”, “Company Rebrand”. Among these, “company rebrand” works the best, and has around 80% to 90% of related articles (actual rebranding announcements); some other unrelated ones (flagged as 0) are about “how to rebrand” or “when to rebrand”.
Moreover, some other common keywords and phrases that we found in the bodies of the articles are as follows: “Firm previously known as”, “formerly known as”, “Changing its name to”, “Under the new name”, “new name”, “Patents”, “Strategic”, “Announce”, “Acquire”, etc.
We settled down on the simple criteria of rebrand/rebranding/rebranded/rebrands in titles: title(rebrand*).
1.2. Pipelining the downloading process
Pipelining the downloading and logging process in Nexis Uni:
- Search using keywords selected
- Download full document in doc format in a zip file
- Download the result list for ‘News’
- Open result list in excel and set up VBA as below
1.3. Limitations
There are some limitations for downloading in Nexis Uni:
- Full text: can only down 100 articles per batch
- Result list: can only down 250 articles per batch
- Can only download the first 1000 results for any search return
- “Please limit your selection to 100 of just the top 1000 results.”
1.4. Number of articles returned
1.5. Exploration In Business Complete
In the Business Complete Dataset, we are not allowed to download the full article. Instead, we could gather abstracts for all articles we needed. When searching for the word ‘rebrand’, we got a dataset with 5050 articles. Also, upon overview, the whole dataset was highly correlated with companies’ rebranding activities. As a result, we downloaded everything there. Using Regular expression, we converted the messy variables such as ‘Date’ to a near datetime variable. Then, we combined it with the 2000 articles we got from the Nexis Uni dataset. We utilized Textract and Fuzzuwuzzy to match the articles with their text bodies. Finally, we got a test dataset with 7000 data points which contained ‘Article Title’, ‘Article Body’, and ‘Publication Date’.
2. Exploratory Data Analysis
We took out the first 300 articles to create a Master Table and manually labeled them based on whether the article in question was related to an official trademark revision of a company (1 for related and 0 otherwise). This table would later be used to train and test our models.
Within these 300 articles, we found the percentage of relevance to official rebranding activities to be quite high:
From here, we were determined to build an appropriate model to automate this labeling process. The question was, how do we make sure this model of ours to work as or more accurately? Using the collected data, we performed various EDA to observe certain trends as well as potential key features derived from the nature of the data. These features, if exist, would greatly aid the process of building said classifying models.
2.1. Using the length of the subject of interest
The two graphs below presented the distributions of the lengths of the Article Titles and Article Body, respectively. It could be seen that this analysis provided far more useful information for the Title in comparison to the Body text. We concluded that the length of an article’s body was not a good indicator for accurate classification, and that we should allocate our efforts onto other features.
2.2. Using keywords
As discussed above, certain keywords when being input into the database yielded different levels of relevance. Thus, we looked into the labeled articles to find any significantly different words that might naturally differentiate the related articles from non-related.
Unsurprisingly, the word “rebrand” had the highest frequency among the strategically selected words since we used it as the main web scraping keyword. However, it was easily seen that formal words such as “announce” or “launch” yielded better results since they were commonly found in official announcements, whereas words such as “like”, “might” were less appropriate for this type of document. Noticeably, “logo” was an outlier as it seemed to appear more in unrelated articles despite having a neutral nature.
III. Data modeling
1. Classification
1.1. Preliminary attempts
Once we had a working dataset of articles, we began the modeling process to create a classification model that would take in an article and output a 1 or 0 depending on whether it was rebranding related or not. The first approach involved splitting the dataset into a training and test set and consequently tokenizing the titles to implement a bag-of-words logistic regression model. However, we quickly realized that this was leading to extreme overfitting since despite relatively high training accuracy, our validation set and test set were both giving less than 60 percent accuracy. We tried changing features of the model by implementing an n-gram aspect to the model and even engineering some additional features such as presence of certain keywords related to rebranding but still could not break through above 75 percent test accuracy. Ultimately, we realized that the titles were not descriptive and distinctive enough to meaningfully and accurately classify unseen articles. From this point, we attempted to translate our working models to shift our focus toward modeling the actual body of text in the articles and using that as the preliminary criteria for article classification.
1.2. Using Body of Text
- Fuzzy matching
Since titles in the master table are not exactly the same as file names, fuzzy matching is applied to extract the body of text in the “docx” format files and import into the master table for analysis. The main package here is “fuzzywuzzy”. This package will return the similarity score between two objects. If the similarity score is over 87, both the title will be kept in the master table and the body of text will be matched to the exact title.
- Cleaning
By virtue of being textual data, each article’s text body had to be cleaned significantly in order to increase the efficacy of our classifier. The cleaning was done using a custom tokenizing function made using spaCy’s library. This function tokenized the strings in each article’s body, converted all characters to lower-case, removed whitespace, punctuations, stop words, and also lemmatized the words. Digits were also removed from the text body.
- Feature Selection
After the cleaning stage, the custom tokenizer was incorporated into the function CountVectorizer from Scikit Learn. Using this, a DTM (Document Term Matrix) was built where each row was one article, each column was a unique token, and the associated cells indicated the number of occurrences or the frequency of that token in a particular article. The DTM looked like the figure below.
This table, combined with the manually prepared labels for each article (1 indicating the article is rebranding-related and 0 indicating it is not), was fit into a Random Forest Classifier with n_estimators = 250, or 250 trees. Each feature was then ranked based on importance using the Gini impurity index. The following table shows the top 12 rankings.
These top 12 features were used to fit into the different classifier models. This was done because fitting all the 18,485 columns would definitely cause overfitting, hence only the most important features were used.
1.3. Random Forest
Ensemble methods like Random Forest perform pretty well in binary classification problems, hence the impetus to use it. A standard Random forest model was used for both. Hyperparameter tuning was also done using RandomSearchCV, but the results were not markedly different. This is perhaps because the sample size wasn’t too large.
1.4. Standard Model
The base model that was used had 100 trees (n_estimators = 100), with all the other default features. The metrics of the model are as follows:
As seen in the figure above, the test accuracy achieved (91.84%) was significantly higher than the test accuracy using just the titles of these articles. Using the body of the text gave us a larger, richer DTM of words from which we could choose better, more important features for the classification problem. However, the confusion matrix below shows that the False positive rate is much larger than False Negative rate. Our data set contains a lot more 1’s than 0’s, and this is a byproduct of the imbalance data set.
1.5. Hyperparameter Tuning
Hyperparameter tuning was done through RandomizedSearchCV on the following grid of parameters:
These are the features of the RandomizedSearchCV itself.
rf_random = RandomizedSearchCV(estimator = rf1, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
It was found that the best features were the following:
‘n_estimators’: 2000
‘min_samples_split’: 5
‘min_samples_leaf’: 1
‘max_features’: sqrt
‘max_depth’: 100
‘bootstrap’: True
Interesting to note is that the n_estimators has increased to 2000 from 100 that we used in the base model. Let’s see how this fine-tuned model performed on the test set:
As seen in the figure, the performance of the model remained unchanged, despite it taking more time to train because of the increase in the number of trees from 100 to 2000.
1.6. XGBoost
As an alternative, an XGBoost (Extreme Gradient Boosting) model was also used to examine if there are any differences in performance. The base model was used with ‘n_estimators = 100’ and all other default parameters. Test accuracy achieved was 87.55% and 43 out of 49 were classified correctly, as opposed to 45 out of 49 with the Random Forest model. However, the comparison can be made more concrete with a larger dataset with less class imbalance. The following figure shows the full performance report.
Unsurprisingly, there are 6 false positives and 0 false negatives as seen in the confusion matrix below. As explained above, this is a byproduct of the class imbalance in our dataset.
2. Next steps
This classification analysis has a lot of potential and room for improvement. The following is a list of possible next steps that could be taken in the future to enhance the ability of classifiers to predict which articles are rebranding related:
- Imbalanced dataset — the dataset used is small and has a larger proportion of 1’s than 0’s. As such, a larger dataset with a more equal proportions of 1’s and 0’s would help make the classifier more robust. A large imbalanced dataset could also be used by changing class weights in the models. GAN’s (Generative Adversarial Networks) could also potentially be implemented to create larger datasets.
- Using PCA — Instead of using plain feature selection, different tools could be used for feature extraction. A deeper study of the textual data could help generate new features. PCA could also be used to condense the large DTMs into smaller dimensions that could be fed into ML models.
- Better cleaning — The custom tokenizer could also be improved and coupled with other methods for cleaning the textual data in ways that remove things like newline characters, tabs, etc.
- Using different models — Just ensemble methods like XGBoost and Random Forest were used for this method. Simple multi-layer perceptrons (MLP), support vector machines (SVM), and perhaps even logistic regression models could be examined, although it is likely that the ensemble methods and MLPs perform better for such classification problems.