Policybot: Turning Policy Data into Insights (Part 1)
This is a series of articles originated from the author's Honors Thesis entitled “A Machine Learning Approach to predict China’s Industrial Policy Movements” , which deploys Machine Learning, NLP, predictive data analytics and statistics to explain the behavior of governments, gain policy intelligence insights, as well as predict the outcome of future government public policy behaviors. This is an in-depth analysis of China's Ministry of Industrial and Information Technology (MIIT) from 2002 to 2018, and a complimentary to the articles written on LinkedIn. All analyses are based on MIIT online policy data. Conclusions are arrived from the Author's own judgments and logical probabilistic assumptions.
In the last article, we learned to use NLP and machine-learning methods to build a Word2Vec model that recognizes the vocabulary and word-associations of the MIIT. Visual example of WordVectors are shown below:
Through these apparati, we can actually "artificially" define what China's Internet-Plus, Made in China 2025, and One Belt One Road means to a Machine. We do this by training on bodies of MIIT publicly available policy articles from online. In this article, we will talk about using word-vectors to build topical models to construct an unsupervised learning method (LDA). In other words, last article we told the machine what to look for "Made-in China 2025, One Belt One Road, etc.", but in this article, we will let the Machine tells us what are the most important topics. In addition, we will discover what are the most relevant and important variables that determine whether a policy document gets published online or not. Lastly, we will dive into these topics by exploring their composition and analyze how each topic interacts with one another to understand publication behavior of the MIIT.
This time we let the Machine tell us what is important, not the other way around. Reading and learning from these 9000 policy documents took the Machine less than 3 minutes (training process), but it would probably take a human 3 weeks! That is the power of Machine Learning!
EXECUTIVE SUMMARY
- When more than 9000 policy-related ministry posts (Input) from 2004 to 2018 were fed through the machine learning model, the Machine discovered five major topical themes:
- Industrial and business information products.
- Public opinion concerning industrial policy. (Very Important)
- Internet product marketization. (Important)
- Automotive production and equipment.
- Setting national business technology standards. (Most Important)
The most insightful finding: data suggests that policies published online by the MIIT are dependent on topics related to " National Business Technology Standard Setting"
- The Gradient Boost predictive model is about 90% accurate in classifying whether or not policy documents are likely to be published using the 5 topics as the main predictors.
- Life-time Cycle of a Policy: when there is a high volume of output from the Document Release (文件公示) section of the website, we can expect also a high volume of output from the Policy Regulations Publications (文件发布) section, following by a high volume of policy explanation (政策解读) publication.
- Topical relationship between Setting National Business Standards Topic and Internet Product Marketization Topic tends to be negatively correlated. When Internet Product Marketization document output are abnormally high, there is a low volume of Setting National Business Standards document output.
- Topical relationship between Public Opinion Topics and Internet Product Marketization tends to be negative. When there is a huge spike in output volume of Opinion & Feedback topics, there is a low volume of Internet Product Marketization document output.
- For 2017, data has led us to reasonably conclude that there may be a causal relationship between how MIIT ask for public opinion feedback and marketization of internet products, which subsequently has an influence on the establishment of business technology standard-setting policies. It is not clear which variable comes first or last, for all we know, this can be a cycle of constant feedback-test-standardization loop in the policy generation pipeline.
TECHNICAL SUMMARY
Policy Document defined: it can refer to policy proposal, feedback proposal, or even an unofficial policy that has not been legitimately ratified yet. I will use the word Policy Document to describe policy related materials in general.
TOPICAL MODELING: is an unsupervised learning methodology that extracts the hidden topics from large volumes of text. Specifically, a Latent Dirichlet Allocation (LDA) text-mining algorithm was deployed to complete this modeling process. Why is topical modeling important to this project? When topics are extracted from the policy texts, we can segment these instances into different groups with similar topical features. In addition, these groups can be given a topical identity, or an overall genre, which could be subsequently assigned to each of the policies to make “new features.” For example, Wheel and Engine should be grouped into Car topic. The accuracy of a machine learning model largely depends on the feature engineering (or selection of the variables), as well as parameter-tuning for these variables. We will need topical modeling to generate topical distributions for every policy document. After we group the policy documents into buckets, we can train the supervised learning binary classification model to predict whether or not a document is more likely to be published as a policy or not, based on the contents of the post.
The Machine can segment the policy documents into different topics based on its content and Word-Vectors. The LDA algorithm has successfully cluster the 9000 articles into 5 different topics based on the policy contents. To interpret a topic, one typically examines a ranked list of the most associated terms (or word-vectors) in that topic, using anywhere from three to thirty terms in the list. From this list of word-vectors, now it’s up to the analyst or the reader to interpret. Again, the machine can only output frequency of words and buckets the words together by the probability of WordVec distributions, thus every analyst may interpret them differently depending on the individual’s perspective and professional background. In the next machine learning model, we will be making a supervised learning model from topics extracted from an unsupervised learning model (using the LDA algorithm).
The topical distribution of each policy document is the key to our feature engineering. Each document's content is spread out across all 5 topics generated by the Machine. In other words, each document is made up of many topics that give that document a distinct identity. Just like human personality traits, we can have different characteristics that make us who we are. For example, some are 80% extroverted, 20% introverted, or vice versa. The same logic applies these topical traits distributed into 5 buckets that make each policy document "unique." In the figure above, one can see that the Topical Distribution (or the genetic make-up) of each document is represented across 5 topics ranging from 0 to 4 (for the machine).
Feature Importance Variable Analysis can give insights to how MIIT make its policy publication decisions.
Note: the Machine indexes from 0 to 4, we humans can equate “Topic 0” to Topic 1 and “Topic 4” to Topic 5. After merging topical data with our original data set, we are ready to build a supervised learning model. The above figure displays the order of variable importance from a Random Forest binary classification model to predict whether a policy document gets published. The most important variable that determines policy publication is the topic of "Setting national business tech standards," the least important trait being QTR (quarter of the year). QTR most likely is a feature that covaries with year and month, providing redundant information that the model already knows, thus it was ranked least important. You can see the topical groups (1 to 5) are not touching each other without any intersecting unions among the topics (see diagram below), meaning the LDA algorithm separated the topics quite well. This is important because this clean separation of topics allows the supervised learning model to learn better and generalize more accurately with future testing data. The following diagrams describe the 5 key topics and display keywords on the right-hand side:
Topic 1: is likely to be a topic concerning “industrial business information products” because most frequent keywords consist of industrial, business information, medium-small size enterprises.
Topic 2: “Publicizing opinion, feedback, and industrial news." The top-ranked most frequent keywords expressed include opinion, demonstration, news, publishing, and feedback. From the feature importance analysis, we know that this topic is a very crucial variable to predict policy publication. We can hypothesize that policy documents that appeared in the “public opinion and feedback” section are more likely to be published. Findings show that MIIT is actually moving towards a more market-driven feedback approach over the last years by implementing online feedback requests. This democratic and market-driven feedback mechanism is quite surprising to hear, given that China is an authoritarian government that usually does not represent the interest of the people. This finding is very interesting and should be brought up in conversations.
Topic 3: "Online Product Marketization" (e-commerce) or opening up internet-related products. Top frequently appeared translated words tend to be: public, department, entrepreneurship, online, internet, etc. This is the third most important feature to look for when predicting policy publication.
Topic 4: "Automotive production and equipment firms.” Top frequently appeared keywords include: manufacturing, automobile, cars, roads, highway, enterprise, products, internet. Below is a chart that displays policy documents containing the word "Automobile" grouped into frequency ranges from 2016 to 2019.
Data indicates there is a large jump in the number of policies from 2017 to 2018 (60% growth), depicting an apparent increasing intensity of word frequencies focused on the automotive industry. Documents containing the word "Automotive" 1-5 (blue) and 5-10 (red) stayed relatively the same year-to-year; however, there is an obvious increase in the number of policies that contain the keyword 20-40 times (Green) from 2017 to 2018. We can conclude that there are more policy documents with higher intensity focused on "Automobile" from 2017 to 2018.
Topic 5: "Setting national business technological standards". Top frequently appeared words include: publicize, technology, national standards, industrial, conditions, rules, industrial standards, etc.” This is the most important factor in predicting policy publication. This essentially supports the argument for Industrial Policy because technological and business standards dictate the market, and the government would be more likely to publish a policy that reforms standards or sets new rules for the market.
The machine learning model suggests that the MIIT policy publication behavior are heavily dependent on topics regarding setting national technological standards and online product standardization, while seeking the opinion-feedback from stakeholders of the market. It is not clear which variable comes first or last, for all we know, this can be a cycle of constant feedback-test-standardization loop in the policy generation pipeline.
For the hardcore data scientists, A Gradient Boosting Algorithm was deployed, and here are the results: best AUC is 94% accurate on the training set, and 89% accurate on the 4-fold cross-validation set. The Gradient Boost predictive model is about 90% accurate in classifying whether or not policy documents are likely to be published using the 5 topics as the main predictors.
TURNING DATA INTO INSIGHTS:
To interact with the data visualizations, click here.
How many policy documents were published in 2017 by the MIIT?
July-August 2017: displays a record high number of 356 policy publications. One large factor that contributed to this spike was “New Generation Artificial Intelligence Development Plan” (新一代人工智能发展规划) published in July 2017 by the State Council, which plans to drastically change how China interacts with AI.
2017 is an interesting year for MIIT:
Here is what we already know: MIIT published 1194 policy documents in total (including public policies, explanations, and feedback comments) during a 224-day publication period. That's 2 out of 3 days during the year that something new was being published! Out of these 1194 policy documents, about 560 contain the topic of "Public Opinion & Feedback". Now let's explore why that is and what are all these documents are about!
- For the Internet Product Marketization (Green) topic, it appears to be the dominant topic among the MIIT publication documents starting from July 21st to October 11th. The hottest day for this topic was on July 29th, on which it appeared in 81% of all released documents. That was a big day for internet products regulations, according to the data. This also confirms with our previous findings of a large spike in publication volume in July 2017 with a record high number policy publications in total 356 documents. As mentioned before this spike can be associated with the “New Generation Artificial Intelligence Development Plan” (新一代人工智能发展规划).
Thus we can interpret MIIT's policy publication behavior are as follows:
In the year 2017, data has led us to reasonably conclude that there may be a causal relationship between how MIIT asks for public opinion feedback and the Marketization of Internet products, which then has some influence on the establishment of business technology standard-setting policies. It is not clear which variable comes first or last, for all we know, this can be a cycle of the constant feedback-test-standardization loop. See my in-depth data analysis here.
SUMMARY: In this article, we have gone over how we use Topical Modeling to create new features for each policy document, eliminating the text data all together but still retaining all the information necessary for predictive analytics. In addition, we analyzed and dissected every topic to understand their contents and what role they play in the Life-time Cycle of each policy document. We then turned data into insights by assessing the relationships among each topic group.
To download the raw data, click here.
About the Author: Jianyin Roachell. Business Data Intelligence Developer, M.S. China Economy, and Language Candidate, B.S. Business Data Analytics & Statistics, SAS Predictive Analytics certified, Machine Learning certified by Stanford University via Coursera.
Inquiries & Feedback: I welcome inquiries, critiques, comments and feedback from China policy experts, data scientists, China watchers, and business consultants. Please send me a message!
Data: If you have any inquiries concerning the data or the validity or integrity of the data, or the methodologies about the model selection, please contact or email me.