Policybot: How does a Machine define "Made In China 2025"? (Part 0)
About this article: This is a series of articles originated from the author’s Honors Thesis entitled “A Machine Learning Approach to Predict China’s Industrial Policy Movements”, which deploys Machine Learning, NLP, predictive data analytics, and statistics to explain the behavior of governments. The purpose is to gain policy insights and intelligence, as well as predict the outcome of certain future public policy behaviors.
ABSTRACT: Why is understanding and predicting China’s Industrial Policy important? Government Industrial Policies that support huge investment and infrastructure initiatives like “Made in China 2025” and “One Belt One Road” have been controversial factors that contribute to the exacerbation US-China trade war. Many economies, governments, and businesses are closely watching China’s every economic policy to plan appropriate responses to international relations and economic trade-agreement with China. Government policy in China is set by‘steering committees’, in both party and state at all levels of government. Policy is formed by the party, set into administrative regulation by the state bureaucracy and finally shaped into legislation for ratification process passage through the National People's Congress. Current literature (Chan & Zhong, 2018) have used machine learning to predict policy change with China’s news media propaganda People’s Daily (人民日报), but none have used the raw online ministry’s published articles data to directly describe the behavior of the Chinese Government and predict China’s industrial policy tightening or expansion in certain strategic industries. Metadata features like website address, section code, date, title of the articles, and ministry name can be used to predict China’s Industrial policy movements.
TAGS: Ministry of Industrial and Information Technology (MIIT), Artificial Intelligence (AI), Machine Learning, Natural Language Processing (NLP), Data Analytics , Public Policy ,Policy Analytics
DATA: 9000 records of publicly available data concerning industrial policy on MIIT ministry websites were scraped by the author and saved in a POSTGRESQL database. This database will act as a China policy archive for storing articles from Ministry of Commerce (MOFCOM), National Development Reform Commission (NDRC) concerning industrial policy. Other ministries websites can be included in the future.
EXECUTIVE SUMMARY:
- When the machine learning model is asked to define “Made in China 2025”, its output can be interpreted as“A National Innovation Movement or a strategic deployment of Industrial Policy that aims to strengthen system integration between production (up-stream) and the service (down-stream) of the value-chain through the implementation of long term value-adding manufacturing processes especially targeted towards the textile and agricultural industries.”
- When the machine learning model is asked to interpret “One Belt One Road”, its output can be interpreted as “A line or main channel which acts as a vehicle racing for multilateral cooperation from all countries to build free-trading zones and economic zones for large-small-medium enterprises."
- When the Machine Learning model is asked to interpret "China's Internet Plus" initiative, its output can be interpreted as “A State-owned Enterprise project for the purpose of national defense, control of satellite network, and preemptively counter any cyber-attacks. Ultimately, it promotes industrial control and communication safety in public places.”
TECHNICAL SUMMARY
WORD-VECTOR: is a text-mining technique under Natural Language Processing that transforms unstructured text data into rows of real valued numbers where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings and based on frequency of appearances), whereas the word banana should be quite distant.
TRAINING THE WORD2VEC model: The Methodology is called Word2Vec, a Python library. Most_similar() function uses Cosine to measure the textual similarity between Word-Vector A and Word-Vector B:
Most importantly, the key take-away is that after we feed the Word2Vec model with ALL the policy-related articles (input data scraped from the ministry website), the Machine will output word associations similar to that of the Ministry of MIIT, thus can predict or generalize how the Ministry behaves with a degree of error.
The figure above illustrates that cosine-similarity function is the basis of the NLP machine learning model because it can recognize the similarities between words and phrases. With this we can then determine word associations and semantics of policy phrases. The most.similar() function give us the associations of the input word. When we type in the inputs (labeled in Green text), we are asking the WordVec model for the closest word associations related to the input. Output is listed below with the most similar words ranked by their cosine-similarity factor. Let’s test out this concept on a few examples by using Word-Vector cosine-similarities.
Okay let’s look at Example 1: How does the Machine Interpret “China’s Internet Plus” (互联网+)?
INPUT: 产品 + 管理 + 网络 = ? (Product + Management + Internet) associate with the following top 10 words ranked by Cosine-similarity:
OUTPUT: Ranking translated (国有资产-State-owned Assets;所属-ownership;参考-consultation;国防科技-National Defense Technology;烟草专卖-tobacco monopoly;卫星网络-satellite network;电子邮件-email;经费-funding;事业单位-Institutional;复-again, once more).
Using our trained model, the machine defines “product + management + Internet” as State-owned Assets, National Defense technology, satellite network, funding, Email (communication network). These are the key words that are closest associated by cosine-similarity with the input (Internet Plus+). In comparison with what Wiki defines “China’s Internet Plus”: “refers to the application of the internet and other information technology in conventional industries including mobile Internet, cloud computing, big data or Internet of Things.” Note that other words (e.g. email, again, etc.) may not be as important to mention because they are adverbs or adjectives. What does that mean? The machine can only give us signals and probabilities based on statistics. It’s up to the human to interpret outcome.
Example 2: How does the Machine define “Made in China 2025” (中国制造2025)?
INPUT: 中国 + 制造 + 2025 ( Made in China 2025) = ?
OUTPUT: Ranking translated (家电业-State-owned Assets;战略-strategy;服务型-service type;系统集成-System Integration;强国-strong nation;生产型-Production manufacturing;改革方案-planning reform;战略部署-strategic deployment;农机化-Agriculture Industry;制笔-styli).
Surprisingly accurate to the human eye, the model outputted the associations of Made-in-China 2025 that can be interpreted as “a state-owned asset or strategy that focuses on system integration between service and production capacity of the value-chain, and strategically deployed to the Agricultural industrial sector.” Did the model tell you something you didn’t know?
Example 3: let’s see what the Machine associates with China’s Eurasia Silk Road: ONE BELT ONE ROAD (OBOR).
INPUT: 一带 + 一路 + 投资 (One Belt + One Road + Investment)
OUTPUT: 各国-all countries; 沿线-along the line; 美国- United States,倡议-initiative, 自贸区-free trade zone; 多边合作-multilateral cooperation, 双边-bilateral; 多边-multilateral; 经济区-economic zone;开放型- open market mode. Can the human analyst confirm that these word associations are correctly linked to the actual definition OBOR? -- "a development strategy adopted by the Chinese government involving infrastructure development and investments in 152 countries and international organizations", according to Wikipedia.
CONCLUSION:
In this article, we have gone over how we can use NLP and machine-learning methods to build a Word2Vec model that recognizes the vocabulary and word-associations of the Ministry. Through these apparati, we can actually "artificially" define what China's Internet-Plus, Made in China 2025, and One Belt One Road means to a Machine. We do this by training on bodies of MIIT publicly available policy articles from online. This China Policy Database I've created has very organized data-schemes and design that allows SQL query language to access the data-pipeline to perform data cleaning, Machine Learning Modeling, Descriptive Analytics, and Predictive Analytics. As the results demonstrated, what the Machine outputted can be surprisingly relatively accurate. Its output are not in complete sentences, but it gives us the most important and relevant word-vectors.
One can ask: can the Machine distinguish changes of behavior or policy changes, movements, and tightening? Yes, but to a degree of accuracy. As of right now, the definition of policy tightening is associative, or the meaning of policy tightening is measured by the frequency of words similar to meaning as "policy tightening" or its synonyms. For example I have written a program that tells the machine to search for key words such as: “Monitoring, Supervising management, checking, auditing, failing, approving, forbidden, mandatory spot-checks.” This is the most intuitive method to define the meaning of “policy tightening.” However, there are limitations to the accuracy of the prediction. Although it is not as powerful as the Latent Semantic Analysis. Nonetheless, once we define what policy tightening looks like to the machine, then cosine-similarities scores can be compared from week-to-week, month-to-month, year-to-year to see whether there are changes in government published contents. The results can be transformed to an index to track frequency overtime. Moreover, we can dive into the qualitative differences of word associations to see what actually changed between the different time frames. Furthermore, we can apply this AI tool to ALL the ministries including Ministry of Defense, Education, Finance and Commerce, etc. Once we have converted all the government ministerial textual information into digital information that a Machine can learn from, then we can start data-mining and bringing new insights to policy intelligence. For example, we can use data from multiple ministries from multiple sources all at once to conduct data analytics; then we can answer challenging questions like the following:
- As the policy in MOFCOM’s trade and foreign investment tightens, what are the impacts to MIIT’s policies in One Belt One Road initiative?”
- What is the relationship between policy expansion in MIIT’s manufacturing standardization and MOFCOM’s stimulus packages for agricultural industry?
- How does NDRC's loosening policy of Green Energy efficient programs influence changes in the MIIT’s tightening or expansion of EV automotive industrial policy?
In the next article, we will be talking about using word-vectors from this article to build topical models. This is a unsupervised learning method (called LDA), and Machine tells the human what are the most likely topics discussed in the policy-related documents. Then finally we will build a supervised learning model from these topics given by the LDA algorithm. With this supervised learning model, we can determine what are the most relevant and important variables that determine whether a policy gets published online or not.
Credits: thanks to Yvonne CEO of Politech for inspiring me to take on this challenging but interesting project. This article was edited by Eduardo Baptista.
Feedback: I welcome critiques, comments and feedback from China policy experts, data scientists, China watchers, and business consultants. Please send me a message!
Data: If you have any inquiries concerning the data or the validity or integrity of the data, or the methodologies about the model selection, please contact or email me.
About the Author: Jianyin Roachell. Business Data Intelligence Developer, M.S. Candidate China Language and Economy , B.S. Business Data Analytics & Statistics, SAS Predictive Analytics certified, Machine Learning certified by Stanford University via Coursera.
References:
[0] Chan, Julian TszKin and Weifeng Zhong. 2018. “Reading China: Predicting Policy Change with Machine Learning.” [1] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In arXiv preprint arXiv:1301.3781. [2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). [3] Nanyun Peng and Mark Dredze. 2015. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. In Empirical Methods in Natural Language Processing (EMNLP). [4] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In International Joint Conference on Artificial Intelligence (IJCAI). [5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. In arXiv preprint arXiv:1607.04606. [6] Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-Margin Tensor Neural Network for Chinese Word Segmentation. In Annual Meeting of the Association for Computational Linguistics (ACL). [7] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long Short-Term Memory Neural Networks for Chinese Word Segmentation. In Empirical Methods in Natural Language Processing (EMNLP).
Dear Jian, It is good to know that you are enthusiastic about your master thesis topic. Please note however, that the CV/education data listed in your linkedin account is inaccurate. You do not, yet, have a master agree from Würzburg University. Your one term at Peking University was an integral part of your Würzburg master program, not a separate master program (and definitely not a degree). And, by the way, the master program you are enrolled in is called China Language and Economy. I strongly recommend that you correct the respective entries. Best, Doris
I enjoyed this article, especially the approach. It would be interesting to follow this further to see how it handles alternatives sources which may not be as succinct as ministry papers. Great work though!
Very clear and insightful research! Raymond L. and Keith Carter should check this out.