Zhihu Dataset Exploratory Data Analysis

Jan 17, 2016


Zhihu is a Chinese question-and-answer website with over 17 millions users as of May 2015. People usually consider it as a Chinese version of Quora. However, Zhihu has achived great success recently. According to Alexa, Zhihu ranks 114 while Quora ranks 145 globally by Jan 19, 2017.

Brain Way has developed a tool to obtain ~600k users profiles with the Zhihu API. The dataset is used in this project.

Among 545,825 users, 199,377 are female, 280,108 are male and 66,340 are unfilled. 58 percents are male. Attracting more female users can be potential increase for the company.

The figure below shows the geographical distribution of users. Beijing and Shanghai are the top two cities with the largest user population. Shenzhan should be number three. However, there are two tags: Shenzhan Shi and Shenzhen. To make the distribution clear, the data is virutlized on a interactive map.


Geographical Distribution of Users

A Python package named geocoder 1.19.0 is used to retrieve latitude and longitude from Google and Bing. Mapbox GL JS is used to generate interactive maps. The heatmap below shows the Zhihu users are widely spreaded in the world with majority in China.


Heatmap of User Distrubution
Link to Interactive Map

With a closer look at the users in China. The majority users are located in Beijing, Shanghai and Shenzhen areas (red spots). There are huge related market in there three areas but great potentials in inland areas.


Cluster Map of User Distrubution
Link to Interactive Map

The top 10 industries of users are shown below. IT industries take the majority while other industries such as finance, high tech are also in the list. It appears that people working with computer tend to spend more time on Zhihu. People in enigineering fields are not possible to use Zhihu too often or they are just not interested.


Industrial Distribution of Users

Users can post questions, answer question and publish articles on Zhihu. Anwser or article can gain "thanks", "voteup" and also attract followers. The pairplot below shows the relations between these variables. There is strong linear relationship between thanked count and follower count. However, the quantity of answers or articles is not correlated with quantity of thanks or followers. It appears that high quality answer or aritcle can attract more followers and users who have more followers tend to gain more thanks.


Pairplot of Interaction Activities

A Radviz plot is generated for the top 10 popular users (with highest followers). The majority of them answer plenty of questions and gain plenty of voteups and favorites. However, Kaifu Lee gain large number of followers without that much effects. The reason may be that he is so famous in China and worldwise. ZhouYuan on the the hand, is not that famous but also gain lot of followers. It turned out that he is the one of the founders of Zhihu.


Information of Top 10 Users

Zhihu is a place where people talk about the trending events. The figure below shows when the best answers about 2016 US election. The numbers closely respond to the election timeline. When several major candidate anounced their presidential campaign in March and April 2015, Zhihu users started to talk about this topic. There are much less answers until July 2016 when the Democratic and Republican National Convention occurred. The number goes to peak in Nov 2016 around the election day. It indicates that Chinese users care much about what is happenning in the United States.


Answers for 2016 US Election


The attention on the Rio Olymics is not as continuous as the election as shown below. However, there are hot discussion in August during the Olymics. It is interesting to notice some anwsers occurred as early as March 2013. It turned out the in March 2013, a stadium in Brazil used for hosting athletics at the Rio Olympics has been closed indefinitely because of structural problems (BBC news).


Answers for 2016 Rio Olympics


Further work: There are much more interesting things to do with Zhihu data, which includes apply NLP techniques to analyze the change of trending topics over time. It can be also interesting to take a deeper look into the most popular answers and find their patterns.