Modeling Dynamics of Wikipedia: An Empirical Analysis Using a Vector Error Correction Model

. In this paper, we constructed a system dynamic model of Wikipedia based on the co-evolution theory, and investigated the interrelationships among topic popularity, group size, collaborative conflict, coordination mechanism, and information quality by using the vector error correction model (VECM). This study provides a useful framework for analyzing the dynamics of Wikipedia and presents a formal exposition of the VECM methodology in the information system research.


Introduction
Wikipedia has become one of the most striking emblems of mass collaboration.Its unprecedented success has posed challenges to traditional theories of public goods and collective-action, which has inspired many scholars from various fields to study it [1,2].Existing research highlights many factors that are crucial to the success of Wikipedia, including topic popularity, group size, collaborative conflict, coordination mechanism, and information quality, etc [3].However, most of these studies examine the relationships among factors from a static perspective without considering the dynamic evolution of Wikipedia.Although some scholars have already explored the statistical properties in many aspects of Wikipedia by statistical methods and revealed the dynamic relations among factors by visual analysis tools [4][5][6][7].there are lack of rigorous empirical studies.So far, the dynamic mechanism of Wikipedia is not completely known.
As an attempt complementary to the previous studies, this study constructs the PSCCQ model, a system model consisting of five microcosmic factors (i.e., topic popularity, group size, collaborative conflict, coordination mechanism, and information quality), with which we explore the fundamental dynamic mechanism behind Wikipedia.
Our study makes several key contributions.First, we build the PSCCQ conceptual model from a systemic perspective, and provide strong empirical evidence to support the validity and usefulness of this model.Second, the VECM methodology provides a statistically rigorous yet atheoretical approach for analyzing the dynamics of temporal relationships among variables without strong theoretical restrictions.In summary, this paper promotes the importance of taking a systematic view of the dynamics of Wikipedia by utilizing the VECM approach in concert with the PSCCQ model.

Research model
The process of collaborative knowledge building in Wikipedia demonstrates complex self-cleaning, selfregulating, and self-developing dynamics of the mass of participants that are akin to a kind of evolution [8].The coevolution theory provides an effective theoretical lens for analyzing the dynamics of Wikipedia.Based on the coevolution theory, A few scholars have attempted to analyze the dynamic interactions among factors, modules, and subsystems of the Wikipedia system.For example, Cress and Kimmerle built a theoretical model for describing the coevolution between the Wiki's social system and the individuals' cognitive systems [9].Kimmerle et al. used the social network analysis to graphically visualize coevolutionary processes of individual knowledge learning and collective knowledge building [7].
Based on the co-evolution theory, we build the PSCCQ model as a theoretical framework to analyze the dynamic interactions among topic popularity, group size, collaborative conflict, coordination mechanism, and information quality, and reveal the dynamics of Wikipedia (cf.Fig. 1).

Research context
We select a specific Wikipedia article-global warming as research object for several reasons.Firstly, global warming belongs to a featured article, which means that it is identified as one of the best articles.Secondly, it is also one of the most frequently viewed articles in Wikipedia.It has around 50 million page views, ranking 152 in English Wikipedia article traffic.Thirdly, global warming has always been one of the most intensely controversial topics in Wikipedia.In summary, this case represents one of the most typical, highly concerned and intensely controversial articles in Wikipedia, so it is ideal for deeply examining the dynamics of Wikipedia.

Data collection
Following Ransbotham, Kane, and Lurie [10], we use the relative search frequency in Google to reflect the topic popularity of global warming.We determine the number of times that users of Google search for keywords from the article title each month from Google Trends.The number of unique editors contributed to an article in Wikipedia is widely used to measure group size in previous studies [11], so we adopt it to measure group size during the monthly observation period.Rollback has been frequently used to model conflicts and identify edit wars in Wikipedia [12].Therefore, we measure collaborative conflict by calculating the monthly number of rollbacks in the article.The most important coordination mechanism in Wikipedia is communication [13].We operationalize coordination mechanism as the monthly accumulated number of discussions recorded in the article talk page during the study period.The number of edits provides a good indicator of a "high level of quality" for Wikipedia articles [3].We

Model specification and estimation
Selection of a VECM involves three basic decisions: (1) Unit root test, (2) lag length selection, and (3) co-integration test.Non-stationary data generally lead to spurious regression due to non-constant mean and variance [14].Therefore, we first test the stability of variables.The results indicate that all variables are stationary at first differences implying that all variables are integrated of order one.Before co-integration test, we need to select an optimal lag length to ensure that the model is not mispecified [15].The results show that the optimal lag length is identified to be lag 4. Finally, the Johansen co-integration test results show that there are cointegration relationships among the variables.
The estimates of VECM regression coefficients typically are not as informative as analyzing relationships among variables because of the complicated dynamics inherent in VECM models [15,16].Therefore, we report the general estimation results in Table 1 and then provide a detailed analysis using Granger causality tests, impulse response functions, and forecast error variance decomposition in the next section.

Granger causality tests
Granger causality tests help to determine whether the lagged values of one variable help to predict values of another variable [17].

Impulse response functions
Impulse response functions (IRFs) plot the response of current and future values of the endogenous variables to a one-unit increase in the current value of a random disturbance term [18], which provide a more intuitive description of the dynamics of temporal relationships among variables.

Forecast error variance decomposition
Forecast error variance decomposition (FEVD) analysis provides the relative importance of the variance of the error made in forecasting a variable because of specific shocks of all variables in the system at a specified time horizon [18].

Discussion
We have systematically investigated the dynamic interrelationships among topic popularity, group size, collaborative conflict, coordination mechanism, and information quality in Wikipedia through a detailed empirical analysis.What's more, our study demonstrates the usefulness of the PSCCQ framework and VECM methodology can be applied to formally and practically analyze the dynamics of Wikipedia.Our first finding shows that the critical importance of coordination mechanism in effectively harnessing the "wisdom of the crowd" in the online collaborative environment.Our second finding shows that too many contributors involved in a particular project may be detrimental to group performance.Wikipedia managers should not necessarily pursue a more-is-better strategy towards the number of contributors.
This paper also has some limitations.First, a potential limitation of our study relates to sample data.A potential extension of the research is to apply panel vector autoregression (PVAR) to further explore the dynamics of Wikipedia with large sample.Second, we overlook the network characteristics of Wikipedia community.Future research should combine the network dynamics with the knowledge dynamics to give a fuller picture of the dynamics of Wikipedia.

D(LNQUAL) D(LNPOPU) D(LNSIZE) D(LNCONF) D(LNCOOR)
quantify information quality as the number of edits in the given month.Using monthly counts of each variable, we obtain five time series.With the data from February 2004 to November 2015, we have a total of 142 monthly observations in the form of time series, as shown inFig 2.
Fig 4 provides twenty possible impulse response functions for the estimated VECM.

Figure 4 .
Figure 4. Impulse responses (Impulse Response).The results of the IRFs analysis basically corroborate the results of the Granger causality tests.Specifically, the feedback relationship between LNCONF and LNCOOR identified previously holds, which can be found in the significantly negative response of LNCONF at forecast horizons 3 and 8-18 in Fig 4c and in the significantly positive response of LNCOOR at forecast horizons 2-11 in Fig 4d.Similarly, we can also confirm the feedback relationship between LNCOOR and LNQUAL.Additionally, the unidirectional relationships among the variables identified previously also hold.The response of LNPOPU to a one standard deviation shock in LNCOOR has a maximum value of 0.107 at the 3th period, as shown in Fig 4a.Turning now to the response of LNSIZE to LNCOOR, we find that LNSIZE initially reaches the maximum value of about 0.2, then gradually decreases, and achieves stability until the 10th period (cf.Fig 4b).The significantly positive response of LNSIZE to LNCONF can be seen during the whole period, (cf.Fig 4c).Fig 4e reveals the LNQUAL responses to a shock in each variable.The response of LNQUAL to LNSIZE reflects a discernible decline, which declines from 0.48 to 0.19.In parallel, the magnitude of the effect of LNCONF on LNQUAL is about -0.1 which persists over the entire forecast period.

Fig 5
provides a graphical representation of the FEVD, where each graph depicts the proportions of forecast error variance, up to 12 periods (one year) ahead, accounted for by shocks in each variable.

Table 2
As would be expected, the results of the FEVD further corroborate the results of the Granger causality tests and IRFs.Specifically, approximately 20% of the LNPOPU (or LNSIZE) error variance is accounted for by a shock in LNCOOR (cf.Fig 5a and Fig 5b).The LNSIZE accounts for the large majority of the LNCONF error variance, reaching nearly 67.73% at the end of the forecast period, while the explanation ability of LNCOOR is relatively low, only about 13% (cf.Fig 5c).Moreover, Fig 5d shows that LNCONF and LNQUAL together account for approximately 43% of the error variance in LNCOOR, about 29% and 14% respectively.With respect to the FEVD of LNQUAL, the result indicates that the LNSIZE has the greatest impact on LNQUAL, followed by LNCOOR, LNCONF and LNPOPU (cf.Fig 5e).

Table 2 .
Granger causality tests based on the VECM.