Skip to main content

Modeling Digital Advertising Data with Measurement Error: Poisson Time Series and Poisson Kalman Filter Approach

Modeling Digital Advertising Data with Measurement Error: Poisson Time Series and Poisson Kalman Filter Approach

Jeongwoo Park*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

This study examines the impact of measurement error, an inherent problem in digital advertising data, on predictive modeling. To do this, we simulated measurement error in digital advertising data and applied a GLM(Generalized Linear Model) based and an Kalman Filter based moodel, both of which can partially mitigate the measurement error problem. The results show that measurement errors can trigger regularization effects, improving or degrading predictive accuracy, depending on the data. However, we confirmed that reasonable levels of measurement error did not significantly impact our proposed models. In addition, we noted that the two models we applied showed heterogeneity depending on the data size, hence we applied an ensemble-based stacking technique that combines the advantages of both models. For this process, we designed our objective function to apply different weights depending on the precision of the data. We confirmed that the final model displays better results compared to the individual models.

1. Introduction
1.1 Background

Digital advertising has exploded in popularity and has become a mainstream part of the global advertising market, offering new areas unreachable by traditional media such as TV and newspapers. In particular, as the offline market shrank during the COVID-19 pandemic, the digital advertising market gained more attention. Domestic digital marketing spend grew from KRW 4.8 trillion in 2017 to KRW 6.5 trillion in 2019 and KRW 8.0 trillion in 2022, a growth of about 67\% in five years, and accounted for 51\% of total advertising expenditure as of 2022\cite{KOBACO}.

The rise of digital advertising has been driven by the proliferation of smartphones. With the convenience of accessing the web anytime and anywhere, which is superior to PCs and tablets, new internet-based media have emerged. Notably, app-based platform services that provide customized services based on user convenience have rapidly emerged and significantly contributed to the growth of digital advertising.

Advertisers prefer digital advertising due to its immediacy and measurability. Traditional medias such as TV, radio, and offline advertising make it challenging to elicit immediate reactions from consumers through advertisements. At best, post-ad surveys can gauge brand recognition and the predilection to purchase its products when needed. However, in digital advertising, a call to action button leading to a purchase page can precipitate quick consumer responses before diminishing brand recall and purchase intentions.

In addition, in traditional advertising media, it is difficult to accurately measure the number of people exposed to the ad and the effect of conversions through the ad. Especially, due to the lag effect of traditional media mentioned above, there are limitations in retrospecting the ad performance based on the subsequent business performance as the data rife with noise must be taken into account. Therefore, there is a problem of distinguishing whether the incremental effect of business performance is caused by advertising or other exogenous variables. In digital advertising, on the other hand, 3rd party ad tracking services store user information on the web/app to track which ad users responded to and subsequent behavior. The benefits of immediacy and measurability help advertisers to quickly and accurately determine the effectiveness of a particular ad and make decisions.

However, with the advent of measurability came the issue of measurement errors in the data. There are many sources of measurement error in digital ad data, such as a user responding to an ad multiple times in a short period of time, or ad fraud, which is the manipulation of ad responses for malicious financial gain. As a result, ad data providers regularly update their ad reports up to a week to provide updated data to ad demanders.

1.2 Objectives

In this study, we aim to apply a model that can reasonably make predictions based on data with inherent measurement errors. The analysis has two main objectives: first, we will verify the impact of measurement error on the prediction model. We will perform simulations for various cases, considering that the innovation may vary depending on the size of the measurement error and the data period. Second, we will present several models that take into account the characteristics of the data and propose a final model that can robustly predict the data based on these models.

2. Key Concepts and Methods

Endogeneity and Measurement Error

A regressor is endogenous, if it is correlated with the error in the regression models. Let $E(\epsilon_{i} | x_{i}) = \eta$. Then the OLS estimator, b, is biased since

$\DeclareMathOperator*{\plim}{plim}$

\begin{align}
E(b | X) = \beta + (X'X)^{-1}X'\eta \neq \beta
\end{align}

So the Gauss-Markov Theorem no longer holds. Also, the estimator is inconsistent since

\begin{align}
\plim b = \beta + \plim (\frac{X'X}{n})^{-1} \plim (\frac{X'\epsilon}{n}) \neq \beta
\end{align}

Endogeneity can be induced by major factors such as omitted variable bias, measurement error, and simultaneity. In this study, we focus on the problem of measurement error in the data.

Measurement error refers to the problem where data, due to some reason, differs from the true value. Measurement error is divided into systematic error and random error. Systematic error refers to the situation where the measured value differs from the true value due to a specific pattern. For example, a scale might be incorrectly zeroed, giving a value that is always higher than the true value. Random error means that the measurement is affected by random factors that deviate from the true value.

While systematic errors can be corrected by data preprocessing to handle specific patterns in the data,random error characteristically requires data modeling for random factors. In theory, various assumptions can be made about the random factor, it is generally common to assume errors follow a Normal distribution.

We will cover the regression coefficient of classical measurement error model with normally distributed random errors. Consider the following linear regression:

\begin{align}
y = \beta x + \epsilon
\end{align}

And we define $\tilde{x}$ with measurement error as follows.

\begin{align}
\tilde{x} = x + u
\end{align}

Substitute (4) into (3):

\begin{align}
y = \beta (\tilde{x} - u) + \epsilon = \beta \tilde{x} + (\epsilon - \beta u)
\end{align}

Hence,

\begin{gather}
b = (X'X)^{-1}X'y \\
\plim b = (\frac{\sigma_{x}^{2}}{\sigma_{x}^{2} + \sigma_{u}^{2}})\beta
\end{gather}

When measurement error occurs as mentioned above, the larger the magnitude of the measurement error, the greater the regression dilution problem, where the estimated coefficient approaches zero. In the extreme case, if the explanatory variables have little information so the measurement error has most of the information, the model will treat them as just noise and the regression coefficient will be close to zero. This problem occurs not only in simple linear regression, but also in multiple linear regression.

In addition to the additive case, where the measurement error is added to the original variable, we can also consider a multiplicative case where the error is multiplied. In the multiplicative case, the regression dilution problem occurs as follows.

\begin{gather}
\tilde{x} = xw = x + u \\
u = x(w - 1)
\end{gather}

Similarly, substituting (9) into (3) yields a result similar to (7), where the variance of the measurement error $u$ is derived as follows.

\begin{align}
\sigma_{u}^{2} = E[X(w - 1)X(w - 1)] = E(w^{2}X^{2} - 2wX^{2} + X^{2}) = \sigma_{w}^{2}(\sigma_{x}^{2} + \mu_{x}^{2})
\end{align}

Therefore, in the case of measurement error, the sign of the regression coefficient does not change, but the size of the regression coefficient gets attenuated, making it difficult to quantitatively measure the effect of a certain variable.

However, let us look at the endogeneity problem from a perspective of prediction, where the importance lies solely in accurately forecasting the dependent variable rather than the explanatory context where we try to explain phenomena through data - and so the size and sign of coefficients are not crucial. Despite the estimation of the regression coefficient being inconsistent in an explanatory context, there is a research that residual errors, which are crucial in the prediction context, deem that endogeneity is not a significant issue\cite{Greenshtein}.

Given these results and recent advancements in computational science, countless non-linear models have been proposed, which could lead one to think that the endogeneity problem is not significant when focusing on the predictive perspective. However, the regression coefficient decreases due to measurement error included in the covariates, resulting in model underfitting compared to actual data. We will later discuss the influence of underfitting due to measurement error.

Heteroskedasticity

Heteroscedasticity means that the residuals are not equally distributed in OLS(Ordinary Least Squares). If the residuals have heteroskedasticity in OLS, it is self-evident by the Gauss-Markov theorem that the estimator is inefficient from an analytical point of view. It is also known that in the predictive perspective, heteroskedasticity of residuals in nonlinear models can lead to inaccurate predictions during extrapolation.

In digital advertising data, measurement error can induce heteroskedasticity, in addition to the endogeneity problem of measurement error itself. As mentioned in the introduction, the size of the measurement error decreases the further back in time the data is from the present, since the providers of advertising data are constantly updating the data. Therefore, the characteristic of varying measurement error sizes depending on the recency of data can potentially induce heteroskedasticity into the model.

Poisson Time Series

Poisson Time Series is a model based on the Poisson Regression that uses the log-link as the link function in GLM(Generalized Linear Model) class, with additional autoregressive and moving average terms. The key difference between the Vanilla Poisson Regression and ARIMA-based model is that the time series parameter are set to reflect the characteristics of the data following the conditional Poisson distribution.

Let us set the log-link $\log(\mu) = X\beta$ from the GLM as. In this case, the equation considering the additional autocorrelation parameters are as follows.

\begin{align}
\log(\lambda_{i}) = \beta_{0} + \sum_{j=1}^{p}\beta_{j}\log(Y_{i-j} + 1) + \sum_{l=1}^{q}\alpha_{l}\log(\lambda_{i-l}) + \eta'X
\end{align}

Where $\beta_{0}$ is the intercept, $\beta_{j}$ is the autoregressive parameter, $\alpha_{l}$ is the moving average parameter, and $\eta$ is the covariate parameter. The estimation is done as follows. Consider the log-likelihood

\begin{align}
l(\theta) = \sum_{i=1}^{n}\log p_{i}(y_{i} | \theta) = \sum_{i=1}^{n}(y_{i}\log(\lambda_{i}(\theta)) - \lambda_{i}(\theta))
\end{align}

and the Score function is derived as follows

\begin{align}
S(\theta) = \frac{\partial l(\theta)}{\partial \theta} = \sum_{i=1}^{n}(\frac{y_{i}}{\lambda_{i}(\theta)} - 1)\frac{\partial \lambda_{i}(\theta)}{\partial \theta}
\end{align}

By iteratively calculating the score function using the mean-variance relationship assumed in the GLM, the information matrix is derived as follows. For Poisson Regression, it is assumed that the mean and variance are the same.

\begin{align}
I(\theta) = \sum_{i=1}^{n} Var(\frac{\partial l(\theta)}{\partial \theta}) = \sum_{i=1}^{n}(\frac{1}{\lambda_{i}(\theta)})(\frac{\partial \lambda_{i}(\theta)}{\partial \theta})(\frac{\partial \lambda_{i}(\theta)}{\partial \theta})'
\end{align}

To estimate the parameters maximizing the information matrix, we perform Non-Linear Optimization using the Quasi-Newton Method algorithm. While the MLE needs to assume the overall distribution shape, thus being powerful but difficult to use in some cases. But the Quasi-Newton method computes the quasi-likelihood by assuming only the mean-variance relationship of a specific distribution. Generally, it is known that Quasi-MLE derived using the Quasi-Newton method also satisfies the CUAN(Consistent abd Uniformly Asymptotically Normal), given a well-defined mean-variance relationship, similar to MLE. However, it is inefficient estimator compared to MLE, when MLE computation is possible.

One of the advantages of a Poisson Time Series model based on GLM in this study is that GLM does not assume the homoskedasticity of residuals, focusing only on the mean-variance relationship. This allows, to a certain extent, bypass the problem of heteroskedasticity in residuals that can occur when the sizes of measurement errors in varying observation periods.

Poisson Kalman Filter

The Kalman Filter is one of the state space model class, which combines state equations and observation equations to describe the movement of data. When observations are accurate, the weight of the observation equation increases, and on the other hand, when the observations are inaccurate, correcting values derived through the state equation. This feature allows for the estimation of data movements even when the data is inaccurate, like in the case of measurement error, or when data is missing.

Let us consider the Linear Kalman Filter, a representative Kalman Filter model. Assuming a covariate $U$, the state equation representing the movement of the data is given by

\begin{align}
x_{t} = \Phi x_{t-1} + \Upsilon u_{t} + w_{t}
\end{align}

Where $w_{t}$ is an independent and identically distributed error that follows Normal distribution, assuming $E(W) = 0$ and $Var(W) = Q$.

The Kalman Filter uses observation equation to update its predictions, where the equation is

\begin{align}
y_{t} = A_{t}X_{t} + \Gamma u_{t} + v_{t}
\end{align}

Where $v_{t}$ is an independent and identically distributed error that follows the same Normal distribution as $w_{t}$, assuming $E(V) = 0$ and $Var(V) = R$.

Let $x_{0} = \mu_{0}$ be the initial value and $P_{0} = \Sigma_{0}$ be the variance of $x$. Recursively iterate over the expression below

\begin{gather}
x_{t} = \Phi x_{t-1} + \Upsilon u_{t}\\
P_{t} = \Phi P_{t-1} \Phi ' + Q
\end{gather}

with

\begin{gather}
x := x_{t} + K_{t}(y_{t} - A_{t}x_{t} - \Gamma u_{t})\\
P := [I -K_{t}A_{t}]P_{t}
\end{gather}

where

\begin{align}
K_{t} = P_{t}A_{t}'[A_{t}P_{t}A_{t}' + R]^{-1}
\end{align}

The process of updating the data in (19) and (20) utilizes ideas from Bayesian methodology, where the state equation can be considered as a prior that we know in advance, and the observation equation as a likelihood. The Linear Kalman Filter is known to have the minimum MSE(Mean Squared Error) among linear models if the model specification well (process and measurement covariance are known), even if the residuals are not Gaussian.

The Poisson Kalman Filter is a type of extended Kalman Filter. The state equation can be designed in a variety of ways, but in this study, the state equation is set to be Gaussian, just like the Linear Kalman Filter. Instead, similar to the idea in GLM, we introduce a log-link in the observation equation, which can be expressed as

\begin{gather}
E(y_{i} | \theta_{i}) = Var(y_{i} | \theta_{i}) = \exp^{\theta_{i}} \\
\theta_{i} = \log(\lambda_{i})
\end{gather}

We define $K_{t}$, which is derived from (21), as the Kalman Gain. It determines the weight of the values derived from the Observation Equation in (19), which can be laid between 0 and 1. Noting the expression in (21), we can see that the process by which $K_{t}$ is derived has the same structure as how $\beta$ is shrunk in (7). Whereas in (7) the magnitude of $\sigma_{u}^{2}$ determined the degree of attenuation, in (21) the weight is determined by $R$, the covariance matrix of $v_{t}$ in the observation equation. Finally, even if there is a measurement error in the data, the weight of the state equation can be increased by the magnitude of the measurement error, indicating that the Kalman Filter inherently solves the measurement error problem.

Ensemble Methods

Ensemble Methods combine multiple heterogeneous models to build a large model that is better than the individual models. There are various ways to combine models, such as bagging, boosting, and stacking. In this study, we used the stacking method that combines models appropriately using weights.

Stacking is a method that applies a weighted average to the predictions derived from heterogeneous models to finally predict data. It can be understood as solving an optimization problem that minimizes an objective function under some constraints, and the objective function can be flexibly designed according to the purpose of the model and the Data Generating Process(DGP).

3. Data Description
3.1 Introduction

The raw data used in the study are the results of digital advertising run over a specific period in 2022. The independent variable is the marketing spend, and the dependent variable is the marketing conversion. Since the marketing conversion, such as 1, 2, etc. are count data with a low probability of occurrence, it can be inferred that modeling based on the Poisson model would be appropriate.

Figure 1: Daily Marketing Conversion
Figure 2: Daily Marketing Spend
3.2 Data Preprocessing and Assumptions

The raw data were filtered with only performance data generated from marketing channels using marketing spend out of overall marketing performance. Generally, marketing performance obtained using marketing spend is referred to as "Paid Performance", while performance gained without using marketing spend is classified as "Organic Performance". There may be a correlation between organic and paid performance depending on factors such as the size of the service, brand recognition, and some exogenous factors. Moreover, each marketing channel has different influences, and they can affect each other, suggesting the application of a hierarchical model or a multivariate model. However, in this study, a univariate model was applied.

To verify the impact of measurement error, observation values were created by multiplying the actual marketing spend (true value) by the size of the measurement error. The reason for setting it multiplicatively is that the size of the measurement error is proportional to the marketing spend. At this point, considering that the observation value is inaccurate the more recent the data, the measurement error was set to increase exponentially the more it gets closer to the most recent value. As mentioned in the introduction, considering that media executing ads usually update data up to a week, measurement errors were applied only to the most recent 7 data points. The detailed process of the observed value is as follows.

\begin{gather}
e_{i} = \max(0, 1 + \eta_{i})\\
\eta_{i} \sim N(0, a(1+r)^{-\min(0, n-(i+7))})\\
spend^{*}_{i} = e_{i} * spend_{i}
\end{gather}

Where $e_{i}$ is the parameter representing the measurement error at time $i$. Since the ad spend cannot be negative, we set the Supremum to zero. The error is randomly determined by two parameters, $a$ and $r$, where $a$ is the scaling parameter and $r$ is the size of the error. We also accounted for the fact that the measurement error decreases exponentially over time.

As mentioned earlier, this measurement error is multiplicative, which can cause the variance of the residuals to increase non-linear. The magnitude of the measurement error is set to $[0.5, 1]$, which is not out of the domain, and simulated by Monte Carlo method ($n = 1,000$).

4. Data Modeling

Based on the aforementioned data, we define the independent and dependent variables for modeling. The dependent variable $count_{i}$ is the marketing conversion at time $i$, and the independent variable is the marketing spend at time $[i-7, i]$. The dependent variable is assumed to follow the following conditional Poisson distribution.

\begin{align}
count_{i} | spend_{i} \sim pois(\lambda)
\end{align}

The lag variable before the 7-day reflects the lag effect of users who have been influenced by an ad in the past, which causes marketing conversion to occur after a certain amount of time rather than on the same day. The optimal time may vary depending on the type of marketing action and industry, but we used 7-day performance as a universal.

First, let us apply a Distributed Lag Poisson Regression with true values that do not reflect measurement error and do not reflect autocorrelation effects. The equation and results are as follows.

\begin{align}
\log(\lambda_{t}) = \beta_{0} + \sum_{i=1}^{8}\beta_{i}Spend_{(t-i+1)}
\end{align}

Table 1: Summary of Distributed Lag Poisson Regression

The results show that using the lag variable of 7 times is significant for model fit. To test the autocorrelation of the residuals, we derived ACF(Autocorrelation Function) and PACF(Partial Actucorrelation Function). In this case, we used Pearson residuals to consider the fit of the Poisson Regression Model.

Figure 3: ACF Plot of Distributed Lag Poisson Regression
Figure 4: PACF Plot of Distributed Lag Poisson Regression

By the graph, there is autocorrelation in the residuals, so we need to add some time series parameters to reflect the model. The model equation with an autoregressive, mean average parameter that follows a Poisson distribution is as follows.

\begin{align}
\log(\lambda_{t}) = \beta_{0} + \sum_{k=1}^{7}\beta_{k}\log(Y_{t-k} + 1) + \alpha_{7}\log(\lambda_{t-7}) + \sum_{i=1}^{8}\eta_{i}Spend_{(t-i+1)}
\end{align}

Where $\eta$ is the marketing spend used as an independent variable, $\beta$ is the intercept, and $\alpha$ is the unobserved conditional mean of the lagged variable of the dependent variable before 7 times, log-transformed into a log-linear model, which reflecting seasonality. The $\beta$ allows us to include effects that may affect the model other than the marketing spend used as a covariates, and the $\alpha$ is inserted to account for the effect of day of the week since the data is daily.

The results show that the lagged variables, $\alpha$ and $\beta$, are significant before 7 times. The quasi log-likelihood is also -874.725, which is a significant increase from before, and the AICc and BIC, which are indicators of model complexity, are also better for the Poisson Time Series.

Table 2: Summary of Poisson Time Series Model

As shown below, when deriving ACF and PACF with Pearson residuals, we can see that autocorrelation is largely eliminated. Therefore, the results so far show that Poisson Time Series is better than Distributed Lag Poisson Regression.

Figure 5: ACF Plot of Poisson Time Series
Figure 6: PACF Plot of Poisson Time Series

And, we will simulate and include measurement error in our independent variable, marketing spend, and see how it affects our proposed models.

5. Results

In this study, we evaluated the models on a number of criteria to understand the impact of measurement error and to determine which of the proposed models is superior. First, the "Prediction Accuracy" is an indicator of how well a model can actually predict future values, regardless of its fitting. The future values were set to 1 interval and measured by the Mean Absolute Error (MAE).

Since the characteristic of data follows time series structure, it is difficult to perform K-fold cross-validation or LOOCV(Leave One-Out Cross Validation) by arbitrarily dividing the data. Therefore, the MAE was derived by fitting the model with the initial $d$ data points, predicting 1 interval later, and then rolling the model to recursively repeat the same operation with one more data point. The MAE for the Poisson Time Series is as follows.

Table 3: Mean Absolute Error (# of simulations = 1,000)

We can see that as the magnitude of the measurement error increases, the prediction accuracy decreases. However, at low levels of measurement error, we actually see lower MAE on average compared to performance evaluation on real data. This implies that instead of inserting bias into the model, the measurement error reduced the variance, which is more beneficial from an MAE perspective. The expression for MSE as a function of bias and variance is as follows.

\begin{align}
MSE = Bias^{2} + Var
\end{align}

If $Var$ decreases more than $Bias^{2}$ increases, we can understand that the model has developed from overfitting. MAE is the same, just a different metric. Therefore, with a reasonable measurement error size, the attenuation of the regression coefficient on the independent variable due to the measurement error can be understood as a kind of regularization effect.

However, for measurement errors above a certain size, the MAE is higher on average than the actual data. Therefore, if the measurement error is large, it is necessary to continuously update with new data by comparing with the data that is usually updated continuously, or to reduce the size of the measurement error by using the idea of repeated measures ANOVA(Analysis of Variance).

In some cases, you may decide that it is better to force additional regularization from the MAE perspective. In this case, it would be natural to use something like Ridge Regression, since the measurement error has been acting to dampen the coefficient effect in the same way as Ridge Regression.

Depending on the size of the data points, the influence of measurement error will decrease as the number of data points increases. This is because the error of measurement is only present for the last 7 data points, regardless of the size of the data points, hence the error of measurement gradually decreases as a percentage of the total data. Therefore, we can see that the impact of error of measurement is not significant in modeling situations where we have more than a certain number of data points.

However, in the case of digital advertising, there may be issues such as terminating ads within a short period of time if marketing performance is poor. Therefore, if you need to perform a hypothesis test with short-term data, you need to adjust the significance level to account for the effect of measurement error.

The 2SLS(2 Stage Least Squares) model, inserted in the table, will be proposed later to check the efficiency of the coefficients. Note that the 2SLS has a high MAE due to initial uncertainty, but as the data size increases, the MAE decreases rapidly compared to the original model.

Next, we need to determine the nature of the residuals in order to make more accurate and robust predictions. Therefore, we performed autocorrelation and heteroskedasticity tests on the residuals.

The following results is the autocorrelation test on the Pearson residuals. In this study, the Breusch-Godfrey test used in the regression model was performed on lag 7. In general, the Ljung-Box test is utilized, but the Ljung-Box test is the Wald test class, which has a high power under the strong exogeneity(Mean Independent) assumption between the residuals and independent variables\cite{Hayashi}. Therefore, the strong exogeneity assumption about Wald test are not appropriate for this study, which requires a test for measurement error and the case of few data points. On the other hand, the Breusch-Godfrey test has the advantage of being more robust than the Ljung-Box test, because it assumes more relaxed exogeneity(Same Row Uncorrelated) assumption under the Score test class.

Table 4: p-value of Breusch-Godfrey Test for lag 7 (# of simulations = 1,000)

The test shows that the measurement error does not significantly affect the autocorrelation of the residuals.

Next, here are the results for the heteroskedasticity test. Although GLM-type models do not specifically assume homoskedasticity of the residuals, we still need to investigate the mean-variance relationship assumed in the modeling. To check this indirectly, we scaled the residuals as Pearson, and then performed a Breusch-Pagan test for heteroskedasticity.

Table 5: p-value of Breusch-Pagan Test (# of simulations = 1,000)

We can see that the measurement error does not significantly affect the assumed mean-variance relationship of the model. Consider the process of estimating the parameters in a GLM. The Information Matrix in (14) is weighted by the mean, whereas in Poisson Regression, the mean is same as variance, so it is weighted by the mean. Since it utilizes a weight matrix with a similar idea to GLS(Generalized Least Squares), it has the inherent effect of suppressing heterogeneity to a certain extent by giving lower weights to uncertain data.

On the other hand, we can see that the Breusch-Pagan test has a low p-value on some data points. If the significant level is higher than 0.05, the null hypothesis can be rejected. This is because there is a regime shift in the independent variable before and after $n = 47$, as shown in Fig. 1.

To test this, we performed a Quasi Likelihood Ratio Test(df = 9) between the saturated model, that considered the pattern change before and after the regime shift and the reduced model that did not consider it. The results are shown below.

Table 6: Quasi-LRT for Structural Break (Changepoint = 47)

Since the test statistic exceeds the rejection bound and is significant at the significance level 0.05. It can be concluded that the interruption of ad delivery after the changepoint, or the lower marketing spend compared to before, may have affected the assumed mean-variance relationship. We do not consider this in our study, but it would be possible to account for regime shifts retrospectively or use a Negative Binomial based regression model to account for this.

Next, we test for efficiency of statistics. Although this study does not focus on the endogeneity of the coefficients, we use a 2SLS model as the specification for the efficiency test. The proposed instrumental variable is ad impressions. The instrumental variable should have two characteristics: first, it should be "Relevant", which means that the correlation between the instrumental variable and the original variable is high. The variance of the regression coefficient estimated with the instrumental variable is higher than the variance of the model estimated with the original variable, and the higher the correlation, the more favorable it is to reduce the difference with the variance of the original variable(Highly Relevant). Since the ad publisher's billing policy is "Cost per Impression", the correlation between ad spend and impressions is significantly high.

On the other hand, "Validity" is most important for instrumental variables, which should be uncorrelated with the errors to eliminate endogeneity. In the digital advertising market, when a user is exposed to a display ad, the price of the ad is determined by two things: the number of "Impressions" and the "Strength of Competition" between real-time ad auction bidders. Since the effect of impressions has been removed from the residuals, it is unlikely that the remaining factor, the strength of competition among auction bidders, is correlated with the user being forced to see the ad. Furthermore, the orthogonality test below shows the difficulty in rejecting the null hypothesis of uncorrelated.

Table 7: p-value of Test for Orthogonality

Therefore, we can see that it makes sense to use "Impressions" as an instrumental variable instead of marketing spend. Here are the proposed 2SLS equations.

\begin{gather}
\hat{Spend}_{t} = \gamma_{0} + \gamma_{1}Imp_{t}\\
\log(\lambda_{t}) = \beta_{0} + \sum_{k=1}^{7}\beta_{k}\log(Y_{t-k} + 1) + \alpha_{7}\log(\lambda_{t-7}) + \sum_{i=1}^{8}\eta_{i}\hat{Spend}_{t-i+1}
\end{gather}

It is known that if there is measurement error in the instrumental variable, the number of impressions, but the random measurement error in the instrumental variable does not affect the validity of the model.

We performed the Levene test and Durbin-Wu-Hausman test to see the equality of residual variances. Below is the result of the Levene test.

Table 8: p-value of Levene Test (m = 0) (# of simulations = 1,000)

We can see that the measurement error does not significantly affect the variance of the residuals. Furthermore, 2SLS also shows that there is no significant difference in the variance of the residuals at the significance level 0.05. This means that the instrumental variable is highly correlated to the original variables.

The Durbin-Wu-Hausman test checks whether there is a difference in the estimated coefficients between the proposed model and the original model. If the null hypothesis is rejected, the measurement error has a significant effect and the variance of the residuals will be affected. The results of the test between the original model and the model with measurement error are shown in the table below. We can see that the presence of measurement error does not affect the efficiency of the model, except in a few cases.

Table 9: p-value of Durbin-Wu-Hausman Test (m = 0) (# of simulations = 1,000)

In addition, we check whether there is a difference in the coefficients between the proposed 2SLS and the original model. If the null hypothesis is rejected, it can be understood that there is an effect of omitted variables other than measurement error, which can affect the variance of the residuals. The results of the test are shown below.

Table 10: p-value of Durbin-Wu-Hausman Test (2SLS)

When the data size is small, the model is not well specified and the 2SLS is more robust than the original model, but above a certain data size, there is no significant difference between the two models. In conclusion, the results of the above tests show that the proposed Poisson Time Series does not show significant effects of measurement error and unobserved variables. This is because, as mentioned earlier, the weight matrix-based parameter estimation method of AR, MA parameters, and GLM class model inherently suppresses some of these effects.

In addition to the GLM based Poisson Time Series, we also proposed a State Space Model based Poisson Kalman Filter. In the Poisson Kalman Filter, the inaccuracy of the observation equation due to measurement error is inherently corrected by the state equation, which has the advantage of being robust to measurement error problem.

The table below shows the benchmark results between Poisson Time Series and Poisson Kalman Filter. You can see that the log-likelihood is always higher for the Poisson Time Series, but lower for the Poisson Kalman Filter in the MAE. This can be understood as the Poisson Time Series is more complex and overfitted, compared to the Poisson Kalman Filter.

However, after $n = 40$, the Poisson Time Series shows a rapid improvement in prediction accuracy. On the other hand, the Poisson Kalman Filter shows no significant improvement in prediction accuracy after a certain data point. This suggests that the model specification of the Poisson Time Series is appropriate beyond a certain data point.

We also compared the computational speed of the two models. We used "furrr" library in the R 4.3.1 environment, and ran 1,000 times each to derive the simulated value. In terms of computation time, the Poisson Time Series is about 1 second slower on average, but we do not believe this has a significant business impact unless you are in a situation where huge simulation is required.

Table 11: Benchmark

The following table below shows the test results for the residuals between the Poisson Time Series and the Poisson Kalman Filter. We can see the heterogeneity between the two models. In the case of the Poisson Kalman Filter, we can see that the evidence of initial autocorrelation and homoscedasticity is high, but the p-value decreases above a certain data size. This means that the Poisson Kalman Filter is not properly specified, when the data size increases.

Table 12: p-value of Robustness Test

Finally, the PIT(Probability Integral Transform) allows us to empirically verify that the model is properly modeled by the mean-variance relationship. If the modeling was done properly, the histogram after the PIT should be close to a Uniform distribution. The farther it is from the Uniform distribution, the less it reflects the DGP of the original data. In the graph below, we can see that the Poisson Time Series shows values that do not deviate much from Uniform distribution, but the Poisson Kalman Filter results in values that are far from the distribution.

Figure 7: PIT of Poisson Time Series
Figure 8: PIT of Poisson Kalman Filter
6. Ensemble Methods

So far, we have covered Poisson Time Series and the Poisson Kalman Filter. When the data size is small, the Poisson Kalman Filter is reasonable, but above a certain data size, the Poisson Time Series is reasonable. To reflect the heterogeneity of these two models, we want to derive the final model through model averaging. The optimization objective function is shown below.

\begin{gather}
p_{t+1} = argmin_{p}\sum_{i=1}^{t}w_{i}|y_{i} - (p \hat{y}_{i}^{(GLM)} + (1 - p) \hat{y}_{i}^{(KF)})|\\
s.t. \hspace{0.1cm} 0 \leq p \leq 1, \hspace{1cm} \forall w > 0
\end{gather}

The objective function is set in terms of minimizing the MAE, and different data points are weighted differently via the $w_{i}$ parameter. $w_{i}$ is the reciprocal of the variance at that point in time out of the total variance in precision, to reflect the fact that the more recent the data, the better the estimation and therefore the lower the variance. And the better the model, the lower the variance. The final weighted model prediction process is shown below.

\begin{align}
\hat{y}_{t+1} = p_{t+1}\hat{y}_{t+1}^{(GLM)} + (1 - p_{t+1})\hat{y}_{t+1}^{(KF)}
\end{align}

Below graph is the weights of the Poisson Time Series per data point derived from Stacking Methods. You can see that the weights are close to zero until $n = 42$, after which they increase significantly. In the middle, where the data becomes more volatile, such as the regime shift(blue vertical line), the weights are partially decreased.

Figure 9: Weight of Poisson Time Series

The table below shows the results of the comparison between the final stacking model and the Poisson time series and Poisson Kalman Filter. First, we can see that the stacking model is superior in all times in the MAE, as it absorbs the advantages of both models, reflecting the Poisson Kalman Filter's advantage when the data size is small, and the Poisson Time Series' advantage above a certain data size. We can also see that the robustness test shows that the p-value of stacking model is laid between the p-values derived from both models.

Table 13: Benchmark
Table 14: p-value of Robustness Test
7. Conclusion

We have shown the impact of measurement error on count data in the digital advertising domain. Even if the main purpose is not to build an analytical model but simply to build a model that makes better predictions, it is also important to check the measurement error in predictive modeling since the model may be underfitted by the measurement error, and the residuals may be heteroskedastic depending on the characteristic of the measurement error.

To this end, we introduced GLM based Poisson Time Series, and Poisson Kalman Filter, a class of Extended Kalman Filter, which can partially solve the measurement error problem. After applying these models to simulated data based on real data, the results of prediction accuracy and statistical tests were obtained.

In terms of prediction accuracy, we found that the magnitude of the coefficients is attenuated due to measurement error, causing a kind of regularization effect. For the data used in this study, we found that the smaller the measurement error, the better the prediction accuracy, while the larger the measurement error, the worse the prediction accuracy compared to the original data. We also found that the impact of the measurement error was relatively high when the data size was small, but as the data size increased, the impact of the measurement error became smaller. This is due to the nature of digital advertising data, where only recent data is subject to measurement error.

The test of residuals shows that there is no significant difference with and without measurement error. Therefore, the proposed models can partially avoid the problem of measurement error, which is advantageous in digital advertising data.

We also note that the two models are heterogeneous in terms of data size. When the data size is small and the impact of measurement error is relatively large, we found that the Poisson Kalman Filter, which additionally utilizes the state equation, is superior to the overspecified Poisson Time Series. On the other hand, as the data size increases, we found that the Poisson Time Series is gradually superior in terms of model specification accuracy. Finally, based on the heterogeneity of the two models, we proposed an ensemble class of stacking models that can combine their advantages. In the tests of prediction accuracy and residuals, the advantages of the two models were combined, and the final model showed better results than the single model.

On the other hand, while we assumed that the data follows a conditional Poisson distribution, some data points may be overdispersed due to volatility. This is evidenced by the presence of structural breaks in the retrospective analysis. If the data has overdispersion compared to the model, it may be more beneficial to assume a Negative Binomial distribution. Also, since the proposed data is a daily time series data, further research on increasing the frequency to hourly data could be considered. Finally, although we assumed a univariate model in this study, in the case of real-world digital advertising data, a user may be influenced by multiple advertising media simultaneously, so there may be correlation between media. Therefore, it would be good to consider a multivariate regression model such as SUR(Seemingly Unrelated Regression), which considers correlation between residuals, or GLMM(Generalized Linear Mixed Model), which considers the hierarchical structure of the data, in subsequent studies.

References

[1] Agresti, A. (2012). Categorical Data Analysis 3rd ed. Wiley.

[2] Biewen, E., Nolte, S. and Rosemann, M. (2008). Multiplicative Measurement Error and the Simulation Extrapolation Method. IAW Discussion Papers 39.

[3] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

[4] Czado, C., Gneiting, T. and Held, L. (2009). Predictive Model Assessment for Count Data. Biometrics 65, 1254-1261.

[5] Greene, W. H. (2020). Econometric Analysis 8th ed. Pearson.

[6] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection
and the virtue of overparametrization. Bernoulli 10(6), 971-988.

[7] Hayashi, F. (2000). Econometrics. Princeton University Press.

[8] Helske, J. (2016). Exponential Family State Space Models in R. arXiv preprint
arXiv:1612.01907v2.

[9] Hyndman, R. J., and Athanasopoulos, G. (2021). Forecasting: principles and practice 3rd ed.
OTexts. OTexts.com/fpp3.

[10] KOBACO. (2022). Broadcast Advertising Survey Report, 165-168.

[11] Liboschik, T., Fokianos, K. and Fried, R. (2017). An R Package for Analysis of Count Time Series
Following Generalized Linear Models. Journal of Statistical Software 82(5), 1-51.

[12] Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.

[13] Montgomery, D. C., Peck, E. A. and. Vining, G. G. (2021). Introduction to Linear Regression
Analysis 6th ed. Wiley.

[14] Shmueli, G. (2010). To Explain or to Predict?. Statistical Science 25(3), 289-310.

[15] Shumway, R. H. and Stoffer, D. S. (2016). Time Series Analysis and Its Applications with R
Examples 4th ed. Springer.

Price Premium Discovery In Real Estate Auction Market: Decomposition Of The Korea Auction Sale Rate

Price Premium Discovery In Real Estate Auction Market: Decomposition Of The Korea Auction Sale Rate

Bohyun Yoo*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

This study discovers and analyzes price premium (discount/surcharge) factors in the real estate auction market. Unlike existing bottom-up studies based on individual auction cases, a top-down time-series analysis is conducted, assuming that the price premium factor varies over time. To overcome limitations such as the difference between the court appraisal time* and the auctioned time, and the difficulty of using external data on court appraisals and price premium factors, the Fourier transform is utilized to extract the court appraisals and price premium factors in reverse. The extracted components are verified to determine if they can play a role as each factor. The price premium factor is found to have a similar movement to the difference in past values of the auction sale rate, and, as it signifies the discounts/surcharges in the auction market compared to the general market, it is named the “momentum factor”. Furthermore, by leveraging the momentum factor, the price premium can be differentiated by region, and the extent of the price premium applied can be distinguished over various time periods compared to the general market. Given the clustering tendency, the momentum factor can be a significant indicator for auction market participants to detect market changes.

1. Introduction

The housing auction market in Korea is one of the real estate markets, and many stakeholders such as mortgage banks, arbitrage investors, and non-performing loan operators are deeply involved. In general, there is a perception that the auction market is surcharged or discounted compared to the general market. If the auction market is an efficient and fair-trading market, it will not be different from the general market price, but most housing auction cases are implemented by default, so it is known that have legal issues and that applies as a discount factor. However, the bottom-up analysis based on individual auction cases, which is a method mainly used in previous studies on discounts and surcharges, is limited in time and space, and the time-varying effect cannot be considered, and the results of the analysis are limited and dependent on the data held by the researcher.

To overcome these limitations, it should be carried out the analysis from the market perspective, but the time series data Auction Sale Rate is unreliable as an indicator because the court appraiser price, which is the standard, is performed at the past rather than at the time of the auctioned price. It is difficult to specify the time of court appraisal as a variable in the model because it varies from case to case of individual auction how much it is in the past at the time of successful bid, and even if the time is known, the court appraisal price cannot be accurately estimated. Individual cases can be investigated in a bottom-up manner to return the point of view based on the general market price, but it is a very vast task and likewise a study limited to time and space.

The target of this paper is the apartment auction market, and to overcome the limitations of the auction sale rate, the auction sale rate is decomposed into three components in a top-down manner using Fourier transform. The proof of the decomposed each component is performed. And the price premium effect at the auction market is presumed and the reason is analyzed and the section discrimination in which the price premium effect acts is attempted. In addition, the time-varying beta through the Kalman filter is used to support the price premium effect, and the analysis of how the price premium effect differs in each region's market is also performed.

2. Literature review

Shilling et al (1990) analyzed the apartment auction in 1985 in the baton lounge, Louisiana, USA, and found an auction discount rate of -24%, Forgey et al (1994) analyzed houses from 1991 to 1993 in the United States and found that they were traded at a -23% discount. Spring (1996) analyzed foreclosures in Texas from 1991 to 1993 and found a 4-6% discount, Clauretie and daneshvary (2009) analyzed the housing auctions from 2004 to 2007 and found that about 7.5% of foreclosures were discounted because of endogenous and autocorrelation.

Campbell et al (2011) analyzed about 1.8 million housing transactions in Massachusetts and found that the discount rates for foreclosures and deaths were different. Zhou et al (2015) found that on average, 16 cities in the United States were discounted by 14.7%, Arslan, Guler & Tasking (2015) analyzed that a 1% increase in risk-free interest rates led to a 27% drop in house prices and a 3% increase in foreclosure rates.Jin (2010) compared and analyzed the general sale price and the auction price of apartments in Dobong-gu, Seoul and Suji-gu, Yongin-si, Korea, and found that the auction price is more discounted than the general transaction price. Lee (2012) noted that the real estate market is not efficient and is one of the anomalies of the discount / surcharge phenomenon in the apartment auction market.

Lee (2009) and Oh (2021) pointed out the limitations that occurred when the court appraisal price and the auctioned price were different and estimated the auction sale rate by correcting the court appraisal price to the auctioned time.

However, previous studies mainly focus on the analysis of variables in the bottom-up method along with the limitation of space and time based on individual auction cases. In addition, it is difficult to see the analysis in the same environment as Korea because the cases other than Korea adopt the open bidding system.

3. Materials and method
3.1. Decomposition of auction sale rate

Configuration of the auction sale rate defined as

\begin{equation} \label{eq:auction-sale-rate}
Auction\ Sale\ Rate\ _t=\frac{\sum_{i}\ Auctioned\ Price_{it}}{\sum_{i}\ Appraisal\ Price_{it-n}}\
\end{equation}

\begin{equation} \label{eq:auction-price}
Auctioned\ Price_t=\ Market\ Price_t\pm\ Price\ Premium_t\ (=discount\ or\ surcharge)
\end{equation}

\begin{equation} \label{eq:auction-sale-rate-price}
Auction\ Sale\ Rate\ _t=\frac{\sum_{i}\ (Market\ Price_t\ \pm\ Premium\ _t)}{\sum_{i}\ Appraisal\ Price_{t-n}}
\end{equation}

\begin{equation} \label{eq:market-price}
\text{If}\ Price\ Premium_t=0\ ,\ \ Market\ Price_t=Auctioned\ Price_t
\end{equation}

Where i is each auction case, t is each per month. If the auctioned price is discounted and surcharged compared to the general market price, the component can be separated as shown in (2), and if there is no discount and surcharge, it can be expressed as shown in (4). In order to estimate the price premium effect, which is the discount or surcharge, it can be defined in the Regression form as shown in (5), and it is assumed that the explanatory power of each component is as shown in (6).

In the Regression form in terms of effects,

\begin{equation} \label{eq:auction-sale-rate-in-regression}
Auction\ Sale\ Rate\ _t={\beta_0}_t{+\beta}_1EoM+\beta_2EoA_t+\ \beta_3EoP_t+\epsilon_t
\end{equation}

\begin{equation} \label{eq:explanatory-power}
\text{Explanatory Power of Each Components :} \\
EoM (Effect of Market Price) > EoA (Effect of Appraisal Price) > EoP (Effect of Price Premium)
\end{equation}

3.2. The data

The empirical analysis in this paper is based on Auction Sale Rate and Market Price Index in nationwide 2012.03 ~ 2022.10 in month. The auction sale rate is calculated by collecting the sum of court appraiser prices and auctioned prices nationwide announced by the court from 2012.03 to 2022.10. The Market price index is an index of general market apartment prices nationwide and is provided by the Korea Real Estate Board. Log-Differencing is taken in the Market price index to match the forms of both data equally then Standardization, which translates to mean 0 and variance 1, take both data to match the same scale.

Table 1. Data Description BoHyunYoon
Figure 1. Auction Sale Rate and Market Price Index
Figure 2. Comparison of Standardized Auction Sale Rate and Market Price Index (Log-differencing)

skewness and kurtosis reported in Table 1 shows AuctionSaleRate and MarketPriceIndex has different peaks and tails compared to normal distribution. and the Lev results in Table 1 show that it is different from the leverage effect (Black 1976.) of the stock market. The auction market and the general sales market has a positive sign relationship with the future volatility. This means that volatility in the real estate market has a positive correlation with price.

3.3. Identification of variables
3.3.1. The effect of market price

Auction sale rate can be decomposed into three components in the regression form as shown in (5), and log-differencing market price index is used as the first variable, EoM's proxy variable. As shown in Table 2, EoM has the strongest explanatory power in auction sale rate.

3.3.2. Component identification

\begin{equation} \label{eq:component-identification}
y_t=\beta_0+\beta_1Mkt_t+\epsilon_t
\end{equation}

Where y_t is Auction sale rate at time t, $\beta_0$ is intercept $\beta_1$ is parameter of $Mkt$ and $Mkt$ is Log differencing Market Price Index. as define in (5), the remaining EoA and EoP components are in the residual as latent. To identify EoA, EoP components, a Fourier transform is used in $\epsilon_t$ (7), and then two highest amplitude signals can be extracted, assuming that they are court appraisers and price premium effects as defined in (6).

3.3.2.1. Fourier transform

Fourier transform is a mathematical transformation that decomposes a function into a frequency component, representing the output of the transformation as a frequency domain. In this paper, it is used to extract the orthogonal cycle of EoA and EoP as defined in (5). In terms of linear transformation, the orthogonal factor present in the signal can be extracted as a Forward and Inverse Discreate Fourier matrix, as shown in (9).

\begin{equation} \label{eq:fft}
X=F_{N}x \ \text{and} \ x=\frac{1}{N}F_N^{-1}X\ \text{<Forward and Inverse>}
\end{equation}

\begin{equation} \label{eq:fft-in-matrix}
{\underbrace{\left[\begin{matrix}
X\left[0\right] \\
X\left[1\right] \\
\vdots \\
X\left[N-1\right] \\
\end{matrix}\right]}}_{Signal} \
= \
{\underbrace{\left[\begin{matrix}
W_N^{0\cdot0} & W_N^{0\cdot1} & \cdots & W_N^{0\cdot(N-1)} \\
W_N^{0\cdot1} & W_N^{0\cdot1} & \cdots & W_N^{1\cdot(N-1)} \\
\vdots & \vdots & \ddots & \vdots \\
W_N^{0\cdot1} & W_N^{0\cdot1} & \cdots & W_N^{(N-1)\cdot(N-1)} \\
\end{matrix}\right]}}_\text{$F_N$(Discrete Fourier Matrix)} \\
{\underbrace{\left[\begin{matrix}
x\left[0\right] \\
x\left[1\right] \\
\vdots \\
x\left[N-1\right] \\
\end{matrix}\right]}}_\text{Residual($\epsilon_t)$} \\
\text{, where } W^{n\cdot k}=\exp{\left(-j\frac{2\pi k}{N}n\right)}
\end{equation}

\begin{equation} \label{eq:signal-k}
X\left[k\right]=x\left[0\right]W^0+x\left[1\right]W^{N\times1}+\ldots+\ x\left[n-1\right]W^{i\times\left(n-1\right)} , \text{where} \ k=signal_k
\end{equation}

where $x$ is vector of $\epsilon$ in (7) $x=\left(x_0,x_1\ldots x_N\right)^T$ $N$ is length of vector and $X$ is signal $X=\left(X_0,X_1\ldots X_N\right)^T$ and $F_N$ is Discrete Fourier Matrix. As shown (9), (10) time series data which cyclic can be decomposed to orthogonal signal by Discrete Fourier Transform as linear transformation. However, in practice, DFT calculation $O(N^2)$ are replaced by Fast Fourier Transform (Cooley-Tukey algorithm, 1965) which is that performs fast calculations by dividing the DFT into odd and even two terms. $O\left({Nlog}_\ N\right)$ (11). Figure 3 shows that two high amplitude signals were extracted by performing FFT on Residual in (7).

\begin{equation} \label{eq:n-log-n}
\begin{split}
X\left[ k \right] & = \sum_{n=0}^{N-1} x_n \ exp \left( -j \frac{2 \pi k}{N} n \right) \\
& = \sum_{m=0}^{N/2-1}x_{2m}\exp{\left(-j\frac{2\pi k}{N}2m\right)}+\ \sum_{m=0}^{N/2-1}x_{2m+1}\exp{\left(-j\frac{2\pi k}{N}2m+1\right)} \\
& = \sum_{m=0}^{N/2-1}x_{2m}\exp{\left(-j\frac{2\pi k}{N\ /\ 2}\ m\ \right)}+\exp{\left(-j\frac{2\pi k}{N}\ \right)}\sum_{m=0}^{N/2-1}x_{2m+1}\exp{\left(-j\frac{2\pi k}{N/2}m\right)}
\end{split}
\end{equation}

where $x_{2m}=(x_0,x_1\ldots\ x_{n-2})$ is even-indexed part, $x_{2m+1}=(x_1,x_3,\ldots,x_{n-1})$ is odd-indexed part.

Figure 3-1. Transformed to Frequency Domain and Filtered by Amplitude
Figure 3-2. Transform Residual in (7) to FFT and extract signals
3.3.2.2. Regression analysis
Table 2. Result

\begin{equation} \label{eq:stage-2}
Y_t=\beta_0+\beta_1Mkt_t+\beta_2SI{G1}_t+\mu_t
\end{equation}

\begin{equation} \label{eq:stage-3}
Y_t=\beta_0+\beta_1Mkt_t+\beta_2SI{G1}_t+\beta_3\widehat{SIG2_t}+\omega_t
\end{equation}

\begin{equation} \label{eq:signal-2}
\widehat{Sig2_t}=\sigma\left(Sig2_t\right) , \ \sigma=\frac{1}{1+e^{-\left(x\right)}} , \ >\ 0.5\ =\ 1\ \ ,\ <0.5=\ 0
\end{equation}

where $SIG1$ is highest amplitude signal in residual in $\epsilon_t$ (7) and $SIG2$ is highest apmplitude signal residual in $\mu_t$ (12)

Table 2 shows the results of using the extracted signals as a variable of regression by performing FFT in 4.3.2.1. $SIG2$ is a component of EoP, and to distinguish price premium effects clearly, it is transformed into categorical data(0/1) through Sigmoid function as shown in (14). The Difference result in Table 2 show that the parameter has hardly changed, demonstrating that the two signals found are almost orthogonal components, and do not make omitted variable bias(Wooldridge, 2009). and the adj. R-squared supports the order of explanatory power assumed in (5). Lastly, the residual ACF/PACF plot in Figure 4 indicates that no further patterns exist in the residuals following the exclusion of the three components. (13) This supports the assumption outlined in 3.1 (5) that the auction sale rate is composed of three main components.

Figure 4. ACF/PACF Plot of Residual $\omega_t$ (13)
3.3.3. Proof of the effect of appraisal price

Based on Table 2 and according to the assumption of (5), $SIG1$ is EoA (Effect of Appraisal Price in Auction Sale Rate). The court appraisal time is in the past rather than the Auctioned time (1). The difference between the two points makes it difficult to define the court appraiser effect variable in terms of time series analysis. Since correcting the price difference that occurred in time for all auction cases is a very difficult challenge, the Fourier transform (4.3.2.1) is used. In this paper. Proving that $SIG1$ is EoA, 2,762 individual auction cases occurred between 2016.04 and 2018.03 in Seoul and Busan are empirically analyzed (Table 3, Table 4.)

Figure 5. The difference of time between Court Appraisal time and Auctioned time

The analysis is conducted in two main aspects:

  1. Time interval between the time of court appraisal and the time of Auctioned (Table 4)
  2. Regression with the general market price at the time of court appraisal price (Table 4)
    \begin{equation} \label{eq:cp}
    CP_t=\ \alpha_0dummy_t+\alpha_1MP_t+\gamma_t
    \end{equation}

where $CP_t$ is price at time of court appraisal (Figure 5), MP is housing price, $\alpha_0$ is dummy variable $\alpha_1$is parameter of housing price.

Table 3 Data Description
Table 4 Result of analysis
Figure 6. Residual Distribution in (15) & The difference between Court Appraisal and Auctioned time (days)

As shown in Table 4, the time difference distribution has a right skewed shape and the range of 25% to 75% is about 7 to 11 months. Price difference has a long-tailed distribution, and it can be estimated that the court appraisal price and the housing price at the time of the court appraisal have a very high correlation and are almost the same value. To summarize the results of the two analyses, the court appraisal price is the lag variable of the housing price. In terms of the component (5) EoA can be assumed to have a lag relationship with $Mkt$ and the results are shown in Table 5.

Table 5. Regression of analysis ($SIG1$ vs $Mkt$)

Table 5 [1] shows the relationship between the lag variable of $SIG1$ and $Mkt$. $SIG1$ extracted by Fourier transform is compared with lag variable and $Mkt$ of $SIG1$ because it is a signal indicating the past influence of the present time rather than the past price itself. In addition, the order of the Lag of the comparison target is set from 7 months to 11 months, which ranges from 25% to 75% of Table 4 As a result of the analysis, it was confirmed that the lag variable of $SIG1$ has a significant relationship with $Mkt$.

Table 5 [2] is a confirmation of whether $Mkt$'s lag variable can replace the court appraiser if the court appraisal price has a time lag relationship with the $Mkt$ according to the results of Table 4 As a result of the analysis, there is a significant relationship.

Table 5 [3] confirms the relationship between $SIG1$ and Auction sale rate. If the court appraiser can be replaced by $Mkt$'s lag variable only, as in Table 5, the $SIG1$ variable is not meaningful, but the results of the analysis show that Table 5 [3] is superior to Table [2]. The reason for this is that, as in Figure 6, there is no special depreciation factor in each auction case, which can be explained by $Mkt$'s lag, but there is an unidentified area that has a large gap with $Mkt$, such as legal issues, equity auctions, or the time difference does not fall between 25% and 75%.

Figure 7 Lag of $Mkt$ can be only represented to part of identified

To sum up with Result of Table 5, in Table 4 $Mkt$ and $SIG1$ have lag relations with $Mkt$ and are superior to the lag variables of $Mkt$ according to the limits of Figure 7. therefore, $SIG1$ can be presumed in terms of EoA, as assumed in (5).

3.3.4. Proof of the effect of premium price

Based on result of Table 2 and according to the assumption of (5), $SIG2$ is EoP (Effect of Price Premium in Auction Sale Rate). For the analysis, $SIG2$ is transformed to categorical value through sigmoid function to assume Price premium on/off as in 4.3.2.2. In this paper, two aspects support that $SIG2$ is an EoP.

  1. Verify that $\widehat{SIG2}$ can distinguish between discount and surcharge points. (Figure 8)
  2. Track what variables $SIG2$ is, name it, and verify it makes sense.
3.3.4.1. Distinguish to price premium pffect in auction sale rate

The $\widehat{SIG2}$ parameter of Table 2 [3] is about 0.49 with a positive sign Figure 8 is based on the baseline predicted by Table 2 [2], and the auction sale rate points are clearly distinguished up and down by $\widehat{SIG2}$ 1/0 of Table [3]. The righthand side of Figure 8 shows a distribution of different means and variances. Therefore, $SIG2$ can be presumed in terms of EoP, as assumed in (5).

Figure 8. Surcharge and discount points that can be distinguished by $\widehat{SIG2}$
3.3.4.2. Momentum factor

In 4.3.4.1, it is confirmed that $SIG2$ is a component that can explain the price premium effect, but it is meaningless if it cannot be explained by any variable in practice. In this paper, $SIG2$ confirms which variables can be compared, verifies whether it makes sense, and finally names it. First, $SIG2$ is likely to be a variable of the auction market itself because it is likely that EoM and EoP already has the effects of macro in almost. In fact, no significant correlation was found between comparable macroeconomic variables. According to the Lev result of Table 1, the future volatility of the auction market has a positive correlation with the auction sale rate, The EoP component also has a positive correlation according to table 2 [3]. So, the variable that can be compared as a component of the auction market itself is volatility (16)(17). The results of the verification of this hypothesis is shown in Table 6.

\begin{equation} \label{eq:signal-2-2}
SIG2_t=\ c_0+c_1{v1}_t+c_2{v2}_t+\eta_t, \ {v1}_t = \left(y_t-y_{t-1}\right)_t , \ {v2}_t=\ \left(y_{t-1}-y_{t-2}\right)_t
\end{equation}

\begin{equation} \label{eq:signal-2-3}
SIG2_t=\ c_0+c_1\left(y_t-y_{t-1}\right)t+c_2\left(y{t-1}-y_{t-2}\right)_t+\eta_t
\end{equation}

where c_0 is intercept, y is auction sale rate, v is volatility as differencing of auction sale rate.

Table 6. Regression result
Figure 9. Compare to between $SIG2$ vs $\widehat{C_t^T} V_{t}$ (16)
Figure 10. Surcharge and discount points that can be distinguished by $\sigma(\widehat{C^T} V_t)$

In Table 6, the volatility variable is significantly related to $SIG2$, and in Table 6, the value described by
the volatility variable (16)(17) and $SIG2$ show similar movements. Figure 10 shows that the volatility variable can distinguish between the surcharge and discount points well and has different distribution like Figure 8.

In summary, the volatility variable of Auction sale rate can be explained as the main factor that creates the Price premium effect, and in particular, the reason why volatility causes the price premium effect can be interpreted as the reason that the volatility of the auction market has a positive correlation with the Auction sale rate. As a result, the volatility component can be named the momentum of the auction market.

3.3.5. Time varying beta to capture price premium section

In 4.3.4, it was confirmed that $SIG2$ extracted through Fourier transform is a price premium effect and verified that it is a momentum factor. However, the analysis period of this paper is about 10 years, and it would be more reasonable to assume time-varying than parameter between the market and the Price Premium variable has a fixed constant. It means that the $\beta s$ (18) is not stable over time. Sensitivity of beta can be used to capture the section where momentum works in the market, beyond simply distinguishing the effect of price premium. In this paper, a Kalman filter is used to estimate the time-varying beta and Kalman filter is used to estimate the time-varying parameter.

\begin{equation} \label{eq:betas-not-stable}
{y_t=\beta_0}_\ {+\beta}_1Mkt_t+\beta_2SIG1_t+\ \beta_3\ {\widehat{SIG2}}_t+\epsilon_t , \epsilon_t~N(0,\sigma^2)
\end{equation}

3.3.5.1. Kalman filter

The Kalman filter is a model for describing dynamics based on measurements and recursive procedure for computing the estimator of the unobserved component or the state vector at time t.

\begin{equation} \label{eq:state-model}
\xi_t=F_t\xi_{t-1}+q_t , q_t~N(0,Q_\ ) \ \text{<State Model>}
\end{equation}

\begin{equation} \label{eq:observation-model}
y_t=H_t\xi_t+r_t , r_t~N(0,R_\ ) \ \text{<Observation Model>}
\end{equation}

Table 7. Description

<Predict Step>

Calculate the optimal parameter of $\xi_{t|t-1}$, based on available information up to time $t-1$,

\begin{equation} \label{eq:xi-hat}
{\hat{\xi}}{t|t-1}=F_t{\hat{\xi}}{t-1|t-1}
\end{equation}

\begin{equation} \label{eq:covariance-xi}
P_t=F_tP_{t-1}F_t^T+Q_\
\end{equation}

\begin{equation} \label{eq:state-matrix}
F_t=H_tP_{t-1}H_t^T+R
\end{equation}

Calculate the optimal parameter of $\xi_{t|t}$, based on available information up to time $t$,

\begin{equation} \label{eq:kalman-gain}
K_t=P_{t|t-1}H^T{F_{t|t-1}^T}^{-1}
\end{equation}

\begin{equation} \label{eq:covariance-at-time-t}
P_{t|t}=\left(1-K_tH_t\right)P_{t|t-1}
\end{equation}

\begin{equation} \label{eq:xi-at-time-t}
{\hat{\xi}}{t|t}={\hat{\xi}}{t|t-1}-K_t\ r_{t|t-1}\
\end{equation}

The random walk effect is considered by assuming that Q, R is the initial value near 0 (= diffuse prior) and F is the diag (1,1,1,1) unit matrix and the Kalman gain (K) determines the weight for the new information using the information of the error between the prediction and the observation.

Table 8 Compare to Kalman FIlter
Figure 11. Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)
Figure 12. The Sensitivity points of EoP to the Auction Market

Table 8 shows that Time varying betas with Kalman filter performs better than the OLS with stable parameters. Figure 11 compares the change of the parameters of $\widehat{SIG2}$ and the change of the parameters of $Mkt$ at the same time. In Figure 12, if the parameter of $\widehat{SIG2}$ exceeds the upper confidence interval of OLS, it is set to 1 and plotted. In Figure 11, the area where $\widehat{SIG2}$ exceeds the beta of $Mkt$ and the area 1 of Figure 12 are the same, indicating that the price premium effect of the
auction market is more sensitive than the market price effect. This can be assumed to be an momentum interval, and the price premium effect is a sensitive interval.

3.3.5.2. Experiment

It is necessary to confirm whether the logic constructed so far works in the auction market in the region other than the whole country. Furthermore, when the model is performed by region, the characteristics of each region can be confirmed. The target areas of the empirical analysis are Seoul and gyeong-gi area where the auction market is most active.

Table 9. Result of Seoul and Gyeong-gi
Figure 13. (Seoul) $Mkt$ vs Auction Sale rate in seoul (left) Distinguished auction sale rate by EoP (Right)
Figure 14. (Seoul) Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)
Figure 15. (Seoul) The Sensitivity points of EoP to the Seoul Auction Market
Figure 16. (Gyeong-gi) $Mkt$ vs Auction Sale rate (Left) Distinguished auction sale rate by EoP (Right)
Figure 17. (Gyeong-gi) Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)
Figure 18. (Gyeong-gi) The Sensitivity points of EoP to the Seoul Auction Market

Table 8 and Figure 13 to Figure 18 are the results of the analysis of Seoul and Gyeonggi Province. Table 8 [2] Beta of $SIG2$ shows that Seoul is a more sensitive area than Gyeonggi-do in terms of price premium, and Figure 13-15 shows these resultswell. In particular, Seoul's Beta of EoP has far exceeded $Mkt$'s Beta since early 2020, supporting the general perception that overheating sentiment is forming in the Seoul area in the apartment auction market. On the contrary, the effect of EoP is relatively low in Gyeonggi-do. In addition, through the above results, it can be distinguished whether the outlier points existing in the auction sale rate of each region are the influence of EoP.

4. Conclusion

The previous auction market studies using bottom-up method mainly analyzed the variables affecting the Auction sale rate or had the disadvantage that the space and time were limited to the data they had. In this paper, time series analysis was carried out from the market perspective, and the top-down method using Fourier transform was attempted to solve the problem that the court appraiser price could not reflect the general market price at the time of the auction, and the price premium effect could be specified through the proof of each component.

In addition, it was found that the reason for making the price premium effect in the auction market is the momentum effect, and the time varying beta (Kalman filter) supports the above logic showing that the price premium effect can be divided by region. It is practically impossible to analyze a vast amount of auction cases for the analysis of the auction market, and this paper was very encouraging in that it provided many participants in the auction market with indicators that can be viewed from a market perspective.

However, it requires a deep understanding of the momentum factor. The sensitive activity of the momentum factor signifies not just market rises or falls, it indicates shifts in the price relationship between the auction and the general markets. Intuitively, when the real estate market heats up, high demand narrows the gap between general market prices and auction prices.

Therefore, the role of the momentum factor can be interpreted as representing the 'popularity' of the auction market compared to the general market. To elaborate further, it can serve as an indicator to judge whether the market is overheating or cooling down in comparison to the general market.

The additional insights of this study are as follows: Korea's apartment auction market has only momentum factors except for market prices under court appraiser control. Macro factors such as government regulations and interest rates are in the market price, so the third variable of the auction market is only the momentum factor, which can be very important information for many participants in the auction market.

This paper can be more rigorous if the following limitations are resolved. Since the monthly auction sale rate data may not be enough to support the rigor of the analysis, a wider analysis period or more time will further support the rigor of the analysis. In addition, the rigor of the analysis will be supported if more data on the unidentified area can be obtained in the process of proving the appraiser component of the court.

References

[1] Arslan, Y., Guler, B. & Taskin, T(2015), “Joint dynamic of house prices and foreclosures,”

[2] Journal of Money, Credit and Banking, 47(1), 133-169.

[3] Clauretie, T.M., Daneshvary, N.,(2009). “Estimating the house foreclosure discount corrected for spatial price interdependence and endogeneity of marketing time,” Real Estate Economics. 37 (1), 43-67.

[4] Campbell, J.Y., Giglio, S., Pathak, P.,(2011). “Forced sales and house prices,” American Economic Review. 101 (5), 2108-2131.

[5] Forgey, F.A., Rutherford, R.C., VanBuskirk, M.L.,(1994). “Effect of foreclosure status on residential selling price,” Journal of Real Estate Research. 9 (3), 313-318.

[6] Jin, (2010). Is the Selling Price Discounted at the Real Estate Auction Market? Housing Studies Review, 18(3), 93-117.

[7] Lee, (2009). True Auction Price Ratio for Condominium: The Case of Gangnam Area, Seoul, Korea. Housing Studies Review, 17(4), 233-258.

[8] Lee, (2012). Anomalies in Real Estate Markets: A Survey. Housing Studies Review, 20(3), 5-40.

[9] Mergner, S. (2009). Applications of State Space Models in Finance (pp. 17-40). Universitätsverlag Göttingen.

[10] Oh, (2021). A study on influencing factors for auction successful bid price rate of apartments in Seoul area Journal of the Korea Real Estate Management Review, 23, 99-119.

[11] Shilling, J.D., Benjamin, J.D., Sirmans, C.F.,(1990). “Estimating net realizable value for distressed real estate,” Journal of Real Estate Research. 5 (1), 129-140.

[12] Springer, T.M.,(1996). “Single-family housing transactions: seller motivations, price, and marketing time,” Journal of Real Estate Finance Economics. 13 (3), 237-254.

[13] Wooldridge, J. M. (2015). Introductory econometrics: A modern approach (pp. 83-91). Cengage Learning.

[14] Zhou, H., Yuan, Y., Lako, C., Sklarz, M., McKinney, C.,(2015). “Foreclosure discount: definition and dynamic patterns,” Real Estate Economics. 43 (3), 683-718.

[15] Zhou, Y., Cao, W., Liu, L., Agaian, S., & Chen, C. P. (2015). Fast Fourier transform using matrix decomposition. Information Sciences, 291, 172-183.

Interpretable Topic Analysis

Interpretable Topic Analysis

Mincheol Kim*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

User-generated data, often characterized by its brevity, informality, and noise, poses a significant challenge for conventional natural language processing techniques, including topic modeling. User-generated data encompasses informal chat conversations, Twitter posts laden with abbreviations and hashtags, and an excessive use of profanity and colloquialisms. Moreover, it often contains "noise" in the form of URLs, emojis, and other forms of pseudo-text that hinder traditional natural language processing techniques.

This study sets out to find a principled approach to objectively identifying and presenting improved topics in short, messy texts. Topics, the thematic underpinnings of textual content, are often "hidden" within the vast sea of user-generated data and remain "undiscovered" by statistical methods, such as topic modeling.

We explore innovative methods, building upon existing work, to unveil latent topics in user-generated content. The techniques under examination include Latent Dirichlet Allocation (LDA), Reconstructed LDA (RO-LDA), Gaussian Mixture Models (GMM) for distributed word representations, and Neural Probabilistic Topic Modeling (NPTM).

Our findings suggest that NPTM exhibits a notable capability to extract coherent topics from short and noisy textual data, surpassing the performance of LDA and RO-LDA. Conversely, GMM struggled to yield meaningful results. It is important to note that the results for NPTM are less conclusive due to its extended computational runtime, limiting the sample size for rigorous statistical testing.

This study addresses the task of objectively extracting meaningful topics from such data through a comparative analysis of novel approaches.

Also, this research contributes to the ongoing efforts to enhance topic modeling methodologies for challenging user-generated content, shedding light on promising directions for future investigations.
This study presents a comprehensive methodology employing Graphical Neural Topic Models (GNTM) for textual data analysis. "Group information" here refers to topic proportions (theta). We applied a Non-Linear Factor Analysis (FA) approach to extract this intricate structure from text data, similar to traditional FA methods for numerical data.

Our research showcases GNTM's effectiveness in uncovering hidden patterns within large text corpora, with attention to noise mitigation and computational efficiency. Optimizing topic numbers via AIC and agglomerative clustering reveals insights within reduced topic sub-networks.
Future research aims to bolster GNTM's noise handling and explore cross-domain applications, advancing textual data analysis.

1. Introduction

Over the past few years, the volume of news information on the Internet has seen exponential growth. With news consumption diversifying across various platforms beyond traditional media, topic modeling has emerged as a vital methodology for analyzing this ever-expanding pool of textual data. This introduction provides an overview of the field and the seminal work of foundations.

1.1 Seminal work: topic modeling research

One of the pioneering papers in news data analysis using topic modeling is "Latent Dirichlet Allocation" ,that is, LDA technique, which revolutionized the extraction and analysis of topics from textual data.

The need for effective topic modeling in the context of the rapidly growing user-generated data landscape has been emphasized. The challenges posed by short, informal, and noisy text data, including news articles, are highlighted.

There are numerous advantages of employing topic modeling techniques for news data analysis, including:

  • Topic derivation for understanding frequent news coverage.
  • Trend analysis for tracking news trends over time.
  • Identifying correlations between news topics.
  • Automated information extraction and categorization.
  • Deriving valuable insights for decision-making.

Recent advancements in the fusion of neural networks with traditional topic modeling techniques have propelled the field forward. Papers such as "Neural Topic Modeling with Continuous Neighbors" have introduced innovative approaches that warrant exploration. By harnessing deep learning and neural networks, these approaches aim to enhance the accuracy and interpretability of topic modeling.

Despite the growing importance of topic modeling, existing topic modeling methods do not sufficiently consider the context between words, which can lead to difficult interpretation or inaccurate results. This limits the usability of topic modeling. The continuous expansion of text documents, especially news data, underscores the urgency of exploring its potential across various fields. Public institutions and enterprises are actively seeking innovative services based on their data.

To address the limitations of traditional topic modeling methods, this paper proposes the Graphical Neural Topic Model (GNTM). GNTM integrates graph-based neural networks to account for word dependencies and context, leading to more interpretable and accurate topics.

1.2 Research objectives

This study aims to achieve the following objectives:

  • Present a novel methodology for topic extraction from textual data using GNTM.
  • Explore the potential applications of GNTM in information retrieval, text summarization, and document classification.
  • Propose a topic clustering technique based on GNTM for grouping related documents.

In short, the primary objectives are to present GNTM's capabilities, explore its applications in information retrieval, text summarization, document classification, and propose a topic clustering technique.

The subsequent sections of this thesis delve deeper into the methodology of GNTM, experimental results, and the potential applications in various domains. By the conclusion of this research, these contributions are expected to provide valuable insights into the efficient management and interpretation of voluminous document data in an ever-evolving information landscape.

2. Problem definition
2.1 Existing industry-specific keywords analysis

South Korea boasts one of the world's leading economies, yet its reliance on foreign demand surpasses that of domestic demand, rendering it intricately interconnected with global economic conditions[3]. This structural dependency implies that even a minor downturn in foreign economies could trigger a recession within Korea if the demand for imports from developed nations declines. In response, public organizations have been established to facilitate Korean company exports worldwide.

However, the efficacy of these services remains questionable, with South Korea's exports showing a persistent downward trajectory and a trade deficit anticipated for 2022. The central issue lies in the inefficient handling of global textual data, impeding interpretation and practical application.

Figure 1a*. Country-specific keywords
Figure 1b*. Industry-specific keywords: *Data service provided by public organization

Han, G.J(2022) scrutinized the additional features and services available to paid members through the utilization of big data and AI capabilities based on domestic logistics data[5]: Trade and Investment Big Data (KOTRA), Korea Trade Statistics Information Portal (KTSI), GoBiz Korea (SME Venture Corporation), and K-STAT (Korea Trade Association).

Regrettably, these services predominantly offer basic frequency counts, falling short of delivering valuable insights. Furthermore, they are confined to providing internal and external statistics, rendering their output less practical. While BERT and GPT have emerged as potential solutions, these models excel in generating coherent sentences rather than identifying representative topics based on company and market data and quantifying the distribution of these topics.

2.2 Proposed model for textual data handling

To address the challenge of processing extensive textual data, we introduce a model with distinct characteristics:

  1. Extraction of information from data collected within defined timeframes.
  2. A model structure producing interpretable outcomes with traceable computational pathways.
  3. Recommendations based on the extracted information.

Previous research mainly relied on basic statistics to understand text data. However, these methods have limitations, such as difficulty in determining important topics and handling large text sets, making it hard for businesses to make decisions.

Our research introduces a method for the precise extraction and interpretation of textual data meaning via a natural language processing model. Beyond topic extraction, the model will uncover interrelationships between topics, enhance text data handling efficiency, and furnish detailed topic-related insights. This innovative approach promises to more accurately capture the essence of textual data, empowering companies to formulate superior strategies and make informed decisions.

2.3 Scope and contribution

This study concentrates on the extraction and clustering of topics from textual data derived from numerous companies' news data sources.

However, its scope is confined to outlining the methodology for collecting news data from individual firms, extracting topic proportions, and clustering based on these proportions. We explicitly state the study's limitations concerning the specific topics under investigation to bolster the research's credibility. For instance, we may refrain from delving deeply into a particular topic and clarify the constraints on the generalizability of our findings.

The proposed methodology in this study holds the potential to facilitate the effective handling and utilization of this vast text data reservoir. Furthermore, if this methodology is applied to Korean exporters, it could play a pivotal role in transforming existing export support services and mitigating the recent trade deficit.

3. Literature review
3.1 Non-graph-based method
3.1.1 Latent Dirichlet Allocation (LDA)

LDA, a classic topic modeling technique, discovers hidden topics within a corpus by assigning words to topics probabilistically[2]. It uncovers hidden 'topics' within a corpus by probabilistically assigning words in documents to these topics. Each document is viewed as a mixture of topics, and each topic is characterized by a distribution of words and topic probabilities.

\[p(d|\alpha,\beta^v_{z_n}) = \int{p(\theta_d|\alpha)} \prod_{n} \sum_{z_n} p(w_{d,n}|z_n,\beta^v_{z_n})p(z_n|\theta_d)d\theta_d \]

where \(\beta\) is \(k\times V\) topic-word matrix. \(p(w_{d,n}|z_n,\beta^v_{z_n})\) is probability for word \(w_{d,n}\) to happen when topic is \(z_n\).

However, LDA has a limitation known as the "independence" problem. It treats words as independent and doesn't consider their order or relationships within documents. This simplification can hinder LDA's ability to capture contextual dependencies between words. To address this, models like Word2Vec and GloVe have been developed, taking word order and dependencies into account to provide more nuanced representations of textual data.

3.1.2 Latent Semantic Analysis (LSA)

LSA is a method to uncover the underlying semantic structure in textual data. It achieves this by assessing the semantic similarity between words using document-word matrices[4]. LSA's fundamental concept involves recognizing semantic connections among words based on their distribution within a document. To accomplish this, LSA relies on linear algebra techniques, particularly Singular Value Decomposition (SVD), to condense the document-word matrix into a lower-dimensional representation. This process allows semantically related words or documents to be situated in proximity within this reduced space.

\[X=U\Sigma V^T\]

\[Sim(Q,X)=R=Q^T X\]

where \(X\) is \(t \times d\) matrix, a collection of d documents in a space of t dictionary terms. \(Q\) is \(t \times q\) matrix, a collection of q documents in a space of t dictionary terms.

\(U\) is term eigenvectors and \(V\) is document eigenvectors.

LSA, an early form of topic modeling, excels at identifying semantic similarities among words. Nonetheless, it has its limitations, particularly in its inability to fully capture contextual information and word relationships.

3.1.3 Neural Topic Model (NTM)

Traditional topic modeling has limitations, including sensitivity to initialization and challenges related to unigram topic distribution. The Neural Topic Model (NTM) bridges topic modeling and deep learning, aiming to enhance word and document representations to overcome these issues.

At its core, NTM seamlessly combines word and document representations by embedding topic modeling within a neural network framework. While preserving the probabilistic nature of topic modeling, NTMs represent words and documents as vectors, leveraging them as inputs for neural networks. This involves mapping words and documents into a shared latent space, accomplished through separate neural networks for word and document vectors, ultimately leading to the computation of the topic distribution.

The computational process of NTM includes training using back-propagation and inferring topic distribution through Bayesian methods and Gibbs sampling.

\[p(w|d) = \sum^K_{i=1} p(w|t_i)p(t_i|d)\]

where \(t_i\) is a latent topic and \(K\) is the pre-defined topic number. Let \(\pi(w) = [p(w|t_1), \dot , p(w|t_K)]\) and \(\theta(d) = [p(t_1|d), \dot, p(t_K|d)]\), where \(\pi\) is shared among the corpus and \(\theta\) is document-specific.

Then above equation can be represented as the vector form:

\[p(w|d) = \phi(w) \times \theta^T(d) \]

3.2 Graph-based methods
3.2.1 Global random topic field

To capture word dependencies within a document, the graph structure incorporates topic assignment relationships among words to enhance accuracy[9].

GloVe-derived word vectors are mapped to Euclidean space, while the document's internal graph structure, identified as the Word Graph, operates in a non-Euclidean domain. This enables the Word Graph to uncover concealed relationships that traditional Euclidean numerical data representation cannot reveal.

Calculating the "structure representing word relationships" involves employing a Global Random Field (GRF) that encodes the graph structure in the document using topic weights of words and the topic connections in the graph's edges. The GRF formula is as follows:

\[ p(G) = f_G (g) = \frac{1}{|E|} \phi(z_W) \sum {(w', w'') \in E} \phi(z{w'}, z_{w''}) \]

The above-described Global Topic-Word Random Field (GTRF) shares similarities with the GRF. In the GTRF, the topic distribution (z) becomes a conditional distribution on \(theta\). Learning and inferring in this model closely resemble the EM algorithm. The outcome, denoted as \(p_{GTRF}(z|\theta)\), represents the probability of the graph structure considering whether neighboring words (w' and w'') are assigned to the same topic or different topics. This is expressed as:

\[ p_{GTRF}(z|\theta) = \frac{1}{|E|} Multi(z_W|\theta) \times \sum {(w', w'') \in E} (\sigma{z_{w'} = z_{w''}}\lambda_1 + \sigma_{z_{w'} \neq z_{w''}}\lambda_2) \]

Where \(\sigma_{z}\) is a function that returns 1 if the condition $x$ is true and 0 if $x$ is false.

3.2.2 GraphBTM

While LDA encounters challenges related to data sparsity, particularly when modeling short texts, the Biterm Topic Model (BTM) faces limitations in its expressiveness, especially when dealing with documents containing diverse topics[13]. Additionally, BTM relies on bitwords in conjunction with the co-occurrence features of words, which restricts its suitability for modeling longer texts.

To address these limitations, the Graph-Based Biterm Topic Model (GraphBTM) was developed. GraphBTM introduces a graphical representation of biterms and employs Graph Convolutional Networks (GCN) to extract transitive features, effectively overcoming the shortcomings associated with traditional models like LDA and BTM.

GraphBTM's computational approach relies on Amortized Variational Inference. This method involves sampling a mini-corpus to create training instances, which are subsequently used to construct graphs and apply GCN. The inference network then estimates the topic distribution, which is vital for training the model. Notably, this approach has demonstrated the capability to achieve higher topic consistency scores compared to traditional Auto-Encoding Variational Bayes (AEVB)-based inference methods.

3.2.3 Graphical Neural Topic Model (GNTM)

LDA, in its conventional form, makes an assumption of independence. It posits that each document is generated as a blend of topics, with each topic representing a distribution over the words within the document. However, this assumption of conditional independence, also known as exchangeability, overlooks the intricate relationships and context that exist among words in a document.

The No Variational Inference (NVI) algorithm presents a departure from this independence assumption. NVI is a powerful technique for estimating the posterior distribution of latent topics in text data. It leverages a neural network structure, employing a reparameterization trick to accurately estimate the genuine posterior distribution for a wide array of distributions.

\[\alpha(prior) \rightarrow z(topic) \: from \: \theta \rightarrow G_d(structure) \rightarrow V(word \: set) \]

\[p(G^0_d|Z_d;M) = \prod_{(n,n') \in E^0_d} m_{z_{d,n}}{z_{d,n'}} \prod_{(n,n') \notin E^0_d} (1-m_{z_{d,n}}{z_{d,n'}})\]

\[p(G_d, \theta_d, Z_d;\alpha) = p(V_d|Z_d,G^0_d)p(G^0_d|Z_d)\prod^{N_d}_{n=1} p(z_{d,n}|\theta_d)p(\theta|\alpha) \]

Unlike the Variational Autoencoder (VAE), which is primarily employed for denoising and data restoration and can be likened to an 'encoder + decoder' architecture, NVI serves a broader purpose and can handle a more extensive range of distributions. It's based on the mean-field assumption and employs the Laplace approximation method, replacing challenging distributions like the Dirichlet distribution with the computationally efficient logistic normal distribution[8].

Based mean field assumption:

\[q(\theta_d,Z_d|G_d) = q(\theta_d|G_d;\mu_d, \delta_d) \prod^{N_d}_{n=1} q(z_{d,n}|G_d,w_d,n;\varphi_{d,n})\]

\[L_d = E_{q(Z_d|G_d)} [log p(G^0_d|Z_d;M) + logp(V_d|Z_d, G^0_d;\beta)] - KL[q(\theta_d|G_d)||p(\theta_d)] - E_{q(\theta_d|G_d)}\sum^{N_d}_{n=1} KL[q(z_{d,n}|G_d, w_{d,n})||p(z_{d,n}|\theta_d)]
\]

This substitution simplifies parameter estimation, making it more tractable and readily differentiable. In the context of the Global Neural Topic Model (GNTM), the logistic normal distribution facilitates the approximation of correlations between latent variables, allowing for the utilization of dependencies between topics. Additionally, the Evidence Lower Bound (ELBO) in NVI is differentiable in closed-form, enhancing its applicability.

The concept of topic proportion is represented by the equation:

\[\theta_d = \text{softmax}(N(\mu_d, \delta_d^2))\]

\[f_X(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{\frac{(logit(x)-\mu)^2}{2\sigma^2}}\frac{1}{x(1-x)}\]

This equation encapsulates the distribution of topics within a document, reflecting the proportions of different topics in that document.

Figure 2. Transformation of logit-normal distribution after conversion
3.3 Visualization techniques
3.3.1 Fast unfolding of communities in large networks

This algorithm aids in detecting communities within topic-words networks, facilitating interpretation and understanding of topic structures.

3.3.2 Uniform Manifold Approximation and Projection (UMAP)

UMAP is a nonlinear dimensionality reduction technique that preserves the underlying structure and patterns of high-dimensional data while efficiently visualizing it in lower dimensions. It outperforms traditional methods like t-SNE in preserving data structure.

3.3.3 Agglomerative Hierarchical Clustering

Hierarchical clustering is an algorithm that clusters data points, combining them based on their proximity until a single cluster remains. It provides a dynamic and adaptive way to maintain cluster structures, even when new data is added.

Additionally, several evaluation metrics, including the Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index, assist in selecting the optimal number of clusters for improved data understanding and analysis.

4. Method
4.1 Graphical Neural Topic Model(GNTM) as Factor analysis

GNTM can be viewed from a factor analysis perspective, as it employs concepts similar to factor analysis to unveil intricate interrelationships in data and extract topics. GNTM can extract \(\theta\), which signifies the proportion of topics in each document, for summarizing and interpreting document content. In this case, \(\theta\) follows a logistic normal distribution, enabling the probabilistic modeling of topic proportions.

The \(\theta\) can be represented as follows[1][7]:

\[ \tilde{\theta} \sim \text{LN}(\mu, \sigma^2) \]

For \(0 < \tilde{x} < 1\) and \(\sum_i^K x_i = 1\):

\[ y = [\log(\frac{x_1}{x_D}), ..., \log(\frac{x_{D-1}}{x_D})]^T \]

Probability Density Function (PDF) for \(X\):

\[ f_X(x; \mu, \Sigma) = \frac{1}{|2 \pi \Sigma|^{\frac{1}{2}}} \frac{1}{\prod^K_{i=1} x_i (1-x_i)} e^{-\frac{1}{2} \{ \log (\frac{x}{1-x}) - \mu \} \Sigma^{-1} \{ \log(\frac{x}{1-x}) - \mu \}} \]

where the log and division in the argument are element-wise. This is due to the diagonal Jacobian matrix of the transformation with elements \(\frac{1}{{x_i}{(1-x_i)}}\)

GNTM shares similarities with factor analysis, which dissects complex data into factors associated with each topic to unveil the data's structure. In factor analysis, the aim is to explain observed data using latent factors. Similarly, GNTM treats topics in each document as latent variables, and these topics contribute to shaping the word distribution in the document. Consequently, GNTM decomposes documents into combinations of words and topics, offering an interpretable method for understanding document similarities and differences.

4.2 Akaike Information Criteria (AIC)

The Akaike Information Criterion (AIC) is a crucial statistical technique for model selection and comparison, evaluating the balance between a model's goodness of fit and its complexity. AIC aids in selecting the most appropriate model from a set of models.

In the context of this thesis, AIC is employed to assess the fit of a Graphical Network Topic Model (GNTM) and determine the optimal model. Since GNTMs involve parameters related to the number of topics in topic modeling, selecting the appropriate number of topics is a significant consideration. AIC assesses various GNTM models based on the choice of the number of topics and assists in identifying the most suitable number of topics.

AIC can be represented by the following formula:

\[ AIC = -2 \cdot \text{log-likelihood} + 2 \cdot \text{number of parameters} \]

Where:

  • The \(\text{log-likelihood}\) is a measure of the goodness of fit of the model to explain the data.
  • Number of parameters indicates the count of parameters in the model.

AIC weighs the tradeoff between a model's log-likelihood and the number of parameters, which reflects the model's complexity. Lower AIC values indicate better data fit while favoring simpler models. Therefore, the model with the lowest AIC is considered the best. AIC plays a pivotal role in enhancing the quality of topic modeling in GNTM by assisting in managing model complexity when choosing the number of topics.

For our current model, following a Logistic Normal Distribution, we utilize GNTM's likelihood:

\[ L(\theta| D) = \prod_{d=1}^D \left[-\frac{1}{2} \log(|2 \pi \Sigma|) - \sum_{i=k}^K (\log\theta_i - \log(1-\theta_i)) - \frac{1}{2} \left\{ \log \left(\frac{\theta}{1-\theta}\right) - \mu \right\} \Sigma^{-1} \left\{ \log \left(\frac{\theta}{1 - \theta}\right) - \mu \right\}\right] \]

When applied to a formula, it appears as:

\[ AIC = -2 \cdot l(\theta) + 2 \cdot \text{number of topics} \]

Where:

  \[ l(\theta) = \sum_{d=1}^D [ -\frac{1}{2}\log (|2\pi \Sigma|) - \sum_{k=1}^K \log(\theta_k (1 - \theta_k)) + -\frac{1}{2} (\log(\frac{\theta}{1-\theta}) - \mu_i)^T \Sigma^{-1} (\log(\frac{\theta}{1-\theta}) - \mu_i)] \]

This encapsulates the essence of GNTM and AIC in evaluating and selecting models.

5. Result
5.1 Model setup
5.1.1 Data

The data consists of news related to the top 200 companies by market capitalization on the NASDAQ stock exchange. These news articles were collected by crawling Newsdata.io in August. Analyzing this data can provide insights into the trends and information about companies that occurred in August. Having a specific timeframe like August helps in interpreting the analysis results clearly.

To clarify the research objectives, companies with fewer than 10 articles collected were excluded from the analysis. Additionally, a maximum of 100 articles per company was considered. As a result, a total of 13,896 documents were collected, and after excluding irrelevant documents, 13,816 were used for the analysis. The data format is consistent with the "20 News Groups" dataset, and data preprocessing methods similar to those in Shen(2021)[10] were applied. This includes steps like removing stopwords, abbreviations, punctuation, tokenization, and vectorization. You can find examples of the data in the Appendix.

5.1.2 Parameters

"In our experiments, as the dataset contained a large number of words and edges, it was necessary to reduce the number of parameters for training while minimizing noise and capturing important information. To achieve this, we set the threshold for the number of words and edges to 140 and 40, respectively, which is consistent with the configuration used in the BNC dataset, a similar dataset. The experiments were conducted in an RTX3060 GPU environment using the CUDA 11.8 framework, with a batch size of 25. To determine the optimal number of topics, we calculated and compared AIC values for different numbers of topics. Based on the comparison of AIC values, we selected 20 as the final number of topics."

5.2 Evaluation
5.2.1 AIC
Figure 3. Changes in AIC values depending on the number of topics

AIC is used in topic modeling as a tool to select the optimal number of topics. However, AIC is a relative number and may vary for different data or models. Therefore, when using AIC to determine the optimal number of topics, it is important to consider how this metric applies to your data and model.

In our study, we calculated the AIC for a given dataset and model architecture and used it to select the optimal number of topics. This approach served as an important metric for finding the best number of topics for our data. The AIC was used to evaluate the goodness of fit of our model, allowing us to compare the performance of the model for different numbers of topics.

Additionally, AIC allows us to evaluate the performance of our model in comparison to AICs obtained from other models or other datasets. This allows us to determine the relative superiority of our model and highlights that we can perform optimized hyperparameter tuning for our own data and model, rather than comparing to other models. This approach is one of the key strengths of our work, contributing to a greater emphasis on the effective utilization and interpretation of topic models.

5.2.2 Topic interpretation
5.2.3 Classification
Figure 4a*. 10 Topics graph
Figure 4b*. 30 Topics graph: *The result of Agglomerative Clustering

In our study, we leveraged Agglomerative Clustering and UMAP to classify and visualize news data. In our experiments, we found that news is generally better classified when the number of topics is 10. These results suggest that the model is able to group and interpret the given data more effectively.

However, when the number of topics is increased, broader topics tend to be categorized into more detailed topics. This results in news content being broken down into relatively more detailed topics, but the main themes may not be more apparent.

Figure 5a*. UMAP graph with 10 topics
Figure 5b*. UMAP graph with 20 topics
Figure 5c*. UMAP graph with 30 topics: *The result of Agglomerative Clustering

Also, as the number of topics increases, the difference in the proportion of topics that represent the nature of the news increases. This indicates a hierarchy between major and minor topics, which can be useful when you want to fine-tune your investigation of different aspects of the news. This diversity provides important information for detailed topic analysis in context.

Therefore, when choosing the number of topics, we need to consider the balance between major and minor topics. By choosing the right number of topics, the model can best understand and interpret the given data, and we can tailor the results of the topic analysis to reflect the key features of the news content.

6. Discussion
6.1 Limitation

Even though this paper has contributed to addressing various challenges related to textual data analysis, it is essential to acknowledge some inherent limitations in the proposed methodology:

  1. Noise Edges Issue
    The modeling approach used in this paper introduces a challenge related to noise edges in the data, which can be expected when dealing with extensive corpora or numerous documents from various sources.
    To effectively mitigate this noise issue, it is crucial to implement regularization techniques tailored to the specific objectives and nature of the data. Approaches such as the one proposed by Zhu et al. (2023)[12] enhanced the model’s performance by more efficiently discovering hidden topic distributions within documents.}
  2. Textual Data Versatility
    While this paper focuses on extracting and utilizing the topic latent space from text data, it is worth noting that textual data analysis can have diverse applications across various fields.
    In addition to hierarchical clustering, there is potential to explore alternative recommendation models, such as Matrix Factorization methods like NGCF(Neural Graph Collaborative Filtering)[11]{Wang2019} and LightGCN(Light Graph Convolutional Network)[6], which utilize techniques like Graph Neural Networks(GNN) for enhancing recommendation performance.

Acknowledging these limitations is essential for a comprehensive understanding of the proposed methodology's scope and areas for potential future research and improvement.

6.2 Future work

While this study has made significant strides in addressing key challenges in the analysis of textual data and extracting valuable insights through topic modeling, there remain several avenues for future research and improvement:

  1. Enhanced Noise Handling
    The modeling used has shown promise but is not immune to noise edge issues often encountered in extensive datasets. In this study, we used a dataset comprising approximately 9,000 news articles from 194 countries, totaling around 5 million words. To mitigate these noise edge issues effectively, future work can focus on developing advanced noise reduction techniques or data preprocessing methods tailored to specific domains, further enhancing the quality of extracted topics and insights.
  2. Cross-Domain Application
    While the study showcased its effectiveness in the context of news articles, extending this approach to other domains presents an exciting opportunity. Adapting the model to different domains may require domain-specific preprocessing and feature engineering, as well as considering transfer learning approaches. Models based on Graph Neural Networks (GNN) and Matrix Factorization, such as Neural Graph Collaborative Filtering (NGCF) and LightGCN, can be employed to enhance recommendation systems and knowledge discovery in diverse fields. This cross-domain versatility can unlock new possibilities for leveraging textual data to extract meaningful insights and improve decision-making processes across various industries and research domains.
7. Conclusion

In the context under discussion, the term "group information" pertains to the topic proportions represented by theta. From my perspective, I have undertaken an endeavor that can be characterized as Non-Linear Factor Analysis (FA) applied to textual data, analogous to traditional FA methods employed with numerical data. This undertaking proved intricate due to the inherent non-triviality in its extraction, thus warranting the classification as Non-Linear FA. (Indeed, there exists inter-topic covariance.)

Hitherto, the process has encompassed the extraction of information from textual data, a task which may appear formidable for utilization. This encompasses the structural attributes of words and topics, the proportions of topics, as well as insights into the prior distribution governing topic proportions. These constituent elements have facilitated the quantitative characterization of information within each group.

A central challenge encountered in the realm of conventional Principal Component Analysis (PCA) and FA techniques lies in the absence of definitive answers, given our inherent limitations. Consequently, the interpretation of the extracted factors poses formidable challenges and lacks assuredness. However, the GNTM methodology applied to this paper, in tandem with textual data, furnishes a network of words for each factor, thereby affording a means for expeditious interpretation.

If the words assume preeminence within Topic 1, they afford a basis for interpretation. This alignment with the intentions of the GNTM. In effect, this model facilitates the observation of pivotal terms within each topic (factor) and aids in the explication of their conceptual representations.

This research has presented a comprehensive methodology for the analysis of textual data using Graphical Neural Topic Models (GNTM). The paper discussed how GNTM leverages the advantages of both topic modeling and graph-based techniques to uncover hidden patterns and structures within large text corpora. The experiments conducted demonstrated the effectiveness of GNTM in extracting meaningful topics and providing valuable insights from a dataset comprising news articles.

In conclusion, this research contributes to advancing the field of textual data analysis by providing a powerful framework for extracting interpretable topics and insights. The combination of GNTM and future enhancements is expected to continue facilitating knowledge discovery and decision-making processes across various domains.

Nevertheless, a pertinent concern arises about inordinate amount of noise pervade newspaper data or all data. Traditional methodologies employ noise mitigation techniques such as Non-Negative Matrix Factorization (NVI) and the execution of numerous epochs for the extraction of salient tokens. In the context of this research, as aforementioned, the absence of temporal constraints allowed for the execution of epochs as deemed necessary.

However, computational efficiency was bolstered through the reduction in the number of topics, while remaining the primary objectives from a clustering perspective by finding out the optimized number of topic by AIC and agglomerative clustering. This revealed that a reduction in the number of topics resulted in the observation of words associated with the original topics within sub-networks of the diminished topics.

Future research can further enhance the capabilities of GNTM by improving noise handling techniques and exploring cross-domain applications.

References

[1] Atchison, J., and Shen, S. M. Logistic-normal distributions: Some properties and uses.
Biometrika 67, 2 (1980), 261–272.

[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine
Learning research 3, Jan (2003), 993–1022.

[3] Choi, M. J., and Kim, K. K. Import demand in developed economies. In Economic Analysis
(Quarterly) (2019), vol. 25, Economic Research Institute, Bank of Korea, pp. 34–65.

[4] Evangelopoulos, N. E. Latent semantic analysis. Wiley Interdisciplinary Reviews: Cognitive
Science 4, 6 (2013), 683–692.

[5] Han, K. J. Analysis and implications of overseas market provision system based on domestic
logistics big data. KISDI AI Outlook 2022, 8 (2022), 17–30.

[6] He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. Lightgcn: Simplifying and
powering graph convolution network for recommendation. In Proceedings of the 43rd International
ACM SIGIR conference on research and development in Information Retrieval (2020), pp. 639–
648.

[7] Hinde, J. Logistic Normal Distribution. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011,
pp. 754–755.

[8] Kingma, D. P., and Welling, M. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013).

[9] Li, Z., Wen, S., Li, J., Zhang, P., and Tang, J. On modelling non-linear topical dependencies.
In Proceedings of the 31st International Conference on Machine Learning (Bejing, China,
22–24 Jun 2014), E. P. Xing and T. Jebara, Eds., vol. 32 of Proceedings of Machine Learning
Research, PMLR, pp. 458–466.

[10] Shen, D., Qin, C., Wang, C., Dong, Z., Zhu, H., and Xiong, H. Topic modeling revisited:
A document graph-based neural network perspective. Advances in neural information processing
systems 34 (2021), 14681–14693.

[11] Wang, X., He, X., Wang, M., Feng, F., and Chua, T.-S. Neural graph collaborative
filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval (jul 2019), ACM.

[12] Zhu, B., Cai, Y., and Ren, H. Graph neural topic model with commonsense knowledge.
Information Processing Management 60, 2 (2023), 103215.

[13] Zhu, Q., Feng, Z., and Li, X. Graphbtm: Graph enhanced autoencoded variational inference
for biterm topic model. In Proceedings of the 2018 conference on empirical methods in natural
language processing (2018), pp. 4663–4672.

Appendix

News Data Example
Google courts businesses with ramped up cloud AI Synopsis The internet giant unveiled new AI-powered features for data searches, online collaboration, language translation, images and more at its first annual Cloud Next conference held in-person since 2019. AP Google on Tuesday said it was weaving artificial intelligence (AI) deeper into its cloud offerings as it vies for the business of firms keen to capitalize on the technology. The internet giant unveiled new AI-powered features for data searches, online collaboration, language translation, images and more at its first annual Cloud Next conference held in-person since 2019. Elevate Your Tech Process with High-Value Skill Courses Offering College Course Website Indian School of Business ISB Product Management Visit Indian School of Business ISB Digital Marketing and Analytics Visit Indian School of Business ISB Digital Transformation Visit Indian School of Business ISB Applied Business Analytics Visit The gathering kicked off a day after OpenAI unveiled a business version of ChatGPT as tech companies seek to keep up with Microsoft , which has been ahead in powering its products with AI. "I am incredibly excited to bring so many of our customers and partners together to showcase the amazing innovations we have been working on," Google Cloud chief executive Thomas Kurian said in a blog post. Most companies seeking to adopt AI must turn to the cloud giants -- including Microsoft, AWS and Google -- for the heavy duty computing needs. Those companies in turn partner up with AI developers -- as is the case of a major tie-up between Microsoft and ChatGPT creator OpenAI -- or have developed their own models, as is the case for Google.

MDSA, 2023 1st seminar

MDSA, 2023 1st seminar
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

The first seminar of the Data Science Management Association was held at Forest Hall on May 12, 2023 / Photo = Data Science Management Association

The Data Science Management Association successfully held the ‘Data Science Management Association 2023 1st Seminar’ on the 12th at Yeoksam Forest Hall under the theme of ‘Corporate Management Activities of AI Algorithms’.

The seminar was conducted in the following order: topic presentation, Q&A, and general discussion. Starting with the topic presentation by President Ho-yong Choi, topic presentations were made in that order by Academician Jeong-hoon Song, Hye-young Park, Bo-hyun Yoo, Min-cheol Kim, Jeong-woo Park, and finally Gyeong-hwan Lee, CEO of Pabii.

First, President Hoyong Choi gave a presentation on ‘Deep Learning as Solution Methods in Finance’ and introduced how machine learning and deep learning techniques can be used to find solutions to partial differential equations related to cash asset dividends of big tech companies. .

Academic member Song Jeong-hoon pointed out the problems with existing electricity/gas usage forecasts under the theme of ‘Monthly electricity/gas usage forecast for each building’ and further predicted monthly energy usage through statistical techniques that calculate the off-diagonal component of the second moment matrix. A model that predicts more accurately was introduced.

In order to find bubbles in the real estate sales market or auction market under the topic ‘Is bubble in housing auction market really bubble?’, academy member Park Hye-young defined the ‘difference between first and second place in the auction market’ as ‘bubble index’ and used it as a statistical index. The verification process through testing was explained.

Academician Bo-Hyun Yoo introduced a paper on the topic of ‘Discount/surcharge and momentum in the real estate auction market’ in which the factors that make up the winning bid rate in the real estate auction market were extracted using Fourier transform and the results were statistically verified.

Under the theme of ‘Interpretable Topic Analysis,’ academic member Mincheol Kim discussed a true ‘big data’ service that can be of practical help in matching between overseas buyers and domestic companies.

Under the theme of ‘Advertising time series modeling under measurement error,’ Jeongwoo Park, a member of the Academy of Sciences, introduced an advertising performance prediction model that statistically verifies and corrects the impact of measurement error included in user data of digital advertising.

Lastly, Professor Keith Lee discussed the interpretation and application cases of the recently controversial mathematical model related to ChatGPT, as well as expected usage methods, under the topic of ‘Use and Limitations of ChatGPT’.

In the general discussion that followed, SIAI (Swiss Institute Artificial Intelligence) students and MDSA academic members had a heated discussion about the direction of innovation and development in the Korean data science industry.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

MDSA Korean aI/DS news journal publication as of apr 2023

MDSA Korean aI/DS news journal publication as of apr 2023
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

The Managerial Data Science Association (MDSA) has been operating an online magazine since April 1,2023.

SIAI Professor Kyung-hwan Lee, one of the founders of the society, donated the Internet media company registered with Seoul City Hall to MDSA in October 2020, and MDSA will operate it as of April 1, 2023.

Subsequently, MDSA was incorporated under the Global Institute of Artificial Intelligence (GIAI), and the name of the journal was confirmed as GIAI R&D Korea, which refers to GIAI’s Korean research institute. GIAI is a group of global researchers and already has its own research institute in Europe. The research institute is already operating more specialized academic paper sharing and expert contributions than GIAI R&D Korea under the name of GIAI R&D. GIAI R&D Korea will also share Korean translations of some of the content operated by GIAI R&D.

In order to ensure the independence of the editorial journal’s opinion, ownership has been transferred to an independent corporation under MDSA, but the election of the editor-in-chief and verification of AI/data science knowledge are operated under the supervision of the MDSA board of directors. SIAI Professor Gyeong-Hwan Lee, who is in charge of MDSA’s audit, said that he referenced the structure of Newstapa, which has a reputation for investigative reporting in Korea, which operates an independent professional journal under the supervision of a non-profit corporation.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

MDSA 2023 Brunch seminar

MDSA 2023 Brunch seminar
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

On Mar 18, the Managerial Data Science Association (MDSA) held a small seminar to commemorate the establishment of the corporation.

Next, we plan to hold a second small seminar in April and then confirm the presenters for the society’s official seminar in May.

In the discussion on this day, the May conference seminar was decided to be held on May 12.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

MDSA Offical foundation

MDSA Offical foundation
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

The Managerial Data Science Association (MDSA), chaired by KAIST technology management professor Ho-yong Choi, announced on February 9 that it had received permission to establish an incorporated association from Seoul City Hall. Subsequently, the corporation was established on March 9th.

KAIST technology management professor Choi Ho-yong, president of the society, as well as Kookmin University College of Economics professor Kim Jae-jun and Korea University technology management professor Kim Jong-myeon were appointed as directors. In addition, Professor Kyunghwan Lee of the Swiss Institute of Artificial Intelligence (SIAI), director of the Global Institute of Artificial Intelligence (GIAI) research institute, will serve as auditor.

With the private contribution of SIAI Professor Gyeong-hwan Lee, MDSA will operate a specialized journal under the academic society from April 1st. The first seminar after the establishment of the corporation is scheduled to be held in May.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

English for Second Language Support for class of 2023

English for Second Language Support for class of 2023

Member for

3 months 2 weeks
Real name
GIAI Admin

SIAI has supported an English for Second Language (ESL) support course (non-credit) for class of 2023.

  • Sep 1, 2022 ~ Feb 28, 2023
  • Teacher: Jeniifer Mi (University of Nottingham, CELTA certified)

Member for

3 months 2 weeks
Real name
GIAI Admin

Data Science Management Association (MDSA) 2023 inaugural general meeting

Data Science Management Association (MDSA) 2023 inaugural general meeting
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

The Data Science Management Association (MDSA) announced on the 7th that it held its 2023 inaugural general meeting.

The society, which has been in the process of obtaining approval as a non-profit corporation since April of last year, will conclude its activities in 2022 and will proceed with the operation of a professional journal, seminars, and AI/data science education activities under the society in accordance with the time of corporate approval this year.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

Data Science Management Society 2022 2nd Establishment General Meeting

Data Science Management Society 2022 2nd Establishment General Meeting
Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service

The Managerial Data Science Association (MDSA) held its founding general meeting on Saturday, August 27, 2022.

KAIST technology management professor Choi Ho-yong, president of the society, as well as Kookmin University College of Economics professor Kim Jae-jun and Korea University technology management professor Kim Jong-myeon were appointed as directors. In addition, Professor Kyunghwan Lee of the Swiss Institute of Artificial Intelligence (SIAI), director of the Global Institute of Artificial Intelligence (GIAI) research institute, will serve as auditor.

Four professors announced that they established the society through MDSA for the purpose of supporting the application of data science to corporate management. In particular, the main purpose of establishing the society is to conduct specialized journals, academic seminars, and basic education to improve Korea’s AI reality, which focuses on simple computer programming.

The society initially held the founding general meeting in April, but announced that it held the second founding general meeting in accordance with the demands of Seoul City Hall, the licensing agency. The society is about to receive approval for establishment as a non-profit corporation from the Ministry of Trade, Industry and Energy through Seoul City Hall.

Picture

Member for

3 months 2 weeks
Real name
GIAI News
Bio
https://giai.org
[email protected]
GIAI Admin for News service