Skip to main content

Jeongwoo Park (MSc Data Science, 2023)

Jeongwoo Park (MSc Data Science, 2023)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Ⅰ. The Digital Advertising Market is Facing the Issue of Measurement Error

Digital advertising has been growing explosively every year. Especially during the global pandemic, as the offline market significantly contracted, the shift in consumer focus to online platforms made digital advertising the mainstream in the global advertising market.

The core of digital advertising is undoubtedly the smartphone. With the ability to access the web anytime and anywhere via smartphones, internet-based media have emerged in the advertising market. Particularly, app-based platform services that offer customized user experiences have surged rapidly, significantly contributing to the growth of digital advertising. This market has been driven by the convenience smartphones offer compared to traditional devices like PCs and tablets.

However, the digital advertising industry is currently grappling with the issue of "Measurement Error". This problem causes significant disruptions in accurately measuring and predicting advertising performance.

The rapidly growing digital advertising market

The key difference between digital advertising and traditional advertising is the ability to track results. In traditional advertising, companies could only estimate performance by saying, "I advertised on a platform seen by thousands of people per day" to gauge brand awareness. As a result, even when advertising agencies tried to analyze performance, they often faced dissatisfaction due to the difficulty in accurately assessing outcomes because of various types of noise.

With the advent of the web, advertising entered a new phase. Information of users is stored in cookies when they access websites, allowing advertisers to instantly track which ads users viewed, which products they looked at, and what they purchased. As a result, companies can now easily verify how effective their ads are on users. Furthermore, they can compare multiple ads and quickly determine the direction for planning future campaigns.

The advent of smartphones has accelerated this paradigm shift. Unlike in the past when multiple people shared a single PC or tablet, we are now in the era of "one person, one smartphone", allowing behavior patterns on specific devices to be attributed to individual users. In fact, according to a 2022 Gallup Korea survey, the smartphone penetration rate among Korean adults was 97%. In recent years, many companies have introduced hyper-personalized tareting services to the public, signaling a major shift in the digital advertising market.

Issue in Digital Advertising: Measurement Error

However, everything has its pros and cons, and digital advertising is no exception. Industry professionals point out that the effectiveness of digital ads is hindered by factors such as user fatigue and privacy concerns. From my own experience in the advertising industry, the issue that stands out the most is "measurement error".

Measurement error refers to data being distorted due to specific factors, resulting in outcomes that differ from the true values. Common issues in the industry include users being exposed to the same ad multiple times in a short period, leading to insignificant responses, or fraudulent activities where malicious actors create fake ad interactions to gain financial benefits. Additionally, technical problems such as server instability can cause user data to be double counted, omitted, or delayed. For various reasons, the data becomes "contaminated", preventing advertisers from accurately assessing ad performance.

Of course, media companies that deliver the ads are not idle either. They continuously update advertising reports, correcting inaccurate data related to ad spend, impressions, clicks, and other performance metrics. During this process, advertisers change the reported ad performance for up to one week.

The problem is that for "demanders" like me, for whom accurate measurement of ad performance is crucial, measurement error leads to an endogeneity issue in performance analysis, significantly reducing the reliability of the analysis. Simply put, because the reports keep being revised due to measurement errors, it becomes difficult to accurately analyze ad performance.

Even in the advertising industry, where the focus is not on performance measurement but on predicting future values, the issue of measurement error remains significant. This is because measurement error increases the variance of the residuals, reducing the model's goodness of fit. Additionally, in cases where the magnitude of the measurement error changes daily due to the frequency of data updates, as with digital advertising data, non-linear models that do not guarantee linearity are more likely to show poor predictive performance in extrapolation.

Unfortunately, due to the immediacy characteristic of digital advertising, advertisers cannot afford to wait up to a week for the data to be fully updated. If advertisers judge that an ad's performance is poor, they may immediately reduce its exposure or even stop the campaign altogether. Additionally, for short-term ads, such as promotions, waiting up to a week is not an option.

The situation is no different for companies claiming to use "AI" to automatically manage ads. Advertising automation is akin to solving a reinforcement learning problem, where the goal is to maximize overall ad performance within a specific period using a limited budget. When measurement error occurs in the data, it can disrupt the initial budget allocation. Ultimately, it is quite likely to result in optimization failure.

Research Objective: Analysis of the Impact of Measurement Error and Proposal of a Reasonable Prediction Model

If, based on everything we've discussed so far, you're thinking, "The issue of measurement error could be important in digital advertising," then I've succeeded. Unfortunately, the advertising industry is not paying enough attention to measurement error. This is largely because measurement errors are not immediately visible.

This article focuses on two key aspects. First, we analyzed the impact of measurement error on advertising data based on the size of the measurement error and the data. Second, we proposed a reasonable prediction model that considers the characteristics of the data.

II. Measurement Error from a Predictive Perspective

In this chapter, we will examine how measurement error affects actual ad performance.

Measurement Error: Systematic Error and Random Error

Let's delve a bit deeper into measurement error. Measurement error can be divided into two types: systematic error and random error. Systematic error has a certain directionality; for example, values are consistently measured higher than the true value. This is sometimes referred to as the error having a "drift". On the other hand, random error refers to when the measured values are determined randomly around the true value.

So, what kind of distribution do the measured values follow? For instance, if we denote the size of the drift as $\alpha$ and the true value as $\mu$, the measured value, represented as the random variable X, can be statistically modeled as following a normal distribution, $N(\mu + \alpha, \sigma^{2})$. In other words, the measured value is shifted by $\alpha$ from the true value (systematic error) while also having variability of $\sigma^2$ (random error).

Systematic errors can be resolved through data preprocessing and scaling, so they are not a significant issue from an analyst's perspective. Specifically, removing the directional bias by $\alpha$ from the measurements is usually sufficient. On the other hand, random errors significantly influence the magnitude of measurement errors and can cause problems. To resolve this, a more statistically sophisticated approach is required from the analyst's perspective.

Let's take a closer look at the issues that occur when data contains random errors. In regression models, when measurement errors are included in the independent variables, a phenomenon known as "Regression Dilution" occurs, where the absolute value of the estimated regression coefficients shrinks towards zero. To understand this better, imagine including an independent variable filled with measurement errors in the regression equation. Since this variable fluctuates randomly due to the random component, the effect of the regression coefficient will naturally appear as zero. This issue is not limited to basic linear regression models but occurs in all linear and nonlinear models.

The Data Environment in Digital Advertising

So far, we have discussed measurement errors. Now, let's examine the environment in which digital advertising data is received for modeling purposes. In Chapter 1, we mentioned that media companies continuously update performance data such as impressions, clicks, and ad spend to address measurement errors. Given that the data is updated up to a week later, when the data is first received, it is likely that a significant amount of measurement error is present. However, as the data gets updated the next day, it becomes more accurate, and by the following day, it becomes even more precise. Through this process, the measurement error in the data tends to decrease exponentially.

Since the magnitude of measurement errors changes with each update, this can lead to issues of heteroskedasticity in addition to model fit. When heteroskedasticity occurs, the estimates become inefficient from an analytic perspective. Furthermore, from a predictive perspective, it presents challenges for extrapolation, as predicting new values based on existing data tends to result in poor performance.

Additionally, as ad spend increases, the magnitude of measurement errors grows. For example, when spending 1 dollar on advertising, the measurement error might only be a few cents, but with an ad spend of 1 million dollars, the error could be tens of thousands of dollars. In this context, it makes sense to use a multiplicative model, where a random percentage change is applied based on the ad spend. Of course, it is well-known that regression dilution can occur in multiplicative models, just as it does in additive models.

Model and Variable Selection

We have defined the dependent variable as the "number of events" that occur after users respond to an ad, based on their actions on the web or app. Events such as participation, sign-ups, and purchases are countable, occurring as 0, 1, 2, and so on, which means a model that best captures the characteristics of count data is needed.

For the independent variables, we will use only "ad spend" and the "lag of ad spend," as these are factors that the advertiser can control. Metrics like impressions and clicks are variables that can only be observed after the ads have been served, meaning they cannot be controlled in advance, and are therefore excluded from a business perspective. Impressions are highly correlated with ad spend, meaning these two variables contain similar amounts of information. This will play an important role later in the modeling process.

Meanwhile, to understand the effect of measurement errors, we need to deliberately "contaminate" the data by introducing measurement errors into the ad spend. The magnitude of these errors was set within the typical range observed in the industry, and simulations were conducted across various scenarios.

The proposed models are a Poisson regression-based time series model and a Poisson Kalman filter. We chose models based on the Poisson distribution to reflect the characteristics of count data.

The reason for using Poisson regression is that it helps to avoid the issue of heteroskedasticity in the residuals. Due to the nature of Poisson regression and other generalized linear models (GLMs), the focus is on the relationship between the mean and variance through the link function. This allows us to mitigate the heteroskedasticity problem mentioned earlier to some extent.

Furthermore, using the Poisson Kalman filter allows us to partially avoid the measurement error issue. This model accounts for the Poisson distribution in the observation equation while also compensating for the inaccuracies (including measurement errors) in the observation equation through the state equation. This characteristic enables the model to inherently address the inaccuracies in the observed data.

The Effect of Measurement Error

First, we will assess the effect of measurement error using the Poisson time series model.

[ \log(\lambda_{t}) = \beta_{0} + \sum_{k=1}^{7}\beta_{k}\log(Y_{t-k} + 1) + \alpha_{7}\log(\lambda_{t-7}) + \sum_{i=1}^{8}\eta_{i} Spend_{(t-i+1)} ]

Here, Spend represents the ad spend from the current time point up to 7 time points prior, and $\beta$ captures the lagged effects embedded in the residuals, beyond the effect of ad spend. Additionally, $\alpha$ accounts for the day-of-week effects.

Figure 1. Analysis Table for the Poisson Time Series Model

Although it may be too lengthy to include, we confirmed that this model reasonably reflects the data when considering model fit and complexity.

What we are really interested in is the measurement error. How did the measurement error affect the model’s predictions? To explore this, we first need to understand time series cross-validation.

Typically, K-fold or LOO (Leave-One-Out) methods are used when performing cross-validation on data. However, for time series data, where the order of the data is crucial, excluding certain portions of the data is not reasonable. Therefore, the following method is applied instead.

  • Fit the model using the first $d$ data points and predict future values.
  • Add one more data point, fit the model with ($d+1$) data points, and predict future values.
  • Repeat this process.

This can be illustrated as follows.

Figure 2. Time Series Cross-Validation / Source: Hyndman, R.J., Athanasopoulos, G. (2021) *Forecasting: Principles and Practice*, 3rd edition

Using this cross-validation method, we calculated the 1-step ahead forecast accuracy, with the evaluation metric set as MAE (Mean Absolute Error), taking the Poisson distribution into account.

Figure 3. Time Series Cross-Validation Results Based on the Magnitude of Measurement Error and Sample Size

An interesting result was found: in the table above, for low levels of measurement error (0.5 ~ 0.7), the model with measurement error recorded a lower MAE than the model without it. Wouldn’t we expect the model without measurement error to perform better, according to conventional wisdom?

This phenomenon occurred due to the regularization effect introduced by the measurement error. In other words, the measurement error caused attenuation bias in the regression coefficients, which helped mitigate the issue of high variance to some extent. In this case, the measurement error effectively played the role of the regularization parameter, $\lambda$, that we typically focus on in regularization.

Figure 4. Properly Fitted Model (Left) / Overfitted Model (Right)
Figure 5. $\lambda=0$ (Left) / $\lambda=\infty$ (Right)

Let's look at Figure 5. If the variance of the measurement error increases infinitely, the variable becomes useless, as shown in the right-hand diagram. In this case, the model would be fitted only to the sample mean of the dependent variable, with an R-squared value of 0. However, we also know that a model with no regularization at all, as depicted in the left-hand diagram, is not ideal either. Ultimately, finding the right balance is crucial, and it’s important to “listen to the data” to achieve this.

Let’s return to the model results. While low levels of measurement error clearly provide an advantage from the perspective of MAE, higher levels of measurement error result in a higher MAE compared to the original data. Additionally, since measurement errors only occur in recent data, as the amount of data increases, the proportion of error-free data compared to data with measurement error grows, reducing the overall effect of the measurement error.

What does it mean that MAE gradually improves as the data size increases? Initially, the model had high variance due to its complexity, but as more data becomes available, the model begins to better explain the data.

In summary, a small amount of measurement error can be beneficial from the perspective of MAE, which means that measurement error isn’t inherently bad. However, since we can't predetermine the magnitude of measurement error in the independent variables, it can be challenging to decide whether a model that resolves the measurement error issue is better or if it's preferable to leave the error unresolved.

To determine whether stronger regularization would be beneficial, one approach is to add a constraint term with $\lambda$ to the model for testing. Since the measurement error has acted similarly to ridge regression, it is appropriate to test using L2 regularization in this case as well.

If weaker regularization is needed, what would be the best approach? In this case, one option would be to reduce measurement error by incorporating the latest data updates from the media companies. Alternatively, data preprocessing techniques, such as applying ideas from repeated measures ANOVA, could be used to minimize the magnitude of the measurement error.

III. Measurement Error from an Analytic Perspective

In Chapter 2, we explained that from a predictive perspective, an appropriate level of measurement error can act as regularization and be beneficial. At first glance, this might make measurement error seem like a trivial issue. But is that really the case?

In this chapter, we will explore how measurement error impacts the prediction of advertising performance from an analytic perspective.

Endogeneity: Disrupting Performance Measurement

In Chapter 1, we briefly discussed ad automation. Since a customer's ad budget is limited, the key to maximizing performance with a limited budget lies in solving the optimization problem of how much budget to allocate to each medium and ad. This decision ultimately determines the success of an automated ad management business.

There are countless media platforms and partners that play similar roles. It's rare for someone to purchase a product after encountering just one ad on a single platform. For example, consider buying a pair of pants. You might first discover a particular brand on Instagram, then search for that brand on Naver or Google before visiting a shopping site. Naturally, Instagram, Naver, and Google all contributed to the purchase. But how much did each platform contribute? To quantify this, the advertising industry employs various methodologies. One of the most prominent techniques is Marketing Media Mix Modeling.

As mentioned earlier, many models are used in the advertising industry, but the fundamental idea remains the same: distributing performance based on the influence of coefficients in regression analysis. However, the issue of "endogeneity" often arises, preventing accurate calculation of these coefficients. Endogeneity occurs when there is a non-zero correlation between the explanatory variables and the error term in a linear model, making the estimated regression coefficients unreliable. Accurately measuring the size of these coefficients is crucial for determining each platform's contribution and for properly building performance optimization algorithms. Therefore, addressing the issue of endogeneity is essential.

Solution to the Endogeneity Problem: 2SLS

In econometrics, a common solution to the endogeneity problem is the use of 2SLS (Two-Stage Least Squares). 2SLS is a methodology that addresses endogeneity by using instrumental variables (IV) that are highly correlated with the endogenous variables but uncorrelated with the model’s error term.

Figure 6. Instrumental Variable Represented by a Venn Diagram

Let's take a look at the example in Figure 6. We are using independent variable X to explain the dependent variable Y, but there is endogeneity in the red section of X, which negatively affects the estimation. To address this, we can use an appropriate instrumental variable Z, which is uncorrelated with the residuals of Y after removing X's influence (green), ensuring validity, and correlated with the original variable X, ensuring relevance. By performing the regression analysis only on the intersection of Z and X (yellow + purple), we can explain Y while solving the endogeneity problem in X. The key idea behind instrumental variables is to sacrifice some models fit in order to remove the problematic (red) section.

Returning to the main point, in our model, there is not only the issue of measurement error in the variables, but also the potential for endogeneity due to omitted variable bias (OVB), as we are using only "ad spend" and lag variables as explanatory variables. Since the goal of this study is to understand the effect of measurement error on advertising performance, we will use a 2SLS test with appropriate IV to examine whether the measurement error in our model is actually causing endogeneity from an analytic perspective.

IV for Ad Spend: Impressions

As we discussed earlier, instrumental variables can help resolve endogeneity. However, verifying whether an instrumental variable is appropriate is not always straightforward. While it may not be perfect, for this model, based on industry domain knowledge, we have selected "impressions" as the most suitable instrumental variable.

First, let’s examine whether impressions satisfy the relevance condition. In display advertising, such as banners and videos, a CPM (cost per thousand impressions) model is commonly used, where advertisers are charged based on the number of impressions. Since advertisers are billed just for showing the ad, there is naturally a very high correlation between ad spend and impressions. In fact, a simple correlation analysis shows a correlation coefficient of over 0.9. This indicates that impressions and ad spend have very similar explanatory power, thus satisfying the relevance condition.

The most difficult aspect to prove for an instrumental variable is its validity. Validity means that the instrumental variable must be uncorrelated with the residuals, those are the factors in the dependent variable (advertising performance) that remain after removing the effect of ad spend. In our model, what factors might be included in the residuals? From a domain perspective, possible factors include the presence of promotions or brand awareness. Unlike search ads, where users actively search for products or brands, in display ads, users are passively exposed to ads, as advertisers pay to have them shown by the media platforms. Therefore, the number of impressions, which reflects forced exposure to ads, is likely uncorrelated with factors such as brand awareness or the presence of promotions, which influence the residuals.

If you're still uncertain about whether the validity condition is satisfied, you can perform a correlation test between the instrumental variable and the residuals. As shown in the results of Figure 7, we cannot reject the null hypothesis of no correlation at the significance level of 0.05.

Figure 7. Validity Test Table

Of course, the instrumental variable, impressions, also contains measurement error. However, it is known that while measurement error in the instrumental variable can reduce the correlation with the original variable, it does not affect its validity.

Method for Detecting Endogeneity: Durbin-Wu-Hausman Test

Now, based on the instrumental variable(impressions) we identified let’s examine whether measurement error affects the endogeneity of the coefficients. After performing the Durbin-Wu-Hausman test, we can see that in some intervals, the null hypothesis of no endogeneity is rejected. This indicates that the measurement error in the coefficients is indeed affecting endogeneity.

Figure 8. Durbin-Wu-Hausman Test

Depending on the patterns in the newly acquired data revealed through this test, even seemingly robust models can change. Therefore, we can conclude that modeling with consideration for measurement error is a safer approach.

IV. Poisson Kalman Filter and Ensemble

Up until now, we have explored measurement error from both predictive and analytic perspectives. This time, we will look into the Poisson Kalman Filter, which corrects for measurement error, and introduce an "ensemble" model that combines the Poisson Kalman Filter with the Poisson time series model.

Poisson Kalman Filter, Measurement Error, Bayesian, and Regularization

The Kalman filter is a model that finds a compromise between the information from variables that the researcher already knows (State Equation) and the actual observed values (Observation Equation). From a Bayesian perspective, this is similar to combining the researcher's prior knowledge (Prior) with the likelihood obtained from the data.

Figure 9. Kalman Filter Estimation Process / Source: YouTube

The regularization and measurement error introduced in Chapter 3 can also be interpreted from a Bayesian perspective. This is because the core idea of regularization aligns with how strongly we hold the prior belief in Bayesian modeling that $\beta=0$. In Chapter 3, we effectively drove the (random) measurement error coefficients toward zero, which ties together the intuition behind Kalman filters, Bayesian inference, regularization, and measurement error. Therefore, using a Kalman filter essentially means incorporating measurement error through the state equation, and this can further be understood as including regularization into the model.

Then, how should we construct the observation equation? Since our dependent variable is count data, it would be reasonable to use the log-link function from the GLM framework to model it effectively.

Poisson Time Series vs. Poisson Kalman Filter

Figure 10. Comparison: Poisson Time Series Model and Poisson Kalman Filter

Let’s compare the performance of the Poisson time series model and the Poisson Kalman filter. First, looking at the log likelihood, we can see that the Poisson time series model consistently has a higher value across all intervals. However, when we examine the MAE, the Poisson Kalman filter shows superior performance. This suggests that the Poisson time series model is overfitted compared to the Poisson Kalman filter. In terms of computation time, the Poisson Kalman filter is also faster. However, since both models take less than 2 seconds to compute, this is not a significant factor when considering their application in real-world services.

If you look closely at Figure 10, you can spot an interesting detail: the decrease in MAE as the data volume increases is significantly larger for the Poisson time series model compared to the Poisson Kalman filter. The reason for this is as follows.

The Poisson Kalman filter initially reflected the state equation well, leading to a significant advantage in prediction accuracy (MAE) early on. However, as more data was added, it seems that the observation equation failed to effectively incorporate the new data, resulting in a slower improvement in MAE. On the other hand, the Poisson time series model suffered from poor prediction accuracy early on due to overfitting, but as more data came in, it was able to reasonably incorporate the data, leading to a substantial improvement in MAE.

Similar results were found in the model robustness tests. Specifically, in tests for residual autocorrelation, mean-variance relationships, and normality, the Poisson Kalman filter performed better when there was a smaller amount of data early on. However, after the mid-point, the Poisson time series model outperformed it.

Ensemble: Combining the Poisson Time Series and the Poisson Kalman Filter

Based on the discussion so far, we have combined the distinct advantages of both models to build a single ensemble model.

To simultaneously account for bias and variance, we set the constraint for the stacked model, which minimizes MAE, as follows.

[ p_{t+1} = argmin_{p}\sum_{i=1}^{t}w_{i}|y_{i} – (p * \hat{y}{i}^{(GLM)} + (1 – p) * \hat{y}{i}^{(KF)})| ]

[ s.t. 0 \leq p \leq 1, \forall w > 0 ]

As we observed earlier, the Poisson Kalman filter had a lower MAE across all intervals, so without considering the momentum of MAE improvement, the stacked model would output $p=0$ across all intervals, meaning it would rely 100% on the Poisson Kalman filter. However, since the MAE of the Poisson time series model improves significantly in the later stages with a larger data set, we introduced a weight $W$ in front of the absolute constraint to account for this.

How should the weights be assigned? First, as the amount of data increases, both models will become progressively more reliable, resulting in reduced variance. Additionally, the model that performs better will typically have lower variance. Therefore, by assigning weights inversely proportional to the variance, we can effectively reflect the models' increasing accuracy over time.

The predictions from the final model, which incorporates the weights, are as follows.

[ \hat{y}{t+1} = p{t+1}\hat{y}{t+1}^{(GLM)} + (1 – p{t+1})\hat{y}_{t+1}^{(KF)} ]

Figure 11. Variation of p (Poisson Time Series Weight) with Sample size

When analyzing the data using the ensemble model, we observed that in the early stages, p (the weight of the Poisson time series model in the ensemble) remained close to 0, but then jumped near 1 in the mid stages. Additionally, in certain intervals during the later stages where the data patterns change, we can see that the Poisson Kalman filter, which leverages the advantages of the state equation, is also utilized.

Figure 12. Comparison of MAE Between Models

Let’s take a look at the MAE of the ensemble model in Figure 12. By reasonably combining the two models that exhibit heterogeneity, we can see that the MAE is lower across all intervals compared to the individual models. Additionally, in the robustness tests, we confirmed that the advantages of the ensemble were maximized, making it more robust than the individual models.

Conclusion

While the fields of applied statistics, econometrics, machine learning, and data science may have different areas of focus and unique strengths, they ultimately converge on the common question: "How can we rationally quantify real-world problems?" Additionally, a deep understanding of the domain to which the problem belongs is essential in this process.

This study focuses on the measurement error issues commonly encountered in the digital advertising domain and how these issues can impact both predictive and analytic modeling. To address this, we presented two models, the Poisson time series model and the Poisson Kalman filter, tailored to the domain environment (advertising industry) and the data generating process (DGP). Considering the strong heterogeneity between the two models, we ultimately proposed an ensemble model.

With the universalization of smartphones, the digital advertising market is set to grow even more rapidly in the future. I hope that as you read this paper, you take the time to savor the knowledge rather than hurriedly trying to absorb the text. It would be wonderful if you could expand your understanding of how statistics applies to the fields of data science and artificial intelligence.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Mincheol Kim (MSc Data Science, 2023)

Mincheol Kim (MSc Data Science, 2023)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Having completed over half of my graduate courses and approaching graduation, I wanted to write a thesis in a field that heavily utilized machine learning and deep learning, rather than relying on traditional statistical analysis methods. This felt more aligned with the purpose of my graduate education in data science and artificial intelligence, making the experience more meaningful.

Data Accessibility and Deep Learning Applicability

Like many others, I struggled to obtain data. This led me to choose a field where data was accessible, but conventional methods failed to uncover meaningful insights. In graduate school, we weren't restricted to specific topics. Instead, we learned a variety of data analysis methods based on mathematical and statistical principles. This flexibility allowed me to explore different areas of interest.

I eventually chose topic modeling with deep learning as the subject of my thesis. I selected topic modeling because deep learning methodologies for this field have developed well, moving beyond statistical logic to generative models with layered structures of factor analysis, which trace underlying structures based on data probability. Moreover, there were many excellent researchers in Korea working in this field, making it easier to access good educational resources.

Korea's high dependency on exports

During a conversation with my mentoring professor, he suggested, "Instead of focusing on the analysis you're interested in, why not tackle an AI problem that society needs?" Taking his advice to heart, I started exploring NLP problems where I could make a meaningful contribution, using the IMRaD approach. This search led me to a research paper that immediately caught my attention.

The paper highlighted that while Korea has a globally respected economic structure, it remains heavily dependent on foreign markets rather than its domestic one. This dependency makes Korea's economy vulnerable to downturns if demand from advanced countries decreases. Furthermore, recent trade tensions with China have significantly affected Korea's trade sector, emphasizing the need for export diversification. Despite various public institutions—such as KOTRA, the Korea International Trade Association, and the Small and Medium Business Corporation—offering services to promote this diversification, the paper questioned the overall effectiveness of these efforts. This issue resonated with me, sparking ideas on how AI could potentially offer solutions.

Big Data for Korean Exports

The paper's biggest criticism is that the so-called 'big data services' provided by public institutions don't actually help Korean companies with buyer matching. Large companies that already export have established connections and the resources to keep doing so without much trouble. However, small businesses and individual sellers, who lack those resources, are left to find new markets on their own. In this context, public institutions aren't providing the necessary information or effectively helping with buyer matching, which is critical for improving the country's economy and competitiveness.

While the institutions mentioned do have the experience, expertise, and supply chains to assist many Korean companies with exporting, the paper highlights that their services aren't truly utilizing big data or AI. So, what would a genuine big data and AI service look like—one that leverages the strengths of these institutions while truly benefiting exporters and sellers? The idea I developed is a service that offers 'interpretable' models, calculations, and analysis results based on data, providing practical support for decision-making.

LDA does not capture 'context'

When I think about AI, the first professor who comes to mind is Andrew Ng. I’m not sure why, but at some point, I started noticing more and more people around me mentioning his lectures, interviews, or research papers. Given that his work dates back to the early 2000s, I might have come across it later than many others. Interestingly, the topic modeling in my thesis traces back to Professor Ng’s seminal paper on Latent Dirichlet Allocation (LDA).

In LDA, the proportion (distribution) of topics is assumed to follow a beta distribution, and based on this assumption, the words in each document are temporarily assigned to topics. Then, prior parameters are calculated from the words in each topic, and the changes are measured using KL-divergence. The topic assignments are repeatedly adjusted until the changes narrow and the model converges. Since LDA uses Gibbs sampling, it differs significantly from Neural Network Variational Inference (NVI), which I will explain later.

Simply put, the goal of LDA is to identify hidden topics within a collection of documents. The researcher sets the number of topics, $k$, and the LDA model learns to extract those topics from the data.

The biggest flaw in LDA is that it assumes the order and relationship of words are conditionally independent. In other words, when using LDA, the next word or phrase can be assigned to a completely different topic, regardless of the conversation's context. For example, you could be discussing topics B and C, but the model might suddenly shift to topic A without considering the flow of the discussion. This is a major limitation of LDA — it doesn’t account for context.

Figure 1. In context, "new public facilities" should clearly be grouped under the same topic, but we can see that "new" has been analyzed as part of a separate topic

LSA uses all the information

On the other hand, Latent Semantic Analysis (LSA) addresses many of LDA's weaknesses. It extracts the relationships between words, words and documents, and documents and documents using Singular Value Decomposition (SVD).

The principle is simple. In a set of documents, the co-occurrence of words is represented on a graph. By decomposing an $n \times m$ matrix with SVD, we can extract eigenvectors representing words and documents. These eigenvalues show the significance of each word or document within the context of the vector space.

LSA calculation method using SVD

Although LSA is a basic model that relies on simple SVD calculations, it stands out for utilizing all the statistical information from the entire corpus. However, it has its limitations. As mentioned earlier, LSA uses a frequency-based TF-IDF matrix, and since the matrix elements only represent the 'occurrence count' of a word, it struggles with accurately inferring word meanings.

GloVe: Contextual Word Embeddings

A model designed to capture the word order, or 'contextual dependency,' between words is Global Vectors for Word Representation (GloVe). Unlike Word2Vec, which focuses on context but doesn't fully leverage all the information from the document, GloVe was developed to address this limitation while retaining the strengths of both approaches. In essence, GloVe reflects the statistical information of the entire corpus and uses dense representations that allow for efficient calculation of word similarities. The goal of GloVe is to find embedding vectors that minimize the objective function $J$, shown below.

Objective function

To bypass the complex derivation of GloVe's objective function $J$ and focus on the core idea, GloVe can be summarized as aiming to make the dot product of two embedded word vectors' correspond to the co-occurrence probability of those words in the entire corpus. This is done by applying least squares estimation to the objective function, with the addition of a weighting function $f(X_{ij})$ to prevent overfitting, which is a common issue in language models.

The GloVe model requires training on a large, diverse corpus to effectively capture language rules and patterns. Once trained, it can generate generalizable word representations that can be applied across different tasks and domains.

The reason for highlighting the evolution of topic modeling techniques from LDA to GloVe is that GloVe's embedding vectors, which capture comprehensive document information, serve as the input for the core algorithm of this paper, Graph Neural Topic Model (GNTM).

Word Graph: Uncovering Word Relationships

The techniques we've explored so far (LDA, LSA, GloVe) were all developed to overcome the limitations of assuming word independence. In other words, they represent the step-by-step evolution of word embedding methods to better capture the 'context' between words. Now, taking a slightly different approach, let's explore Word Graphs to examine the relationships between words.

Both GloVe and Word Graph aim to understand the relationships between words. However, while GloVe maps word embeddings into Euclidean space, Word Graph defines the structure of word relationships within a document in non-Euclidean space. This allows Word Graph to reveal 'hidden' connections that traditional numerical methods in Euclidean space might overlook.

Figure 2. The relationships between words can be calculated and visualized using Word Graph

So, how can we calculate the 'structure that represents relationships between nearby words'? To do this, we use a method called Global Random Field (GRF). GRF models the graph structure within a document by using the topic weights of words and the information from the topics associated with the edges that connect the words in the graph. This process helps capture the relationships between words and topics, as illustrated below.

GRF. $p(G)$: Graph generation probability, $w$: word, $w', w''$: neighboring words, $z_w', z_w''$: topics to which word w belongs

The key point here is that the sum in the last term of the edge does not equal 1. If $w’$ corresponds to topic 1, the sum of all possible cases for $w’’$ would indeed be 1. However, since $w’$ can also be assigned to other topics, we need to account for this. As a result, the total number of edges $|E|$, which acts as a normalizing factor, is included in the denominator.

The GTRF model proposed by Li et al. (2014) is quite similar to GRF. The key difference is that the distribution of topic $z$ is now conditional on $\theta$, with the EM algorithm still being used for both learning and inference. The resulting $p_{GTRF}(z|\theta)$ represents the probability of two topics being related. In other words, it calculates whether neighboring words $w'$ and $w''$ are assigned to the same or different topics, helping to determine the probability of the graph structure.

We have reviewed the foundational keyword embedding techniques used in this study and introduced GTRF, a model that captures 'the relationships between words within topics' through graph-based representation.

Figure 3. An example of higher-order GNN. As the order increases, it expresses relationships between data more deeply

Graph Neural Topic Model: Key Differences

Building on the previous discussion, let's delve into the core of this study: the Graph Neural Topic Model (GNTM). GNTM utilizes a higher-order Graph Neural Network (GNN). As illustrated in the diagram, GNTM increases the order of connections, enabling a more comprehensive understanding and embedding of complex word relationships.

Additionally, GNTM significantly reduces computational costs by utilizing Neural Variational Inference (NVI), rather than standard Variational Inference (VI), to streamline the learning process. Unlike LDA, GNTM introduces an extra step to incorporate 'graph structure' into its calculations, further enhancing efficiency.

Figure 4. Calculation diagram for LDA (top) and GNTM (bottom). In the diagram, $G$ represents the structure of topics and words, while $V$ denotes the word set.

Let’s explore the mechanism of GNTM (GTRF). The diagram above compares the calculations of GTRF and LDA side by side. As previously mentioned, GTRF learns how the structure of $z$ evolves based on the conditional distribution once $\theta$ is determined.

This may seem complex, so let’s step back and look at the bigger picture. If we assume topics are evenly distributed throughout the document, each topic will have its own proportion. We’ll refer to the parameter representing this proportion as $\alpha$.

Here, $\alpha$ (similar to the LDA approach) is a parameter that defines the shape of the Dirichlet Distribution, an extended version of the Beta distribution. The shape of the distribution changes according to $\alpha$, as shown below.

Figure 5. Changes in the Beta distribution when there are 3 $\alpha$ values (corresponding to 3 topics)

After setting the topic proportions using the parameter $\alpha$, a variable $\theta_d$ is derived to represent these proportions. While this defines the ratio, the outcomes remain flexible. Additionally, once the topic $z$ is established, the structure $G$ and word set $V$ are determined accordingly.

First Difference: Incorporating Graphs into LDA

Up to this point, we've discussed the process of quantifying the news information at hand. Now, it's time to consider how to calculate it accurately and efficiently.

The first step is straightforward: set all the parameters of the Dirichlet distribution ($\alpha$) to 1, creating a uniform distribution across $n$ dimensions. This approach assumes equal proportions for all topics, as we don't have any prior information.

Next, assuming a uniform topic distribution, the topic proportions are randomly sampled. Based on these proportions, words in the news articles are then randomly assigned to their corresponding topic $z$. The intermediate graph structure $G$, however, doesn't need to be predefined, as it will be 'learned' during the process to reveal hidden relationships. The initial formula can thus be summarized as follows.

As we saw with GTRF, the probability of a graph structure forming depends on the given condition, in this case, the topic. This is represented by multiplying all instances of $p(1-p)$ from the binomial distribution's variance. In other words, topics are randomly assigned to words, and based on those assignment ratios, we can calculate $m$. With this value, the probability of structures forming between topics is then quantified using the variance from the binomial distribution.

Second Difference: NVI

The final aspect to examine is NVI. NVI estimates the posterior distribution of latent topics in text data. It uses a Neural Network structure to parametrize the algorithm, allowing accurate estimation of the true posterior across various distributions. In the process, it often uses the reparameterization trick from VI to simplify distributions. Using neural networks means NVI can be applied to more diverse distributions than VAE (Variational AutoEncoder), which learns data in lower dimensions. This is supported by the Universal Approximation Theorem, which states that theoretically, any function can be estimated using neural networks.

To explain reparameterization further, it replaces the existing probability distribution with another, representing it through learnable parameters. This allows backpropagation and effective gradient computation. This technique is often used in VAE during the sampling of latent variables.

As mentioned earlier, both VI and NVI use the reparameterization trick. However, NVI's advantage is that it can estimate diverse distributions through neural networks. While traditional Dirichlet distribution-based VI uses only one piece of information, NVI can use two, mean and covariance, through the Logistic Normal Distribution. Additionally, like GTRF, which estimates topic structure, NVI reflects the process of inferring relationships between topics in the model.

So far, we have designed a topic model that maximizes computational efficiency while capturing word context using advanced techniques. However, as my research progressed, one question became my primary concern: "How can this model effectively assist in decision-making?"

The first key decision is naturally, "How many topics should be extracted using the GNTM model?" This is akin to the question often posed in PCA: "How many variables should be extracted?" Both decisions are critical for optimizing the model's usefulness.

Determining the Number of Topics

From a Computational Efficiency Standpoint

Let's first determine the number of topics with a focus on computational efficiency and minimizing costs. During my theoretical studies in school, I didn't fully grasp the significance of computational efficiency because the models we worked with were relatively "light" and could complete calculations within a few minutes.

However, this study deals with a vast dataset—around 4.5 to 5 million words—which makes the model significantly "heavier." While we've integrated various methods like LDA, graph structures, and NVI to reduce computational load and improve accuracy, failing to limit the number of topics appropriately would cause computational costs to skyrocket.

To address this, I compared the computational efficiency when the number of topics was set to 10 and 20. I used TC (Topic Coherence) to evaluate the semantic consistency of words classified into the same topic, and TD (Topic Diversity) to assess the variety of content across topics.

The results showed that when the number of topics was 10, the calculation speed improved by about 1 hour (for Epoch=100) compared to 20 topics, and the TD and TC scores did not drop significantly. Personally, I believe more accurate validation should be done by increasing the Epoch to 500, but since this experiment was conducted on a CPU, not a GPU, increasing the number of Epochs would take too much time, making realistic validation difficult.

There may be suggestions to raise the Epoch further, but since this model uses Adam (Adaptive Moment Estimation) as its activation function, it’s expected to converge quickly to the optimal range even with a lower Epoch count, without significant changes.

From a Clustering Standpoint

In the previous discussion, we determined that 10 topics are optimal from a computational efficiency standpoint. Now, let's consider how many topics would lead to the best "seller-buyer matching"—or how many industries should be identified for overseas buyers to effectively find domestic sellers with relevant offerings—from a clustering perspective.

If the number of topics becomes too large, Topic Coherence (TC) decreases, making it harder to extract meaningful insights. This is similar to using adjusted $R^2$ in linear regression to avoid adding irrelevant variables, or to the caution needed in PCA when selecting variables after the explained variance stops increasing significantly.

As dimensions increase, the "curse of high dimensionality" emerges in Euclidean space. To minimize redundant variables, we utilized clustering metrics such as the Silhouette Index, Calinski-Harabasz Index, and Davies-Bouldin Index, all based on cosine similarity and correlation.

Figure 6. We can confirm that the optimal clustering occurs with 9 clusters using the UMAP algorithm

As shown in the figure above, the clustering results indicate that the optimal grouping occurs with 9 topics. This was achieved using agglomerative hierarchical clustering, which begins by grouping small units and progressively merges them until all the data is clustered.

Strengths of GNTM

The key strength of this method lies in its interpretability. By using a dendrogram, we can visually trace how countries are grouped within each cluster. The degree of matching can be calculated using cosine similarity, and the topics that form these matches, along with the specific content of the topics, can be easily interpreted through a word network.

The high level of interpretability enhances the model's potential for direct integration with KOTRA's topic service too. By leveraging the topics extracted from KOTRA's existing export-import data, this model can strengthen the capabilities of an AI-based buyer matching service using KOTRA's global document data. Additionally, the model's structure, calculation process, and results are highly transparent, making decision-making, post-analysis, and tracking calculations significantly more efficient. This interpretability not only increases transparency but also maximizes the practical application of the AI system.

Figure 7. Applying UMAP allows us to identify nonlinear relationships within the data

Additionally, the model can be applied to non-English documents. While there may be some loss of information compared to English, as mentioned earlier, GloVe captures similar words with similar vectors across languages and reflects contextual relationships. As a result, applying this methodology to other languages should not present significant challenges.

Furthermore, the model has the ability to uncover hidden nonlinear relationships within the data through UMAP (Uniform Manifold Approximation and Projection) clustering, which goes beyond the limitations of traditional linear analysis. This makes it promising for future applications, not only in hierarchical clustering but also in general clustering and recommendation algorithms.

Summary

In summary, this paper presents an Natural Language Processing (NLP) study that performs nonlinear factor analysis to uncover the topic proportions $\theta$, accounting for the covariance between topics.

In other words, while factor analysis (FA) is typically applied to numerical data, this research extends nonlinear factor analysis to the NLP field, extracting the structure of words and topics, topic proportions, and the prior distribution governing these proportions, effectively quantifying information for each group.

One of the biggest challenges in PCA and FA-related studies is interpreting and defining the extracted factors. However, the 'GNTM' model in this paper overcomes the limitation of 'difficulty in defining factors' by presenting word networks for each topic.

Now, for each topic (factor), we can identify important words and interpret what the topic is. For example, if the words "bank," "financial," "business," "market," and "economic" dominate in Topic 1, it can be defined as 'Investment.'

The paper also optimizes the number of topics in terms of TC (Topic Coherence) and TD (Topic Diversity) to best fit the purpose of buyer-seller matching. The results for the optimized number of topics were visually confirmed using UMAP and word networks.

Lastly, to solve the high-dimensional curse, clustering was performed using metrics based on cosine similarity and correlation.

What about the noise issue?

Text data often contains noise, such as special characters, punctuation, spaces, and unnecessary tags unrelated to the actual data. This model minimizes noise by using NVI, which extracts only important tokens, and by significantly increasing the number of epochs.

However, increasing the number of epochs exponentially raises computational costs. In real-world applications with time constraints, additional methods to quickly and efficiently reduce noise will be necessary.

Applicability

The most attractive feature that distinguishes GNTM from other NLP methods is its interpretability. While traditional deep learning is often called a 'black box,' making it hard for humans to understand the computation process, this model uses graph-based calculations to intuitively understand the factors that determine topics.

Additionally, GNTM is easy to apply. The Graph Neural Network Model, which forms the basis of GNTM, is available in a package format for public use, allowing potential users to easily utilize it as a service.

Furthermore, this study offers a lightweight model that can be applied by companies handling English text data, enabling them to quantify and utilize the data flexibly according to their objectives and needs.

Through UMAP (Uniform Manifold Approximation and Projection), the presence of nonlinear relationships in the data was visually confirmed, allowing for the potential application of additional nonlinear methods like LightGCN.

Moreover, since this paper assigned detailed topic proportions to each document, there is further research potential on how to utilize these topic proportions.

Future Research Direction

Similar to prompt engineering, which aims to get high-quality results from AI at a low cost, the future research direction for this paper will focus on 'how to train the model accurately and quickly while excluding as much noise as possible.' This includes applying regularization to prevent overfitting from noise and improving computational efficiency even further.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Bohyun Yoo (MBA AI/BigData, 2023)

Bohyun Yoo (MBA AI/BigData, 2023)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

The real estate market is showing unusual signs. As global tightening begins, experts worry that the bubble in the domestic real estate market, which benefited from the post-COVID-19 liquidity, may burst. They warn we should prepare for a potential impact on the real economy.

Since late last year, major central banks, including the U.S. Federal Reserve, have been raising interest rates to combat inflation. This has caused housing prices to decline, reducing household net worth and increasing losses for real estate developers, which could potentially trigger a recession.

Global Liquidity and the Surge in Housing Prices

Meanwhile, some investors are attempting to exploit the 'bubble' in the real estate market for profit. They expect prices to fall soon and aim for capital gains by buying low. Others seek arbitrage opportunities, assuming prices haven’t yet aligned with fair value. For these investors, it is crucial to assess whether current property prices are discounted or overpriced compared to intrinsic value.

Similarly, for financial institutions heavily involved in mortgage lending, analyzing the real estate market is key to the success of their loan business. This study examines why identifying the 'bubble' in the real estate, especially in auctions, is important and how it can be explored mathematically and statistically.

Importance of Real Estate Auction Market

Various stakeholders participate in Korea's real estate auction market, each with distinct objectives. Homebuyers, investors seeking profit opportunities, and financial institutions managing mortgages are all active players. The apartment auction market, in particular, is highly competitive, with prices often closely aligned with those in the regular sales market.

Financial institutions are closely connected to the auction market. In Korea, when a borrower defaults on a property loan, the property is handled through court auctions or public sales overseen by the state. Financial institutions recover the loan amount by selling the collateral through these auctions in the event of a default.

Therefore, one of the key factors for financial institutions in determining their lending limits is how much principal they can recover in the auction market in the event of a default, especially for fintechs (P2P lending) and secondary lenders such as savings banks and capital, which are not subject to loan-to-value (LTV) restrictions.

Since most financial institutions hold a significant portion of their assets in mortgage loans, lending the maximum amount within a safe limit is ideal for maximizing revenue. Thus, when financial institutions review mortgage loan limits, trends in the auction market serve as a critical decision-making indicator.

To See Beyond Prices in the Market

It's easy to assume that the winning bid for an apartment auction in a certain area of Seoul, at a particular time, would either come at a discount or a premium compared to the general market price. And, with a bit of rights analysis, setting a cautious upper limit wouldn't be all that hard. But, in reality, it's a bit more complex than just making those assumptions.

Furthermore, if we want to examine the market movement from a broader perspective rather than focusing on individual auction cases, we need to change our approach. For example, it's easy to track Samsung’s stock price trends in the stock market, even down to minute-by-minute data over the past year. However, in real estate, auctions for a specific apartment, like Unit 301 of Building 103 in a particular complex, don’t happen every month. Even expanding the scope to the whole complex yields similar results. Therefore, it's no longer feasible to analyze the market purely based on prices. Real estate analysis must shift from a [time-price] perspective, as in stocks, to a [time-location] perspective.

Errors in the Auction Winning Bid Rate Indicator

Just as the general sales market has a time-series index like the apartment sales index, the auction market has the winning bid rate indicator. This is a monthly indicator published by local courts, showing the ratio of auction-winning bids to court-appraised values in a given area. For example, if the court appraises a property at 1 billion won and the winning bid is 900 million won, the winning bid rate would be 90%.

Since court appraisals are generally considered market prices, the winning bid rate represents the ratio of the auction price to the market price. When calculated for all auctions in an area, it gives the average auction price compared to the market value for that month.

However, this indicator has significant flaws. The court appraisal is set when the auction begins, but the winning bid reflects the price at the time of the auction. Given that auctions typically take 7 to 11 months, this time gap can lead to errors if market prices drop or rise sharply. For instance, news reports during recent price surges claimed that the winning bid rate in Seoul exceeded 120%, which seems hard to believe—how could auction prices be 1.2 times higher than market prices? This is actually incorrect information.

If market prices rise sharply during the 7 to 11 months it takes to complete an auction, bidders place their bids based on current market prices, while the court appraisal remains fixed at the start. As a result, the appraised value becomes relatively lower compared to the current market price, creating the illusion of a 120% winning bid rate. Interpreting this rate at face value can lead to poor real estate decisions or significant errors.

Limitations of Previous Studies

This has prompted previous auction market studies to try addressing the shortcomings of the winning bid rate indicator. For instance, some researchers adjusted the court-appraised value—the denominator—by factoring in the sales index at the time of the auction, aiming to estimate a more accurate "true winning bid rate."

However, experts agree this is not a perfect solution. To estimate the true winning bid rate for Seoul, all auctions during that period would need to have their court appraised values corrected. Each auction has different appraised values and closing dates, and researchers would have to manually correct hundreds or thousands of auctions. Expanding the region would make this task even more challenging, and even if corrected, the values would only be approximations, not guarantees of accuracy.

If researchers selectively sample auctions for convenience, it could introduce sampling bias. This is similar to trying to find the average height of Korean men by only sampling from a tall group.

It highlights the need for time-series indicators from a market perspective when making business decisions, rather than focusing solely on price data. The winning bid rate, commonly used in auctions, is prone to errors. Although methods like adjusting the court-appraised value have been suggested, they are difficult to apply in real-world scenarios.

These are the same problems I encountered as a practitioner. When time-series analysis was needed for decision-making, the persistent issues with the winning bid rate made it hard to use effectively.

Winning Price vs. Winning Bid Rate

There is an important distinction to make here. Analyzing the auction "winning price" and analyzing the "winning bid rate" have different meanings and purposes. As mentioned earlier, analyzing the winning price of a single auction case poses no issue.

For example, focusing on apartments, bidders base their bids on the market price at the time of bidding. If the gap between the bidding and final winning is about 1 to 2 months, considering that real estate prices don’t fluctuate dramatically like stocks within a month, the winning price should not significantly deviate from the market price a couple of months earlier. Factors like distance to schools, floor level, and brand, which are known to affect market prices, are likely already reflected in the market price, meaning they won't heavily impact the auction winning price.

Most prior studies on real estate auctions, particularly for apartments, have concentrated on how accurately they can predict the "winning price" and identifying the key factors that influence it. However, in practice, as previously discussed, price prediction is not the primary concern.

Even a simple linear regression analysis reveals that the R-squared between winning prices and KB market prices from 1-2 months earlier exceeds 95%, indicating a strong linear relationship. There is no evidence of a non-linear connection. If future trend forecasting is needed, the focus should shift toward a time-series analysis.

Discounts/Premiums Changing Over Time

After a lengthy introduction, let's get to the main point—I want to analyze the auction market. The problem is, the data contains significant errors, and trying to correct them individually has its limitations, especially within the industry. We need a different approach. So, what alternative methods can we use? And what insights can this new analysis reveal?

This is the core topic and background of the study. I used statistics as a tool to solve a seemingly insurmountable business problem encountered in practice.

What I aimed to find in the market was the difference between the sales market and the auction market. This 'difference' can be expressed as the discount or premium of the auction market compared to the sales market. Additionally, a time-series analysis is essential because the discount/premium factors will change over time depending on the economic or market conditions.

Factors of Discount/Premium in the Auction Market

Existing studies on the factors of discount/premium in housing auctions are quite varied. Nonetheless, as mentioned earlier, both international and domestic research mainly focus on price analysis rather than market analysis, making it difficult to grasp the broader trends of real estate. Typically, they gather auction cases over several years, remove legal issues, compare with market prices, and conclude there was a discount/premium, attributing it to specific factors.

Moreover, overseas studies often allow private auctions and use bidding systems, making direct application to Korea difficult. In domestic studies, the few that exist lack market-based analysis.

The Challenge: 'Data Availability'

If we assume that there is a discount/premium factor in the auction market compared to the sales market, the auction sale rate can be restructured as follows:

[ Auction Sale Rate_{t} = \frac{\sum Market Price_{t} \pm Premium_{t}}{\sum Appraisal Price_{t-n}} ]

Now, interpreting the three elements of the auction sale rate as influential factors and transforming it into a linear regression model, it would look like this:

[ Auction Sale Rate_{t} = \beta_{0} + \beta_{1} EoM_{t} + \beta_{2} EoA_{t} + \beta_{3} EoP_{t} ]

  • EoM: Effect of Market Price (influence of the general sales market)
  • EoA: Effect of Appraisal Price (influence of court appraised values)
  • EoP: Effect of Price Premium (influence of discount/premium)

To complete this regression model, data for all three variables is needed. The effect of market prices can be substituted with the sales index provided by the Korea Real Estate Board. The sales index should be transformed using a log difference to match the format of the auction sale rate.

A major challenge lies in obtaining data for the other two variables. First, acquiring court-appraised price data for research purposes is nearly impossible. The focus here isn't on the historical 'appraised price' itself, but rather on how much it influenced each analysis period (typically monthly). This means we need data adjusted to the auction's closing time. However, without digitizing all auction cases nationwide over the past 10 years, this task is virtually unachievable. The unobservable variables are intertwined, resembling a noisy background.

Factor Separation and Extraction

How can we isolate a specific male voice from a noisy mix of sounds? This is where the Fourier Transform comes in. It converts an input signal from the time domain to the frequency domain, separating each individual frequency. By applying an inverse Fourier Transform, we can find the unique voice, leaving it intact while setting other elements to zero, effectively filtering out noise.

In the same way, if we view the auction sale rate as a noisy input signal, we can separate its contributing factors independently. First, by removing the effect of market prices from the auction sale rate using a regression model, we can assume that the residual term contains hidden influences from court appraisals and discount/premium factors. Among the remaining elements in the residuals, we can assume the two strongest factors are the court appraised price and discount/premium. Fourier Transform can then be used to extract these two independent signals.

Table 1. Changes in Regression Coefficients and $R^2$ When Adding Variables

This assumption can be statistically verified. As shown in the table, when regressing the auction sale rate using the three initially assumed variables—two components extracted by Fourier Transform and the market price data—the adjusted R-squared is about 94%. In other words, the auction market can be explained by these three factors (market price, court appraisal, and discount/premium). Additionally, the ACF/PACF plot of the residuals after Fourier extraction (see figure below) shows no significant remaining patterns.

Figure 1. ACF/PACF Plot (No Significant Patterns in Residuals

Through the Fourier Transform, I was able to resolve both the limitations of the auction sale rate as a time-series data and the issue of relying on external data. I successfully extracted the two remaining factors (court appraisal and discount/premium) from the residuals after removing the effect of market prices.

However, I must caution that using Fourier Transform on general asset market data, like stocks or bonds, is risky. This method is only applicable to data with consistent cycles. Unlike price or sales indices, auction sale rate data exhibits cyclic movements between 80-120%, driven by economic and market conditions, allowing the process to be performed without errors.

Court Appraisal Extraction

The two factors extracted through the Fourier Transform are currently only assumptions, believed to represent court appraised value and the discount/premium factor. Therefore, it is necessary to accurately verify if these factors are indeed related to court appraised values and discount/premium factors. First, I analyzed two aspects using around 2,600 auction cases:

  • The average time gap between the court appraisal date and the auction closing date.
  • The relationship between court appraisal prices and KB market prices at the appraisal date.

The time gap between appraisal and auction ranged from 7 to 11 months (within the 25% to 75% range), and the relationship between the court appraisal price and KB market price showed a Beta coefficient of 1.03, indicating almost no difference. Based on these two results, I reached the following conclusions:

  • There is a lag relationship between court appraisal prices and market prices (lag = time gap).
  • The lag variable of market prices can substitute for court appraisal prices.

Regression analysis showed that the lag variable of the sales index and court appraisal had about 54% explanatory power. This confirmed that the court appraisal component extracted via Fourier Transform could function as an actual court appraisal. Additionally, when comparing how well the lag variable and court appraisal component explained the auction sale rate, the appraisal component (50%) outperformed the lag variable (20%).

Discount/Premium Extraction

Next, I tested the discount/premium component, the core of this study, from two angles. First, whether the component extracted by the Fourier Transform can function as a discount/premium factor, and second, what the true identity of this component is.

For verification, I applied a sigmoid function to the discount/premium component to produce an on/off effect (0/1).

Figure 2. Auction Sale Rate: Winning Bid Rate, SIG2 = Discount/Premium Component (by Fourier)

I attempted to compare this component with various data available from sources like the National Statistical Office, but I couldn't find any data showing similar patterns. The reason for this was simpler than expected.

Figure 3. Month-over-Month and Two-Month Differences in Winning Bid Rate (v1, v2) vs Discount/Premium Component (SIG2)

The auction market is dependent on the sales market. Most macroeconomic variables we know likely influence housing prices, which have already been removed from the regression model. Therefore, the remaining factors are likely unique to the auction market, independent of sales prices. The variable that shows a similar pattern to the discount/premium component is the month-over-month and two-month differences in the winning bid rate, as shown in the figure.

The Nature of the Discount/Premium

To summarize the analysis so far: after excluding the effects of market prices and court appraisals, the remaining factor in the auction sale rate is the discount/premium factor. This factor exhibits a similar pattern to the month-over-month fluctuations (volatility).

In other words, if past volatility explains what the 'sales price' and 'court appraisal' couldn't, it suggests that the auction market has a discount/premium factor driven by volatility (the difference in past winning bid rates). As I will explain later, I have named this component the 'momentum factor,' believing it to explain trends.

Cluster Characteristics of the Momentum Factor

As we delve deeper into this analysis, it's essential to recognize that auction market dynamics are not static but evolve over time, necessitating a more adaptive model to track these changes effectively.

Unlike Ordinary Least Squares (OLS) regression, which assumes a fixed beta coefficient, the Kalman Filter's state-space model allows the beta coefficient to change over time. By tracking this time-varying coefficient, we can observe how the influence of different variables fluctuates over various periods. To analyze the 'momentum factor' in greater detail, I applied the Kalman Filter to assess whether the beta coefficient indeed varies over time.

Consequently, as shown in the figure below, we can observe that the regression coefficient of the momentum factor exceeds that of the sales price regression coefficient in certain intervals. Upon examining these intervals, it becomes clear that the momentum factor exhibits a type of clustering effect.

Figure 4. Market Price vs Discount/Premium
Figure 5. Time-Varying Coefficient (Market Price vs Discount/Premium) (Top), Sensitivity Exceeding Plot (Bottom)

The True Meaning of the Momentum Factor

We need to think more deeply about the "intervals where the momentum factor's sensitivity exceeds market prices." The momentum factor explains the discount/premium. Therefore, when the discount/premium factor significantly impacts the auction sale rate, it suggests that the usual "average relationship" between the sales market and the auction market has been disrupted.

What does it mean when the "average relationship is disrupted"? For example, if the sales and auction markets typically maintain a gap of 10, this disrupted relationship means the gap has shrunk to 5 or expanded to 15. Such situations typically occur during overheated or excessively cooled markets, or just before such conditions arise. When everyone is rushing to buy homes, this can naturally lead to an "overheating" that breaks the usual relationship, which can be interpreted as increased "popularity" in the auction market.

However, one important thing to note is that when the sales market falls, the auction market typically falls too. This is because the market price has the largest influence on changes in the auction sale rate. In other words, even if the momentum factor is inactive, the auction market can rise or fall in response to the sales market. Therefore, the "activation of the momentum factor" doesn't necessarily indicate price increases or decreases.

The discount/premium factor is ultimately defined as the effect of market prices + '@'. The sensitivity analysis of the discount/premium factor indicates that '@' represents "excessive movement beyond the average." I named this the "momentum factor" because I believe it can detect changes in market sentiment or trends. As seen in the Figure 5, the momentum factor tends to signal market trend changes before and after its cluster periods.

I cautiously suggest that when the momentum factor shows excessive movement, it could signal a "bubble" or "cooling" sign. Further exploration of this idea is beyond the scope of this [paper discussion], but it certainly warrants future research.

Focus on Logic Over Technique

The reason I wrote this paper wasn't because I majored in real estate or specialized in the field. Most of my work in recent years involved changing systems to enable data-driven decision-making, one of which was loan screening, and another was related to real estate.

I understand that definitions of data science vary from person to person. However, for me, data science was the perfect tool for solving business problems. That’s why I chose a topic that was considered insurmountable in practice and applied the knowledge I learned in school.

The aspect I want to highlight in this paper is not the technical side but the logical one. The techniques used—regression analysis, Fourier transformation, and the Kalman filter—are not particularly advanced for graduate-level science and engineering. There was also an incentive to avoid using non-linear pattern matching techniques like ML/DL, which are unsuitable for financial data requiring clear interpretations. For me, it was more important to choose the method suited to the problem, and nothing more. The key was how to logically solve and approach this issue.

I believe that in solving business problems, logic should come first, and technology is just a tool. This is my ideal approach, and I wanted to keep the paper's concept simple, yet logically solid.

The Gap Between Business and Research

When I started researching for this paper, I remember thinking, "What problem should I try to solve?" My obsession with problem-solving came from the belief that there is a gap between the worlds of business and research, a bias I developed through experience.

As a practitioner, I think that in most fields, decisions are still largely based on subjective judgment rather than data. Furthermore, I know that many industries face challenges in successfully adopting data analysis systems, and I personally experienced this. While each field has its own circumstances, I believe one key reason for this gap is the disconnect between research and business.

From the industry perspective, I often felt that many research results focused on the study itself, neglecting "real-world applicability." On the other hand, from an academic perspective, I found that business often relied too heavily on subjective decisions, ignoring the complexities of the real world.

Bridging the Gap

Thus, the real intent of this paper was to bridge the gap between business and research, however small that contribution may be. I wanted to be a "conceptualizer" who actively uses data analysis to solve business problems. In this sense, I believe this paper sits somewhere between research and business. Throughout the writing process, I fought hard against the temptation to get lost in academic curiosity, focusing instead on practical applicability.

The quality and results of the paper will be judged by reviewers or proven in real-world industries, not by me. However, I anticipate that my future work will also be positioned between these two worlds. Connecting these two domains is an incredibly fascinating challenge. To view the article in Korean, please clickhere.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Hyeyoung Park (MBA AI/BigData, 2023)

Hyeyoung Park (MBA AI/BigData, 2023)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Ⅰ. What if we could detect a real estate bubble?

In 2021, the real estate market entered a recession as the bubble burst. The government is hastily preparing policy to activate the market, but it doesn't seem to be working as planned. However, if we could detect the real estate bubble in advance, wouldn’t it be possible to prevent the market from entering a recession?

The impact of the real estate bubble

The impact of the real estate bubble was immense. Apartment prices recorded the largest decline and Seoul's overall housing sale prices experienced the biggest drop since the subprime mortgage crisis.

Along with the decline in real estate prices, real estate transactions have also decreased as financial authorities sharply raised the base rate to reduce liquidity. In Seoul, homeowners are even offering to cover maintenance fees, moving costs, and luxury bags to promote leases and sales, but unlike before, transactions are not happening as actively. As seen in Figure 1, the number of apartment transactions in Seoul from January to September 2022 fell below 10,000, a decrease of 73.7% compared to the last year.

Many experts are concerned that the real estate recession could directly lead to a shock in the economy. To prevent this, they argue that the current real estate policies must be swiftly overhauled with new taxation policies suited to the era of high interest rates. They think the comprehensive real estate tax and capital gains tax are too burdensome, and coupled with high loan interest rates, real estate transactions are not occurring actively. Therefore, they suggest that easing real estate taxes and removing regulations are necessary to boost transaction volumes.

On the other hand, hastily changing real estate policies could be risky. The massive liquidity and low-interest rate environment caused by the global pandemic led to speculative behavior among people in their 20s and 30s, trapping many young adults in debt. Considering this reality, indiscriminately easing or removing real estate regulations could be dangerous. Therefore, the government should maintain appropriate levels of regulation to curb speculation and ensure that housing opportunities are available for real homebuyers.

Governments have alternated between these two approaches to find policies that would minimize the impact of housing bubbles. However, they have been unable to find an ideal solution that satisfies everyone. Excessive regulation risks violating the basic market principle that prices are determined by supply and demand, while too much leniency can lead to market disruptions like speculation, over-leveraging. It's extremely difficult to strike a balance in policy-making that satisfies everyone's needs.

What if we could identify a real estate bubble in advance?

Many people fail to recognize a bubble in the real estate market until prices crash. This is because accurately measuring the intrinsic value of real estate is difficult. As a result, most market participants mistake rising property prices for an increase in intrinsic value and get swept up by the decisions of others, leading to a bubble. When the bubble bursts, the inflated asset prices drop, which can lead to increased household debt, large-scale bad debts for financial institutions, and, in severe cases, an economic recession.

However, what if we could detect a real estate bubble in advance? It could help resolve many of the concerns mentioned above. The government would be able to identify overheating in the real estate market early and take an action before the bubble negatively impacts the real economy. Moreover, given the high level of global interconnectedness today, a bubble in one country can have significant effects on the global economy, making the prediction of bubbles increasingly important.

In this article, I will explain the steps taken to identify factors related to a real estate bubble and statistically verify them. Specifically, we will examine whether the "winner’s curse", often cited as a cause of real estate overheating, truly corresponds to a bubble by using regression analysis and statistical testing.

II. The History of Bubbles and the Reasons They Recur

A bubble refers to a phenomenon where the price of a specific asset significantly exceeds its intrinsic value due to excessive demand. This typically occurs when the economy becomes overheated.

When a bubble bursts, it leads to massive losses for investors and delivers a significant blow to financial institutions whose main business is providing mortgage loans. This could result in systemic risk across the financial market.

The History of Bubbles

Bubbles have historically repeated themselves multiple times in the global financial market. Examples include the Dutch Tulip Bubble of the 1630s, the South Sea Bubble in Britain during the 1720s, the Japanese real estate and stock market bubble of the 1980s, the Dot-com Bubble of the 1990s, and the U.S. housing bubble of the 2000s.

Let’s first take a closer look at the Japanese real estate bubble. In the early 1980s, as the yen surged and Japan's trade situation worsened, the Japanese government implemented monetary policies to stimulate the economy. With increased liquidity in the market, speculation was fueled, leading to a bubble between 1985 and 1989, during which the value of Japanese stocks and urban land tripled. At the peak of the real estate bubble in 1989, the value of the Imperial Palace grounds in Tokyo exceeded the total real estate value of the state of California. Ultimately, the bubble burst in 1991, leading to Japan's prolonged economic stagnation, known as the "Lost Decade."

Next, let’s look at the U.S. subprime mortgage crisis. After the dot-com bubble burst, many investors, learning from the experience, shifted money into real estate, which was considered a relatively safe asset. As a result, U.S. housing prices nearly doubled between 1996 and 2006. Additionally, as interest rates dropped, people rushed to buy homes using mortgage loans. However, to control the skyrocketing housing prices, the U.S. government sharply raised interest rates, leading to a wave of defaults by subprime borrowers who were unable to repay their loans. This turned mortgage-backed securities into worthless assets and pushed banks and other financial institutions on the brink of collapse.

The U.S. subprime mortgage crisis had a significant impact on South Korea as well. Major hedge funds and investment banks, including prominent U.S. financial institutions like Bear Stearns and Lehman Brothers, faced bankruptcy. Learning from this, foreign investors began favoring safer assets, causing significant volatility in the Korean foreign exchange market. As the interest rate spread between the U.S. and South Korea widened, the carry trade became widespread. It leads to negative effects on both the domestic financial market and the real economy in Korea.

The Reasons Bubbles Recur

The main method of detecting a real estate bubble is by examining the ratio between the money supply and the market capitalization of apartments. Generally, when the money supply increases, the value of money declines, leading to a rise in apartment prices. However, if the gap between the growth rates of the money supply and apartment market capitalization becomes unusually large, it may indicate a bubble. This suggests that apartment prices are rising independently of the available money supply. In November 2021, when the bubble was at its peak, the ratio of apartment market capitalization to the money supply soared to 147%. This shows that apartments were highly overvalued compared to their intrinsic value.

So why do bubbles continue to recur in cycles? Despite experts consistently presenting objective indicators, such as the one mentioned above, and warning about signs of a real estate bubble, why do people persist in risky behaviors like over-leveraging ("all-in" borrowing) and speculative investments ("betting with borrowed money")?

Robert Shiller, the 2013 Nobel Prize winner in economics, argued in his renowned book "Irrational Exuberance" that most market participants do not fully understand the true nature of the market. He further stated that people often don’t even care about why the market might be undervalued or overvalued. In such an environment, people’s investment decisions are heavily influenced by easily accessible information. In other words, instead of conducting deep quantitative and qualitative analysis, most investors are drawn to shallow, hearsay-like information, leading them to make decisions that are closer to gambling.

The core principle of a bubble can be summarized in one word: "Herd effect". In these situations, the independence of individuals breaks down, leading to irrational decisions made collectively.

III. Characteristics of the Real Estate Auction Market in Korea

In this section, let's explore why we should examine the auction market as a tool for predicting bubbles in the real estate sales market.

Recently, the real estate sales market has suffered a transaction freeze. The auction market typically comes into the spotlight when the real estate market enters a downturn. Due to the bleak outlook, competition for successful bids significantly decreases, and the bid price ratio — the ratio of the final bid to the appraised value — also drops noticeably. This, in turn, increases the incentive for investors.

In fact, the auction market appears to be reviving. According to a report published by the court auction specialist firm, Gigi Auction, the number of apartment auctions nationwide in October 2022 was 1,472, marking an upward trend after recording 1,330 cases in June of the same year. Additionally, the nationwide apartment bid price ratio in October 2022 was 83.6%, only a 0.5 percentage point increase from 83.1% in September of the same year, which was the lowest level since 2019. Considering that appraised values are like market prices, this suggests that recently, prices in the auction market are lower than in the sales market.

Savvy investors are turning their attention to the real estate auction market to exploit niche opportunities. If they can properly analyze the real estate market and identify undervalued areas, they will be able to fully enjoy excess returns in the auction market.

Bubbles can also occur in the auction market

A question naturally arises: Can bubbles occur in the auction market just like in the sales market? To answer this, we need to understand the real estate auction market in Korea.

In Korea, auctions are widely perceived as a way to buy real estate at a lower price. However, one peculiar aspect of the Korean auction market is the frequent occurrence of the so-called "Winner's Curse," where the winning bidder ends up overpaying due to overestimation of the property's value or intense competition. This phenomenon is often attributed to Korea's unique real estate auction system, which sets it apart from those in other developed countries.

Korea’s real estate auction system employs a sealed-bid auction and a first-price auction format. In a sealed-bid auction, bidders cannot see the prices offered by others, ensuring independence between participants. In the first-price auction, the highest bidder wins and pays the amount they submitted. Bidders aim to submit a price lower than the market value but higher than their competitors, so the bid prices generally don’t vary greatly. Unless the bidder is a stakeholder, such as a tenant or creditor, it is rare for someone to submit an overwhelmingly higher bid than others.

However, if current real estate prices do not accurately reflect intrinsic value, or if expectations of future price increases take hold in the market, the situation changes. Market participants, anticipating excess returns, will flood into the auction market, leading to intense competition. As a result, the gap between the winning bid and the second-highest bid will widen significantly. Moreover, as bidders are driven by herd effect and inflate bid prices, this phenomenon closely mirrors the bubble seen in the real estate sales market.

In addition, the real estate auction market is known to precede the sales market. As we observed earlier, when real estate prices begin to rise, properties listed in the sales market often move to the auction market at lower prices, activating the auction market. Therefore, if we can detect a bubble in the auction market through data analysis, it could also serve as an indicator to identify a bubble in the real estate sales market.

Ⅳ. Bubble index: Price Differences between the 1st and 2nd Place Bids in Auction Market

Literature review

Previous studies on predicting real estate market prices or auction winning bids have employed the Hedonic Price Model and Time Series Model. The Hedonic Model is a regression model based on the assumption that the price of a good is the sum of the quantities of its inherent characteristics. In prior research, the focus was on increasing accuracy by adding as many variables as possible that could represent the characteristics of real estate. However, adding a large number of variables poses a risk. Including unnecessary variables without sufficient validity can lead to multicollinearity, which increases the variance of the estimates and results in unreliable outcomes. This is also the reason why research using the Hedonic Price Model has not been conducted since the mid-2010s.

Data and variable selection

In this study, we aim to address the issues found in previous research by introducing the price differences between the 1st and 2nd place bids in the auction market as an indicator of a bubble and statistically verifying this index.

In this study, the data consists of the quarterly transaction volumes from 2014 to 2022 for the Gangnam and Nowon districts. These areas were deemed most suitable for the study, as Gangnam and Nowon are the regions in Korea with the most active transaction and bidding activities.

Additionally, while this study is based on the Hedonic Price Model as a foundation, it introduces some modifications. The dependent variable is the corrected winning bid rate(y), while the independent variables include the number of unsuccessful bids(FB_NUM), the number of bidders(BD_Num), the bubble index (Index_5), and the M2 currency volume(M2), distinguishing the variable selection from previous studies.

To elaborate on the dependent variable, the corrected winning bid rate, the original winning bid rate is calculated by using the court appraised value (typically the KB market price) as the denominator and the winning bid as the numerator. However, there is a time gap of about 7 to 11 months between the appraisal and the winning bid. Therefore, this study adjusts the court appraised value, which is the denominator in the traditional winning bid rate, to reflect the market price at the time of the winning bid using the following formula.

Figure 3. $S_p$: KB market price at the time of the winning bid, $S_{p-t}$: KB market price at the time of appraisal.

To further explain the independent variables in this regression model, first, the number of unsuccessful bids(FB_NUM) serves as a control variable that explains auction risk factors. A high number of unsuccessful bids indicates that the auction price is set higher than the market price, leading participants to forgo the auction. Consequently, this suggests a higher likelihood that the next auction will also fail. Second, the number of bidders(BD_Num) refers to the number of people who participated in the auction. As market overheating(a bubble) occurs in the auction market, the number of bidders tends to increase, making it a potential indicator of a bubble. Third, the price differences between the 1st and 2nd place have been explained several times before, so it will be omitted here. Lastly, the inclusion of the M2 currency volume(M2) in the model considers the general trend that an increase in money supply often leads to a sharp rise in real estate prices.

Exclusion of the Intrinsic Value of Real Estate and the Necessity of the Chow Test

There are numerous factors that determine real estate prices, such as school districts, job opportunities, apartment floor levels, proximity to roads, building age, apartment structure, and transportation convenience. These elements that influence intrinsic value are extensive and complex. Therefore, in the regression model for the corrected winning bid rate, I applied a logarithmic transformation to eliminate the intrinsic value, allowing the focus to remain solely on the auction characteristics, which is the main purpose of this study. The detailed process is as follows.

Figure 4. $V_i$: Market price (including intrinsic value), $X_{ik}$: Intrinsic characteristics of real estate, $Z_{im}$: Auction characteristics of real estate, $A_i$: Appraised price (including market price), $B_i$: Winning bid (intrinsic characteristics + auction characteristics of real estate)

This paper also assumes that a structural break occurs when a bubble forms, so conducts a Chow test to verify this. The assumption is that during periods of market overheating, such as a bubble, the market will behave differently due to irrational investment sentiment compared to other periods. The Chow test is a statistical test used to determine whether there has been a "structural shock or change" by comparing the regression coefficients of two linear regression models before and after a specific period in time series data.

Regression Analysis and Chow Test

First of all, I conducted regression analysis over the entire period. The dependent variable is the corrected winning bid rate(y), and the independent variables are the number of unsuccessful bids (FB_NUM), the number of bidders (BD_Num), the price difference between the first and second bidders (Index_5), and the M2 currency volume(M2).

Figure 5. Regression Analysis Results for the Entire Period

As shown in Figure 5, the adjusted R-squared is 0.774, and all independent variables are statistically significant. One notable point is that in the regression model for the entire period, the coefficient for Index_5 (the price difference between the first and second bidders) is 0.039, indicating that it has a relatively small impact on the winning bid rate compared to the other independent variables.

Next, let's identify the structural break point through the Chow test. As shown in Figure 6, it can be statistically confirmed that a structural break occurred at point 226 (Q2 2016).

Figure 6. Structural Break at Points 226 and 321(left) / Distribution Differences of the Dependent Variable (Log of Winning Bid Rate) Before and After the Structural Break(right)
Figure 7. Chow Test Statistics at the Structural Break Points

Let’s divide the data into two regression models based on the structural break at point 226 and examine whether there is a significant change in Index_5(bubble index) before and after the structural break.

Figure 8. Regression Analysis Results Before the Structural Break(above), Regression Analysis Results After the Structural Break(below)

As seen in Figure 8, Index_5 was not rejected at the 0.05 significance level before the structural break, but after the structural break, Index_5 (t-stat = 2.613, p-value = 0.01) became statistically significant. This indicates that the bubble index (Index_5) significantly increased at the point where the actual bubble occurred (the structural break point).

Figure 9. Changes in the Regression Coefficient of Index_5 from 2014 to 2022

Figure 9 also shows that Index_5 began to fluctuate significantly around the structural break point at 226. After point 226, there was a notable increase in Index_5, reflecting the overheated real estate market at that time, as liquidity in the low-interest-rate environment flooded into the Gangnam reconstruction market and new apartment developments in Q2 2016.

For this reason, I argue that the price difference between the first and second bidders can be used as a bubble indicator. By utilizing this metric, we can prevent the bubble from inflating further and take action before the bubble bursts unexpectedly, leading the real estate market into a downturn.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Analyzing the U.S.-China trade conflict using Comparative Advantage Theory and the Cobb-Douglas Production Function

Analyzing the U.S.-China trade conflict using Comparative Advantage Theory and the Cobb-Douglas Production Function

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Recently, the U.S.-China conflict has intensified. Beginning with trade restrictions on China under the Trump administration in 2018, the Biden administration, which took office in 2021, has continued to take bold steps aimed at domestic manufacturing recovery and tightening control over China. These efforts include bipartisan legislation such as the Infrastructure Investment and Jobs Act and the CHIPS and Science Act. As a result, the global industrial landscape is undergoing significant restructuring.

At one time, U.S. manufacturing was unrivaled, but it gradually declined under the pressure of "Triffin's Dilemma," a result of the dollar's status as the global reserve currency and increasing global competition. However, a recent wave of U.S.-China "decoupling" has shifted the comparative advantage of capital and labor between the two countries. As a result, U.S. manufacturing is showing signs of revival, with a sharp increase in employment. In contrast, China's once-explosive manufacturing growth is shrinking due to the aggressive trade sanctions imposed by the U.S. Meanwhile, America's IT and financial sectors are facing a wave of layoffs, bringing cold news to industry workers.

This column reinterprets the newly evolving industrial structures of the United States and China by applying the Heckscher-Ohlin theorem, an international trade theory based on comparative advantage in economics, the Cobb-Douglas production function, a mathematical tool of microeconomics, and regression analysis from statistics.

Heckscher-Ohlin Theorem and Cobb-Douglas Production Function

Before examining the U.S.-China conflict, it is essential to introduce the theoretical tools that will aid in its interpretation. First, the Heckscher-Ohlin theorem is a theory in international economics which states that trade occurs between countries due to differences in their factor endowments and the varying factor intensities required to produce different goods. The theorem builds upon David Ricardo's theory of comparative advantage but introduces a slightly different perspective. While Ricardo’s theory suggests that comparative advantage arises from differences in technological capabilities, the Heckscher-Ohlin theorem asserts that comparative advantage and trade stem from differences in factor intensities, such as labor (L) and capital (K), which are used in the production process. For example, China has an abundance of labor, while the U.S. is rich in capital and technology. As a result, the two countries trade labor-intensive goods for capital-intensive goods.

The Heckscher-Ohlin theorem can be expressed in the form of a Cobb-Douglas production function, which is commonly used in economics. This function illustrates the relationship between factor intensities (L for labor, K for capital) and output (Y) and is widely used to analyze how the inputs of labor and capital contribute to factor productivity in a country or a specific industry. It helps to determine how much productivity is generated when labor and capital are employed in production.

Estimating the "Elasticities" of the Cobb-Douglas Production Function through Regression Analysis

Let's formulate the Cobb-Douglas production function for a country, where the total output is denoted by $Yi$, labor input by $Li$, capital input by $Ki$, and the remainder of the output that cannot be explained by labor and capital (the residual) is represented by $exp(u)$, as follows:

$Y = exp({\beta_ 0}) \cdot L^{\beta_ L} \cdot K^{\beta_K} \cdot exp(u) \quad \cdots ~(1)$

In the Cobb-Douglas production function, "elasticities" or "factor productivities" are represented by $\beta_ L$ and $\beta_K$, and these can be easily estimated through a slight transformation of the equation. First, by taking the logarithm of both sides of equation 1, it transforms into equation 2.

$\log{Y} = \beta_ 0 + {\beta_ L} \log{L_ i} + {\beta_ K} \log{K_i} + u \quad \cdots ~(2)$

Equation 2 takes a familiar form. It is a linear regression equation where $\log{Y}$ is the dependent variable and $\log{L}$ and $\log{K}$ are the independent variables. By taking the partial derivatives of $\log{L}$ and $\log{K}$, we can obtain $\beta_ L$ and $\beta_ K$. In microeconomics, these regression coefficients are referred to as the "elasticities of substitution" for the factors of production, denoted as $e_ {LY}$ and $e_ {KY}$.

$$
\beta_ L = \cfrac{\partial \log{Y}}{\partial \log{L}} = \cfrac{dY}{Y}*\cfrac{L}{dL} = \cfrac{dY/Y}{dL/L} = e_{LY}
$$

$$
\beta_ K = \cfrac{\partial \log{Y}}{\partial \log{K}} = \cfrac{dY}{Y}*\cfrac{K}{dK} = \cfrac{dY/Y}{dK/K} = e_{KY}
$$

By using OLS (Ordinary Least Squares) estimation in equation 2, we can get $\beta_ L$ and $\beta_K$, allowing us to quantitatively analyze how elastically the output of a given country increases when additional labor and capital are input.

Trade Arising from Differences in the Regression Coefficients

In summary, by estimating the $\beta_ L$ and $\beta_K$ of the Cobb-Douglas production function, we can quantitatively assess the labor and capital productivity of each country. Based on this Cobb-Douglas production function, the reasons for trade between countries can be explained using the Heckscher-Ohlin theorem.

For example, let’s assume that South Korea is relatively capital-abundant, while Chile is labor-abundant. In this case, smartphones are considered capital-intensive goods, requiring more capital than labor, while wine is labor-intensive, requiring more labor than capital. Thus, in the production of smartphones, $\beta_ K$ would be greater than $\beta_ L$, whereas in the production of wine, $\beta_ L$ would be greater than $\beta_ K$. According to the Heckscher-Ohlin theorem, due to differences in factor intensities ($\beta_ L$, $\beta_K$), South Korea would have a comparative advantage in producing capital-intensive smartphones at a lower cost, while Chile would have a comparative advantage in producing labor-intensive wine at a lower cost, leading to trade.

At this point, we have all the necessary tools for analysis. However, before diving in, let's first take a look at the recent shifts in the industrial landscape of the U.S. and China due to their ongoing conflict

Increase in U.S. Manufacturing Employment Rate

The U.S. export-based industries, particularly manufacturing, have consistently recorded trade deficits and gradually declined due to the petrodollar system that began in the mid-to-late 1970s. This decline accelerated further during the 1980s under Ronald Reagan's administration, as the consecutive oil shocks triggered domestic economic recessions, leading to a steep downfall of labor-intensive manufacturing industries. To counter this, the government implemented tax cuts, public spending, and massive defense expenditures to curb the recession dramatically. Subsequently, the U.S. experienced nearly 40 years of low interest rates and low inflation, and until the onset of the COVID-19 pandemic, the manufacturing employment rate had been steadily recovering.

However, in the wave of the COVID-19, the economy contracted once again, leading to the overnight shutdown of major automobile assembly plants and dealerships. Automobile manufacturing came to a sudden halt, and in the food industry, numerous reports emerged of workers contracting and dying from the virus, prompting factories to shut down and causing a massive loss of jobs. In response, the U.S. government implemented a large-scale quantitative easing of $4.5 trillion, which helped boost employment rates, and the manufacturing sector began to recover.

Amidst these developments, the U.S., under the leadership of President Biden, aggressively pushed forward the Bipartisan Infrastructure Law (BIL) and the CHIPS and Science Act. These initiatives aimed to strengthen the domestic economy by overhauling bridges, roads, and rural areas across the country. As a result, not only did the U.S. curb the growth of China's semiconductor industry, a key sector of the Fourth Industrial Revolution, but it is also now moving to reshape the global semiconductor landscape with the U.S. at its center.

Buoyed by this political momentum, the manufacturing sector has continued its upward trend. According to the Financial Times, as of June, U.S. manufacturing employment has increased by nearly 800,000 since President Biden took office, with around 13 million people now employed in the sector, the highest level since the 2008 global financial crisis. As shown in Figure 1, employment rates rose sharply during key periods such as the pandemic-era quantitative easing and the passage of the Bipartisan Infrastructure Law and CHIPS and Science Act.

Figure 1. Trend in U.S. Manufacturing Employment / Source: Bureau of Labor Statistics

The Grim U.S. Financial and Tech Sectors

In stark contrast to the manufacturing sector, which is gearing up for significant growth, major U.S. tech companies have been undergoing large-scale layoffs since the latter half of last year. Leading the wave of employment cuts are global IT giants such as Amazon and Meta, with many other U.S. tech firms following suit. According to Layoffs.fyi, a site that tracks layoffs in the U.S. tech industry, 1,058 IT companies laid off a total of 164,709 employees last year alone. Notably, in November, Amazon laid off 10,000 employees, while Meta cut 11,000 jobs.

The layoffs at major tech companies have continued into this year. Following last year's cuts, Amazon has laid off an additional 17,000 employees so far in 2023. While Apple announced plans to reduce its operating budget to avoid restructuring, some reports suggest that the company began cutting staff in April, starting with its retail team at the U.S. headquarters in California.

Meanwhile, the U.S. financial sector has not been spared from the wave of layoffs. In May, CNBC reported that major Wall Street investment banks such as Morgan Stanley, Bank of America, and Citigroup carried out significant job cuts. Morgan Stanley laid off around 3,000 employees by the end of June, amounting to 5% of its total workforce based in New York. Additionally, Citigroup and Bank of America also announced the dismissal of hundreds of employees in May, bringing grim news to the financial industry.

Declining China's Manufacturing Sector Amid U.S.-China

China, which is in direct hegemonic competition with the U.S., is experiencing significant losses in its manufacturing sector. This is due to the intensifying "decoupling" between the two countries, driven by the Biden administration’s domestic-focused initiatives such as the Bipartisan Infrastructure Law and the CHIPS and Science Act, which aim to strengthen U.S. technological sanctions on China. Additionally, ongoing tariff disputes between the two nations have further escalated the situation.

China's manufacturing growth has significantly slowed due to geopolitical tensions with the U.S. As shown in Figure 2, China's secondary industry (related to manufacturing) began to decline starting in 2021, when the U.S. started implementing parts of the Bipartisan Infrastructure Law as part of its efforts to strengthen domestic competitiveness and achieve self-sufficiency.

In the same context, data released by China’s General Administration of Customs on July 13 showed that China's export value in June was $285.3 billion, a 12.4% decrease compared to the same month last year. This figure falls short of both the previous month's -7.5% and the market expectation of -9.5%.

An Explanation of the U.S.-China Conflict Based on the Heckscher-Ohlin Theorem

Through the discussion so far, we can see that between 2021 and 2022, when national efforts toward "self-sufficiency" and "decoupling" between the U.S. and China began to intensify, the two countries started to take divergent paths in manufacturing. Additionally, we've observed that U.S. sectors such as Wall Street and Big Tech are facing a wave of layoffs, leading to a general downturn in the financial and IT industries.

Here, we can raise a few questions. Why have the industrial structures of the U.S. and China evolved in opposite directions? For instance, why have the Bipartisan Infrastructure Law and the CHIPS and Science Act, aimed at reducing U.S. dependence on China, revitalized U.S. manufacturing while leading to the decline of China's manufacturing sector? Could U.S. and Chinese manufacturing not have risen together when the U.S. decided to focus on strengthening its domestic market? And why, seemingly unrelated, has the U.S. financial and IT sectors faced a downturn?

Let’s view the U.S.-China conflict and industrial structural changes through the lens of the Heckscher-Ohlin theorem. From 1970 to 2022, the U.S. experienced a decline in its export competitiveness in manufacturing due to the dollar's status as the global reserve currency and instead focused on capital-intensive industries like finance and IT. It’s also important to acknowledge that the flow of global capital into the U.S. as a reserve currency issuer played a significant role in this shift. During this period, the U.S. had a comparative advantage in "technology" or "capital," whereas China, by leveraging its cheap labor, focused on labor-intensive industries based on primary sectors, giving it a comparative advantage in "labor." By concentrating on what they did best, the U.S. and China grew their economies by trading technology-intensive goods and labor-intensive goods with one another.

However, with the recent passage of laws such as the Bipartisan Infrastructure Law and the CHIPS and Science Act, the U.S.-China technological competition has intensified, leading to a resurgence in U.S. manufacturing. Additionally, as labor costs have declined in the U.S., the comparative advantage between the U.S. and China has started to shift. With the relative price of labor compared to capital decreasing in the U.S., resources have gradually been reallocated toward manufacturing. As a result, capital-intensive industries like finance and IT have seen a reduction in labor input. In contrast, China’s labor-intensive manufacturing sector has been shrinking due to aggressive U.S. sanctions, causing the relative price of labor to rise compared to capital.

Changes in the U.S.-China Industrial Structure Due to Differences in the Regression Coefficients

Let's revisit the U.S.-China conflict through the lens of factor productivity (elasticity) in the Cobb-Douglas production function. The total output of a country is denoted by $Y_ i$, labor input by $L_ i$, capital input by $K_ i$, and the remainder of the output that cannot be explained by labor and capital (the residual) by $exp(u)$. The Cobb-Douglas function can be expressed as follows. In this case, let’s assume there are only two countries, where $i$ represents either the U.S. ($U$) or China ($C$).

$$
Y = exp(\beta_ 0) \cdot L^{\beta_ L} \cdot K^{\beta_K} \cdot exp(u)
$$

Before the implementation of the petrodollar system and the resulting large trade deficits, the U.S. in the pre-1980s likely had higher values for both $\beta_ L^U$ (labor elasticity) and $\beta_K^U$ (capital elasticity) compared to China. The reason is that the U.S., with its robust manufacturing sector focused on secondary industries, would have had higher labor productivity than China, which was still primarily focused on the primary sector (light industries) during its early stages of economic growth. Additionally, the U.S. had significantly increased productivity through capital investments in mechanization and automation, suggesting that its capital productivity was also higher than that of China, which at the time relied more on manual labor.

On the other hand, from the 1980s until just before the COVID-19 outbreak, the U.S. saw its manufacturing sector decline under the petrodollar system, while China experienced explosive growth due to continuous market reforms and its entry into the WTO. As a result, in the manufacturing sector, both $\beta_ L^U$ (labor elasticity) and $\beta_ K^U$ (capital elasticity) in the U.S. likely became lower than those of China during this period.

However, between 2021 and 2022, as the U.S. took aggressive measures to counter China, its manufacturing sector began to revive. As a result, in manufacturing, the U.S.'s $\beta_ L^U$ (labor elasticity) and $\beta_ K^U$ (capital elasticity) are now catching up with those of China.

Applying the Heckscher-Ohlin theorem, it appears that due to the recent U.S.-China conflict, the U.S. has not only gained an absolute advantage in both $\beta_ L^U$ (labor elasticity) and $\beta_ K^U$ (capital elasticity) over China, but there are also signs that $\beta_ L^U$ may be surpassing $\beta_ K^U$. This indicates that labor is shifting toward the manufacturing sector in the U.S., while in China, trade sanctions are causing $\beta_ L^C$ (labor elasticity) to decline, making $\beta_ K^C$ (capital elasticity) relatively larger.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

GSB Yearbook

GSB Yearbook

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

The Gordon School of Business (GSB) is the educational branch of the Swiss Institute of Artificial Intelligence (SIAI). While SIAI now serves as a hub for global conferences and high-level research, GSB continues the Institute’s original educational mission: training the next generation of thinkers capable of applying deep statistical reasoning to real-world complexity. This Yearbook offers a rare glimpse into that mission in action—not through official reports or faculty summaries, but through the voices of the students themselves.

A Window Into Transformation

From the outset, GSB’s intent in producing this Yearbook was simple yet ambitious: to let students tell the story of how their thinking has evolved over the course of their studies. We hoped to capture a vibrant mosaic of personal growth—moments of struggle, breakthroughs, and even the emotional highs and lows that inevitably accompany rigorous academic training.

In reality, many students responded with something more restrained. Rather than dramatic narratives, they chose quiet reflection, focusing on the substance of their dissertations and the precision of their methods. To some, this may seem understated. Yet to those familiar with the transformation our students undergo, these reserved voices speak volumes. Their very restraint signals a maturity born of rigorous training: an ability to think deeply, write clearly, and ground ideas in evidence rather than flourish.

Beyond Code and Models

One common misconception about our program is that it produces graduates armed solely with lines of code and algorithms. This Yearbook aims to challenge that view. The public often equates “artificial intelligence” with mystical black-box technologies. What our students discover, and what these pages reveal, is that AI at its core is computational statistics—tools and frameworks that unlock patterns in data and transform how we understand the world.

By the end of the program, students move beyond chasing models for their own sake. They learn to ask deeper questions: What is the cause behind this pattern? How does this insight translate into real-world impact? How can rigorous theory coexist with the messy realities of human systems? The diversity of topics in this Yearbook—spanning finance, health, policy, and beyond—reflects not only the breadth of AI’s applications but also the versatility of the thinkers GSB aims to cultivate.

The School’s Philosophy

GSB is founded on the motto Rerum cognoscere causas—“to seek the causes of things.” This principle runs through every lecture, seminar, and case study. Our teaching begins with theory: statistical frameworks, mathematical reasoning, and computational rigor. But theory alone is never enough. Students are challenged to apply their knowledge to real-world data, confront ambiguous problems, and produce outcomes that are both scientifically robust and practically relevant.

This philosophy shapes a particular kind of graduate. Our alumni are not limited to any single industry or role; they are versatile minds who can move seamlessly between disciplines, bridging technical insight and strategic thinking. In an era where data touches every domain, this breadth is not optional—it is essential.

Reading the Yearbook

As you read this Yearbook, you will notice that the tone is often modest. These are not glossy marketing pieces. They are personal reflections on work that was, for many students, the most demanding intellectual challenge of their lives. Some entries may feel technical; others may feel restrained. But beneath the surface lies something remarkable: evidence of how much their thought processes have changed.

A student who once saw AI as code now sees it as a lens for understanding reality. A student who once sought quick answers now wrestles with root causes. A student who once worked in isolation now frames their work in conversation with society. These subtle shifts mark the true success of GSB’s educational model.

A Quiet Invitation

In sharing this Yearbook, we do not simply showcase dissertations or celebrate milestones. We invite you—whether you are an academic, an industry professional, or simply curious—to witness the growth of minds in motion. This is what a year of intense study at GSB produces: not just technical skill, but a way of thinking that endures long after the program ends.

We hope you read these pages not only for their content, but for what they represent: a generation of students who have learned, in their own quiet ways, to seek the causes of things.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Ten Flagships, One Mirage: Why Korea Needs Fewer Super-Universities and a System Reset

This article was independently developed by The Economy editorial team and draws on original analysis published by East Asia Forum. The content has been substantially rewritten, expanded, and reframed for broader context and relevance. All views expressed are solely those of the author and do not represent the official position of East Asia Forum or its contributors.

Europe's Security Bill Has Come Due

This article is based on ideas originally published by VoxEU – Centre for Economic Policy Research (CEPR) and has been independently rewritten and extended by The Economy editorial team. While inspired by the original analysis, the content presented here reflects a broader interpretation and additional commentary. The views expressed do not necessarily represent those of VoxEU or CEPR.