This article is based on ideas originally published by VoxEU – Centre for Economic Policy Research (CEPR) and has been independently rewritten and extended by The Economy editorial team. While inspired by the original analysis, the content presented here reflects a broader interpretation and additional commentary. The views expressed do not necessarily represent those of VoxEU or CEPR.
Keith Lee is a Professor of AI and Data Science at the Gordon School of Business, part of the Swiss Institute of Artificial Intelligence (SIAI), where he leads research and teaching on AI-driven finance and data science. He is also a Senior Research Fellow with the GIAI Council, advising on the institute’s global research and financial strategy, including initiatives in Asia and the Middle East.
Picture
Member for
1 year 2 months
Real name
Keith Lee
Bio
Keith Lee is a Professor of AI and Data Science at the Gordon School of Business, part of the Swiss Institute of Artificial Intelligence (SIAI), where he leads research and teaching on AI-driven finance and data science. He is also a Senior Research Fellow with the GIAI Council, advising on the institute’s global research and financial strategy, including initiatives in Asia and the Middle East.
Sleep. It's something we all need but often take for granted. As people start to realize just how important it is for our health and well-being, the question of how we can detect and understand our sleep states becomes more critical. This paper takes a closer look at that question, breaking it down into five key sections that will guide us toward better solutions and deeper understanding.
The paper starts by looking at accelerometer data for sleep tracking. This method is popular because it’s non-intrusive and works well with wearable devices for continuous monitoring. It explains how Euclidean Norm Minus One (ENMO, standardized acceleration vector magnitude)-based metrics can be a simple alternative to complex medical exams. Next, it reviews current research, highlighting the strengths and weaknesses of different methods. It also points out gaps in the accuracy and reliability of existing models.
Building on the insights gained from the review, the paper then addresses specific challenges, such as sleep signal variability and irregular sleep intervals. It outlines data preprocessing techniques designed to manage these issues, thereby improving the robustness of sleep state detection. To achieve this, a novel likelihood ratio comparison methodology is introduced, which aims to increase generalizability, ensuring effectiveness across diverse populations. Lastly, the paper concludes by acknowledging the limitations of the current study and proposing future research directions, such as incorporating additional physiological signals and developing more advanced machine learning algorithms.
Sleep Tracking Based on Accelerometer Data
According to the National Health Insurance Service, 1,098,819 patients visited hospitals for sleep disorders in 2022, a 28.5% increase from 855,025 in 2018. As the number of sleep disorder patients rises, interest in high-quality sleep is also growing. However, since the causes and characteristics of sleep disorders vary among patients, there is a burden of needing different treatment methods and diagnostic tests.
Patients suspected of having sleep disorders usually undergo detailed diagnosis through polysomnography. This test involves various methods, including video recording, sleep electroencephalogram (EEG, using C4-A1 and C3-A2 leads), bilateral eye movement tracking, submental EMG, and bilateral anterior tibialis EMG to record leg movements during sleep.
Polysomnography has its limitations. Patients must visit specialized facilities, and it's only a one-time session. As a result, there's increasing demand for tools that offer more convenient and continuous sleep monitoring.
Measuring Movement Using Accelerometer Data
Recently, health management through wearable devices has become increasingly common, enabling real-time data collection. Wrist-worn watches can monitor activity levels, and for sleep measurement, both an accelerometer sensor and a photoplethysmography (PPG) sensor are typically used.
The accelerometer sensor tracks body movements, while the PPG sensor uses light to measure blood flow in the wrist tissue, which helps measure heart rate. Although using data from both sensors could improve the accuracy of sleep measurement, this study only uses accelerometer data due to limitations on data usage. The reasons for this decision will be explained further on.
The accelerometer data consists of three axes, as shown in the figure below [1].
The $x$-axis represents changes in the direction that moves horizontally to the ground, the $y$-axis shows changes in the lateral direction of movement (e.g., how much the arms swing to the sides), and the $z$-axis indicates changes in the vertical direction of movement (peaking when the legs cross over during walking). It is important to understand that each axis's function depends on the sensor's reference axes. If these reference axes change, movement is usually measured based on the axis that shows the biggest change in values.
The graph below shows an example of 3-axis data [4]. This graph shows how the measurements change when walking with the arms swinging compared to walking with the arms held still. The changes in the $x$, $y$, and $z$ axes represent changes in the mean values, and it can be seen that the signal shown in green, when the arms are fixed, has the most significant variation. Therefore, accelerometer data can vary for the same action if the sensor's position or reference axes change.
Making Useful Variables Through Transformation
To solve the problem of axes changing when the sensor's orientation shifts, it is important to convert the data into straightforward yet informative variables. Many studies used summary metrics (or summary measurements). This combines the $x$, $y$, and $z$ axis values into a single value, thereby reducing the impact of changes in sensor orientation.
Examples of summary metrics include Euclidean Norm Minus One (ENMO, standardized acceleration vector magnitude), Vector Magnitude Count (VMC), Activity Index (AI), and Z-angle (wrist angle). Let’s take a closer look at ENMO and Z-angle, as they relate to the signal data from wearable devices discussed earlier.
As shown in the accelerometer diagram in Figure 1, when interpreting the dynamic acceleration of sensor data, it's important to consider the effect of gravity (g). Therefore, to standardize the linear transformation of the three-axis values, a variable that subtracts gravitational acceleration is referred to as the ENMO variable. This can be expressed mathematically as follows.
The Z-angle is a summary metric for the wrist angle, which can be considered as the angle of the arm relative to the body's vertical axis. It can be expressed using the following formula.
To gain an intuitive understanding, let's look at the actual ENMO values measured in real-life situations. Figure 3 below is a table summarizing the ENMO measurements during daily activities [2].
When standing, the ENMO was measured at an average of 1.03g, while it increased to 10.3g during everyday walking. This clearly demonstrates that the ENMO value is lower with minimal movement and rises as activity levels increase. In other words, since humans do not always move at a constant speed like robots, activity levels can be measured using acceleration.
While it may appear that raw $x$, $y$, and $z$ axis data offers more information due to its detail, this study seeks to demonstrate that condensing this information into a single summary metric doesn't significantly impact our ability to accurately estimate sleep states.
Additionally, a basic model revealed that excluding Z-angle data does not result in significant information loss. When we used a tree model to evaluate the explanatory power of variables with statistical metrics from both Z-angle and ENMO, the ENMO variables were found to be much more important. In fact, all of the top 10 most important variables were related to ENMO. Since the importance of Z-angle variables was significantly lower, this study will focus on using ENMO as the primary basis for addressing the problem.
Review of Previous Studies
Existing Methodologies Focused on Optimization
Earlier, we explored several summary measurement variables, such as ENMO, VMC, AI, and Z-angle. More recently, research has been focused on identifying new summary measurements, like MAD (Mean Absolute Deviation), using axis data collected from accelerometer sensors. This kind of variable transformation requires advanced domain knowledge, and the process of validating these reduced variables is complex.
In previous studies, various summary measurements were investigated, and temporal statistics—such as overlapping or non-overlapping deviations, averages, minimums, and maximums at one-minute intervals—were used for classification or detection through machine learning models, heuristic models, or regression models. Figure 4 below summarizes the key methodologies from previous research [5].
Additionally, the evaluation metrics used in sleep research are as follows [2]:
Sensitivity (actigraphy = sleep when PSG = sleep)
Specificity (actigraphy = wake when PSG = wake)
Accuracy: total proportion correct
The amount of Wakefulness After Sleep Onset (WASO): the total number of awakenings during the sleep period
SE (Sleep Efficiency): the proportion of sleep within the periods labeled by polysomnography
TST (Total Sleep Time): calculated as the sum of sleep epochs per night
Limitations of Optimization and Increased Sensitivity to Changes in Data Patterns
Machine learning models like Random Forests and neural networks, such as CNNs (Convolutional Neural Networks) and LSTMs (Long Short-Term Memory networks), are considered "high complexity" due to their focus on achieving high accuracy. This often results in having a large number of parameters, which increases the risk of overfitting. When overfitting occurs, the model might learn the noise in the training data instead of the actual patterns, especially if there isn't enough data.
As a result, the model's performance can decline when applied to new dataset. In practical research, therefore, these high-complexity models sometimes struggle to accurately detect the exact moments of falling asleep or waking up. By focusing too much on optimization, they have overlooked the importance of generalization.
Are simpler models, like regression models, free from optimization issues? Although regression models are generally less sensitive to noise, they rely on the assumption that the data follows a normal distribution. If this assumption is not met, the standard error of the correlation coefficients can be high relative to the coefficients themselves. This increases the p-value, reducing the significance of the correlation and making the model's results less reliable.
Since sleep data often does not follow a normal distribution, additional optimization is needed for regression models like the Cole-Kripke[3] and Oakley[6] models. While these simpler models may be less accurate with target data compared to machine learning or neural network models, their low complexity and optimized adjustments make them useful as baseline models in research, often used alongside polysomnography.
When users have only recently started using wearable devices, there is often a need to classify sleep states with limited data. Early data may lack representativeness, making it challenging to rely on data-intensive models from the machine learning or deep learning fields. Traditional regression models that require extensive optimization are also not sustainable in these cases. This challenge becomes even more significant when analyzing data from multiple users rather than just one. Therefore, this study aims to introduce data transformation and model transformation methodologies that can improve generalization performance.
Characteristics and Collection Methods of ENMO Data
Before diving into the detailed data preprocessing steps and methodologies that aim to overcome the limitations of previous research, let’s take a closer look at the characteristics of ENMO data.
ENMO signals are collected at 5-second intervals and can be analyzed in combination with sleep state labels assigned through sleep diaries. The criteria for labeling sleep states in the sleep diary are as follows:
Sleep is assumed if the sleep state persists for at least 30 minutes.
The longest sleep period during the night is recorded as the sleep state. However, there is no rule limiting the number of sleep episodes that can occur within a given period. For example, if an individual sleeps from 1:00 to 6:00 and again from 17:30 to 23:30 on the same day, both sleep periods are valid and counted. This approach naturally accommodates different sleep patterns, such as early morning and evening sleep, which can be influenced by work schedules.
To help with understanding, let's take a look at the sample data in the graph below. This data was collected over approximately 30 days from one individual, specifically looking at their Z-angle and ENMO signals. Sleep periods are marked as 0, active periods as 1, and -1 indicates cases where the label values in the sleep diary are missing due to device or recording errors.
As expected, we can observe a noticeable periodicity as the sleep periods (0, in red) and active periods (1, in green) alternate. While not everyone exhibits the same sleep pattern, the overarching cycle of sleep and awake remains consistent. Therefore, this study will focus on using generalized data transformations to better distinguish between sleep and wake cycles. In the following section, we will introduce modeling methods that prioritize generalization.
The Z-angle data also showed a cyclic pattern. However, as mentioned earlier, ENMO data is significantly more important than Z-angle data and results in less information loss. Therefore, in the following methodologies, only the ENMO variable was used.
Considering Variability of Sleep Signals and Irregularity of Sleep Intervals
Interestingly, even during sleep, there are small fluctuations. This occurs because sleep consists of different stages, as many people know. These stages are usually classified based on the criteria shown in the diagram below.
In the previous study by Van Hees, sleep stages were classified based on the same diagram. The concept of sleep stages suggests that body movements vary depending on the stage, which can cause subtle fluctuations in sleep signals. As shown in the ENMO data in Figure 5, the $y$ values during sleep periods (indicated by red bars) are not uniform.
It naturally occurred to me that if the signals during sleep periods could be processed into more consistent signals, detecting sleep states would become easier. The goal of stabilization is not to eliminate sleep signals entirely but to preserve their characteristics while maintaining relatively stable values compared to the variance in the raw data.
Building on this idea, we can conclude that generalization is achievable even when the amount of tossing and turning varies between individuals and across all users. While it might be tempting to skip over these complex processes, doing so would be unwise. Previous studies have often overemphasized optimization, which can lead to problems like overfitting. To prevent this, it’s well-known that regularization techniques, such as the Lagrange multiplier method, are commonly used.
In this study, we aimed to develop a methodology with superior generalization performance by processing the data based on insights gained from a more detailed analysis of the data characteristics and modifying the model accordingly. I hope this discussion helps convey the importance of having a solid rationale in the data preprocessing stage to build a reliable model.
Stabilizing Sleep Signals Through Data Transformation
The initial approach to stabilizing sleep signals focused on removing outliers by applying standard filtering techniques. A method similar to Fast Fourier Transform (FFT) was employed, specifically using Power Spectral Density (PSD). PSD is effective for analyzing the distribution of $FFT_2$ density across different frequency bands.
However, after applying PSD to the ENMO signal data, we found that 99.8% of the entire dataset remained, failing to achieve the intended stabilization of sleep-specific signals. As shown in Figure 6, the variability of the processed ENMO signals (indicated by the red bars) during sleep periods (the area between the red and green lines, in that order) remained evident.
To see if smoothing the data would resolve the issue, a Kalman filter was applied. Despite incorporating covariance from previous data, it failed to stabilize the sleep signals. As shown in Figure 7, much of the variability in the processed ENMO signals during sleep periods persisted. Additionally, the Kalman filter performed poorly in detecting sleep states and had higher computational costs compared to PSD, mainly due to the use of covariance from prior data.
Finding Periodicity in Irregular Intervals
There was an aspect that was overlooked during the initial data transformation process. We missed one of the most important characteristics of the given data: even for a single user, the times of falling asleep and waking up are not consistent. Therefore, this time we used the Lomb-Scargle periodogram, a method designed to detect periodic signals in observations with uneven spacing.
Figure 8 below visualizes the data after applying the Lomb-Scargle periodogram. Although the signals in the sleep periods appear almost uniform due to the long duration of the entire dataset, zooming in on specific intervals reveals that while the variability has been reduced, the characteristics of the signals have been preserved as much as possible.
Beyond applying this to a single ID value, we also examined the results for all IDs without missing data. The dominant frequency showed a linear pattern with power. Therefore, for frequencies observed within the regression line, we determined that filtering using the typical values within the linear range would not significantly reduce the accuracy of the predictions.
Signal processing methods like the Lomb-Scargle periodogram, which align with the principles of FFT, need sufficient data to detect periodic patterns. In the case of ENMO data, at least a full day must pass to observe a complete cycle of sleeping and waking.
Thus, if the sample period for each ID was less than 3–5 days or there were many device omissions, filtering was done using the dominant frequency data from the training data. When there was at least about 5 days of data available, filtering was applied individually based on each ID.
Likelihood Ratio Comparison
In the previous section, we explored how data can be transformed as a method of generalization. In this section, we will examine how model transformation can improve generalization performance.
Sleep and Awake Period Distributions
When examining the ENMO signal data after applying the Lomb-Scargle periodogram, neither the awake nor the sleep period data exhibited a uniform distribution.
As shown in Figure 10, the distribution shapes of the two periods also differ. Notably, there is a distinct difference in the shape of the distribution peaks: the joint distribution peak during sleep periods forms a smooth curve, whereas the peak during awake periods appears more angular.
Interestingly, upon closer inspection, despite the different distribution shapes, the peak values for each ID are clustered around 0 on the x-axis, whether it’s during sleep or awake periods.
Similarly, in both Figure 11a (entire dataset) and Figure 11b (9% of the entire dataset), the peak values of the sleep and awake distributions did not change significantly. Additionally, the peak values in Figure 12, which shows 800 randomly sampled observations from Figure 11b, also showed minimal variation.
Using Sleep and Awake Period Distributions
We examined whether the peak values of the distribution functions were different or the same. This was part of an effort to apply a likelihood ratio (LR) comparison method by utilizing the distribution information of sleep and awake periods to generalize the sleep state detection method. If the distributions are known, approaching the problem using Maximum Likelihood Estimation (MLE) is the most appropriate method. Similarly, we aimed to model based on the Likelihood Ratio (LR) by using the information from the sleep and awake distributions.
Sleep and awake distributions may not follow commonly known probability density functions (e.g., Gaussian, Poisson, etc.) and are often irregular. As an alternative, we used distributions derived from kernel density estimation (KDE). Kernel density estimation involves creating a kernel function centered on each observed data point, summing these, and then dividing by the total number of data points. Typically, the optimal kernel function is the Epanechnikov kernel, but for computational convenience, the Gaussian kernel is frequently used. In this study, we also used the Gaussian kernel.
First, let's explain how the LR method was applied using equations. $LR = \frac{L_ {1} (D)}{L_ {0} (D)}$.
The likelihood ratio can be calculated for each data input point, where $L_ {0} (D)$ represents the likelihood of the data under the null hypothesis, indicating a higher probability of being a sleep signal. Conversely, $L_{1} (D)$ represents the likelihood under the alternative hypothesis, indicating a higher probability of being an awake signal. If the LR is greater than a threshold, it suggests that the data is more likely under the alternative hypothesis (awake signal).
Figure 13 visualizes the results of sleep state detection using the above generalization methodology for a single ID. In the graph below, you can see that the lowermost graph finds as many points as possible between the first activity signal point (moment of waking up) and the last point (moment of falling asleep) after data transformation.
From a computational efficiency standpoint, the likelihood ratio (LR) method is also advantageous. When measuring computation time, it was observed that data transformation occurs simultaneously with data input, allowing the LR results to be produced quickly. Processing 39,059 data points all at once took about 7 seconds. For one day's worth of data for 10 users (17,280 data points), it took around 1 minute and 40 seconds in total.
As expected, this method, which depends on distribution, doesn’t detect cases where the device is missing. However, visual checks using Figure 13 showed that the method works well in detecting signals without label values, as long as the device wasn’t missing.
To assess the robustness of the model, this study proposes using the time difference between predicted values and label values as a performance metric. Since the likelihood ratio method introduced above focuses on generalization, we determined that traditional evaluation metrics from existing sleep research, which are geared toward optimization, would not be applicable.
New Evaluation Metric for Assessing Model Robustness
When comparing the time difference between the model's predicted values and the label values, the model tends to predict the moment of falling asleep earlier than the actual label values and the moment of waking up later than the actual labels. To understand the cause of this, we applied the LR method to the raw, unprocessed ENMO signals to detect sleep states.
As shown in Figure 14a, even when using unprocessed ENMO signals, the tendency for predictions to be early or late remained the same. This suggests that these tendencies are inherent to the collected ENMO signals themselves. It is expected that if additional data, such as pulse rate or other complementary information, is used in the future, the time difference (time diff) could be reduced.
Limitations and Future Research Plans
In the previous section, we briefly discussed assessing the performance of the likelihood ratio comparison method by using the time difference between predicted values and label values. To provide a more objective evaluation, we will now also examine the results of applying this method to IDs that were not included in the training data (test set).
The numbers shown above are the average of the standard errors of time differences (within individual standard error, SE) calculated for each ID. Figure 15a shows the results of applying the likelihood ratio comparison model to 10 randomly selected IDs included in the training data, while Figure 15b shows the results using 3 randomly selected IDs that were not included in the training data.
Verifying the Robustness of the Likelihood Ratio Comparison
The results showed that the processed ENMO signals did not exhibit significant performance differences between the training and test data sets, reaffirming the robustness of the generalization-focused methodology. Although the processed ENMO signals displayed understandable levels of variability, the unprocessed original ENMO signals showed a significant increase in average standard error.
Additionally, Figures 15a and 15b highlight the contribution of data transformation to performance improvement. The original ENMO signals, without any preprocessing, had a higher average standard error compared to the processed ENMO signals. This difference was more pronounced in the test set (Figure 15b), where the average standard error for the processed ENMO signal data was reduced by over 20 minutes for both wakeup and sleep onset times. This underscores the importance of investing effort into data preprocessing to enhance generalization performance.
The training set's performance was validated using random samples from 10 different IDs, while the test set included three additional randomly selected IDs not used in the training distribution, further rigorously testing the model's generalization performance. For reference, the total number of nights analyzed across the 13 IDs was approximately 110 days, suggesting that a sufficiently long period was utilized for comparing average standard errors.
Key Research Findings
In summary, this study focused on generalization rather than optimization. Efforts were made to enhance generalization performance, starting from the data transformation stage. Using raw data statistics based on accelerometer data can increase dimensionality and make the data more susceptible to outliers and noise, highlighting the need for data preprocessing. Additionally, since sleep patterns are not consistent, it was necessary to stabilize the data by applying the Lomb-Scargle periodogram, which can detect periodicity in unevenly spaced data.
From a modeling perspective, rather than enhancing the fit for each individual data point as is common in traditional machine learning or deep learning models, this study utilized distribution data rich in information. Distributions contain more information than variance, leading to a structure that is inherently more efficient from a modeling standpoint. As a result, even users who have only recently started wearing wearable devices can benefit from early detection (though at least one hour of data is needed), improving the practical utility of the device.
Furthermore, the LR method offers the advantage of high computational efficiency. Compared to complex models like machine learning or deep learning, as well as traditional models using rolling statistics, the computational efficiency of the LR method is significantly higher. In the same vein, the LR method is easier to maintain. With its lower model complexity and sequential execution of data preprocessing and LR model inference stages, subsequent modifications to the model structure are also straightforward.
Future Research
Currently, only ENMO signal data is used, but incorporating more supplementary variables (e.g., heart rate) is expected to make sleep state detection more refined. Enhancing performance may also be possible by implementing more detailed updates during data preprocessing for each individual ID. In this study, the period for allowing the use of past distribution data and determining the dominant frequency was chosen through basic experiments, but future studies could consider more precise adjustments.
The heterogeneity that exists between individuals should also be considered. Future studies could achieve higher accuracy by analyzing different groups (e.g., those with above-average activity levels vs. those with minimal activity) rather than simply adjusting the current threshold values. Expanding the study population could also contribute more to public healthcare research by reflecting demographic characteristics among individuals, which would be valuable for both business and sleep research perspectives.
Continually expanding the variety of data has great potential for advancing sleep research. For example, the Healthy Brain Network, which provided the data used in this study, aims to explore the relationship between sleep states and children's psychological conditions. This highlights the increasing importance and interest in using sleep state measurements as a supplementary tool for understanding human psychology and social behavior.
Meaningful Inference Amid Uncertainty
Understanding complex issues depends on how the available information is utilized. The data used in this study are signal data, and most signal measurements inherently contain noise, which introduces uncertainty. Moreover, understanding sleep states requires domain expertise, and direct measurement is often difficult. Despite these difficulties, this research made significant efforts to predict human sleep states through indirect measurements or partial observations, aligning with recent advances in wearable devices.
In conclusion, optimization and generalization are naturally in a trade-off relationship. While this paper focused on generalization, the emphasis on optimization should be adjusted dynamically based on how much precision is required from a business perspective. Just as the phrase "one size fits all" is contradictory and almost impossible to achieve perfectly, it is important to recognize that choices must be made depending on the data and the specific context.
[2] Kishan Bakrania, Thomas Yates, Alex V Rowlands, Dale W Esliger, Sarah Bunnewell, James Sanders, Melanie Davies, Kamlesh Khunti, and Charlotte L Edwardson. Intensity thresholds on raw acceleration data: Euclidean norm minus one (enmo) and mean amplitude deviation (mad) approaches. PloS one, 11 (10):e0164045, 2016. 4
[3] Roger J. Cole, Daniel F. Kripke, William Gruen, Daniel J. Mullaney, and J. Christian Gillin. Automatic sleep/wake identification from wrist activity. Sleep, 15(5):461–469, 09 1992. ISSN 0161-8105. doi: 10.1093/sleep/15.5.461. URL https://doi.org/10.1093/sleep/15.5.461. (document)
[4] Marta Karas, Jiawei Bai, Marcin Str´aczkiewicz, Jaroslaw Harezlak, Nancy W. Glynn, Tamara Harris, Vadim Zipunnikov, Ciprian Crainiceanu, and Jacek K. Urbanek. Accelerometry data in health research: challenges and opportunities. bioRxiv, 2018. doi: 10.1101/276154. URL https://www.biorxiv.org/content/early/2018/03/05/276154. (document), 3
[5] Miguel Marino, Yi Li, Michael N. Rueschman, J. W. Winkelman, J. M. Ellenbogen, J. M. Solet, Hilary Dulin, Lisa F. Berkman, and Orfeu M. Buxton. Measuring sleep: Accuracy, sensitivity, and specificity of wrist actigraphy compared to polysomnography. Sleep, 36(11):1747–1755, 11 2013. ISSN 0161-8105. doi:10.5665/sleep.3142. URL https://doi.org/10.5665/sleep.3142. (document)
[6] Nigel R Oakley. Validation with polysomnography of the sleepwatch sleep/wake scoring algorithm used by the actiwatch activity monitoring system. mini mitter co. Sleep, 2:0–140, 1997. (document)
[7] Matthew R Patterson, Adonay AS Nunes, Dawid Gerstel, Rakesh Pilkar, Tyler Guthrie, Ali Neishabouri, and Christine C Guo. 40 years of actigraphy in sleep medicine and current state of the art algorithms. NPJ Digital Medicine, 6(1):51, 2023. 5
I am in my early 40s and work at an office near Magok Naru Station and I live near Haengsin Station in Goyang City. I used to commute by company shuttle, but recently I've taken up cycling as a hobby and now commute by bike. The biggest reason I got into cycling was because of the positive image I had of Seoul's public bicycle program, Ddareungyi.
What Sparked My Interest
One day, I stepped off the shuttle, rubbing my sleepy eyes, and was surprised to see hundreds of green bikes clustered together. I hadn't noticed them before, probably because I’m usually too tired as an office worker, not paying much attention to my surroundings once I get to work. Or maybe it’s just because I’m so groggy in the mornings that the bikes slipped past me. Either way, the sight took me by surprise.
Most crowded areas
I often wondered where the many Seoul bikes at the Magok intersection came from. This also sparked my interest in Ddareungi and made me think about researching public bike programs as a topic for my thesis.
As I continued to develop my thoughts, I suddenly wondered, "Is there really a place that uses bicycles more than Magok?" A quick internet search provided the answer. According to the "2022 Traffic Usage Statistics Report" published by the Seoul Metropolitan Government, the district with the highest use of public bicycles (Ddareungi) in Seoul was Gangseo-gu, with 16,871 cases.
Furthermore, according to data released on the Seoul Open Data Platform, the top seven public bicycle rental stations in Gangseo-gu are as follows: ▲ Magoknaru Station Exit 2 with 88,001 cases ▲ Balsan Station near Exits 1 and 9 with 63,166 cases ▲ Behind Magoknaru Station Exit 5 with 59,095 cases ▲ Gayang Station Exit 8 with 56,627 cases ▲ Magok Station Intersection with 56,117 cases ▲ Magoknaru Station Exit 3 with 52,167 cases ▲ Behind Balsan Station Exit 6 with 48,145 cases, etc. I was quite surprised to learn this. The place with the highest use of Ddareungi in Seoul was right here, the Magok Business District, where I commute to work.
During my daily commute, I began to notice more people using bicycles than I had originally thought. Bikes are increasingly viewed as a way to address environmental concerns while also promoting fitness for office workers. Inspired by this trend, I considered commuting by bike myself, like many others in Seoul. However, since I live in a different district, I faced the dilemma of choosing between Goyang City's Fifteen program or Seoul's Ddareungi. During my research, however, I discovered that Goyang City's Fifteen program had been discontinued due to financial losses.
Reasons for Deficits in Public Bicycle Programs
So, I looked into the deficit sizes of other public bicycle programs and found that "Nubija" in Changwon had a deficit of 4.5 billion KRW, "Tashu" in Daejeon had 3.6 billion KRW, and "Tarangke" in Gwangju had a deficit of 1 billion KRW. This showed that most regional public bicycle programs are struggling with deficits. Even Seoul's public bicycle program, "Ddareungi", which I thought was doing well, has a deficit of over 10.3 billion KRW. This made me wonder why public bicycle programs are always in deficit.
At the same time, although Ddareungi is a beloved mode of transportation for the ten million citizens of Seoul, I started to worry whether this program could be sustained in the long run. After looking into the issue, I discovered that the biggest contributor to the deficits in public bicycle programs is the high cost of redistributing the bikes across the city.
For Goyang City, it was estimated that out of a total maintenance budget of 1.778 billion KRW, around 375 million KRW is spent on on-site distribution, and 150 million KRW is used for vehicle operation costs related to redistribution. This means approximately 30% of the total budget goes towards redistribution, making it the largest single expenditure. A similar trend is observed in Changwon City, where redistribution costs also account for a significant portion of the budget. Although this information is not directly about Ddareungi, it suggests that about 30% of the total operating costs of public bicycle programs are likely spent on bicycle redistribution.
This led me to believe that cutting bicycle redistribution costs could be the key to resolving the chronic deficits in public bicycle rental programs. It also made me consider that optimizing redistribution by analyzing Ddareungi users' usage patterns could help reduce these expenses. To achieve this, I needed to analyze the factors influencing rental volume and create a model to predict expected demand, which would help prevent shortages and minimize unnecessary redistribution efforts.
Optimizing Redistribution Through Demand Forecasting
The Ddareungi bike rental data includes bike ID, return time, and station information. To visualize rental volumes by station, additional location data (latitude and longitude) from the Seoul Open Data Plaza was used. Synoptic weather data from the Seoul Meteorological Station was also integrated with the rental records to analyze the impact of weather on bike usage. A detailed analysis of usage patterns was conducted on a four-year dataset (2019-2023) from the Ddareungi station at Exit 5 of Magoknaru Station.
General Usage Patterns
The result showed that bike usage drops with stronger winds and rain but peaks at moderate temperatures (15-17°C). The highest usage occurs during weekday morning and evening commutes. Usage patterns are concentrated in business districts such as Magok, G-Valley, and Yeouido, where most users are in their 20s and 30s. These areas experience imbalances in rentals and returns, especially during commutes.
The general usage patterns were analyzed to forecast bicycle demand and supply. Using the STL (Seasonal and Trend decomposition using Loess) method, rental and return volumes were first decomposed to reveal seasonality, trends, and cycles. The residuals from this decomposition were then applied to a SARIMAX model, incorporating weather and time variables to explain the usage patterns. The model successfully forecasted demand, achieving an R² of 0.73 for returns and 0.65 for rentals.
Optimization Based on the Rental-Return Index Range
To optimize bike redistribution, the "Rental-Return Index" was introduced to measure the difference between expected rentals and returns at each station.
[ 1 \ Day \ Index = \frac{Estimated \ Rental \ Volume}{Estimated \ Return \ Volume} ]
As shown in the equation above, when a station has the right balance, with neither a surplus nor a shortage of bikes, the Index equals 1. An Index greater than 1 indicates a shortage, while an Index below 1 signifies a surplus. By categorizing stations into surplus or deficit, redistribution efforts can be directed toward stations with shortages (Index greater than 1), improving customer satisfaction.
In addition, this approach is particularly useful because the number of redistribution targets can be quantified based on the available budget for Seoul's bike system. Stations with the highest Index values are prioritized first, and the top stations for redistribution are selected according to the allocated budget, ensuring cost-effective and efficient redistribution efforts.
To further optimize bike redistribution, clustering can be applied to group business and residential areas based on rental and return distributions within districts, aiming for a rental-return Index of 1. This method would minimize the distance bikes need to be moved during redistribution, as workers would be assigned to specific teams responsible for managing these clustered regions. In other words, by focusing on areas where the Index is balanced, this approach ensures more efficient redistribution while reducing overall transportation efforts.
Clustering Idea for Implementing Spatial-Temporal Balance
Common Clustering Method
Initially, a K-Means clustering approach was tested to identify areas where the difference between bike rentals and returns was close to zero. By adjusting the number of clusters to match Seoul’s 25 districts, the analysis of June 2023 data showed that clusters with more districts had net volume averages closer to zero, indicating a better balance between rentals and returns. In contrast, smaller clusters with fewer districts exhibited greater imbalance.
Further testing with other clustering methods, such as the Gaussian Mixture Model (GMM), produced results similar to those of K-Means. However, neither method fully captured the underlying bike movement patterns, as these clustering models were unable to account for the dynamic mobility data within the bike-sharing system. This suggested that the algorithms might not be well-suited to the structure of Ddareungi's data, highlighting the need for alternative modeling approaches.
Since Ddareungi’s data reflects bike movements between stations, it is logical to treat these movements as links within a graph, with rental and return stations acting as nodes. By applying a community detection method, clusters can be identified based on the most frequent bike movements. This graph-based approach, which focuses on actual bike movement patterns, could lead to more efficient bike redistribution and yield improved clustering results.
etwork Detection Method
The approach involves treating the movement of bikes between rental and return stations as links between nodes, thereby creating a graph. By identifying clusters with the highest number of links, it's possible to detect community divisions where bikes tend to circulate internally. This can significantly enhance the efficiency of bike redistribution across the network.
This is where network community detection comes into play. Community detection is a method that divides a graph into groups with dense internal connections. Applied to Ddareungi data, it helps track rental-return patterns by clustering areas where rentals and returns are balanced. By identifying these clusters, we can detect regions that maintain spatial balance, with more compact clusters reflecting higher modularity.
Modularity measures how densely connected the links are within a community compared to the connections between different communities. It ranges from -1 to 1, with values between 0.3 and 0.7 indicating the existence of meaningful clusters. Higher modularity signifies stronger internal connections, leading to more effective clustering.
To optimize modularity, the Louvain algorithm was tested. This algorithm works in two phases: In Phase 1, nodes are assigned to communities in a way that maximizes modularity. In Phase 2, the network is simplified by merging the links between communities, further refining the structure and improving cluster detection.
When applied to Ddareungi data, the Louvain algorithm significantly outperformed K-Means clustering, which relies on Euclidean coordinates. The average net deviation, where 0 is ideal, dropped sharply from 21.19 with K-Means to 9.23 using Louvain, indicating a more accurate clustering of stations. Unlike K-Means, which ignores key geographical features like the Han River, the Louvain algorithm took Seoul's geography into account, resulting in more precise and meaningful clusters.
The following map comparison highlights this difference, showing how Louvain provides clearer cluster differentiation across the Han River, whereas K-Means fails to capture these geographic distinctions.
Understanding the Cycle
I likened Ddareungi bike movement to the flow of water. Just as the total amount of water on Earth remains constant, the total number of Ddareungi bikes stays fixed. This analogy helps conceptualize the system as spatially and temporally closed, where clustering can maintain balance.
Temporal imbalances can be managed by tracking the flow of bikes throughout the day. For instance, business districts experience high demand in the morning but accumulate excess bikes by evening, while residential areas face the opposite situation. Redistribution efforts can be minimized by transferring surplus bikes from business districts to residential areas overnight, before the morning commute begins. After the morning rush, bikes concentrate in business districts but are naturally redistributed as users ride them back to residential areas during the evening commute.
Although there is some uncertainty in the evening, as it's unclear whether users will choose bikes for their return journey, any surplus can still be addressed overnight as part of the regular redistribution cycle. This ensures that before the next morning commute, any leftover bikes in business districts are moved to residential areas as mentioned above. When viewed over a full day, these fluctuations tend to balance out, reducing the need for excessive intervention.
To manage these imbalances more effectively, a rental-return index was used to prioritize stations for redistribution, ultimately reducing operational costs. Additionally, network community detection, particularly through the Louvain algorithm, provided more accurate clustering than previous methods. This approach better reflected Seoul's geography, especially by distinguishing clusters across the Han River, greatly improving redistribution strategies.
By viewing Ddareungi as a system striving for both spatial and temporal balance, shortages can be managed more efficiently. This approach not only optimizes the Ddareungi system but also offers valuable insights for enhancing the management of other shared resource systems.
It's difficult to maintain blood stock at safe levels
South Korea has recorded its lowest birth rate in history. In 2023, the country's total fertility rate was 0.72, raising concerns about various future issues. Among them, the potential blood supply shortage due to low birth rates has come into focus. According to the Korean Red Cross, by 2028, the demand for whole blood donations is expected to exceed supply. Moreover, this gap is anticipated to widen further.
Blood shortages have long been a recurring problem. Especially during the winter season, the lack of blood donors causes hospital staff to worry about whether they can ensure a smooth supply of blood to patients. Despite these concerns, the blood shortage problem continues to worsen.
The Korean Red Cross considers a blood stock of more than five days to be at a "safe level", while a stock of less than five days is regarded as a "shortage". However, past data shows that the number of days the blood stock remains at a safe level has been decreasing.
Why is it difficult to maintain blood stock at a safe level? The reason is that both the supply and usage for blood are hard to control. Blood is used in medical procedures like surgeries, and reducing its usage would cause significant backlash. On the other hand, blood can only be supplied through donations, meaning supply is limited. Therefore, despite the efforts of the Korean Red Cross, it remains challenging to keep blood stock at a safe level.
Literature Review
This study aims to understand the dynamics of blood supply and usage to help address the issue of blood shortages. Additionally, the study measures the effects of "blood donation promotional activities", one of the key factors in increasing blood supply, and propose efficient solutions.
Before delving into the analysis, let's review how previous studies have approached blood supply and usage. Blood has the characteristics of a public good, so it's heavily influenced by laws, and blood donation and management systems vary significantly between countries. Therefore, it was deemed difficult to apply research findings from other countries domestically, which is why I focused on reviewing domestic studies.
Yang Ji-hye(2013), Lee Tae-min(2013), Yang Jun-seok(2019), and Shin Ui-young(2021) focused on qualitative analysis by identifying motivations for blood donation participation through surveys. Kim Shin(2015) used multiple linear regression analysis to predict the number of donations by individual donors. However, personal information of donors was used as explanatory variables, and time series factors were not considered, making it difficult to understand the dynamics of blood supply and usage. Kim Eun-hee(2023) studied the impact of the COVID-19 pandemic on the number of donations, but her research had limitations, as it did not account for exogenous variables or types of blood donations. Unfortunately, previous studies did not focus on the dynamics of blood supply and usage, leaving little content to reference for this analysis.
Analysis of Blood Supply Dynamics
Selection of Analysis Subjects
From this section, I will introduce the analysis process. Rather than diving straight into the analysis, I will first clearly define the subjects of analysis. The Korean Red Cross publishes annual blood donation statistics, providing the number of donors categorized by group (age, gender, donation method, etc.). This study utilized that data for the analysis.
There are various types of blood donations. Depending on the method, donations are classified into whole blood, plasma, and platelets & multiple components. First, looking at plasma, approximately 68% of it is used as a raw material for pharmaceutical production, and it has a long shelf life of one year, making imports feasible. Therefore, in the case of plasma shortages, the issue can be resolved through imports, and as such, it is not our primary concern.
Next, platelet & multiple component donation has stricter criteria. Women who have experienced pregnancy are not eligible to donate, and it requires better vascular conditions compared to other types of donations. As a result, the gender ratio of donors is skewed at 20:1, raising concerns about sample bias and making it difficult to derive accurate estimates during analysis. Moreover, unlike whole blood, platelet & multiple component donations are primarily used for specific diseases. For these reasons, this study focuses solely on whole blood donations as the subject of analysis.
After selecting whole blood donations as the subject of analysis, one concern arose: whether to differentiate the data based on the amount of blood collected. The data I received is categorized by 320ml and 400ml amounts. Should I divide the data based on these amounts, just as we divide groups by gender? I decided that it would not be appropriate to make this distinction. Dividing the data by amount would distort the data structure because the amount is not a choice made by the donor but is determined by the donor's age and weight. Since donors cannot choose the amount, the 320ml and 400ml data come from the same distribution, and dividing them would arbitrarily split this distribution. Therefore, in this analysis, I integrated the data categorized by amount of blood collected and defined it as the "number of donors" for the analysis.
The day of the week effect
Now that the analysis target has been clearly defined as the number of whole blood donors, let's begin the analysis. Since the number of donors is time series data, it's important to check whether it shows any seasonality. First of all, it is expected that the number of donors will vary depending on the weekly seasonality, specifically the day of the week and holidays. Let's examine the data to confirm this.
As seen in Figure 3, the number of blood donors is higher on weekdays and relatively lower on holidays. Let's incorporate this information into the model. If the differences between groups in the data are overlooked and not included in the model, omitted variable bias (OVB) may occur, leading to inaccurate results. Therefore, it is important to identify variables that could cause group differences and incorporate them in the model.
It is natural to think that if we are dividing the data by groups, we should also split the data by gender. However, there is no need to group the blood donor data by gender. This is because the purpose of the analysis is to understand the dynamics of the blood supply from the perspective of the entire population. If the goal were to analyze individual donation frequencies, gender would be an important variable. However, since we are examining data for the whole population, there is no need to separate by gender. Additionally, when the number of male and female donors is normalized for mean and variance, they show very similar patterns. For these reasons, we analyzed the data without dividing it by gender.
Next, let's examine how the distribution changes as we divide the blood donor data into groups. Our goal is for the data to follow a normal distribution. Since a normal distribution indicates that no unexplained factors remain in the data.
First, let's look at the distribution of the number of blood donors without dividing it into any groups. The distribution shows a bimodal pattern, which indicates that there are still many unexplained factors in the data. Now, let's add the day-of-the-week effect that we discovered earlier to the model and see how the distribution changes. As seen in Figure 5, the distribution of weekday data after removing the day-of-the-week effect is no longer bimodal and has shifted to resemble a bell shape.
The distribution of the data after removing the day-of-the-week effect takes on a bell shape, but the long tail extending to the left is still concerning. We suspected this was due to a concentration of blood donations occurring on days when most donor centers are closed, and we incorporated this into the model. When we plotted the distribution using only data from non-holiday days, like how we removed the day-of-the-week effect, the tail disappeared.
Annual Seasonality
So far, we have identified day of the week and holidays as factors that influence the number of blood donors. Let's express this in a regression equation and check the residuals. If the residuals do not follow a normal distribution, it means there are still unexplained factors affecting the number of blood donors. The regression equation for the number of blood donors based on day of the week and holidays is shown below.
This equation means that the response variable represents the number of whole blood donors, combining both 320ml and 400ml blood donations. The explanatory variables are the day of the week and holidays, which have been included in the equation in the form of dummy variables.
The residuals after removing the day-of-the-week and holiday effects no longer show the unusual patterns from the original data, such as the bimodal shape or long tail. However, when looking at the right side of the mean, there is an unusual pattern that wasn't detected in the distribution of blood donors. This suggests that there are still factors not explained by the day-of-the-week and holiday variables. What could those factors be?
There are two types of seasonality: weekly seasonality, such as day-of-the-week effects, and annual seasonality, like spring, summer, fall, and winter. Since we've already accounted for weekly seasonality, let's now consider annual seasonality. As mentioned earlier, we know that the number of blood donors tends to decrease in winter, so we can expect that annual seasonality exists. Let's examine the data to confirm this.
Looking at Figure 8, we can see that the distribution of blood donors varies by month. Therefore, it is reasonable to conclude that annual seasonality exists in the number of blood donors, and we should incorporate this into the model. It is suspected that annual seasonality may be contributing to the unusual patterns in the residuals.
How can we incorporate annual seasonality into the model? The simplest method would be to include all days of the year using 365 dummy variables. However, this approach is inefficient as it uses too many variables. When there are too many variables, the model's variance increases, and multicollinearity issues may arise. This is especially concerning because the number of blood donors does not fluctuate dramatically on a daily basis, so multicollinearity is likely. So, how can we capture similar information without using 365 dummy variables?
Let's focus on the word “cycle”. When we think of cycles, sine and cosine functions come to mind. How about using sine and cosine functions to capture annual seasonality? This approach is called Harmonic Regression.
Figure 9 illustrates that annual seasonality is captured using appropriate sine and cosine functions. By using a method suited to the characteristics of the cycle, we were able to capture seasonality with a small number of variables. Of course, using temperature to capture annual seasonality is another option. This method has the advantage of being more intuitive and easier to control variables. However, there is annual seasonality in the blood donor data that cannot be fully explained by temperature alone, which is why harmonic regression was used to model the seasonality.
As a result of incorporating annual seasonality into the model, the unusual patterns in the residuals were eliminated. The regression equation with annual seasonality included is shown below.
Do temperature and weather affect the number of blood donors? Upon investigating the data, we found that 70% of donors visit blood donation centers in person. This leads to a strong suspicion that temperature and precipitation, which influence outdoor activities, could have a significant impact on the number of blood donors.
Since weather conditions vary significantly by region, we conducted the analysis separately for each region. We examined the significance of temperature and precipitation variables for individual regions. The results showed that precipitation negatively impacted the number of blood donors in all regions, while temperature did not have a significant effect. This is because the information provided by temperature was already captured when we incorporated annual seasonality into the model. The regression equation, including precipitation, is shown below.
Dynamics of Blood Supply and Usage During the COVID-19 Period
In this section, we will examine how blood stock responds when a significant external shock occurs. Specifically, we will analyze the dynamics of blood stock during the COVID-19 period, which was the most significant recent shock.
It is likely that maintaining blood stock above a certain level was challenging during the COVID-19 period. This is because population movement significantly decreased due to various quarantine measures and fears of infection. Moreover, as shown in Figure 12, the number of individuals ineligible for blood donation increased starting in 2020. This was due to the introduction of new health criteria during the COVID-19 period, which restricted blood donations for a certain period after recovering from COVID-19 or receiving a vaccine. For these reasons, we expect that blood stock levels decreased significantly during the pandemic. Let’s examine the data to see if our hypothesis is correct.
As seen in Figure 13, interestingly, blood stock levels were maintained above a certain level during the COVID-19 period. The blood stock never dropped below two days' supply. How was the Korean Red Cross able to maintain blood stock above a certain level despite the external shock of the pandemic?
After controlling for the factors considered earlier and conducting a regression analysis, it was found that blood usage decreased by 4.25% during the COVID-19 pandemic. This reduction can be attributed to two factors: the intentional decrease in blood usage to maintain stock levels, and the natural decline due to the shortage of medical personnel and hospital wards during the pandemic.
A regression analysis on blood supply using the same variables showed a 5.3% decrease in supply. The reason blood stock levels were maintained during the COVID-19 period is that both usage and supply decreased at similar rates. However, considering the broader societal impact of the pandemic, the 5.3% decrease is relatively minimal.
Finding of the "Blood Shortage" Variable
A regression analysis of blood donor numbers by region showed that, in certain areas, the number of donors increased. Since COVID-19 did not occur only in specific regions, this contradicts common sense. Therefore, it is suspected that some factor during the pandemic may have contributed to an increase in blood supply in those areas. Additionally, the 5.3% decrease in the number of donors is likely offset by this increase factor.
We anticipated that an increase factor might come into play during periods of blood shortage. Thus, we created a proxy variable called "Blood Shortage". Days when blood stock dropped below a certain level, along with a defined period thereafter, were classified as "shortage periods". This reflects the impact of specific measures taken by the Korean Red Cross during these periods.
An analysis of the effect of the "blood shortage" on the number of blood donors showed that, in most regions, it had a positive effect on donor numbers. This supports the earlier hypothesis that some factor was increasing blood supply. Similarly, when examining the effect of the shortage condition on blood usage, we observed a decrease in usage during those periods. This indicates that the manual for blood supply shortages, which is triggered when blood stock levels fall below a certain threshold, worked effectively.
However, the increase factor associated with the "blood shortage" is likely only effective when the decrease in blood donors can be anticipated in advance. This is because the Korean Red Cross needs to predict a decline in donor numbers to respond through promotion efforts. Let’s verify this looking at the data.
Looking at the model’s residuals, we can see that during the early stages of the COVID-19 pandemic in Daegu/Gyeongbuk and the Omicron wave—both unexpected events—the number of blood donors decreased. In other more predictable periods, donor numbers did not continue to decline, suggesting that the increase factor operated effectively. The reason blood stock levels were maintained during those times is that the mannual for blood supply shortage was activated, and the public became more aware of the shortage, leading to more proactive blood donations, which helped increase supply.
Measuring the Effect of Promotions
The Effect of the Additional Giveaway Promotion
During the COVID-19 period, the Korean Red Cross employed various methods to prevent a decline in the number of blood donors, including promotions, SMS donation appeals, and public service advertisements. Which of these methods was the most effective? If the effect can be accurately measured, the Korean Red Cross will be able to respond more efficiently to future blood shortages.
It would be ideal to measure the effect of all methods, but most were difficult to analyze due to a lack of data or one-time events. Fortunately, promotions were deemed suitable for quantitative analysis, so we focused on measuring their impact. Let’s examine how much promotions increased the number of blood donors.
The giveaway promotion was conducted in the same way across all regions for an extended period, so there should be no major issues in measuring its effect. To assess its impact, we created a dummy variable for "promotion days" while controlling for the variables we previously identified. The results showed that the response to the promotion varied by gender. Men responded strongly to the promotion while women did not show a significant response. However, does simply adding a dummy variable truly capture the pure increase driven by the promotion?
Using a simple dummy variable to capture the effect of the promotion period results in a mixture of both the "promotion effect" and the "trend during the promotion period". For example, the number of blood donors in May and December differs. May sees more donors due to favorable weather, while December sees fewer. Therefore, simply adding a dummy variable makes it difficult to isolate the pure effect of the promotion, as the existing higher donor numbers in May may get mixed with the increase from the promotion itself. We need to consider how to separate these effects to accurately measure the promotion's impact.
As shown in Figure 18, the giveaway promotion was conducted on a quarterly basis. Since each quarter shares similar seasonality, there is likely no significant change in the number of blood donors across quarters. To remove trends, the entire timeline was divided into quarters, and the pure promotion’s impact was measured.
After removing the trends, there is no significant difference in the promotion response between the male and female groups. Although there is some variance due to unexplained social factors, the average response is similar, leading to more accurate results compared to using a simple dummy variable.
The Effect of Special Promotions
In addition to the giveaway promotion, the Korean Red Cross conducted various special promotions, including gift cards, souvenirs, travel vouchers, and sports event tickets. To accurately measure the effect of these special promotions, it is essential to remove the trends, just as with the giveaway promotion. In other words, we need to identify periods where there would be no differences except for the promotion. In this analysis, we examined the difference in the number of blood donors two weeks before and after the promotion period, as well as during the promotion period itself.
The increase rate in the number of blood donors by special promotions showed positive results in many regions. Among these, the offering sports viewing tickets was particularly effective. Therefore, it is suggested to use sports viewing tickets as a means to effectively increase the number of blood donors during anticipated periods of blood shortage.
Episode for Data Collection
Here, I will conclude the analysis by sharing an episode from the data collection. The data used in this research was collected through various channels. For data related to blood services statistics, I was able to obtain well-organized information through the Statistics Korea API. However, other data sources were not as easily accessible, which was somewhat disappointing. While blood stock, usage, and supply data are available through other APIs, they only provide monthly data, which lacks the resolution needed for detailed analysis.
Fortunately, since the Korean Red Cross is a government organization, we were able to request daily data on blood stock, usage, and supply, as well as data on the giveaway promotion through a "Public Information Request". Government departments or public institutions often provide access to such data, excluding sensitive personal information. I encourage other researchers to actively use information disclosure requests to obtain high-quality data. Especially, in South Korea, where the digitization of administrative data is well-developed, researchers can access the materials they need for their studies.
Administrative Divisions Residential Districts, and Tax Systems
Regions where administrative divisions overlap are exposed to different economic, social, and political issues than those without. While local governments in the regions compete to attract resources to satiate residents’ needs, they run into unexpected results and inefficiencies.
Recent “New Town” Projects in Korea, mainly designed to disperse the population from downtown of large cities and reduce unnecessary administrative burdens, are often located in regions where different administrative districts are intertwined, leading to many inconveniences. A prime example would be Wirye New Town (henceforth, Wirye) and its Light Rail Transit (LRT) project connecting Wirye and Sinsa, which already has two subway lines. With the national government, Seoul Metropolitan City, Songpa-Gu, Gyeonggi Province, Hanam City, and Seongnam City involved with different interests, the project has yet to start construction when it was originally planned to start ten years ago.
A common solution to such issues brought up in the media is an integration of administrative divisions. However, simply integrating the administrative districts will not solve the problem. Will the Gyeonggi, Hanam, or Seongnam governments agree with annexing Wirye to Songpa-Gu of Seoul with the consequence of losing their tax base, or vice versa? Even if the pertinent local governments somehow agree upon one form of integration/annexation, running school districts, as well as public facilities like fire stations, community centers, and libraries, will be a major challenge as securing and managing the resources to operate them will bring additional administrative and financial burdens.
In fact, integrating or annexing administrative districts in Korea is rather rare. There have been several attempts to integrate or annex local governments, but the sheer number is small, and examples like the integration of Changwon, Masan, and Jinhae into the unified Changwon City, and the integration of Cheongju City and Cheongwon County into the unified Cheongju City, are cases where entire local government units are merged or annexed (Source: Yonhap News, 2024). Integrating Wirye is a completely different story and annexation in such circumstances is unprecedented.
Low Fiscal Autonomy of Local Governments
One of the main reasons these problems persist in Korea is that local governments cannot be financially independent. While laws allow local governments to adjust the standard tax rate through a flexible tax rate system, it is rarely used in practice, and most local governments apply the tax rate set by the central government considering the ‘equity of tax system' (Jeong, 2021). In other words, governments only exercise the legal authority to adjust tax rates when it does not bring any particular benefit or disadvantage. Furthermore, they have little, if not zero, authority to determine the tax base, another facet of tax revenues. The national government distributes subsidies to fill the discrepancy between fiscal needs and revenue. However, the decision is ultimately made by the national government. Local governments do not have the power to determine their revenues and they can hardly come up with a long-term policy.
The United States presents a contrasting case. While the federal government holds the highest authority, each state has discretion to determine the types of taxes and tax rates. For example, while Texas, where Shin-Soo Choo played, has no individual income tax, California has one of the highest, if not the highest, individual income tax rates. Delaware, Montana, and Oregon, for instance, have no sales tax. Furthermore, local governments—cities, towns, etc.—in New York State assess real estate and impose different tax rates. New York City prohibits right turns on red whereas it is legal to make a right turn on a red light as long as one yields to oncoming traffic first. Of course, this does not mean that state and local governments are completely fiscally independent. The U.S. also has a federal or state government system of tax collection and distribution through grants and subsidies, but local governments have higher fiscal autonomy compared to Korea.
The Need for Research on Tax Competition
Let's return to the case of Wirye. If Songpa District, Seongnam City, and Hanam City had not just waited for support from the national government but had ways to secure their own resources, they could have solved the problem of the delayed construction of the LRT project. The reality that the construction has been delayed for 16 years, despite residents paying additional contributions, clearly shows how important the fiscal autonomy of local governments is (Source: Chosun Biz, 2024).
However, if local governments' taxing powers were to increase in Korea, tax competition between local governments is inevitable. When local governments can autonomously adjust tax rates, they will inevitably compete to attract resources with minimal resistance from citizens. In other words, granting more taxing power to local governments can present new challenges as it provides local governments more autonomy. Ultimately, interactions between local governments, especially tax competition, play an important role, and the impact of such competition on the regional economy may be unneglectable.
Research Question
This study aims to examine the tax competition that arises in regions where tax jurisdictions are not completely independent from one another. Particularly in situations like Wirye, where multiple administrative districts overlap, the study will use a game-theoretic approach to model how local governments determine and interact with their tax policies, and to analyze the characteristics and results of the competition.
Specifically, this study aims to address the following key questions:
What strategic choices do local governments make in overlapping tax jurisdictions?
How is tax competition in this environment different from traditional models?
What are the characteristics of the equilibrium state resulting from this competition?
What are the impacts of this competition on the welfare of local residents and the provision of public services?
Through this analysis, the study aims to contribute to the effective formulation of fiscal policies in regions with complex administrative structures. It also expects to provide theoretical insights into the problems that arise in cases like Wirye, which can help inform policy decisions in similar situations in the future. The next chapter will go into further detail on tax competition and explain the specific models and assumptions used in this research.
Literature on Tax Competition
Tax competition refers to the competition between local governments to determine tax rates in order to attract businesses and residents. This can have a significant impact on the fiscal situation of local governments and the provision of public services.
The origin of the theory of tax competition can be traced back to Tiebout's (1956) "Voting with Feet" model. Tiebout argued that residents move to areas that provide the combination of taxes and public services that best matches their preferences. Later, Oates (1972) argued that an efficiently decentralized fiscal system can provide public goods, establishing this as the decentralization theorem.
However, Wilson (1986) and Zodrow and Mieszkowski (1986) pointed out that tax competition between local governments can lead to the under-provision of public goods. They argued that as capital mobility increases, local governments tend to lower tax rates, which can ultimately lead to a decline in the quality of public services.
Tax competition takes on an even more complex form in regions with overlapping tax jurisdictions. Keen and Kotsogiannis (2002) analyzed a situation where vertical tax competition, tax competition between different levels of governments such as those between central and local governments, and horizontal tax competition, tax competition between same level governments like those between local governments, occur simultaneously. They showed that in such an overlapping structure, excessive taxation can occur.
In the case of Wirye, the overlapping jurisdictions of multiple local governments make the dynamics of tax competition even more complex. Each government tries to maximize its own tax revenue, but at the same time, they must also consider the competition with other governments. This can lead to results that differ from traditional tax competition models.
In summary, the negative aspects of tax competition are:
Decrease in tax revenue: Long-term shortage of tax revenue due to tax rate reductions
Deterioration of public service quality: Reduction of services due to budget shortages
Regional imbalances: Uneven distribution of public services due to fiscal disparities
Excessive fiscal expenditures: Financial burden from various incentives to attract businesses
However, tax competition is not always negative. Brennan and Buchanan (1980) argued that tax competition can play a positive role in restraining the excessive expansion of government.
Research on tax competition in regions with overlapping tax jurisdictions is still limited. This study aims to analyze the dynamics of tax competition in such situations using a game-theoretic approach. This can contribute to the formulation of effective fiscal policies in regions with complex administrative structures.
Model
The Need for a Toy Model
It is extremely difficult to create a model that considers all the detailed aspects of the complex administrative system described earlier. In such cases, an effective approach is to create a toy model that removes complex details and represents only the core elements of the actual system. For example, to understand the principles of how a car moves, looking at the entire engine would be complicated, as it involves various elements like the fuel used, fuel injection method, cylinder arrangement, etc. However, through a toy model like a bumper car, one can, with relative ease, learn the basic principle that pressing the accelerator makes the car move forward, and pressing the brake makes it stop.
Similarly, in this study, we plan to use a toy model that removes complex elements in order to first understand the core mechanisms of tax rate competition. After understanding the core mechanisms through this study, the goal is to gradually add more complex elements to create a model that is closer to the actual system.
Assumptions
In game theory, a "game" refers to a situation where multiple players choose their own strategies and interact with each other. Each player tries to choose one or more optimal strategies to achieve their own goals, considering other players’ strategies.
In this study, we constructed a game by adding one overlapping region to the toy model of Itaya et al. (2008) and Ogawa and Wang (2016). Based on the Solow Model, a Nobel prize winning economic model for explaining long-term economic growth, the two models mainly investigated the capital tax rate competition between two regional governments within a country.
Specifically, there is a hypothetical country divided into three regions: two independent regions with asymmetric production technologies and capital factors, S and L, and a third region, O, which overlaps with the other two. The three regions have the independent authority to impose capital taxes with tax rates $\tau_i$for $\text{region}_i$. The non-overlapping parts of S and L are defined as SS (Sub-S) and SL (Sub-L), respectively, and the overlapping regions between S and O, and L and O, are denoted as OS and OL, respectively (refer to Figure 2). S and L are higher level of jurisdictions that provide generic public good $G$while O is a special-purpose jurisdiction that provides a specific public good $H$tied to O.
To intuitively observe only the "effect of the capital tax rate" due to the existence of overlapping regions, it is assumed that the population of regions S and L is the same. All residents in the country have the same preference and are inelastically supplying one unit of labor to companies in each region. This is a strong assumption because, under any circumstances, residents do not move and continue to work for their current company. However, it was a necessary assumption to simplify the game. Furthermore, it is assumed that companies employing residents in each region produce homogeneous consumer goods.
As mentioned above, S and L are assumed to have different capital factors and production technologies. Expressing this in a formula, the average capital endowment per person for the entire country is $\bar{k}$, and the average capital endowment per person for regions S and L are as follows:
Even though the capital endowment may differ, capital can move freely. In other words, when a resident of region S invests capital in L, the cost is no different than investing capital in S.
To briefly introduce a few more necessary variables, the amount of capital required in region $i$ is denoted $K_i$, the amount of labor supplied is $L_i$, and the labor and capital productivity coefficients are defined as $A_i$and $B_i > 2K_i$, respectively. Although regions S and L differ in capital production technology, there is no difference in labor production technology, so $A_L = A_S$while $B_L \neq B_S$. Region O does not occupy new territory in the hypothetical country but overlaps S and L regions in equal proportions in terms of area and population. Therefore, $A_O = A_L = A_S$, and $B_O$ would be the weighted average of $B_L$ and $B_S$, with the weights being the proportion of capital invested in the OL and OS regions.
Market Equilibrium
Utilizing the variables introduced above, the production function under constant returns to scale (CRS) for region $i$used in this study can be expressed as follows.
Based on this production function, it is assumed that firms maximize their profits, and market equilibrium is assumed to occur when the total capital endowment and capital investment demand are equal. Based on this, we can calculate the capital demand and interest rates at market equilibrium:
Here, $K_{SS}=\alpha_S K_S$, $K_{SL}=\alpha_L K_L$ while $0<\alpha_S, \alpha_L < 1$. Furthermore, we define $\theta$ as equal to $B_L-B_S$.
Residents maximize their post-tax income by investing capital, earning income from the market equilibrium return on capital $r^{\ast}$, and consuming it all. It is assumed that each tax jurisdiction provides public goods through taxes, and selects an optimal capital tax rate $\tau_i^{\ast}$to maximize the social welfare function, which is represented as the sum of individual consumption and the provision of public goods.
First, to briefly explain the Nash equilibrium, it refers to a state in which every player in a game has made their best possible choice and has no incentive to change their strategy any further. In other words, in a Nash equilibrium, no player can improve their outcome by changing their strategy, so all players continue to maintain their current strategies.
The tax rate at market equilibrium calculated in the previous section can be expressed as the optimal response function for each region. This is because, given the strategies of other regions, each region aims to maximize the social welfare function with its optimal strategy. In other words, this function shows which tax rate is most advantageous for region $i$ when considering the tax rates of other regions. Therefore, the Nash equilibrium tax rate can be derived based on the market equilibrium tax rate, and it can be expressed in the following formula.
Furthermore, we can derive the following lemma and proposition.
Lemma 1. The sign of $\Phi\equiv\varepsilon-\frac{\theta}{4}$determines the net capital position of S and L, where L is the net capital exporter when it is a positive sign and vice versa.
Proposition 1. The sign of $\Gamma\equiv\frac{3\left(\alpha_L+\alpha_S\right)-4}{\left(2-\left(\alpha_L+\alpha_S\right)\right)\left(\alpha_L+\alpha_S\right)}$determines the effective tax rate of O and $\alpha_L + \alpha_S$ must be greater than 4/3 for O to provide a positive sum of special public good $H$.
Additionally, the capital demand and interest rates at the Nash equilibrium are as follows.
Since the Nash equilibrium is nonlinear, the results can be visualized through simulations. By adjusting $\alpha_S$, $\alpha_L$, $\epsilon$, and $\theta$ in the simulation, as mentioned in Lemma 1, one can see that when $\Phi$is greater than 0, the capital demand in region L decreases, and the capital demand in region S increases, leading L to export capital and S to import it (see Figure 3).
Based on the Nash equilibrium results, the utility that the representative residents of each region derive from public goods can be summarized as follows:
[u_p(G_i, H_i) = \begin{cases} \frac{K_S^N \tau_S}{l} & \text{for } i = SS \ \frac{K_L^N \tau_L}{l} & \text{for } i = SL \ \frac{K_S^N \tau_S}{l} + \frac{3 K_O^* \tau_O }{2l} & \text{for } i = OS \ \frac{K_L^N \tau_L}{l} + \frac{3 K_O^* \tau_O }{2l} & \text{for } i = OL \end{cases}]
This utility function can be visualized as is represented in Figure 4.
To summarize, in the case of region O, it is not directly affected by the tax rates of S and L, but it is directly influenced by the ratio of the allocated resources, i.e., the sum of $\alpha_L$and $\alpha_S$. S and L experience changes in their tax rates by $\Gamma\left(1-\alpha_S\right)\bar{k}$ and $\Gamma\left(1-\alpha_L\right)\bar{k}$, respectively, due to the existence of a special-purpose jurisdiction called O, compared to the scenario where such jurisdiction does not exist. Additionally, the net capital position is determined independently of the special-purpose jurisdiction.
Conclusion
The study examined how tax competition unfolds in regions with overlapping tax jurisdictions leveraging ideas from game theory. Specifically, a simplified toy model was constructed to understand the impact of tax competition in complex administrative structures, and from this, a Nash equilibrium was derived. Through this, it was identified that in the presence of overlapping tax jurisdictions, the patterns and outcomes of tax competition differ in certain ways from what is predicted in the models of Itaya et al. (2008) and Ogawa and Wang (2016).
However, this study has several limitations. First, the use of a toy model simplifies the complex nuances of real-world scenarios, meaning it does not fully reflect the various factors that could occur in practice. For instance, factors such as population mobility, governmental policy responses, various tax bases, and income disparities among residents were excluded from the model, which limits the generalizability of the results. Second, the economic variables assumed in the model, such as differences in capital endowments, production technologies, and resident preferences for public goods, may differ from reality, necessitating a cautious approach when applying these findings to real-world situations.
Despite these limitations, this study provides an important theoretical foundation for understanding the dynamics of tax competition in regions with overlapping administrative structures. Specifically, it suggests the need for a policy approach that considers the interaction between capital mobility and public goods provision, rather than merely focusing on a “race to the bottom” in tax rates. Future research should expand the model used in this study to include additional variables such as population mobility and governmental policy responses. Moreover, it will be essential to examine how tax competition evolves in repeated games. Finally, testing the model under various economic and social conditions will be crucial to improving its reliability for practical policy applications. By doing so, we can more accurately assess the real-world impacts of tax competition in complex administrative structures and contribute to designing effective policies.
Reference
Brennan, G., & Buchanan, J. M. (1980). The power to tax: Analytical foundations of a fiscal constitution. Cambridge University Press.
Keen, M., & Kotsogiannis, C. (2002). Does federalism lead to excessively high taxes? American Economic Review, 92(1), 363-370.
Itaya, J.-i., Okamura, M., & Yamaguchi, C. (2008). Are regional asymmetries detrimental to tax coordination in a repeated game setting? Journal of Public Economics, 92(12), 2403–2411.
Jeong, J. (2021). A Study on the Improvement of the Flexible Tax Rate System. Korea Institute of Local Finance.
Oates, W. E. (1972). Fiscal federalism. Harcourt Brace Jovanovich.
Ogawa, H., & Wang, W. (2016). Asymmetric tax competition and fiscal equalization in a repeated game setting. International Review of Economics & Finance, 41, 1–10.
Tiebout, C. M. (1956). A pure theory of local expenditures. Journal of Political Economy, 64(5), 416-424.
Wilson, J. D. (1986). A theory of interregional tax competition. Journal of Urban Economics, 19(3), 296-315.
Zodrow, G. R., & Mieszkowski, P. (1986). Pigou, Tiebout, property taxation, and the underprovision of local public goods. Journal of Urban Economics, 19(3), 356-370.
With economic uncertainty both domestically and globally, the surge in energy-related raw material prices this winter was 'expected.' Experts are urging the need to accurately predict winter energy consumption and come up with strategies to save energy. However, the industry is questioning the methods previously used to estimate energy consumption, claiming that these methods do not reflect reality.
How can energy consumption be predicted accurately? What other impacts could come from accurately predicting energy usage? This article aims to explain a statistical method based on the joint probability distribution model to predict energy consumption more realistically in simple terms for the public.
Global Raw Materials Supply Crisis
According to the Chicago Mercantile Exchange (CME) on August 11, the spot price of European LNG surged to \$62.5 per MMBtu on August 2. This is 6 to 7 times higher than last year’s \$8-10 for the same period and close to the record high of \$63 set in March of this year.
Experts believe the sharp rise in European LNG prices is due to Russia’s 'tightening' of natural gas supplies. Amid the ongoing Russia-Ukraine war, the West, including the U.S., has refused to pay for raw materials in rubles, pressuring Russia. In response, Russia has significantly reduced its natural gas supply.
In fact, Russia has completely halted natural gas supplies to Bulgaria, Poland, the Netherlands, Finland, Latvia, and Denmark, all of which refused to pay in rubles. At the end of last month, it also drastically cut the supply through the Nord Stream 1 pipeline to Germany, its biggest customer, to 20%. As Europe struggles with gas shortages, it has been pulling in all available global LNG, driving up Northeast Asia’s LNG spot prices to $50 as of July 27.
Adding to the LNG supply crisis, a June explosion at the U.S.'s largest LNG export facility, Freeport LNG, which exports 15 million tons of LNG annually, has limited operations until the end of the year. Meanwhile, Australia, the world’s largest LNG exporter, is considering restricting natural gas exports under the pretext of stabilizing raw material prices. As a result, the industry expects a 'dark period' in raw material supply in the second half of this year.
The problem is that these geopolitical issues are severely impacting South Korea’s raw material supply as well. Even in the low-demand summer and fall seasons, LNG spot prices are nearing record highs, and the industry consensus is that winter, with its heating demands, will see prices rise to unimaginable levels.
'Predicted' Energy Crisis
Experts warn that in this 'predicted' energy crisis, the winter LNG spot price may far exceed the record high of \$63 per MMBtu from March, possibly surpassing \$100. They emphasize that South Korea must accurately forecast winter energy consumption now and begin conserving resources through energy-saving measures.
So, how is energy consumption in South Korea estimated, and how accurate are these estimates? To understand this, we first need to examine how electricity and gas are consumed in the country.
Energy consumption, including electricity and gas, occurs not only in households but also in non-residential buildings such as office and commercial facilities. The energy use in non-residential buildings varies greatly depending on the building's purpose. According to the "Energy Usage by Purpose [kWh/y]" data from the New and Renewable Energy Center of the Korea Energy Agency, there are significant differences in energy consumption depending on the building’s use.
Additionally, using the 'average energy consumption per unit area' from the table below, we can estimate the total annual energy consumption of a specific building. This is calculated by multiplying the annual average usage figure for the building’s purpose by its total floor area. For example, the estimated annual energy consumption for an office building with a floor area of 1,000 square meters would be 371,660 kWh.
These energy consumption estimates for individual buildings can be widely applied. As we’ve seen, with energy raw material prices expected to hit record highs, these estimates can help ensure that 'expensive' energy is not wasted and is efficiently distributed.
Additionally, the Korea Energy Agency, which provided the above data, actively uses these statistics to calculate the required amount of renewable energy for public buildings. For example, when a public building is scheduled for new construction or expansion and plans to generate a certain amount of renewable energy, the energy consumption estimates are compared to the expected energy output. This comparison helps determine whether the building is producing enough renewable energy.
Moreover, these energy usage estimates are not limited to individual buildings but can be extended to areas or regions. For instance, if a large-scale building or new city district is planned within a specific urban area, regional energy demand will naturally increase as the buildings are constructed.
However, a limitation of this data is that the estimates rely on a simple one-variable regression, with energy consumption as the dependent variable and floor area as the independent variable. In reality, building energy consumption is influenced by various factors such as heating and cooling systems, building materials and structure, and insulation quality. Thus, explaining energy use based solely on 'floor area' reduces accuracy.
Therefore, government agencies and public corporations responsible for energy management must strive to estimate the increased energy demand from new buildings as accurately as possible. This is crucial for efficient decision-making related to energy supply, production, and infrastructure investment. A precise model to estimate energy consumption for individual buildings is clearly necessary for this purpose.
Existing Energy Estimation Studies Based on Regression Analysis
Ideally, accurately estimating a building’s energy consumption would involve analyzing all detailed characteristics, such as heating and cooling systems, building materials and structure, insulation, occupancy, and schedules. This type of estimation model is known as a Physical Model.
However, predicting energy usage using a physical model is not practical. Most construction companies do not disclose all information, especially for new buildings. While collecting this data directly from the builders may be possible for a single building, doing so for an entire district or region would result in astronomical costs.
Therefore, from a research perspective, it’s best to use a statistical model that estimates energy consumption based on a few simple building attributes. In other words, creating a regression model where the dependent variable is a building's energy consumption, and the independent variables are attributes such as floor area, purpose, number of floors, age, and materials.
Regression analysis is a well-known statistical method for identifying correlations between observed independent variables and a dependent variable. Researchers can use regression analysis to statistically test how much a change in an independent variable influences the dependent variable and, further, predict the dependent variable's value from the independent variables. To ensure a reasonable analysis, researchers must also consider mathematical and statistical assumptions, such as whether their model violates the Gauss-Markov assumptions. Details on these considerations will be discussed in the later part of this research.
To conduct regression model research with monthly energy consumption of individual buildings as the dependent variable, data is required. In South Korea, monthly energy consumption records for non-residential buildings are made available through the Building Data Open System. Information about building attributes, which serves as independent variables, is recorded in the title section and is also provided by the Building Data Open System. This allows anyone to combine monthly energy consumption data with title section data to carry out such research.
Returning to the main point, due to the practical 'cost' issue and the ease of data collection for regression model research, previous studies estimating energy consumption of individual buildings have primarily used regression-based statistical models. A notable domestic study is “Development of Standard Models for Building Energy in Seoul’s Residential/Commercial Sector” (Kim Min-kyung et al., 2014). This research derived a model by performing linear regression on monthly electricity usage with various independent variables and monthly dummy variables (which convert existing variables into 0s and 1s based on certain criteria). Similarly, in a prominent overseas study on heating energy estimates, a model was derived by regressing 'per unit area' monthly heating energy consumption during the heating season against building and climate-related independent variables.
Monthly' Energy Usage Trends
One common feature of the studies reviewed earlier is that the dependent variable in the regression models is not 'annual' energy consumption, but 'monthly' energy consumption. This is to reflect the seasonal trends in energy usage. For example, electricity usage is higher in the summer due to air conditioning, and gas consumption is higher in the winter due to heating. It’s no surprise that electricity usage peaks in July and August, while gas consumption is highest from December to February. In fact, most buildings exhibit similar 'seasonal trends' in energy consumption, as shown in Figure 3.
Therefore, when planning energy supply and maintenance for energy production facilities, it is crucial to accurately predict monthly energy demand by considering seasonal fluctuations. This ensures that sufficient energy is available during high-consumption periods to prevent blackouts, and that energy reserves are minimized during low-consumption periods, allowing for efficient use of government budgets. However, previous studies' energy consumption estimates have not been widely adopted in the industry due to their lack of accuracy and failure to reflect reality. This is because traditional regression models did not incorporate a 'joint' probability distribution model based on the second moment for monthly energy usage.
Hidden Factors Among Variables
Consider two hypothetical office buildings with nearly identical attributes but differing actual energy usage. Both buildings are categorized as office buildings, with similar floor area, number of floors, age, and building materials. However, in one building, employees frequently work overtime and on weekends, using air conditioning extensively, resulting in high electricity consumption. In contrast, the other building emphasizes energy saving, with employees leaving on time daily, leading to much lower energy use.
In this case, even though the explanatory variable values for the two buildings are very similar, their actual electricity usage would differ significantly. The first building would use more electricity compared to the average office building of similar size and materials, while the second would use less. This means that the energy consumption of two buildings with identical attributes like floor area, number of floors, and materials would vary due to the hidden variable of "whether employees leave on time." Since it's practically impossible to collect data on the work hours of all employees in a building, including this variable in existing models is not feasible.
Of course, regression analysis accounts for such variability through the error term. The energy consumption of average buildings is calculated by setting the error term to zero, while buildings that consume more than average will have a positive error term, and those that consume less will have a negative one.
Correlation Among Dependent Variables
In proper research, not only are the coefficient estimates for each explanatory variable provided in a regression model, but so is the estimate of error variance. Using this error variance estimate, the expected energy usage for each month can be obtained as both a point estimate and a confidence interval. In a normal regression model, this confidence interval would cover most of the variability in energy usage mentioned earlier. However, mathematically, one more factor needs to be considered: the 'correlation among energy usage in different months.'
For example, if the electricity consumption in August of a building that frequently has overtime and uses a lot of air conditioning is significantly higher than other similar-sized buildings, it is likely that this building will also consume more electricity in other months, from January to December, compared to other similar buildings. Similarly, if a building that focuses on energy saving has low electricity usage in August, it will likely consume less electricity in other months as well.
This is mathematically referred to as a 'positive correlation.' Previous regression-based studies did not account for this positive correlation. For instance, if we assume that monthly electricity usage follows a probability distribution with the average usage predicted by the existing regression model, and we draw samples of monthly electricity usage for a specific building, it's possible that the sample value for July might be much higher than average, while the sample value for August could be much lower than average.
Common sense tells us that a building that used significantly more electricity than similar buildings in July is unlikely to use much less electricity than other buildings in August. In other words, if the regression model captures all relevant information, the samples of electricity usage for July and August for the same building should be positively correlated—they should both be high or both be low. However, if there is no second moment value (i.e., 'covariance') between the error terms for the two months, such unrealistic samples may occur.
Covariance Among Error Terms
Let’s examine this more mathematically. When viewing the monthly electricity usage (January, February, …, December) for a building over a year as a 12-dimensional vector random variable, previous studies have estimated the first moment vector and the diagonal components of the second moment matrix (the variance of error terms for each month). The first moment vector is obtained by inputting the explanatory variable values into the regression equation and setting the error term to zero. The diagonal components of the second moment matrix correspond to the estimated variances of the error terms for each month. However, previous studies did not estimate the off-diagonal components of the second moment matrix—i.e., the 'covariance' between the error terms of different regression equations—leading to difficulties in accurately modeling real-world scenarios.
If, in addition to calculating the first moment vector, the second moment matrix with covariances is fully estimated, the multivariate normal distribution (Multivariate Normal Distribution) of the multivariate random variable can be defined mathematically. In practical terms, this would allow us to sample monthly electricity usage for a specific building while accounting for the 'correlation between energy usage in different months.' This way, a building that uses significantly more electricity than similar buildings in July would also be expected to use more electricity in August.
These accurately generated samples (monthly energy estimates) can greatly help urban energy-related research by allowing statistical analysis of uncertainties. Additionally, if some monthly energy usage data are missing, the second moment matrix can be used to estimate (impute) the missing values, thereby significantly improving the quality of the data.
However, for the multivariate normal distribution to be defined in this context, the research data must be nearly symmetrical around the mean, and the tails of the distribution must not be excessively thick or thin. Furthermore, the 2021 building data (energy usage, floor area, building purpose, etc.) used in this discussion are generally in line with these assumptions.
Sample Extraction Using Multivariate Normal Distribution
By defining the multivariate normal distribution based on the second moment matrix, it is possible to extract samples of monthly energy usage (January, February, …, December) for an entire year. This approach differs from previous studies because it accounts for the correlation between residuals in the regression model, thus incorporating 'seasonal trends' when generating samples. In simple terms, a building that used significantly more electricity than similar buildings in July can now also be estimated to use more electricity in August.
Example Reflecting Covariance
To validate this claim, let's examine the energy usage data samples drawn from the multivariate normal distribution in the figure below.
The figure shows that the seasonal energy usage trends in the samples are very similar to the actual data. For example, electricity consumption rises significantly during the summer months (July-August) when air conditioning is heavily used, while gas consumption increases during the winter months (December-February) when heating is in high demand. This confirms that our statistical model accurately reflects reality.
Example Without Covariance
Now, let’s see what happens when we extract samples without considering the covariance between energy usage in different months, as in previous studies. This is equivalent to setting the off-diagonal elements of the covariance matrix to zero in the multivariate normal distribution used for the sample extraction.
If a building exhibits significantly lower energy usage in July, it would be reasonable to expect that it consistently uses less energy, meaning its August usage should also be below average.
However, in this case, the model failed to incorporate the covariance information, leading to unrealistic results. As illustrated in the first figure, a building that consumed much less electricity than similar buildings in July unexpectedly uses much more electricity in August compared to others, which defies typical expectations.
Missing Data Estimation Using Multivariate Normal Distribution
In addition to sample extraction, another application is missing data estimation (imputation). For example, the Ministry of Land, Infrastructure, and Transport data sometimes has missing monthly energy usage for certain buildings, or some recorded values may be abnormal. If correct usage data exists for the other months, can we estimate the missing usage based on the recorded values?
If energy usage is recorded for the first and last months of a three-month period, but missing for the second month, we might compromise by using the middle value. But what if two consecutive months are missing? Or if the last month's usage is missing, so the middle value cannot be defined using the following month's data? What should be done then?
Using the multivariate normal distribution derived in this study, missing values can be reasonably estimated in any case. As shown in the formula, when some elements of a random vector following a multivariate normal distribution are fixed, the remaining elements follow a reduced-dimensional conditional multivariate normal distribution based on the fixed values. This allows us to estimate the missing values using the conditional mean of this distribution.
The graph above shows missing values filled in using the conditional mean of the multivariate normal distribution. The blue solid line represents the actual monthly energy usage of a building, while the orange circles represent the conditional mean for February, July, and October, assuming these months' usage is missing. The green squares represent the conditional mean for October to December, assuming the usage from January to September is given, which can be viewed as future usage predictions. The conditional mean does not deviate much from the actual values, indicating that using the conditional mean to estimate missing values is reasonable.
All in all, accurate energy consumption forecasting requires a statistical approach that goes beyond simple regression models, taking into account the correlations between various variables and complex factors. By using a multivariate normal distribution model, it is possible to make more realistic predictions by considering the correlations between monthly energy consumption, which can improve the efficiency of energy supply planning. This approach can also be useful for addressing statistical errors overlooked in previous studies and for imputing missing data. Ultimately, more accurate energy consumption forecasting will serve as crucial foundational data for preparing for winter energy crises, while also contributing to improving energy efficiency and preventing resource waste.
Ⅰ. The Digital Advertising Market is Facing the Issue of Measurement Error
Digital advertising has been growing explosively every year. Especially during the global pandemic, as the offline market significantly contracted, the shift in consumer focus to online platforms made digital advertising the mainstream in the global advertising market.
The core of digital advertising is undoubtedly the smartphone. With the ability to access the web anytime and anywhere via smartphones, internet-based media have emerged in the advertising market. Particularly, app-based platform services that offer customized user experiences have surged rapidly, significantly contributing to the growth of digital advertising. This market has been driven by the convenience smartphones offer compared to traditional devices like PCs and tablets.
However, the digital advertising industry is currently grappling with the issue of "Measurement Error". This problem causes significant disruptions in accurately measuring and predicting advertising performance.
The rapidly growing digital advertising market
The key difference between digital advertising and traditional advertising is the ability to track results. In traditional advertising, companies could only estimate performance by saying, "I advertised on a platform seen by thousands of people per day" to gauge brand awareness. As a result, even when advertising agencies tried to analyze performance, they often faced dissatisfaction due to the difficulty in accurately assessing outcomes because of various types of noise.
With the advent of the web, advertising entered a new phase. Information of users is stored in cookies when they access websites, allowing advertisers to instantly track which ads users viewed, which products they looked at, and what they purchased. As a result, companies can now easily verify how effective their ads are on users. Furthermore, they can compare multiple ads and quickly determine the direction for planning future campaigns.
The advent of smartphones has accelerated this paradigm shift. Unlike in the past when multiple people shared a single PC or tablet, we are now in the era of "one person, one smartphone", allowing behavior patterns on specific devices to be attributed to individual users. In fact, according to a 2022 Gallup Korea survey, the smartphone penetration rate among Korean adults was 97%. In recent years, many companies have introduced hyper-personalized tareting services to the public, signaling a major shift in the digital advertising market.
Issue in Digital Advertising: Measurement Error
However, everything has its pros and cons, and digital advertising is no exception. Industry professionals point out that the effectiveness of digital ads is hindered by factors such as user fatigue and privacy concerns. From my own experience in the advertising industry, the issue that stands out the most is "measurement error".
Measurement error refers to data being distorted due to specific factors, resulting in outcomes that differ from the true values. Common issues in the industry include users being exposed to the same ad multiple times in a short period, leading to insignificant responses, or fraudulent activities where malicious actors create fake ad interactions to gain financial benefits. Additionally, technical problems such as server instability can cause user data to be double counted, omitted, or delayed. For various reasons, the data becomes "contaminated", preventing advertisers from accurately assessing ad performance.
Of course, media companies that deliver the ads are not idle either. They continuously update advertising reports, correcting inaccurate data related to ad spend, impressions, clicks, and other performance metrics. During this process, advertisers change the reported ad performance for up to one week.
The problem is that for "demanders" like me, for whom accurate measurement of ad performance is crucial, measurement error leads to an endogeneity issue in performance analysis, significantly reducing the reliability of the analysis. Simply put, because the reports keep being revised due to measurement errors, it becomes difficult to accurately analyze ad performance.
Even in the advertising industry, where the focus is not on performance measurement but on predicting future values, the issue of measurement error remains significant. This is because measurement error increases the variance of the residuals, reducing the model's goodness of fit. Additionally, in cases where the magnitude of the measurement error changes daily due to the frequency of data updates, as with digital advertising data, non-linear models that do not guarantee linearity are more likely to show poor predictive performance in extrapolation.
Unfortunately, due to the immediacy characteristic of digital advertising, advertisers cannot afford to wait up to a week for the data to be fully updated. If advertisers judge that an ad's performance is poor, they may immediately reduce its exposure or even stop the campaign altogether. Additionally, for short-term ads, such as promotions, waiting up to a week is not an option.
The situation is no different for companies claiming to use "AI" to automatically manage ads. Advertising automation is akin to solving a reinforcement learning problem, where the goal is to maximize overall ad performance within a specific period using a limited budget. When measurement error occurs in the data, it can disrupt the initial budget allocation. Ultimately, it is quite likely to result in optimization failure.
Research Objective: Analysis of the Impact of Measurement Error and Proposal of a Reasonable Prediction Model
If, based on everything we've discussed so far, you're thinking, "The issue of measurement error could be important in digital advertising," then I've succeeded. Unfortunately, the advertising industry is not paying enough attention to measurement error. This is largely because measurement errors are not immediately visible.
This article focuses on two key aspects. First, we analyzed the impact of measurement error on advertising data based on the size of the measurement error and the data. Second, we proposed a reasonable prediction model that considers the characteristics of the data.
II. Measurement Error from a Predictive Perspective
In this chapter, we will examine how measurement error affects actual ad performance.
Measurement Error: Systematic Error and Random Error
Let's delve a bit deeper into measurement error. Measurement error can be divided into two types: systematic error and random error. Systematic error has a certain directionality; for example, values are consistently measured higher than the true value. This is sometimes referred to as the error having a "drift". On the other hand, random error refers to when the measured values are determined randomly around the true value.
So, what kind of distribution do the measured values follow? For instance, if we denote the size of the drift as $\alpha$ and the true value as $\mu$, the measured value, represented as the random variable X, can be statistically modeled as following a normal distribution, $N(\mu + \alpha, \sigma^{2})$. In other words, the measured value is shifted by $\alpha$ from the true value (systematic error) while also having variability of $\sigma^2$ (random error).
Systematic errors can be resolved through data preprocessing and scaling, so they are not a significant issue from an analyst's perspective. Specifically, removing the directional bias by $\alpha$ from the measurements is usually sufficient. On the other hand, random errors significantly influence the magnitude of measurement errors and can cause problems. To resolve this, a more statistically sophisticated approach is required from the analyst's perspective.
Let's take a closer look at the issues that occur when data contains random errors. In regression models, when measurement errors are included in the independent variables, a phenomenon known as "Regression Dilution" occurs, where the absolute value of the estimated regression coefficients shrinks towards zero. To understand this better, imagine including an independent variable filled with measurement errors in the regression equation. Since this variable fluctuates randomly due to the random component, the effect of the regression coefficient will naturally appear as zero. This issue is not limited to basic linear regression models but occurs in all linear and nonlinear models.
The Data Environment in Digital Advertising
So far, we have discussed measurement errors. Now, let's examine the environment in which digital advertising data is received for modeling purposes. In Chapter 1, we mentioned that media companies continuously update performance data such as impressions, clicks, and ad spend to address measurement errors. Given that the data is updated up to a week later, when the data is first received, it is likely that a significant amount of measurement error is present. However, as the data gets updated the next day, it becomes more accurate, and by the following day, it becomes even more precise. Through this process, the measurement error in the data tends to decrease exponentially.
Since the magnitude of measurement errors changes with each update, this can lead to issues of heteroskedasticity in addition to model fit. When heteroskedasticity occurs, the estimates become inefficient from an analytic perspective. Furthermore, from a predictive perspective, it presents challenges for extrapolation, as predicting new values based on existing data tends to result in poor performance.
Additionally, as ad spend increases, the magnitude of measurement errors grows. For example, when spending 1 dollar on advertising, the measurement error might only be a few cents, but with an ad spend of 1 million dollars, the error could be tens of thousands of dollars. In this context, it makes sense to use a multiplicative model, where a random percentage change is applied based on the ad spend. Of course, it is well-known that regression dilution can occur in multiplicative models, just as it does in additive models.
Model and Variable Selection
We have defined the dependent variable as the "number of events" that occur after users respond to an ad, based on their actions on the web or app. Events such as participation, sign-ups, and purchases are countable, occurring as 0, 1, 2, and so on, which means a model that best captures the characteristics of count data is needed.
For the independent variables, we will use only "ad spend" and the "lag of ad spend," as these are factors that the advertiser can control. Metrics like impressions and clicks are variables that can only be observed after the ads have been served, meaning they cannot be controlled in advance, and are therefore excluded from a business perspective. Impressions are highly correlated with ad spend, meaning these two variables contain similar amounts of information. This will play an important role later in the modeling process.
Meanwhile, to understand the effect of measurement errors, we need to deliberately "contaminate" the data by introducing measurement errors into the ad spend. The magnitude of these errors was set within the typical range observed in the industry, and simulations were conducted across various scenarios.
The proposed models are a Poisson regression-based time series model and a Poisson Kalman filter. We chose models based on the Poisson distribution to reflect the characteristics of count data.
The reason for using Poisson regression is that it helps to avoid the issue of heteroskedasticity in the residuals. Due to the nature of Poisson regression and other generalized linear models (GLMs), the focus is on the relationship between the mean and variance through the link function. This allows us to mitigate the heteroskedasticity problem mentioned earlier to some extent.
Furthermore, using the Poisson Kalman filter allows us to partially avoid the measurement error issue. This model accounts for the Poisson distribution in the observation equation while also compensating for the inaccuracies (including measurement errors) in the observation equation through the state equation. This characteristic enables the model to inherently address the inaccuracies in the observed data.
The Effect of Measurement Error
First, we will assess the effect of measurement error using the Poisson time series model.
Here, Spend represents the ad spend from the current time point up to 7 time points prior, and $\beta$ captures the lagged effects embedded in the residuals, beyond the effect of ad spend. Additionally, $\alpha$ accounts for the day-of-week effects.
Although it may be too lengthy to include, we confirmed that this model reasonably reflects the data when considering model fit and complexity.
What we are really interested in is the measurement error. How did the measurement error affect the model’s predictions? To explore this, we first need to understand time series cross-validation.
Typically, K-fold or LOO (Leave-One-Out) methods are used when performing cross-validation on data. However, for time series data, where the order of the data is crucial, excluding certain portions of the data is not reasonable. Therefore, the following method is applied instead.
Fit the model using the first $d$ data points and predict future values.
Add one more data point, fit the model with ($d+1$) data points, and predict future values.
Repeat this process.
This can be illustrated as follows.
Using this cross-validation method, we calculated the 1-step ahead forecast accuracy, with the evaluation metric set as MAE (Mean Absolute Error), taking the Poisson distribution into account.
An interesting result was found: in the table above, for low levels of measurement error (0.5 ~ 0.7), the model with measurement error recorded a lower MAE than the model without it. Wouldn’t we expect the model without measurement error to perform better, according to conventional wisdom?
This phenomenon occurred due to the regularization effect introduced by the measurement error. In other words, the measurement error caused attenuation bias in the regression coefficients, which helped mitigate the issue of high variance to some extent. In this case, the measurement error effectively played the role of the regularization parameter, $\lambda$, that we typically focus on in regularization.
Let's look at Figure 5. If the variance of the measurement error increases infinitely, the variable becomes useless, as shown in the right-hand diagram. In this case, the model would be fitted only to the sample mean of the dependent variable, with an R-squared value of 0. However, we also know that a model with no regularization at all, as depicted in the left-hand diagram, is not ideal either. Ultimately, finding the right balance is crucial, and it’s important to “listen to the data” to achieve this.
Let’s return to the model results. While low levels of measurement error clearly provide an advantage from the perspective of MAE, higher levels of measurement error result in a higher MAE compared to the original data. Additionally, since measurement errors only occur in recent data, as the amount of data increases, the proportion of error-free data compared to data with measurement error grows, reducing the overall effect of the measurement error.
What does it mean that MAE gradually improves as the data size increases? Initially, the model had high variance due to its complexity, but as more data becomes available, the model begins to better explain the data.
In summary, a small amount of measurement error can be beneficial from the perspective of MAE, which means that measurement error isn’t inherently bad. However, since we can't predetermine the magnitude of measurement error in the independent variables, it can be challenging to decide whether a model that resolves the measurement error issue is better or if it's preferable to leave the error unresolved.
To determine whether stronger regularization would be beneficial, one approach is to add a constraint term with $\lambda$ to the model for testing. Since the measurement error has acted similarly to ridge regression, it is appropriate to test using L2 regularization in this case as well.
If weaker regularization is needed, what would be the best approach? In this case, one option would be to reduce measurement error by incorporating the latest data updates from the media companies. Alternatively, data preprocessing techniques, such as applying ideas from repeated measures ANOVA, could be used to minimize the magnitude of the measurement error.
III. Measurement Error from an Analytic Perspective
In Chapter 2, we explained that from a predictive perspective, an appropriate level of measurement error can act as regularization and be beneficial. At first glance, this might make measurement error seem like a trivial issue. But is that really the case?
In this chapter, we will explore how measurement error impacts the prediction of advertising performance from an analytic perspective.
Endogeneity: Disrupting Performance Measurement
In Chapter 1, we briefly discussed ad automation. Since a customer's ad budget is limited, the key to maximizing performance with a limited budget lies in solving the optimization problem of how much budget to allocate to each medium and ad. This decision ultimately determines the success of an automated ad management business.
There are countless media platforms and partners that play similar roles. It's rare for someone to purchase a product after encountering just one ad on a single platform. For example, consider buying a pair of pants. You might first discover a particular brand on Instagram, then search for that brand on Naver or Google before visiting a shopping site. Naturally, Instagram, Naver, and Google all contributed to the purchase. But how much did each platform contribute? To quantify this, the advertising industry employs various methodologies. One of the most prominent techniques is Marketing Media Mix Modeling.
As mentioned earlier, many models are used in the advertising industry, but the fundamental idea remains the same: distributing performance based on the influence of coefficients in regression analysis. However, the issue of "endogeneity" often arises, preventing accurate calculation of these coefficients. Endogeneity occurs when there is a non-zero correlation between the explanatory variables and the error term in a linear model, making the estimated regression coefficients unreliable. Accurately measuring the size of these coefficients is crucial for determining each platform's contribution and for properly building performance optimization algorithms. Therefore, addressing the issue of endogeneity is essential.
Solution to the Endogeneity Problem: 2SLS
In econometrics, a common solution to the endogeneity problem is the use of 2SLS (Two-Stage Least Squares). 2SLS is a methodology that addresses endogeneity by using instrumental variables (IV) that are highly correlated with the endogenous variables but uncorrelated with the model’s error term.
Let's take a look at the example in Figure 6. We are using independent variable X to explain the dependent variable Y, but there is endogeneity in the red section of X, which negatively affects the estimation. To address this, we can use an appropriate instrumental variable Z, which is uncorrelated with the residuals of Y after removing X's influence (green), ensuring validity, and correlated with the original variable X, ensuring relevance. By performing the regression analysis only on the intersection of Z and X (yellow + purple), we can explain Y while solving the endogeneity problem in X. The key idea behind instrumental variables is to sacrifice some models fit in order to remove the problematic (red) section.
Returning to the main point, in our model, there is not only the issue of measurement error in the variables, but also the potential for endogeneity due to omitted variable bias (OVB), as we are using only "ad spend" and lag variables as explanatory variables. Since the goal of this study is to understand the effect of measurement error on advertising performance, we will use a 2SLS test with appropriate IV to examine whether the measurement error in our model is actually causing endogeneity from an analytic perspective.
IV for Ad Spend: Impressions
As we discussed earlier, instrumental variables can help resolve endogeneity. However, verifying whether an instrumental variable is appropriate is not always straightforward. While it may not be perfect, for this model, based on industry domain knowledge, we have selected "impressions" as the most suitable instrumental variable.
First, let’s examine whether impressions satisfy the relevance condition. In display advertising, such as banners and videos, a CPM (cost per thousand impressions) model is commonly used, where advertisers are charged based on the number of impressions. Since advertisers are billed just for showing the ad, there is naturally a very high correlation between ad spend and impressions. In fact, a simple correlation analysis shows a correlation coefficient of over 0.9. This indicates that impressions and ad spend have very similar explanatory power, thus satisfying the relevance condition.
The most difficult aspect to prove for an instrumental variable is its validity. Validity means that the instrumental variable must be uncorrelated with the residuals, those are the factors in the dependent variable (advertising performance) that remain after removing the effect of ad spend. In our model, what factors might be included in the residuals? From a domain perspective, possible factors include the presence of promotions or brand awareness. Unlike search ads, where users actively search for products or brands, in display ads, users are passively exposed to ads, as advertisers pay to have them shown by the media platforms. Therefore, the number of impressions, which reflects forced exposure to ads, is likely uncorrelated with factors such as brand awareness or the presence of promotions, which influence the residuals.
If you're still uncertain about whether the validity condition is satisfied, you can perform a correlation test between the instrumental variable and the residuals. As shown in the results of Figure 7, we cannot reject the null hypothesis of no correlation at the significance level of 0.05.
Of course, the instrumental variable, impressions, also contains measurement error. However, it is known that while measurement error in the instrumental variable can reduce the correlation with the original variable, it does not affect its validity.
Method for Detecting Endogeneity: Durbin-Wu-Hausman Test
Now, based on the instrumental variable(impressions) we identified let’s examine whether measurement error affects the endogeneity of the coefficients. After performing the Durbin-Wu-Hausman test, we can see that in some intervals, the null hypothesis of no endogeneity is rejected. This indicates that the measurement error in the coefficients is indeed affecting endogeneity.
Depending on the patterns in the newly acquired data revealed through this test, even seemingly robust models can change. Therefore, we can conclude that modeling with consideration for measurement error is a safer approach.
IV. Poisson Kalman Filter and Ensemble
Up until now, we have explored measurement error from both predictive and analytic perspectives. This time, we will look into the Poisson Kalman Filter, which corrects for measurement error, and introduce an "ensemble" model that combines the Poisson Kalman Filter with the Poisson time series model.
Poisson Kalman Filter, Measurement Error, Bayesian, and Regularization
The Kalman filter is a model that finds a compromise between the information from variables that the researcher already knows (State Equation) and the actual observed values (Observation Equation). From a Bayesian perspective, this is similar to combining the researcher's prior knowledge (Prior) with the likelihood obtained from the data.
The regularization and measurement error introduced in Chapter 3 can also be interpreted from a Bayesian perspective. This is because the core idea of regularization aligns with how strongly we hold the prior belief in Bayesian modeling that $\beta=0$. In Chapter 3, we effectively drove the (random) measurement error coefficients toward zero, which ties together the intuition behind Kalman filters, Bayesian inference, regularization, and measurement error. Therefore, using a Kalman filter essentially means incorporating measurement error through the state equation, and this can further be understood as including regularization into the model.
Then, how should we construct the observation equation? Since our dependent variable is count data, it would be reasonable to use the log-link function from the GLM framework to model it effectively.
Poisson Time Series vs. Poisson Kalman Filter
Let’s compare the performance of the Poisson time series model and the Poisson Kalman filter. First, looking at the log likelihood, we can see that the Poisson time series model consistently has a higher value across all intervals. However, when we examine the MAE, the Poisson Kalman filter shows superior performance. This suggests that the Poisson time series model is overfitted compared to the Poisson Kalman filter. In terms of computation time, the Poisson Kalman filter is also faster. However, since both models take less than 2 seconds to compute, this is not a significant factor when considering their application in real-world services.
If you look closely at Figure 10, you can spot an interesting detail: the decrease in MAE as the data volume increases is significantly larger for the Poisson time series model compared to the Poisson Kalman filter. The reason for this is as follows.
The Poisson Kalman filter initially reflected the state equation well, leading to a significant advantage in prediction accuracy (MAE) early on. However, as more data was added, it seems that the observation equation failed to effectively incorporate the new data, resulting in a slower improvement in MAE. On the other hand, the Poisson time series model suffered from poor prediction accuracy early on due to overfitting, but as more data came in, it was able to reasonably incorporate the data, leading to a substantial improvement in MAE.
Similar results were found in the model robustness tests. Specifically, in tests for residual autocorrelation, mean-variance relationships, and normality, the Poisson Kalman filter performed better when there was a smaller amount of data early on. However, after the mid-point, the Poisson time series model outperformed it.
Ensemble: Combining the Poisson Time Series and the Poisson Kalman Filter
Based on the discussion so far, we have combined the distinct advantages of both models to build a single ensemble model.
To simultaneously account for bias and variance, we set the constraint for the stacked model, which minimizes MAE, as follows.
As we observed earlier, the Poisson Kalman filter had a lower MAE across all intervals, so without considering the momentum of MAE improvement, the stacked model would output $p=0$ across all intervals, meaning it would rely 100% on the Poisson Kalman filter. However, since the MAE of the Poisson time series model improves significantly in the later stages with a larger data set, we introduced a weight $W$ in front of the absolute constraint to account for this.
How should the weights be assigned? First, as the amount of data increases, both models will become progressively more reliable, resulting in reduced variance. Additionally, the model that performs better will typically have lower variance. Therefore, by assigning weights inversely proportional to the variance, we can effectively reflect the models' increasing accuracy over time.
The predictions from the final model, which incorporates the weights, are as follows.
When analyzing the data using the ensemble model, we observed that in the early stages, p (the weight of the Poisson time series model in the ensemble) remained close to 0, but then jumped near 1 in the mid stages. Additionally, in certain intervals during the later stages where the data patterns change, we can see that the Poisson Kalman filter, which leverages the advantages of the state equation, is also utilized.
Let’s take a look at the MAE of the ensemble model in Figure 12. By reasonably combining the two models that exhibit heterogeneity, we can see that the MAE is lower across all intervals compared to the individual models. Additionally, in the robustness tests, we confirmed that the advantages of the ensemble were maximized, making it more robust than the individual models.
Conclusion
While the fields of applied statistics, econometrics, machine learning, and data science may have different areas of focus and unique strengths, they ultimately converge on the common question: "How can we rationally quantify real-world problems?" Additionally, a deep understanding of the domain to which the problem belongs is essential in this process.
This study focuses on the measurement error issues commonly encountered in the digital advertising domain and how these issues can impact both predictive and analytic modeling. To address this, we presented two models, the Poisson time series model and the Poisson Kalman filter, tailored to the domain environment (advertising industry) and the data generating process (DGP). Considering the strong heterogeneity between the two models, we ultimately proposed an ensemble model.
With the universalization of smartphones, the digital advertising market is set to grow even more rapidly in the future. I hope that as you read this paper, you take the time to savor the knowledge rather than hurriedly trying to absorb the text. It would be wonderful if you could expand your understanding of how statistics applies to the fields of data science and artificial intelligence.
Having completed over half of my graduate courses and approaching graduation, I wanted to write a thesis in a field that heavily utilized machine learning and deep learning, rather than relying on traditional statistical analysis methods. This felt more aligned with the purpose of my graduate education in data science and artificial intelligence, making the experience more meaningful.
Data Accessibility and Deep Learning Applicability
Like many others, I struggled to obtain data. This led me to choose a field where data was accessible, but conventional methods failed to uncover meaningful insights. In graduate school, we weren't restricted to specific topics. Instead, we learned a variety of data analysis methods based on mathematical and statistical principles. This flexibility allowed me to explore different areas of interest.
I eventually chose topic modeling with deep learning as the subject of my thesis. I selected topic modeling because deep learning methodologies for this field have developed well, moving beyond statistical logic to generative models with layered structures of factor analysis, which trace underlying structures based on data probability. Moreover, there were many excellent researchers in Korea working in this field, making it easier to access good educational resources.
Korea's high dependency on exports
During a conversation with my mentoring professor, he suggested, "Instead of focusing on the analysis you're interested in, why not tackle an AI problem that society needs?" Taking his advice to heart, I started exploring NLP problems where I could make a meaningful contribution, using the IMRaD approach. This search led me to a research paper that immediately caught my attention.
The paper highlighted that while Korea has a globally respected economic structure, it remains heavily dependent on foreign markets rather than its domestic one. This dependency makes Korea's economy vulnerable to downturns if demand from advanced countries decreases. Furthermore, recent trade tensions with China have significantly affected Korea's trade sector, emphasizing the need for export diversification. Despite various public institutions—such as KOTRA, the Korea International Trade Association, and the Small and Medium Business Corporation—offering services to promote this diversification, the paper questioned the overall effectiveness of these efforts. This issue resonated with me, sparking ideas on how AI could potentially offer solutions.
Big Data for Korean Exports
The paper's biggest criticism is that the so-called 'big data services' provided by public institutions don't actually help Korean companies with buyer matching. Large companies that already export have established connections and the resources to keep doing so without much trouble. However, small businesses and individual sellers, who lack those resources, are left to find new markets on their own. In this context, public institutions aren't providing the necessary information or effectively helping with buyer matching, which is critical for improving the country's economy and competitiveness.
While the institutions mentioned do have the experience, expertise, and supply chains to assist many Korean companies with exporting, the paper highlights that their services aren't truly utilizing big data or AI. So, what would a genuine big data and AI service look like—one that leverages the strengths of these institutions while truly benefiting exporters and sellers? The idea I developed is a service that offers 'interpretable' models, calculations, and analysis results based on data, providing practical support for decision-making.
LDA does not capture 'context'
When I think about AI, the first professor who comes to mind is Andrew Ng. I’m not sure why, but at some point, I started noticing more and more people around me mentioning his lectures, interviews, or research papers. Given that his work dates back to the early 2000s, I might have come across it later than many others. Interestingly, the topic modeling in my thesis traces back to Professor Ng’s seminal paper on Latent Dirichlet Allocation (LDA).
In LDA, the proportion (distribution) of topics is assumed to follow a beta distribution, and based on this assumption, the words in each document are temporarily assigned to topics. Then, prior parameters are calculated from the words in each topic, and the changes are measured using KL-divergence. The topic assignments are repeatedly adjusted until the changes narrow and the model converges. Since LDA uses Gibbs sampling, it differs significantly from Neural Network Variational Inference (NVI), which I will explain later.
Simply put, the goal of LDA is to identify hidden topics within a collection of documents. The researcher sets the number of topics, $k$, and the LDA model learns to extract those topics from the data.
The biggest flaw in LDA is that it assumes the order and relationship of words are conditionally independent. In other words, when using LDA, the next word or phrase can be assigned to a completely different topic, regardless of the conversation's context. For example, you could be discussing topics B and C, but the model might suddenly shift to topic A without considering the flow of the discussion. This is a major limitation of LDA — it doesn’t account for context.
LSA uses all the information
On the other hand, Latent Semantic Analysis (LSA) addresses many of LDA's weaknesses. It extracts the relationships between words, words and documents, and documents and documents using Singular Value Decomposition (SVD).
The principle is simple. In a set of documents, the co-occurrence of words is represented on a graph. By decomposing an $n \times m$ matrix with SVD, we can extract eigenvectors representing words and documents. These eigenvalues show the significance of each word or document within the context of the vector space.
Although LSA is a basic model that relies on simple SVD calculations, it stands out for utilizing all the statistical information from the entire corpus. However, it has its limitations. As mentioned earlier, LSA uses a frequency-based TF-IDF matrix, and since the matrix elements only represent the 'occurrence count' of a word, it struggles with accurately inferring word meanings.
GloVe: Contextual Word Embeddings
A model designed to capture the word order, or 'contextual dependency,' between words is Global Vectors for Word Representation (GloVe). Unlike Word2Vec, which focuses on context but doesn't fully leverage all the information from the document, GloVe was developed to address this limitation while retaining the strengths of both approaches. In essence, GloVe reflects the statistical information of the entire corpus and uses dense representations that allow for efficient calculation of word similarities. The goal of GloVe is to find embedding vectors that minimize the objective function $J$, shown below.
To bypass the complex derivation of GloVe's objective function $J$ and focus on the core idea, GloVe can be summarized as aiming to make the dot product of two embedded word vectors' correspond to the co-occurrence probability of those words in the entire corpus. This is done by applying least squares estimation to the objective function, with the addition of a weighting function $f(X_{ij})$ to prevent overfitting, which is a common issue in language models.
The GloVe model requires training on a large, diverse corpus to effectively capture language rules and patterns. Once trained, it can generate generalizable word representations that can be applied across different tasks and domains.
The reason for highlighting the evolution of topic modeling techniques from LDA to GloVe is that GloVe's embedding vectors, which capture comprehensive document information, serve as the input for the core algorithm of this paper, Graph Neural Topic Model (GNTM).
Word Graph: Uncovering Word Relationships
The techniques we've explored so far (LDA, LSA, GloVe) were all developed to overcome the limitations of assuming word independence. In other words, they represent the step-by-step evolution of word embedding methods to better capture the 'context' between words. Now, taking a slightly different approach, let's explore Word Graphs to examine the relationships between words.
Both GloVe and Word Graph aim to understand the relationships between words. However, while GloVe maps word embeddings into Euclidean space, Word Graph defines the structure of word relationships within a document in non-Euclidean space. This allows Word Graph to reveal 'hidden' connections that traditional numerical methods in Euclidean space might overlook.
So, how can we calculate the 'structure that represents relationships between nearby words'? To do this, we use a method called Global Random Field (GRF). GRF models the graph structure within a document by using the topic weights of words and the information from the topics associated with the edges that connect the words in the graph. This process helps capture the relationships between words and topics, as illustrated below.
The key point here is that the sum in the last term of the edge does not equal 1. If $w’$ corresponds to topic 1, the sum of all possible cases for $w’’$ would indeed be 1. However, since $w’$ can also be assigned to other topics, we need to account for this. As a result, the total number of edges $|E|$, which acts as a normalizing factor, is included in the denominator.
The GTRF model proposed by Li et al. (2014) is quite similar to GRF. The key difference is that the distribution of topic $z$ is now conditional on $\theta$, with the EM algorithm still being used for both learning and inference. The resulting $p_{GTRF}(z|\theta)$ represents the probability of two topics being related. In other words, it calculates whether neighboring words $w'$ and $w''$ are assigned to the same or different topics, helping to determine the probability of the graph structure.
We have reviewed the foundational keyword embedding techniques used in this study and introduced GTRF, a model that captures 'the relationships between words within topics' through graph-based representation.
Graph Neural Topic Model: Key Differences
Building on the previous discussion, let's delve into the core of this study: the Graph Neural Topic Model (GNTM). GNTM utilizes a higher-order Graph Neural Network (GNN). As illustrated in the diagram, GNTM increases the order of connections, enabling a more comprehensive understanding and embedding of complex word relationships.
Additionally, GNTM significantly reduces computational costs by utilizing Neural Variational Inference (NVI), rather than standard Variational Inference (VI), to streamline the learning process. Unlike LDA, GNTM introduces an extra step to incorporate 'graph structure' into its calculations, further enhancing efficiency.
Let’s explore the mechanism of GNTM (GTRF). The diagram above compares the calculations of GTRF and LDA side by side. As previously mentioned, GTRF learns how the structure of $z$ evolves based on the conditional distribution once $\theta$ is determined.
This may seem complex, so let’s step back and look at the bigger picture. If we assume topics are evenly distributed throughout the document, each topic will have its own proportion. We’ll refer to the parameter representing this proportion as $\alpha$.
Here, $\alpha$ (similar to the LDA approach) is a parameter that defines the shape of the Dirichlet Distribution, an extended version of the Beta distribution. The shape of the distribution changes according to $\alpha$, as shown below.
After setting the topic proportions using the parameter $\alpha$, a variable $\theta_d$ is derived to represent these proportions. While this defines the ratio, the outcomes remain flexible. Additionally, once the topic $z$ is established, the structure $G$ and word set $V$ are determined accordingly.
First Difference: Incorporating Graphs into LDA
Up to this point, we've discussed the process of quantifying the news information at hand. Now, it's time to consider how to calculate it accurately and efficiently.
The first step is straightforward: set all the parameters of the Dirichlet distribution ($\alpha$) to 1, creating a uniform distribution across $n$ dimensions. This approach assumes equal proportions for all topics, as we don't have any prior information.
Next, assuming a uniform topic distribution, the topic proportions are randomly sampled. Based on these proportions, words in the news articles are then randomly assigned to their corresponding topic $z$. The intermediate graph structure $G$, however, doesn't need to be predefined, as it will be 'learned' during the process to reveal hidden relationships. The initial formula can thus be summarized as follows.
As we saw with GTRF, the probability of a graph structure forming depends on the given condition, in this case, the topic. This is represented by multiplying all instances of $p(1-p)$ from the binomial distribution's variance. In other words, topics are randomly assigned to words, and based on those assignment ratios, we can calculate $m$. With this value, the probability of structures forming between topics is then quantified using the variance from the binomial distribution.
Second Difference: NVI
The final aspect to examine is NVI. NVI estimates the posterior distribution of latent topics in text data. It uses a Neural Network structure to parametrize the algorithm, allowing accurate estimation of the true posterior across various distributions. In the process, it often uses the reparameterization trick from VI to simplify distributions. Using neural networks means NVI can be applied to more diverse distributions than VAE (Variational AutoEncoder), which learns data in lower dimensions. This is supported by the Universal Approximation Theorem, which states that theoretically, any function can be estimated using neural networks.
To explain reparameterization further, it replaces the existing probability distribution with another, representing it through learnable parameters. This allows backpropagation and effective gradient computation. This technique is often used in VAE during the sampling of latent variables.
As mentioned earlier, both VI and NVI use the reparameterization trick. However, NVI's advantage is that it can estimate diverse distributions through neural networks. While traditional Dirichlet distribution-based VI uses only one piece of information, NVI can use two, mean and covariance, through the Logistic Normal Distribution. Additionally, like GTRF, which estimates topic structure, NVI reflects the process of inferring relationships between topics in the model.
So far, we have designed a topic model that maximizes computational efficiency while capturing word context using advanced techniques. However, as my research progressed, one question became my primary concern: "How can this model effectively assist in decision-making?"
The first key decision is naturally, "How many topics should be extracted using the GNTM model?" This is akin to the question often posed in PCA: "How many variables should be extracted?" Both decisions are critical for optimizing the model's usefulness.
Determining the Number of Topics
From a Computational Efficiency Standpoint
Let's first determine the number of topics with a focus on computational efficiency and minimizing costs. During my theoretical studies in school, I didn't fully grasp the significance of computational efficiency because the models we worked with were relatively "light" and could complete calculations within a few minutes.
However, this study deals with a vast dataset—around 4.5 to 5 million words—which makes the model significantly "heavier." While we've integrated various methods like LDA, graph structures, and NVI to reduce computational load and improve accuracy, failing to limit the number of topics appropriately would cause computational costs to skyrocket.
To address this, I compared the computational efficiency when the number of topics was set to 10 and 20. I used TC (Topic Coherence) to evaluate the semantic consistency of words classified into the same topic, and TD (Topic Diversity) to assess the variety of content across topics.
The results showed that when the number of topics was 10, the calculation speed improved by about 1 hour (for Epoch=100) compared to 20 topics, and the TD and TC scores did not drop significantly. Personally, I believe more accurate validation should be done by increasing the Epoch to 500, but since this experiment was conducted on a CPU, not a GPU, increasing the number of Epochs would take too much time, making realistic validation difficult.
There may be suggestions to raise the Epoch further, but since this model uses Adam (Adaptive Moment Estimation) as its activation function, it’s expected to converge quickly to the optimal range even with a lower Epoch count, without significant changes.
From a Clustering Standpoint
In the previous discussion, we determined that 10 topics are optimal from a computational efficiency standpoint. Now, let's consider how many topics would lead to the best "seller-buyer matching"—or how many industries should be identified for overseas buyers to effectively find domestic sellers with relevant offerings—from a clustering perspective.
If the number of topics becomes too large, Topic Coherence (TC) decreases, making it harder to extract meaningful insights. This is similar to using adjusted $R^2$ in linear regression to avoid adding irrelevant variables, or to the caution needed in PCA when selecting variables after the explained variance stops increasing significantly.
As dimensions increase, the "curse of high dimensionality" emerges in Euclidean space. To minimize redundant variables, we utilized clustering metrics such as the Silhouette Index, Calinski-Harabasz Index, and Davies-Bouldin Index, all based on cosine similarity and correlation.
As shown in the figure above, the clustering results indicate that the optimal grouping occurs with 9 topics. This was achieved using agglomerative hierarchical clustering, which begins by grouping small units and progressively merges them until all the data is clustered.
Strengths of GNTM
The key strength of this method lies in its interpretability. By using a dendrogram, we can visually trace how countries are grouped within each cluster. The degree of matching can be calculated using cosine similarity, and the topics that form these matches, along with the specific content of the topics, can be easily interpreted through a word network.
The high level of interpretability enhances the model's potential for direct integration with KOTRA's topic service too. By leveraging the topics extracted from KOTRA's existing export-import data, this model can strengthen the capabilities of an AI-based buyer matching service using KOTRA's global document data. Additionally, the model's structure, calculation process, and results are highly transparent, making decision-making, post-analysis, and tracking calculations significantly more efficient. This interpretability not only increases transparency but also maximizes the practical application of the AI system.
Additionally, the model can be applied to non-English documents. While there may be some loss of information compared to English, as mentioned earlier, GloVe captures similar words with similar vectors across languages and reflects contextual relationships. As a result, applying this methodology to other languages should not present significant challenges.
Furthermore, the model has the ability to uncover hidden nonlinear relationships within the data through UMAP (Uniform Manifold Approximation and Projection) clustering, which goes beyond the limitations of traditional linear analysis. This makes it promising for future applications, not only in hierarchical clustering but also in general clustering and recommendation algorithms.
Summary
In summary, this paper presents an Natural Language Processing (NLP) study that performs nonlinear factor analysis to uncover the topic proportions $\theta$, accounting for the covariance between topics.
In other words, while factor analysis (FA) is typically applied to numerical data, this research extends nonlinear factor analysis to the NLP field, extracting the structure of words and topics, topic proportions, and the prior distribution governing these proportions, effectively quantifying information for each group.
One of the biggest challenges in PCA and FA-related studies is interpreting and defining the extracted factors. However, the 'GNTM' model in this paper overcomes the limitation of 'difficulty in defining factors' by presenting word networks for each topic.
Now, for each topic (factor), we can identify important words and interpret what the topic is. For example, if the words "bank," "financial," "business," "market," and "economic" dominate in Topic 1, it can be defined as 'Investment.'
The paper also optimizes the number of topics in terms of TC (Topic Coherence) and TD (Topic Diversity) to best fit the purpose of buyer-seller matching. The results for the optimized number of topics were visually confirmed using UMAP and word networks.
Lastly, to solve the high-dimensional curse, clustering was performed using metrics based on cosine similarity and correlation.
What about the noise issue?
Text data often contains noise, such as special characters, punctuation, spaces, and unnecessary tags unrelated to the actual data. This model minimizes noise by using NVI, which extracts only important tokens, and by significantly increasing the number of epochs.
However, increasing the number of epochs exponentially raises computational costs. In real-world applications with time constraints, additional methods to quickly and efficiently reduce noise will be necessary.
Applicability
The most attractive feature that distinguishes GNTM from other NLP methods is its interpretability. While traditional deep learning is often called a 'black box,' making it hard for humans to understand the computation process, this model uses graph-based calculations to intuitively understand the factors that determine topics.
Additionally, GNTM is easy to apply. The Graph Neural Network Model, which forms the basis of GNTM, is available in a package format for public use, allowing potential users to easily utilize it as a service.
Furthermore, this study offers a lightweight model that can be applied by companies handling English text data, enabling them to quantify and utilize the data flexibly according to their objectives and needs.
Through UMAP (Uniform Manifold Approximation and Projection), the presence of nonlinear relationships in the data was visually confirmed, allowing for the potential application of additional nonlinear methods like LightGCN.
Moreover, since this paper assigned detailed topic proportions to each document, there is further research potential on how to utilize these topic proportions.
Future Research Direction
Similar to prompt engineering, which aims to get high-quality results from AI at a low cost, the future research direction for this paper will focus on 'how to train the model accurately and quickly while excluding as much noise as possible.' This includes applying regularization to prevent overfitting from noise and improving computational efficiency even further.
The real estate market is showing unusual signs. As global tightening begins, experts worry that the bubble in the domestic real estate market, which benefited from the post-COVID-19 liquidity, may burst. They warn we should prepare for a potential impact on the real economy.
Since late last year, major central banks, including the U.S. Federal Reserve, have been raising interest rates to combat inflation. This has caused housing prices to decline, reducing household net worth and increasing losses for real estate developers, which could potentially trigger a recession.
Global Liquidity and the Surge in Housing Prices
Meanwhile, some investors are attempting to exploit the 'bubble' in the real estate market for profit. They expect prices to fall soon and aim for capital gains by buying low. Others seek arbitrage opportunities, assuming prices haven’t yet aligned with fair value. For these investors, it is crucial to assess whether current property prices are discounted or overpriced compared to intrinsic value.
Similarly, for financial institutions heavily involved in mortgage lending, analyzing the real estate market is key to the success of their loan business. This study examines why identifying the 'bubble' in the real estate, especially in auctions, is important and how it can be explored mathematically and statistically.
Importance of Real Estate Auction Market
Various stakeholders participate in Korea's real estate auction market, each with distinct objectives. Homebuyers, investors seeking profit opportunities, and financial institutions managing mortgages are all active players. The apartment auction market, in particular, is highly competitive, with prices often closely aligned with those in the regular sales market.
Financial institutions are closely connected to the auction market. In Korea, when a borrower defaults on a property loan, the property is handled through court auctions or public sales overseen by the state. Financial institutions recover the loan amount by selling the collateral through these auctions in the event of a default.
Therefore, one of the key factors for financial institutions in determining their lending limits is how much principal they can recover in the auction market in the event of a default, especially for fintechs (P2P lending) and secondary lenders such as savings banks and capital, which are not subject to loan-to-value (LTV) restrictions.
Since most financial institutions hold a significant portion of their assets in mortgage loans, lending the maximum amount within a safe limit is ideal for maximizing revenue. Thus, when financial institutions review mortgage loan limits, trends in the auction market serve as a critical decision-making indicator.
To See Beyond Prices in the Market
It's easy to assume that the winning bid for an apartment auction in a certain area of Seoul, at a particular time, would either come at a discount or a premium compared to the general market price. And, with a bit of rights analysis, setting a cautious upper limit wouldn't be all that hard. But, in reality, it's a bit more complex than just making those assumptions.
Furthermore, if we want to examine the market movement from a broader perspective rather than focusing on individual auction cases, we need to change our approach. For example, it's easy to track Samsung’s stock price trends in the stock market, even down to minute-by-minute data over the past year. However, in real estate, auctions for a specific apartment, like Unit 301 of Building 103 in a particular complex, don’t happen every month. Even expanding the scope to the whole complex yields similar results. Therefore, it's no longer feasible to analyze the market purely based on prices. Real estate analysis must shift from a [time-price] perspective, as in stocks, to a [time-location] perspective.
Errors in the Auction Winning Bid Rate Indicator
Just as the general sales market has a time-series index like the apartment sales index, the auction market has the winning bid rate indicator. This is a monthly indicator published by local courts, showing the ratio of auction-winning bids to court-appraised values in a given area. For example, if the court appraises a property at 1 billion won and the winning bid is 900 million won, the winning bid rate would be 90%.
Since court appraisals are generally considered market prices, the winning bid rate represents the ratio of the auction price to the market price. When calculated for all auctions in an area, it gives the average auction price compared to the market value for that month.
However, this indicator has significant flaws. The court appraisal is set when the auction begins, but the winning bid reflects the price at the time of the auction. Given that auctions typically take 7 to 11 months, this time gap can lead to errors if market prices drop or rise sharply. For instance, news reports during recent price surges claimed that the winning bid rate in Seoul exceeded 120%, which seems hard to believe—how could auction prices be 1.2 times higher than market prices? This is actually incorrect information.
If market prices rise sharply during the 7 to 11 months it takes to complete an auction, bidders place their bids based on current market prices, while the court appraisal remains fixed at the start. As a result, the appraised value becomes relatively lower compared to the current market price, creating the illusion of a 120% winning bid rate. Interpreting this rate at face value can lead to poor real estate decisions or significant errors.
Limitations of Previous Studies
This has prompted previous auction market studies to try addressing the shortcomings of the winning bid rate indicator. For instance, some researchers adjusted the court-appraised value—the denominator—by factoring in the sales index at the time of the auction, aiming to estimate a more accurate "true winning bid rate."
However, experts agree this is not a perfect solution. To estimate the true winning bid rate for Seoul, all auctions during that period would need to have their court appraised values corrected. Each auction has different appraised values and closing dates, and researchers would have to manually correct hundreds or thousands of auctions. Expanding the region would make this task even more challenging, and even if corrected, the values would only be approximations, not guarantees of accuracy.
If researchers selectively sample auctions for convenience, it could introduce sampling bias. This is similar to trying to find the average height of Korean men by only sampling from a tall group.
It highlights the need for time-series indicators from a market perspective when making business decisions, rather than focusing solely on price data. The winning bid rate, commonly used in auctions, is prone to errors. Although methods like adjusting the court-appraised value have been suggested, they are difficult to apply in real-world scenarios.
These are the same problems I encountered as a practitioner. When time-series analysis was needed for decision-making, the persistent issues with the winning bid rate made it hard to use effectively.
Winning Price vs. Winning Bid Rate
There is an important distinction to make here. Analyzing the auction "winning price" and analyzing the "winning bid rate" have different meanings and purposes. As mentioned earlier, analyzing the winning price of a single auction case poses no issue.
For example, focusing on apartments, bidders base their bids on the market price at the time of bidding. If the gap between the bidding and final winning is about 1 to 2 months, considering that real estate prices don’t fluctuate dramatically like stocks within a month, the winning price should not significantly deviate from the market price a couple of months earlier. Factors like distance to schools, floor level, and brand, which are known to affect market prices, are likely already reflected in the market price, meaning they won't heavily impact the auction winning price.
Most prior studies on real estate auctions, particularly for apartments, have concentrated on how accurately they can predict the "winning price" and identifying the key factors that influence it. However, in practice, as previously discussed, price prediction is not the primary concern.
Even a simple linear regression analysis reveals that the R-squared between winning prices and KB market prices from 1-2 months earlier exceeds 95%, indicating a strong linear relationship. There is no evidence of a non-linear connection. If future trend forecasting is needed, the focus should shift toward a time-series analysis.
Discounts/Premiums Changing Over Time
After a lengthy introduction, let's get to the main point—I want to analyze the auction market. The problem is, the data contains significant errors, and trying to correct them individually has its limitations, especially within the industry. We need a different approach. So, what alternative methods can we use? And what insights can this new analysis reveal?
This is the core topic and background of the study. I used statistics as a tool to solve a seemingly insurmountable business problem encountered in practice.
What I aimed to find in the market was the difference between the sales market and the auction market. This 'difference' can be expressed as the discount or premium of the auction market compared to the sales market. Additionally, a time-series analysis is essential because the discount/premium factors will change over time depending on the economic or market conditions.
Factors of Discount/Premium in the Auction Market
Existing studies on the factors of discount/premium in housing auctions are quite varied. Nonetheless, as mentioned earlier, both international and domestic research mainly focus on price analysis rather than market analysis, making it difficult to grasp the broader trends of real estate. Typically, they gather auction cases over several years, remove legal issues, compare with market prices, and conclude there was a discount/premium, attributing it to specific factors.
Moreover, overseas studies often allow private auctions and use bidding systems, making direct application to Korea difficult. In domestic studies, the few that exist lack market-based analysis.
The Challenge: 'Data Availability'
If we assume that there is a discount/premium factor in the auction market compared to the sales market, the auction sale rate can be restructured as follows:
Now, interpreting the three elements of the auction sale rate as influential factors and transforming it into a linear regression model, it would look like this:
EoM: Effect of Market Price (influence of the general sales market)
EoA: Effect of Appraisal Price (influence of court appraised values)
EoP: Effect of Price Premium (influence of discount/premium)
To complete this regression model, data for all three variables is needed. The effect of market prices can be substituted with the sales index provided by the Korea Real Estate Board. The sales index should be transformed using a log difference to match the format of the auction sale rate.
A major challenge lies in obtaining data for the other two variables. First, acquiring court-appraised price data for research purposes is nearly impossible. The focus here isn't on the historical 'appraised price' itself, but rather on how much it influenced each analysis period (typically monthly). This means we need data adjusted to the auction's closing time. However, without digitizing all auction cases nationwide over the past 10 years, this task is virtually unachievable. The unobservable variables are intertwined, resembling a noisy background.
Factor Separation and Extraction
How can we isolate a specific male voice from a noisy mix of sounds? This is where the Fourier Transform comes in. It converts an input signal from the time domain to the frequency domain, separating each individual frequency. By applying an inverse Fourier Transform, we can find the unique voice, leaving it intact while setting other elements to zero, effectively filtering out noise.
In the same way, if we view the auction sale rate as a noisy input signal, we can separate its contributing factors independently. First, by removing the effect of market prices from the auction sale rate using a regression model, we can assume that the residual term contains hidden influences from court appraisals and discount/premium factors. Among the remaining elements in the residuals, we can assume the two strongest factors are the court appraised price and discount/premium. Fourier Transform can then be used to extract these two independent signals.
This assumption can be statistically verified. As shown in the table, when regressing the auction sale rate using the three initially assumed variables—two components extracted by Fourier Transform and the market price data—the adjusted R-squared is about 94%. In other words, the auction market can be explained by these three factors (market price, court appraisal, and discount/premium). Additionally, the ACF/PACF plot of the residuals after Fourier extraction (see figure below) shows no significant remaining patterns.
Through the Fourier Transform, I was able to resolve both the limitations of the auction sale rate as a time-series data and the issue of relying on external data. I successfully extracted the two remaining factors (court appraisal and discount/premium) from the residuals after removing the effect of market prices.
However, I must caution that using Fourier Transform on general asset market data, like stocks or bonds, is risky. This method is only applicable to data with consistent cycles. Unlike price or sales indices, auction sale rate data exhibits cyclic movements between 80-120%, driven by economic and market conditions, allowing the process to be performed without errors.
Court Appraisal Extraction
The two factors extracted through the Fourier Transform are currently only assumptions, believed to represent court appraised value and the discount/premium factor. Therefore, it is necessary to accurately verify if these factors are indeed related to court appraised values and discount/premium factors. First, I analyzed two aspects using around 2,600 auction cases:
The average time gap between the court appraisal date and the auction closing date.
The relationship between court appraisal prices and KB market prices at the appraisal date.
The time gap between appraisal and auction ranged from 7 to 11 months (within the 25% to 75% range), and the relationship between the court appraisal price and KB market price showed a Beta coefficient of 1.03, indicating almost no difference. Based on these two results, I reached the following conclusions:
There is a lag relationship between court appraisal prices and market prices (lag = time gap).
The lag variable of market prices can substitute for court appraisal prices.
Regression analysis showed that the lag variable of the sales index and court appraisal had about 54% explanatory power. This confirmed that the court appraisal component extracted via Fourier Transform could function as an actual court appraisal. Additionally, when comparing how well the lag variable and court appraisal component explained the auction sale rate, the appraisal component (50%) outperformed the lag variable (20%).
Discount/Premium Extraction
Next, I tested the discount/premium component, the core of this study, from two angles. First, whether the component extracted by the Fourier Transform can function as a discount/premium factor, and second, what the true identity of this component is.
For verification, I applied a sigmoid function to the discount/premium component to produce an on/off effect (0/1).
I attempted to compare this component with various data available from sources like the National Statistical Office, but I couldn't find any data showing similar patterns. The reason for this was simpler than expected.
The auction market is dependent on the sales market. Most macroeconomic variables we know likely influence housing prices, which have already been removed from the regression model. Therefore, the remaining factors are likely unique to the auction market, independent of sales prices. The variable that shows a similar pattern to the discount/premium component is the month-over-month and two-month differences in the winning bid rate, as shown in the figure.
The Nature of the Discount/Premium
To summarize the analysis so far: after excluding the effects of market prices and court appraisals, the remaining factor in the auction sale rate is the discount/premium factor. This factor exhibits a similar pattern to the month-over-month fluctuations (volatility).
In other words, if past volatility explains what the 'sales price' and 'court appraisal' couldn't, it suggests that the auction market has a discount/premium factor driven by volatility (the difference in past winning bid rates). As I will explain later, I have named this component the 'momentum factor,' believing it to explain trends.
Cluster Characteristics of the Momentum Factor
As we delve deeper into this analysis, it's essential to recognize that auction market dynamics are not static but evolve over time, necessitating a more adaptive model to track these changes effectively.
Unlike Ordinary Least Squares (OLS) regression, which assumes a fixed beta coefficient, the Kalman Filter's state-space model allows the beta coefficient to change over time. By tracking this time-varying coefficient, we can observe how the influence of different variables fluctuates over various periods. To analyze the 'momentum factor' in greater detail, I applied the Kalman Filter to assess whether the beta coefficient indeed varies over time.
Consequently, as shown in the figure below, we can observe that the regression coefficient of the momentum factor exceeds that of the sales price regression coefficient in certain intervals. Upon examining these intervals, it becomes clear that the momentum factor exhibits a type of clustering effect.
The True Meaning of the Momentum Factor
We need to think more deeply about the "intervals where the momentum factor's sensitivity exceeds market prices." The momentum factor explains the discount/premium. Therefore, when the discount/premium factor significantly impacts the auction sale rate, it suggests that the usual "average relationship" between the sales market and the auction market has been disrupted.
What does it mean when the "average relationship is disrupted"? For example, if the sales and auction markets typically maintain a gap of 10, this disrupted relationship means the gap has shrunk to 5 or expanded to 15. Such situations typically occur during overheated or excessively cooled markets, or just before such conditions arise. When everyone is rushing to buy homes, this can naturally lead to an "overheating" that breaks the usual relationship, which can be interpreted as increased "popularity" in the auction market.
However, one important thing to note is that when the sales market falls, the auction market typically falls too. This is because the market price has the largest influence on changes in the auction sale rate. In other words, even if the momentum factor is inactive, the auction market can rise or fall in response to the sales market. Therefore, the "activation of the momentum factor" doesn't necessarily indicate price increases or decreases.
The discount/premium factor is ultimately defined as the effect of market prices + '@'. The sensitivity analysis of the discount/premium factor indicates that '@' represents "excessive movement beyond the average." I named this the "momentum factor" because I believe it can detect changes in market sentiment or trends. As seen in the Figure 5, the momentum factor tends to signal market trend changes before and after its cluster periods.
I cautiously suggest that when the momentum factor shows excessive movement, it could signal a "bubble" or "cooling" sign. Further exploration of this idea is beyond the scope of this [paper discussion], but it certainly warrants future research.
Focus on Logic Over Technique
The reason I wrote this paper wasn't because I majored in real estate or specialized in the field. Most of my work in recent years involved changing systems to enable data-driven decision-making, one of which was loan screening, and another was related to real estate.
I understand that definitions of data science vary from person to person. However, for me, data science was the perfect tool for solving business problems. That’s why I chose a topic that was considered insurmountable in practice and applied the knowledge I learned in school.
The aspect I want to highlight in this paper is not the technical side but the logical one. The techniques used—regression analysis, Fourier transformation, and the Kalman filter—are not particularly advanced for graduate-level science and engineering. There was also an incentive to avoid using non-linear pattern matching techniques like ML/DL, which are unsuitable for financial data requiring clear interpretations. For me, it was more important to choose the method suited to the problem, and nothing more. The key was how to logically solve and approach this issue.
I believe that in solving business problems, logic should come first, and technology is just a tool. This is my ideal approach, and I wanted to keep the paper's concept simple, yet logically solid.
The Gap Between Business and Research
When I started researching for this paper, I remember thinking, "What problem should I try to solve?" My obsession with problem-solving came from the belief that there is a gap between the worlds of business and research, a bias I developed through experience.
As a practitioner, I think that in most fields, decisions are still largely based on subjective judgment rather than data. Furthermore, I know that many industries face challenges in successfully adopting data analysis systems, and I personally experienced this. While each field has its own circumstances, I believe one key reason for this gap is the disconnect between research and business.
From the industry perspective, I often felt that many research results focused on the study itself, neglecting "real-world applicability." On the other hand, from an academic perspective, I found that business often relied too heavily on subjective decisions, ignoring the complexities of the real world.
Bridging the Gap
Thus, the real intent of this paper was to bridge the gap between business and research, however small that contribution may be. I wanted to be a "conceptualizer" who actively uses data analysis to solve business problems. In this sense, I believe this paper sits somewhere between research and business. Throughout the writing process, I fought hard against the temptation to get lost in academic curiosity, focusing instead on practical applicability.
The quality and results of the paper will be judged by reviewers or proven in real-world industries, not by me. However, I anticipate that my future work will also be positioned between these two worlds. Connecting these two domains is an incredibly fascinating challenge. To view the article in Korean, please clickhere.