Skip to main content

From Awe to Audit: Why Hong Kong’s ‘Gateway’ Narrative No Longer Works for Southeast Asia’s Learning Economy.

This article was independently developed by The Economy editorial team and draws on original analysis published by East Asia Forum. The content has been substantially rewritten, expanded, and reframed for broader context and relevance. All views expressed are solely those of the author and do not represent the official position of East Asia Forum or its contributors.

Modeling People, Not Mannequins: Teaching Economics to Capture Real Heterogeneity Without Abandoning Rigor

This article is based on ideas originally published by VoxEU – Centre for Economic Policy Research (CEPR) and has been independently rewritten and extended by The Economy editorial team. While inspired by the original analysis, the content presented here reflects a broader interpretation and additional commentary. The views expressed do not necessarily represent those of VoxEU or CEPR.

Hiring for the AI Frontier: Why Education Systems Must Rewire Jobs, Not Just Buy Tools

This article is based on ideas originally published by VoxEU – Centre for Economic Policy Research (CEPR) and has been independently rewritten and extended by The Economy editorial team. While inspired by the original analysis, the content presented here reflects a broader interpretation and additional commentary. The views expressed do not necessarily represent those of VoxEU or CEPR.

Hold the Ladder, Don’t Pull the Rung: Why Income-Secure Schooling—Not Wage Quotas—Will Break the Immigrant Poverty Trap

This article is based on ideas originally published by VoxEU – Centre for Economic Policy Research (CEPR) and has been independently rewritten and extended by The Economy editorial team. While inspired by the original analysis, the content presented here reflects a broader interpretation and additional commentary. The views expressed do not necessarily represent those of VoxEU or CEPR.

GSB Lecture Notes

GSB Lecture Notes

Picture

Member for

10 months 4 weeks
Real name
Keith Lee
Bio
Keith Lee is a Professor of AI and Data Science at the Gordon School of Business, part of the Swiss Institute of Artificial Intelligence (SIAI), where he leads research and teaching on AI-driven finance and data science. He is also a Senior Research Fellow with the GIAI Council, advising on the institute’s global research and financial strategy, including initiatives in Asia and the Middle East.
Picture

Member for

10 months 4 weeks
Real name
Keith Lee
Bio
Keith Lee is a Professor of AI and Data Science at the Gordon School of Business, part of the Swiss Institute of Artificial Intelligence (SIAI), where he leads research and teaching on AI-driven finance and data science. He is also a Senior Research Fellow with the GIAI Council, advising on the institute’s global research and financial strategy, including initiatives in Asia and the Middle East.

Yeonsook Kwak (MBA AI/BigData, 2024)

Yeonsook Kwak (MBA AI/BigData, 2024)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Sleep. It's something we all need but often take for granted. As people start to realize just how important it is for our health and well-being, the question of how we can detect and understand our sleep states becomes more critical. This paper takes a closer look at that question, breaking it down into five key sections that will guide us toward better solutions and deeper understanding.

The paper starts by looking at accelerometer data for sleep tracking. This method is popular because it’s non-intrusive and works well with wearable devices for continuous monitoring. It explains how Euclidean Norm Minus One (ENMO, standardized acceleration vector magnitude)-based metrics can be a simple alternative to complex medical exams. Next, it reviews current research, highlighting the strengths and weaknesses of different methods. It also points out gaps in the accuracy and reliability of existing models.

Building on the insights gained from the review, the paper then addresses specific challenges, such as sleep signal variability and irregular sleep intervals. It outlines data preprocessing techniques designed to manage these issues, thereby improving the robustness of sleep state detection. To achieve this, a novel likelihood ratio comparison methodology is introduced, which aims to increase generalizability, ensuring effectiveness across diverse populations. Lastly, the paper concludes by acknowledging the limitations of the current study and proposing future research directions, such as incorporating additional physiological signals and developing more advanced machine learning algorithms.

Sleep Tracking Based on Accelerometer Data

According to the National Health Insurance Service, 1,098,819 patients visited hospitals for sleep disorders in 2022, a 28.5% increase from 855,025 in 2018. As the number of sleep disorder patients rises, interest in high-quality sleep is also growing. However, since the causes and characteristics of sleep disorders vary among patients, there is a burden of needing different treatment methods and diagnostic tests.

Patients suspected of having sleep disorders usually undergo detailed diagnosis through polysomnography. This test involves various methods, including video recording, sleep electroencephalogram (EEG, using C4-A1 and C3-A2 leads), bilateral eye movement tracking, submental EMG, and bilateral anterior tibialis EMG to record leg movements during sleep.

Polysomnography has its limitations. Patients must visit specialized facilities, and it's only a one-time session. As a result, there's increasing demand for tools that offer more convenient and continuous sleep monitoring.

Measuring Movement Using Accelerometer Data

Recently, health management through wearable devices has become increasingly common, enabling real-time data collection. Wrist-worn watches can monitor activity levels, and for sleep measurement, both an accelerometer sensor and a photoplethysmography (PPG) sensor are typically used.

The accelerometer sensor tracks body movements, while the PPG sensor uses light to measure blood flow in the wrist tissue, which helps measure heart rate. Although using data from both sensors could improve the accuracy of sleep measurement, this study only uses accelerometer data due to limitations on data usage. The reasons for this decision will be explained further on.

The accelerometer data consists of three axes, as shown in the figure below [1].

!Figure 1. Accelerometer Sensor Diagram [1]

The $x$-axis represents changes in the direction that moves horizontally to the ground, the $y$-axis shows changes in the lateral direction of movement (e.g., how much the arms swing to the sides), and the $z$-axis indicates changes in the vertical direction of movement (peaking when the legs cross over during walking). It is important to understand that each axis's function depends on the sensor's reference axes. If these reference axes change, movement is usually measured based on the axis that shows the biggest change in values.

The graph below shows an example of 3-axis data [4]. This graph shows how the measurements change when walking with the arms swinging compared to walking with the arms held still. The changes in the $x$, $y$, and $z$ axes represent changes in the mean values, and it can be seen that the signal shown in green, when the arms are fixed, has the most significant variation. Therefore, accelerometer data can vary for the same action if the sensor's position or reference axes change.

Figure 2. Example of Accelerometer $x$, $y$, $z$ Axis Data

Making Useful Variables Through Transformation

To solve the problem of axes changing when the sensor's orientation shifts, it is important to convert the data into straightforward yet informative variables. Many studies used summary metrics (or summary measurements). This combines the $x$, $y$, and $z$ axis values into a single value, thereby reducing the impact of changes in sensor orientation.

Examples of summary metrics include Euclidean Norm Minus One (ENMO, standardized acceleration vector magnitude), Vector Magnitude Count (VMC), Activity Index (AI), and Z-angle (wrist angle). Let’s take a closer look at ENMO and Z-angle, as they relate to the signal data from wearable devices discussed earlier.

As shown in the accelerometer diagram in Figure 1, when interpreting the dynamic acceleration of sensor data, it's important to consider the effect of gravity (g). Therefore, to standardize the linear transformation of the three-axis values, a variable that subtracts gravitational acceleration is referred to as the ENMO variable. This can be expressed mathematically as follows.

$\begin{equation}\label{eq:ENMO} ENMO = max(0, \sqrt{x^{2}+y^{2}+z^{2}} - 1)\end{equation}$

The Z-angle is a summary metric for the wrist angle, which can be considered as the angle of the arm relative to the body's vertical axis. It can be expressed using the following formula.

$\begin{equation}\label{eq:Z-angle}Z-angle = tan^{-1}\left(\frac{a_ {z}}{\sqrt{a_ {x}^2 + a_ {y}^2}}\right) \cdot \frac{180}{\pi}\end{equation}$

To gain an intuitive understanding, let's look at the actual ENMO values measured in real-life situations. Figure 3 below is a table summarizing the ENMO measurements during daily activities [2].

!Figure 3. ENMO Measurements for Different Activities [2]

When standing, the ENMO was measured at an average of 1.03g, while it increased to 10.3g during everyday walking. This clearly demonstrates that the ENMO value is lower with minimal movement and rises as activity levels increase. In other words, since humans do not always move at a constant speed like robots, activity levels can be measured using acceleration.

While it may appear that raw $x$, $y$, and $z$ axis data offers more information due to its detail, this study seeks to demonstrate that condensing this information into a single summary metric doesn't significantly impact our ability to accurately estimate sleep states.

Additionally, a basic model revealed that excluding Z-angle data does not result in significant information loss. When we used a tree model to evaluate the explanatory power of variables with statistical metrics from both Z-angle and ENMO, the ENMO variables were found to be much more important. In fact, all of the top 10 most important variables were related to ENMO. Since the importance of Z-angle variables was significantly lower, this study will focus on using ENMO as the primary basis for addressing the problem.

Review of Previous Studies

Existing Methodologies Focused on Optimization

Earlier, we explored several summary measurement variables, such as ENMO, VMC, AI, and Z-angle. More recently, research has been focused on identifying new summary measurements, like MAD (Mean Absolute Deviation), using axis data collected from accelerometer sensors. This kind of variable transformation requires advanced domain knowledge, and the process of validating these reduced variables is complex.

In previous studies, various summary measurements were investigated, and temporal statistics—such as overlapping or non-overlapping deviations, averages, minimums, and maximums at one-minute intervals—were used for classification or detection through machine learning models, heuristic models, or regression models. Figure 4 below summarizes the key methodologies from previous research [5].

!Figure 4. Comparison of Existing Research Algorithms [40 Years of Actigraphy in Sleep Medicine and Current State-of-the-Art Algorithms] [7]

Additionally, the evaluation metrics used in sleep research are as follows [2]:

  • Sensitivity (actigraphy = sleep when PSG = sleep)
  • Specificity (actigraphy = wake when PSG = wake)
  • Accuracy: total proportion correct
  • The amount of Wakefulness After Sleep Onset (WASO): the total number of awakenings during the sleep period
  • SE (Sleep Efficiency): the proportion of sleep within the periods labeled by polysomnography
  • TST (Total Sleep Time): calculated as the sum of sleep epochs per night

Limitations of Optimization and Increased Sensitivity to Changes in Data Patterns

Machine learning models like Random Forests and neural networks, such as CNNs (Convolutional Neural Networks) and LSTMs (Long Short-Term Memory networks), are considered "high complexity" due to their focus on achieving high accuracy. This often results in having a large number of parameters, which increases the risk of overfitting. When overfitting occurs, the model might learn the noise in the training data instead of the actual patterns, especially if there isn't enough data.

As a result, the model's performance can decline when applied to new dataset. In practical research, therefore, these high-complexity models sometimes struggle to accurately detect the exact moments of falling asleep or waking up. By focusing too much on optimization, they have overlooked the importance of generalization.

Are simpler models, like regression models, free from optimization issues? Although regression models are generally less sensitive to noise, they rely on the assumption that the data follows a normal distribution. If this assumption is not met, the standard error of the correlation coefficients can be high relative to the coefficients themselves. This increases the p-value, reducing the significance of the correlation and making the model's results less reliable.

Since sleep data often does not follow a normal distribution, additional optimization is needed for regression models like the Cole-Kripke[3] and Oakley[6] models. While these simpler models may be less accurate with target data compared to machine learning or neural network models, their low complexity and optimized adjustments make them useful as baseline models in research, often used alongside polysomnography.

When users have only recently started using wearable devices, there is often a need to classify sleep states with limited data. Early data may lack representativeness, making it challenging to rely on data-intensive models from the machine learning or deep learning fields. Traditional regression models that require extensive optimization are also not sustainable in these cases. This challenge becomes even more significant when analyzing data from multiple users rather than just one. Therefore, this study aims to introduce data transformation and model transformation methodologies that can improve generalization performance.

Characteristics and Collection Methods of ENMO Data

Before diving into the detailed data preprocessing steps and methodologies that aim to overcome the limitations of previous research, let’s take a closer look at the characteristics of ENMO data.

ENMO signals are collected at 5-second intervals and can be analyzed in combination with sleep state labels assigned through sleep diaries. The criteria for labeling sleep states in the sleep diary are as follows:

  • Sleep is assumed if the sleep state persists for at least 30 minutes.
  • The longest sleep period during the night is recorded as the sleep state. However, there is no rule limiting the number of sleep episodes that can occur within a given period.
    For example, if an individual sleeps from 1:00 to 6:00 and again from 17:30 to 23:30 on the same day, both sleep periods are valid and counted. This approach naturally accommodates different sleep patterns, such as early morning and evening sleep, which can be influenced by work schedules.

To help with understanding, let's take a look at the sample data in the graph below. This data was collected over approximately 30 days from one individual, specifically looking at their Z-angle and ENMO signals. Sleep periods are marked as 0, active periods as 1, and -1 indicates cases where the label values in the sleep diary are missing due to device or recording errors.

Figure 5. Final Data Graph Used: 0 (Sleep Period), 1 (Active Period). ENMO Data (Top), Z-angle Data (Bottom).

As expected, we can observe a noticeable periodicity as the sleep periods (0, in red) and active periods (1, in green) alternate. While not everyone exhibits the same sleep pattern, the overarching cycle of sleep and awake remains consistent. Therefore, this study will focus on using generalized data transformations to better distinguish between sleep and wake cycles. In the following section, we will introduce modeling methods that prioritize generalization.

The Z-angle data also showed a cyclic pattern. However, as mentioned earlier, ENMO data is significantly more important than Z-angle data and results in less information loss. Therefore, in the following methodologies, only the ENMO variable was used.

Considering Variability of Sleep Signals and Irregularity of Sleep Intervals

Interestingly, even during sleep, there are small fluctuations. This occurs because sleep consists of different stages, as many people know. These stages are usually classified based on the criteria shown in the diagram below.

Diagram of Sleep Stage Detection

In the previous study by Van Hees, sleep stages were classified based on the same diagram. The concept of sleep stages suggests that body movements vary depending on the stage, which can cause subtle fluctuations in sleep signals. As shown in the ENMO data in Figure 5, the $y$ values during sleep periods (indicated by red bars) are not uniform.

It naturally occurred to me that if the signals during sleep periods could be processed into more consistent signals, detecting sleep states would become easier. The goal of stabilization is not to eliminate sleep signals entirely but to preserve their characteristics while maintaining relatively stable values compared to the variance in the raw data.

Building on this idea, we can conclude that generalization is achievable even when the amount of tossing and turning varies between individuals and across all users. While it might be tempting to skip over these complex processes, doing so would be unwise. Previous studies have often overemphasized optimization, which can lead to problems like overfitting. To prevent this, it’s well-known that regularization techniques, such as the Lagrange multiplier method, are commonly used.

In this study, we aimed to develop a methodology with superior generalization performance by processing the data based on insights gained from a more detailed analysis of the data characteristics and modifying the model accordingly. I hope this discussion helps convey the importance of having a solid rationale in the data preprocessing stage to build a reliable model.

Stabilizing Sleep Signals Through Data Transformation

The initial approach to stabilizing sleep signals focused on removing outliers by applying standard filtering techniques. A method similar to Fast Fourier Transform (FFT) was employed, specifically using Power Spectral Density (PSD). PSD is effective for analyzing the distribution of $FFT_2$ density across different frequency bands.

However, after applying PSD to the ENMO signal data, we found that 99.8% of the entire dataset remained, failing to achieve the intended stabilization of sleep-specific signals. As shown in Figure 6, the variability of the processed ENMO signals (indicated by the red bars) during sleep periods (the area between the red and green lines, in that order) remained evident.

Figure 6. Graph Showing the Results of Applying PSD to Sample Data from One User (Single ID Value): Blue bars represent the raw ENMO signal, red line over the blue bars represent the processed ENMO signal, the red vertical line indicates the moment of falling asleep, and the green vertical line indicates the moment of waking up.

To see if smoothing the data would resolve the issue, a Kalman filter was applied. Despite incorporating covariance from previous data, it failed to stabilize the sleep signals. As shown in Figure 7, much of the variability in the processed ENMO signals during sleep periods persisted. Additionally, the Kalman filter performed poorly in detecting sleep states and had higher computational costs compared to PSD, mainly due to the use of covariance from prior data.

Figure 7: Graph Showing the Results of Applying a Kalman Filter to Sample Data from One User (Single ID Value): Blue bars represent the raw ENMO signal, red bars represent the processed ENMO signal, the red vertical line indicates the moment of falling asleep, and the green vertical line indicates the moment of waking up.

Finding Periodicity in Irregular Intervals

There was an aspect that was overlooked during the initial data transformation process. We missed one of the most important characteristics of the given data: even for a single user, the times of falling asleep and waking up are not consistent. Therefore, this time we used the Lomb-Scargle periodogram, a method designed to detect periodic signals in observations with uneven spacing.

Figure 8 below visualizes the data after applying the Lomb-Scargle periodogram. Although the signals in the sleep periods appear almost uniform due to the long duration of the entire dataset, zooming in on specific intervals reveals that while the variability has been reduced, the characteristics of the signals have been preserved as much as possible.

Figure 8: Graph Showing the Results of Applying the Lomb-Scargle Periodogram to Sample Data from One User (Single ID Value): Blue bars represent the raw ENMO signal, red bars represent the processed ENMO signal, the red vertical line indicates the moment of falling asleep, and the green vertical line indicates the moment of waking up.

Beyond applying this to a single ID value, we also examined the results for all IDs without missing data. The dominant frequency showed a linear pattern with power. Therefore, for frequencies observed within the regression line, we determined that filtering using the typical values within the linear range would not significantly reduce the accuracy of the predictions.

Figure 9: Graph Showing the Results for Dominant Frequency Regarding ID 35ea. Left: Checked with the entire dataset. Right: Checked with 80% sampling from the left.

Signal processing methods like the Lomb-Scargle periodogram, which align with the principles of FFT, need sufficient data to detect periodic patterns. In the case of ENMO data, at least a full day must pass to observe a complete cycle of sleeping and waking.

Thus, if the sample period for each ID was less than 3–5 days or there were many device omissions, filtering was done using the dominant frequency data from the training data. When there was at least about 5 days of data available, filtering was applied individually based on each ID.

Likelihood Ratio Comparison

In the previous section, we explored how data can be transformed as a method of generalization. In this section, we will examine how model transformation can improve generalization performance.

Sleep and Awake Period Distributions

When examining the ENMO signal data after applying the Lomb-Scargle periodogram, neither the awake nor the sleep period data exhibited a uniform distribution.

As shown in Figure 10, the distribution shapes of the two periods also differ. Notably, there is a distinct difference in the shape of the distribution peaks: the joint distribution peak during sleep periods forms a smooth curve, whereas the peak during awake periods appears more angular.

Figure 10: Distribution of Preprocessed ENMO Signals by Sleep and Awake Periods for Each ID: Sleep Distribution (left), Awake Distribution (right)

Interestingly, upon closer inspection, despite the different distribution shapes, the peak values for each ID are clustered around 0 on the x-axis, whether it’s during sleep or awake periods.

Figure 11a: Distribution of the Entire Dataset
Figure 11b: 80% of Data Without Missing Values

Similarly, in both Figure 11a (entire dataset) and Figure 11b (9% of the entire dataset), the peak values of the sleep and awake distributions did not change significantly. Additionally, the peak values in Figure 12, which shows 800 randomly sampled observations from Figure 11b, also showed minimal variation.

Figure 12: Training Data: Approximately 800 randomly sampled observations from the Figure 11b dataset.

Using Sleep and Awake Period Distributions

We examined whether the peak values of the distribution functions were different or the same. This was part of an effort to apply a likelihood ratio (LR) comparison method by utilizing the distribution information of sleep and awake periods to generalize the sleep state detection method. If the distributions are known, approaching the problem using Maximum Likelihood Estimation (MLE) is the most appropriate method. Similarly, we aimed to model based on the Likelihood Ratio (LR) by using the information from the sleep and awake distributions.

Sleep and awake distributions may not follow commonly known probability density functions (e.g., Gaussian, Poisson, etc.) and are often irregular. As an alternative, we used distributions derived from kernel density estimation (KDE). Kernel density estimation involves creating a kernel function centered on each observed data point, summing these, and then dividing by the total number of data points. Typically, the optimal kernel function is the Epanechnikov kernel, but for computational convenience, the Gaussian kernel is frequently used. In this study, we also used the Gaussian kernel.

First, let's explain how the LR method was applied using equations. $LR = \frac{L_ {1} (D)}{L_ {0} (D)}$.

The likelihood ratio can be calculated for each data input point, where $L_ {0} (D)$ represents the likelihood of the data under the null hypothesis, indicating a higher probability of being a sleep signal. Conversely, $L_{1} (D)$ represents the likelihood under the alternative hypothesis, indicating a higher probability of being an awake signal. If the LR is greater than a threshold, it suggests that the data is more likely under the alternative hypothesis (awake signal).

Figure 13 visualizes the results of sleep state detection using the above generalization methodology for a single ID. In the graph below, you can see that the lowermost graph finds as many points as possible between the first activity signal point (moment of waking up) and the last point (moment of falling asleep) after data transformation.

Figure 13: Original ENMO Signal (Top), Likelihood Ratio (Middle), Awake Periods Exceeding the Threshold Marked with Red Dots (Bottom)

From a computational efficiency standpoint, the likelihood ratio (LR) method is also advantageous. When measuring computation time, it was observed that data transformation occurs simultaneously with data input, allowing the LR results to be produced quickly. Processing 39,059 data points all at once took about 7 seconds. For one day's worth of data for 10 users (17,280 data points), it took around 1 minute and 40 seconds in total.

As expected, this method, which depends on distribution, doesn’t detect cases where the device is missing. However, visual checks using Figure 13 showed that the method works well in detecting signals without label values, as long as the device wasn’t missing.

To assess the robustness of the model, this study proposes using the time difference between predicted values and label values as a performance metric. Since the likelihood ratio method introduced above focuses on generalization, we determined that traditional evaluation metrics from existing sleep research, which are geared toward optimization, would not be applicable.

New Evaluation Metric for Assessing Model Robustness

When comparing the time difference between the model's predicted values and the label values, the model tends to predict the moment of falling asleep earlier than the actual label values and the moment of waking up later than the actual labels. To understand the cause of this, we applied the LR method to the raw, unprocessed ENMO signals to detect sleep states.

As shown in Figure 14a, even when using unprocessed ENMO signals, the tendency for predictions to be early or late remained the same. This suggests that these tendencies are inherent to the collected ENMO signals themselves. It is expected that if additional data, such as pulse rate or other complementary information, is used in the future, the time difference (time diff) could be reduced.

Figure 14a: Time Difference Results (Using Original ENMO Data for the First 3 Days)
Figure 14b: Time Difference Results (Using Processed ENMO Data for the First 3 Days)

Limitations and Future Research Plans

In the previous section, we briefly discussed assessing the performance of the likelihood ratio comparison method by using the time difference between predicted values and label values. To provide a more objective evaluation, we will now also examine the results of applying this method to IDs that were not included in the training data (test set).

Figure 15a: Training Set - Results of Applying IDs Included in the Training Data
Figure 15b: Test Set - Results of Applying IDs Not Included in the Training Data

The numbers shown above are the average of the standard errors of time differences (within individual standard error, SE) calculated for each ID. Figure 15a shows the results of applying the likelihood ratio comparison model to 10 randomly selected IDs included in the training data, while Figure 15b shows the results using 3 randomly selected IDs that were not included in the training data.

Verifying the Robustness of the Likelihood Ratio Comparison

The results showed that the processed ENMO signals did not exhibit significant performance differences between the training and test data sets, reaffirming the robustness of the generalization-focused methodology. Although the processed ENMO signals displayed understandable levels of variability, the unprocessed original ENMO signals showed a significant increase in average standard error.

Additionally, Figures 15a and 15b highlight the contribution of data transformation to performance improvement. The original ENMO signals, without any preprocessing, had a higher average standard error compared to the processed ENMO signals. This difference was more pronounced in the test set (Figure 15b), where the average standard error for the processed ENMO signal data was reduced by over 20 minutes for both wakeup and sleep onset times. This underscores the importance of investing effort into data preprocessing to enhance generalization performance.

The training set's performance was validated using random samples from 10 different IDs, while the test set included three additional randomly selected IDs not used in the training distribution, further rigorously testing the model's generalization performance. For reference, the total number of nights analyzed across the 13 IDs was approximately 110 days, suggesting that a sufficiently long period was utilized for comparing average standard errors.

Key Research Findings

In summary, this study focused on generalization rather than optimization. Efforts were made to enhance generalization performance, starting from the data transformation stage. Using raw data statistics based on accelerometer data can increase dimensionality and make the data more susceptible to outliers and noise, highlighting the need for data preprocessing. Additionally, since sleep patterns are not consistent, it was necessary to stabilize the data by applying the Lomb-Scargle periodogram, which can detect periodicity in unevenly spaced data.

From a modeling perspective, rather than enhancing the fit for each individual data point as is common in traditional machine learning or deep learning models, this study utilized distribution data rich in information. Distributions contain more information than variance, leading to a structure that is inherently more efficient from a modeling standpoint. As a result, even users who have only recently started wearing wearable devices can benefit from early detection (though at least one hour of data is needed), improving the practical utility of the device.

Furthermore, the LR method offers the advantage of high computational efficiency. Compared to complex models like machine learning or deep learning, as well as traditional models using rolling statistics, the computational efficiency of the LR method is significantly higher. In the same vein, the LR method is easier to maintain. With its lower model complexity and sequential execution of data preprocessing and LR model inference stages, subsequent modifications to the model structure are also straightforward.

Future Research

Currently, only ENMO signal data is used, but incorporating more supplementary variables (e.g., heart rate) is expected to make sleep state detection more refined. Enhancing performance may also be possible by implementing more detailed updates during data preprocessing for each individual ID. In this study, the period for allowing the use of past distribution data and determining the dominant frequency was chosen through basic experiments, but future studies could consider more precise adjustments.

The heterogeneity that exists between individuals should also be considered. Future studies could achieve higher accuracy by analyzing different groups (e.g., those with above-average activity levels vs. those with minimal activity) rather than simply adjusting the current threshold values. Expanding the study population could also contribute more to public healthcare research by reflecting demographic characteristics among individuals, which would be valuable for both business and sleep research perspectives.

Continually expanding the variety of data has great potential for advancing sleep research. For example, the Healthy Brain Network, which provided the data used in this study, aims to explore the relationship between sleep states and children's psychological conditions. This highlights the increasing importance and interest in using sleep state measurements as a supplementary tool for understanding human psychology and social behavior.

Meaningful Inference Amid Uncertainty

Understanding complex issues depends on how the available information is utilized. The data used in this study are signal data, and most signal measurements inherently contain noise, which introduces uncertainty. Moreover, understanding sleep states requires domain expertise, and direct measurement is often difficult. Despite these difficulties, this research made significant efforts to predict human sleep states through indirect measurements or partial observations, aligning with recent advances in wearable devices.

In conclusion, optimization and generalization are naturally in a trade-off relationship. While this paper focused on generalization, the emphasis on optimization should be adjusted dynamically based on how much precision is required from a business perspective. Just as the phrase "one size fits all" is contradictory and almost impossible to achieve perfectly, it is important to recognize that choices must be made depending on the data and the specific context.

To view the article in Korean, please click here.

References

[1] URL https://insights.globalspec.com/images/assets/263/1263/Accelerometers-04-fullsize.jpg.

[2] Kishan Bakrania, Thomas Yates, Alex V Rowlands, Dale W Esliger, Sarah Bunnewell, James Sanders, Melanie Davies, Kamlesh Khunti, and Charlotte L Edwardson. Intensity thresholds on raw acceleration data: Euclidean norm minus one (enmo) and mean amplitude deviation (mad) approaches. PloS one, 11 (10):e0164045, 2016. 4

[3] Roger J. Cole, Daniel F. Kripke, William Gruen, Daniel J. Mullaney, and J. Christian Gillin. Automatic sleep/wake identification from wrist activity. Sleep, 15(5):461–469, 09 1992. ISSN 0161-8105. doi: 10.1093/sleep/15.5.461. URL https://doi.org/10.1093/sleep/15.5.461. (document)

[4] Marta Karas, Jiawei Bai, Marcin Str´aczkiewicz, Jaroslaw Harezlak, Nancy W. Glynn, Tamara Harris, Vadim Zipunnikov, Ciprian Crainiceanu, and Jacek K. Urbanek. Accelerometry data in health research: challenges and opportunities. bioRxiv, 2018. doi: 10.1101/276154. URL https://www.biorxiv.org/content/early/2018/03/05/276154. (document), 3

[5] Miguel Marino, Yi Li, Michael N. Rueschman, J. W. Winkelman, J. M. Ellenbogen, J. M. Solet, Hilary Dulin, Lisa F. Berkman, and Orfeu M. Buxton. Measuring sleep: Accuracy, sensitivity, and specificity of wrist actigraphy compared to polysomnography. Sleep, 36(11):1747–1755, 11 2013. ISSN 0161-8105. doi:10.5665/sleep.3142. URL https://doi.org/10.5665/sleep.3142. (document)

[6] Nigel R Oakley. Validation with polysomnography of the sleepwatch sleep/wake scoring algorithm used by the actiwatch activity monitoring system. mini mitter co. Sleep, 2:0–140, 1997. (document)

[7] Matthew R Patterson, Adonay AS Nunes, Dawid Gerstel, Rakesh Pilkar, Tyler Guthrie, Ali Neishabouri, and Christine C Guo. 40 years of actigraphy in sleep medicine and current state of the art algorithms. NPJ Digital Medicine, 6(1):51, 2023. 5

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Sungsu Han (MBA AI/BigData, 2024)

Sungsu Han (MBA AI/BigData, 2024)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

I am in my early 40s and work at an office near Magok Naru Station and I live near Haengsin Station in Goyang City. I used to commute by company shuttle, but recently I've taken up cycling as a hobby and now commute by bike. The biggest reason I got into cycling was because of the positive image I had of Seoul's public bicycle program, Ddareungyi.

What Sparked My Interest

One day, I stepped off the shuttle, rubbing my sleepy eyes, and was surprised to see hundreds of green bikes clustered together. I hadn't noticed them before, probably because I’m usually too tired as an office worker, not paying much attention to my surroundings once I get to work. Or maybe it’s just because I’m so groggy in the mornings that the bikes slipped past me. Either way, the sight took me by surprise.

Ddareungyi docking station near the author's office in the Magok area/Credit: https://steemit.com/hive-183959/@nasoe/58ha14

Most crowded areas

I often wondered where the many Seoul bikes at the Magok intersection came from. This also sparked my interest in Ddareungi and made me think about researching public bike programs as a topic for my thesis.

As I continued to develop my thoughts, I suddenly wondered, "Is there really a place that uses bicycles more than Magok?" A quick internet search provided the answer. According to the "2022 Traffic Usage Statistics Report" published by the Seoul Metropolitan Government, the district with the highest use of public bicycles (Ddareungi) in Seoul was Gangseo-gu, with 16,871 cases.

Furthermore, according to data released on the Seoul Open Data Platform, the top seven public bicycle rental stations in Gangseo-gu are as follows: ▲ Magoknaru Station Exit 2 with 88,001 cases ▲ Balsan Station near Exits 1 and 9 with 63,166 cases ▲ Behind Magoknaru Station Exit 5 with 59,095 cases ▲ Gayang Station Exit 8 with 56,627 cases ▲ Magok Station Intersection with 56,117 cases ▲ Magoknaru Station Exit 3 with 52,167 cases ▲ Behind Balsan Station Exit 6 with 48,145 cases, etc. I was quite surprised to learn this. The place with the highest use of Ddareungi in Seoul was right here, the Magok Business District, where I commute to work.

During my daily commute, I began to notice more people using bicycles than I had originally thought. Bikes are increasingly viewed as a way to address environmental concerns while also promoting fitness for office workers. Inspired by this trend, I considered commuting by bike myself, like many others in Seoul. However, since I live in a different district, I faced the dilemma of choosing between Goyang City's Fifteen program or Seoul's Ddareungi. During my research, however, I discovered that Goyang City's Fifteen program had been discontinued due to financial losses.

Reasons for Deficits in Public Bicycle Programs

So, I looked into the deficit sizes of other public bicycle programs and found that "Nubija" in Changwon had a deficit of 4.5 billion KRW, "Tashu" in Daejeon had 3.6 billion KRW, and "Tarangke" in Gwangju had a deficit of 1 billion KRW. This showed that most regional public bicycle programs are struggling with deficits. Even Seoul's public bicycle program, "Ddareungi", which I thought was doing well, has a deficit of over 10.3 billion KRW. This made me wonder why public bicycle programs are always in deficit.

At the same time, although Ddareungi is a beloved mode of transportation for the ten million citizens of Seoul, I started to worry whether this program could be sustained in the long run. After looking into the issue, I discovered that the biggest contributor to the deficits in public bicycle programs is the high cost of redistributing the bikes across the city.

For Goyang City, it was estimated that out of a total maintenance budget of 1.778 billion KRW, around 375 million KRW is spent on on-site distribution, and 150 million KRW is used for vehicle operation costs related to redistribution. This means approximately 30% of the total budget goes towards redistribution, making it the largest single expenditure. A similar trend is observed in Changwon City, where redistribution costs also account for a significant portion of the budget. Although this information is not directly about Ddareungi, it suggests that about 30% of the total operating costs of public bicycle programs are likely spent on bicycle redistribution.

This led me to believe that cutting bicycle redistribution costs could be the key to resolving the chronic deficits in public bicycle rental programs. It also made me consider that optimizing redistribution by analyzing Ddareungi users' usage patterns could help reduce these expenses. To achieve this, I needed to analyze the factors influencing rental volume and create a model to predict expected demand, which would help prevent shortages and minimize unnecessary redistribution efforts.

Optimizing Redistribution Through Demand Forecasting

The Ddareungi bike rental data includes bike ID, return time, and station information. To visualize rental volumes by station, additional location data (latitude and longitude) from the Seoul Open Data Plaza was used. Synoptic weather data from the Seoul Meteorological Station was also integrated with the rental records to analyze the impact of weather on bike usage. A detailed analysis of usage patterns was conducted on a four-year dataset (2019-2023) from the Ddareungi station at Exit 5 of Magoknaru Station.

General Usage Patterns

The result showed that bike usage drops with stronger winds and rain but peaks at moderate temperatures (15-17°C). The highest usage occurs during weekday morning and evening commutes. Usage patterns are concentrated in business districts such as Magok, G-Valley, and Yeouido, where most users are in their 20s and 30s. These areas experience imbalances in rentals and returns, especially during commutes.

The general usage patterns were analyzed to forecast bicycle demand and supply. Using the STL (Seasonal and Trend decomposition using Loess) method, rental and return volumes were first decomposed to reveal seasonality, trends, and cycles. The residuals from this decomposition were then applied to a SARIMAX model, incorporating weather and time variables to explain the usage patterns. The model successfully forecasted demand, achieving an R² of 0.73 for returns and 0.65 for rentals.

Optimization Based on the Rental-Return Index Range

To optimize bike redistribution, the "Rental-Return Index" was introduced to measure the difference between expected rentals and returns at each station.

[ 1 \ Day \ Index = \frac{Estimated \ Rental \ Volume}{Estimated \ Return \ Volume} ]

As shown in the equation above, when a station has the right balance, with neither a surplus nor a shortage of bikes, the Index equals 1. An Index greater than 1 indicates a shortage, while an Index below 1 signifies a surplus. By categorizing stations into surplus or deficit, redistribution efforts can be directed toward stations with shortages (Index greater than 1), improving customer satisfaction.

In addition, this approach is particularly useful because the number of redistribution targets can be quantified based on the available budget for Seoul's bike system. Stations with the highest Index values are prioritized first, and the top stations for redistribution are selected according to the allocated budget, ensuring cost-effective and efficient redistribution efforts.

To further optimize bike redistribution, clustering can be applied to group business and residential areas based on rental and return distributions within districts, aiming for a rental-return Index of 1. This method would minimize the distance bikes need to be moved during redistribution, as workers would be assigned to specific teams responsible for managing these clustered regions. In other words, by focusing on areas where the Index is balanced, this approach ensures more efficient redistribution while reducing overall transportation efforts.

Clustering Idea for Implementing Spatial-Temporal Balance

Common Clustering Method

Initially, a K-Means clustering approach was tested to identify areas where the difference between bike rentals and returns was close to zero. By adjusting the number of clusters to match Seoul’s 25 districts, the analysis of June 2023 data showed that clusters with more districts had net volume averages closer to zero, indicating a better balance between rentals and returns. In contrast, smaller clusters with fewer districts exhibited greater imbalance.

Further testing with other clustering methods, such as the Gaussian Mixture Model (GMM), produced results similar to those of K-Means. However, neither method fully captured the underlying bike movement patterns, as these clustering models were unable to account for the dynamic mobility data within the bike-sharing system. This suggested that the algorithms might not be well-suited to the structure of Ddareungi's data, highlighting the need for alternative modeling approaches.

Since Ddareungi’s data reflects bike movements between stations, it is logical to treat these movements as links within a graph, with rental and return stations acting as nodes. By applying a community detection method, clusters can be identified based on the most frequent bike movements. This graph-based approach, which focuses on actual bike movement patterns, could lead to more efficient bike redistribution and yield improved clustering results.

etwork Detection Method

The approach involves treating the movement of bikes between rental and return stations as links between nodes, thereby creating a graph. By identifying clusters with the highest number of links, it's possible to detect community divisions where bikes tend to circulate internally. This can significantly enhance the efficiency of bike redistribution across the network.

This is where network community detection comes into play. Community detection is a method that divides a graph into groups with dense internal connections. Applied to Ddareungi data, it helps track rental-return patterns by clustering areas where rentals and returns are balanced. By identifying these clusters, we can detect regions that maintain spatial balance, with more compact clusters reflecting higher modularity.

Modularity measures how densely connected the links are within a community compared to the connections between different communities. It ranges from -1 to 1, with values between 0.3 and 0.7 indicating the existence of meaningful clusters. Higher modularity signifies stronger internal connections, leading to more effective clustering.

Modularity

To optimize modularity, the Louvain algorithm was tested. This algorithm works in two phases: In Phase 1, nodes are assigned to communities in a way that maximizes modularity. In Phase 2, the network is simplified by merging the links between communities, further refining the structure and improving cluster detection.

When applied to Ddareungi data, the Louvain algorithm significantly outperformed K-Means clustering, which relies on Euclidean coordinates. The average net deviation, where 0 is ideal, dropped sharply from 21.19 with K-Means to 9.23 using Louvain, indicating a more accurate clustering of stations. Unlike K-Means, which ignores key geographical features like the Han River, the Louvain algorithm took Seoul's geography into account, resulting in more precise and meaningful clusters.

The following map comparison highlights this difference, showing how Louvain provides clearer cluster differentiation across the Han River, whereas K-Means fails to capture these geographic distinctions.

Network

Understanding the Cycle

I likened Ddareungi bike movement to the flow of water. Just as the total amount of water on Earth remains constant, the total number of Ddareungi bikes stays fixed. This analogy helps conceptualize the system as spatially and temporally closed, where clustering can maintain balance.

Cycle

Temporal imbalances can be managed by tracking the flow of bikes throughout the day. For instance, business districts experience high demand in the morning but accumulate excess bikes by evening, while residential areas face the opposite situation. Redistribution efforts can be minimized by transferring surplus bikes from business districts to residential areas overnight, before the morning commute begins. After the morning rush, bikes concentrate in business districts but are naturally redistributed as users ride them back to residential areas during the evening commute.

Although there is some uncertainty in the evening, as it's unclear whether users will choose bikes for their return journey, any surplus can still be addressed overnight as part of the regular redistribution cycle. This ensures that before the next morning commute, any leftover bikes in business districts are moved to residential areas as mentioned above. When viewed over a full day, these fluctuations tend to balance out, reducing the need for excessive intervention.

To manage these imbalances more effectively, a rental-return index was used to prioritize stations for redistribution, ultimately reducing operational costs. Additionally, network community detection, particularly through the Louvain algorithm, provided more accurate clustering than previous methods. This approach better reflected Seoul's geography, especially by distinguishing clusters across the Han River, greatly improving redistribution strategies.

By viewing Ddareungi as a system striving for both spatial and temporal balance, shortages can be managed more efficiently. This approach not only optimizes the Ddareungi system but also offers valuable insights for enhancing the management of other shared resource systems.

To view the article in Korean, please click here.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Donggyu Kim (MBA AI/BigData, 2024)

Donggyu Kim (MBA AI/BigData, 2024)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

It's difficult to maintain blood stock at safe levels

South Korea has recorded its lowest birth rate in history. In 2023, the country's total fertility rate was 0.72, raising concerns about various future issues. Among them, the potential blood supply shortage due to low birth rates has come into focus. According to the Korean Red Cross, by 2028, the demand for whole blood donations is expected to exceed supply. Moreover, this gap is anticipated to widen further.

Blood shortages have long been a recurring problem. Especially during the winter season, the lack of blood donors causes hospital staff to worry about whether they can ensure a smooth supply of blood to patients. Despite these concerns, the blood shortage problem continues to worsen.

The Korean Red Cross considers a blood stock of more than five days to be at a "safe level", while a stock of less than five days is regarded as a "shortage". However, past data shows that the number of days the blood stock remains at a safe level has been decreasing.

Figure 1: Annual Blood stock Ratio/credit=Korean Red Cross

Why is it difficult to maintain blood stock at a safe level? The reason is that both the supply and usage for blood are hard to control. Blood is used in medical procedures like surgeries, and reducing its usage would cause significant backlash. On the other hand, blood can only be supplied through donations, meaning supply is limited. Therefore, despite the efforts of the Korean Red Cross, it remains challenging to keep blood stock at a safe level.

Literature Review

This study aims to understand the dynamics of blood supply and usage to help address the issue of blood shortages. Additionally, the study measures the effects of "blood donation promotional activities", one of the key factors in increasing blood supply, and propose efficient solutions.

Before delving into the analysis, let's review how previous studies have approached blood supply and usage. Blood has the characteristics of a public good, so it's heavily influenced by laws, and blood donation and management systems vary significantly between countries. Therefore, it was deemed difficult to apply research findings from other countries domestically, which is why I focused on reviewing domestic studies.

Yang Ji-hye(2013), Lee Tae-min(2013), Yang Jun-seok(2019), and Shin Ui-young(2021) focused on qualitative analysis by identifying motivations for blood donation participation through surveys. Kim Shin(2015) used multiple linear regression analysis to predict the number of donations by individual donors. However, personal information of donors was used as explanatory variables, and time series factors were not considered, making it difficult to understand the dynamics of blood supply and usage. Kim Eun-hee(2023) studied the impact of the COVID-19 pandemic on the number of donations, but her research had limitations, as it did not account for exogenous variables or types of blood donations. Unfortunately, previous studies did not focus on the dynamics of blood supply and usage, leaving little content to reference for this analysis.

Analysis of Blood Supply Dynamics

Selection of Analysis Subjects

From this section, I will introduce the analysis process. Rather than diving straight into the analysis, I will first clearly define the subjects of analysis. The Korean Red Cross publishes annual blood donation statistics, providing the number of donors categorized by group (age, gender, donation method, etc.). This study utilized that data for the analysis.

There are various types of blood donations. Depending on the method, donations are classified into whole blood, plasma, and platelets & multiple components. First, looking at plasma, approximately 68% of it is used as a raw material for pharmaceutical production, and it has a long shelf life of one year, making imports feasible. Therefore, in the case of plasma shortages, the issue can be resolved through imports, and as such, it is not our primary concern.

Next, platelet & multiple component donation has stricter criteria. Women who have experienced pregnancy are not eligible to donate, and it requires better vascular conditions compared to other types of donations. As a result, the gender ratio of donors is skewed at 20:1, raising concerns about sample bias and making it difficult to derive accurate estimates during analysis. Moreover, unlike whole blood, platelet & multiple component donations are primarily used for specific diseases. For these reasons, this study focuses solely on whole blood donations as the subject of analysis.

After selecting whole blood donations as the subject of analysis, one concern arose: whether to differentiate the data based on the amount of blood collected. The data I received is categorized by 320ml and 400ml amounts. Should I divide the data based on these amounts, just as we divide groups by gender? I decided that it would not be appropriate to make this distinction. Dividing the data by amount would distort the data structure because the amount is not a choice made by the donor but is determined by the donor's age and weight. Since donors cannot choose the amount, the 320ml and 400ml data come from the same distribution, and dividing them would arbitrarily split this distribution. Therefore, in this analysis, I integrated the data categorized by amount of blood collected and defined it as the "number of donors" for the analysis.

Figure 2: Distribution of donors by amount(left), Distribution of all donors(right)

The day of the week effect

Now that the analysis target has been clearly defined as the number of whole blood donors, let's begin the analysis. Since the number of donors is time series data, it's important to check whether it shows any seasonality. First of all, it is expected that the number of donors will vary depending on the weekly seasonality, specifically the day of the week and holidays. Let's examine the data to confirm this.

Figure 3: Distribution of the Number of Blood Donors by The Day of Week(Left), Distribution of the Number of Blood Donors on Holidays and Non-Holidays(Right)

As seen in Figure 3, the number of blood donors is higher on weekdays and relatively lower on holidays. Let's incorporate this information into the model. If the differences between groups in the data are overlooked and not included in the model, omitted variable bias (OVB) may occur, leading to inaccurate results. Therefore, it is important to identify variables that could cause group differences and incorporate them in the model.

It is natural to think that if we are dividing the data by groups, we should also split the data by gender. However, there is no need to group the blood donor data by gender. This is because the purpose of the analysis is to understand the dynamics of the blood supply from the perspective of the entire population. If the goal were to analyze individual donation frequencies, gender would be an important variable. However, since we are examining data for the whole population, there is no need to separate by gender. Additionally, when the number of male and female donors is normalized for mean and variance, they show very similar patterns. For these reasons, we analyzed the data without dividing it by gender.

Figure 4: Distribution of the Number of Blood Donors by Day of Week and Gender(Left), Distribution of Blood Donors on Holidays and Gender(Right)

Next, let's examine how the distribution changes as we divide the blood donor data into groups. Our goal is for the data to follow a normal distribution. Since a normal distribution indicates that no unexplained factors remain in the data.

First, let's look at the distribution of the number of blood donors without dividing it into any groups. The distribution shows a bimodal pattern, which indicates that there are still many unexplained factors in the data. Now, let's add the day-of-the-week effect that we discovered earlier to the model and see how the distribution changes. As seen in Figure 5, the distribution of weekday data after removing the day-of-the-week effect is no longer bimodal and has shifted to resemble a bell shape.

Figure 5: Distribution of Blood Donors without Grouping(Left), Distribution of Blood Donors on Weekdays(Right)

The distribution of the data after removing the day-of-the-week effect takes on a bell shape, but the long tail extending to the left is still concerning. We suspected this was due to a concentration of blood donations occurring on days when most donor centers are closed, and we incorporated this into the model. When we plotted the distribution using only data from non-holiday days, like how we removed the day-of-the-week effect, the tail disappeared.

Figure 6: Distribution of Blood Donors on Weekdays(Left), Distribution of Blood Donors on Weekdays and Non-Holidays(Right)

Annual Seasonality

So far, we have identified day of the week and holidays as factors that influence the number of blood donors. Let's express this in a regression equation and check the residuals. If the residuals do not follow a normal distribution, it means there are still unexplained factors affecting the number of blood donors. The regression equation for the number of blood donors based on day of the week and holidays is shown below.

[ \left(bd_{320ml} \cup bd_{400ml}\right) \sim d_{dow}, d_{holiday} ]

This equation means that the response variable represents the number of whole blood donors, combining both 320ml and 400ml blood donations. The explanatory variables are the day of the week and holidays, which have been included in the equation in the form of dummy variables.

Figure 7: Residual Distribution Before Removing Annual Seasonality

The residuals after removing the day-of-the-week and holiday effects no longer show the unusual patterns from the original data, such as the bimodal shape or long tail. However, when looking at the right side of the mean, there is an unusual pattern that wasn't detected in the distribution of blood donors. This suggests that there are still factors not explained by the day-of-the-week and holiday variables. What could those factors be?

There are two types of seasonality: weekly seasonality, such as day-of-the-week effects, and annual seasonality, like spring, summer, fall, and winter. Since we've already accounted for weekly seasonality, let's now consider annual seasonality. As mentioned earlier, we know that the number of blood donors tends to decrease in winter, so we can expect that annual seasonality exists. Let's examine the data to confirm this.

Figure 8: Distribution of Blood Donors by Day of Week(Left), Distribution of Blood Donors by Month(Right)

Looking at Figure 8, we can see that the distribution of blood donors varies by month. Therefore, it is reasonable to conclude that annual seasonality exists in the number of blood donors, and we should incorporate this into the model. It is suspected that annual seasonality may be contributing to the unusual patterns in the residuals.

How can we incorporate annual seasonality into the model? The simplest method would be to include all days of the year using 365 dummy variables. However, this approach is inefficient as it uses too many variables. When there are too many variables, the model's variance increases, and multicollinearity issues may arise. This is especially concerning because the number of blood donors does not fluctuate dramatically on a daily basis, so multicollinearity is likely. So, how can we capture similar information without using 365 dummy variables?

Let's focus on the word “cycle”. When we think of cycles, sine and cosine functions come to mind. How about using sine and cosine functions to capture annual seasonality? This approach is called Harmonic Regression.

Figure 9: Capturing Annual Seasonality Using Harmonic Regression

Figure 9 illustrates that annual seasonality is captured using appropriate sine and cosine functions. By using a method suited to the characteristics of the cycle, we were able to capture seasonality with a small number of variables. Of course, using temperature to capture annual seasonality is another option. This method has the advantage of being more intuitive and easier to control variables. However, there is annual seasonality in the blood donor data that cannot be fully explained by temperature alone, which is why harmonic regression was used to model the seasonality.

Figure 10: Residual Distribution After Removing Annual Seasonality

As a result of incorporating annual seasonality into the model, the unusual patterns in the residuals were eliminated. The regression equation with annual seasonality included is shown below.

[ \left(bd_{320ml} \cup bd_{400ml}\right) \sim d_{dow}, d_{holiday}, sin_i, cos_i ]

Weather Effect

Do temperature and weather affect the number of blood donors? Upon investigating the data, we found that 70% of donors visit blood donation centers in person. This leads to a strong suspicion that temperature and precipitation, which influence outdoor activities, could have a significant impact on the number of blood donors.

Figure 11: The Effect of Precipitation on the Number of Blood Donors

Since weather conditions vary significantly by region, we conducted the analysis separately for each region. We examined the significance of temperature and precipitation variables for individual regions. The results showed that precipitation negatively impacted the number of blood donors in all regions, while temperature did not have a significant effect. This is because the information provided by temperature was already captured when we incorporated annual seasonality into the model. The regression equation, including precipitation, is shown below.

[ \left(bd_{320ml} \cup bd_{400ml} |region \right) \sim d_{dow}, d_{holiday}, sin_i, cos_i, rain_i ]

Dynamics of Blood Supply and Usage During the COVID-19 Period

In this section, we will examine how blood stock responds when a significant external shock occurs. Specifically, we will analyze the dynamics of blood stock during the COVID-19 period, which was the most significant recent shock.

It is likely that maintaining blood stock above a certain level was challenging during the COVID-19 period. This is because population movement significantly decreased due to various quarantine measures and fears of infection. Moreover, as shown in Figure 12, the number of individuals ineligible for blood donation increased starting in 2020. This was due to the introduction of new health criteria during the COVID-19 period, which restricted blood donations for a certain period after recovering from COVID-19 or receiving a vaccine. For these reasons, we expect that blood stock levels decreased significantly during the pandemic. Let’s examine the data to see if our hypothesis is correct.

Figure 12: Increase in the Ineligibility Rate for Blood Donation Since the COVID-19 Pandemic

As seen in Figure 13, interestingly, blood stock levels were maintained above a certain level during the COVID-19 period. The blood stock never dropped below two days' supply. How was the Korean Red Cross able to maintain blood stock above a certain level despite the external shock of the pandemic?

Figure 13: Blood Stock Levels Maintained Above a Certain level Despite COVID-19

After controlling for the factors considered earlier and conducting a regression analysis, it was found that blood usage decreased by 4.25% during the COVID-19 pandemic. This reduction can be attributed to two factors: the intentional decrease in blood usage to maintain stock levels, and the natural decline due to the shortage of medical personnel and hospital wards during the pandemic.

A regression analysis on blood supply using the same variables showed a 5.3% decrease in supply. The reason blood stock levels were maintained during the COVID-19 period is that both usage and supply decreased at similar rates. However, considering the broader societal impact of the pandemic, the 5.3% decrease is relatively minimal.

Finding of the "Blood Shortage" Variable

A regression analysis of blood donor numbers by region showed that, in certain areas, the number of donors increased. Since COVID-19 did not occur only in specific regions, this contradicts common sense. Therefore, it is suspected that some factor during the pandemic may have contributed to an increase in blood supply in those areas. Additionally, the 5.3% decrease in the number of donors is likely offset by this increase factor.

Figure 14: Increase in Blood Donor Numbers in Certain Regions Despite COVID-19

We anticipated that an increase factor might come into play during periods of blood shortage. Thus, we created a proxy variable called "Blood Shortage". Days when blood stock dropped below a certain level, along with a defined period thereafter, were classified as "shortage periods". This reflects the impact of specific measures taken by the Korean Red Cross during these periods.

Figure 15: Example of a Blood Shortage Period

An analysis of the effect of the "blood shortage" on the number of blood donors showed that, in most regions, it had a positive effect on donor numbers. This supports the earlier hypothesis that some factor was increasing blood supply. Similarly, when examining the effect of the shortage condition on blood usage, we observed a decrease in usage during those periods. This indicates that the manual for blood supply shortages, which is triggered when blood stock levels fall below a certain threshold, worked effectively.

However, the increase factor associated with the "blood shortage" is likely only effective when the decrease in blood donors can be anticipated in advance. This is because the Korean Red Cross needs to predict a decline in donor numbers to respond through promotion efforts. Let’s verify this looking at the data.

Looking at the model’s residuals, we can see that during the early stages of the COVID-19 pandemic in Daegu/Gyeongbuk and the Omicron wave—both unexpected events—the number of blood donors decreased. In other more predictable periods, donor numbers did not continue to decline, suggesting that the increase factor operated effectively. The reason blood stock levels were maintained during those times is that the mannual for blood supply shortage was activated, and the public became more aware of the shortage, leading to more proactive blood donations, which helped increase supply.

Figure 16: The Increase Factor Did Not Operate During Unexpected Shocks

Measuring the Effect of Promotions

The Effect of the Additional Giveaway Promotion

During the COVID-19 period, the Korean Red Cross employed various methods to prevent a decline in the number of blood donors, including promotions, SMS donation appeals, and public service advertisements. Which of these methods was the most effective? If the effect can be accurately measured, the Korean Red Cross will be able to respond more efficiently to future blood shortages.

It would be ideal to measure the effect of all methods, but most were difficult to analyze due to a lack of data or one-time events. Fortunately, promotions were deemed suitable for quantitative analysis, so we focused on measuring their impact. Let’s examine how much promotions increased the number of blood donors.

The giveaway promotion was conducted in the same way across all regions for an extended period, so there should be no major issues in measuring its effect. To assess its impact, we created a dummy variable for "promotion days" while controlling for the variables we previously identified. The results showed that the response to the promotion varied by gender. Men responded strongly to the promotion while women did not show a significant response. However, does simply adding a dummy variable truly capture the pure increase driven by the promotion?

Figure 17: Effect of the Giveaway Promotion by Gender Using a Simple Dummy Variable

Using a simple dummy variable to capture the effect of the promotion period results in a mixture of both the "promotion effect" and the "trend during the promotion period". For example, the number of blood donors in May and December differs. May sees more donors due to favorable weather, while December sees fewer. Therefore, simply adding a dummy variable makes it difficult to isolate the pure effect of the promotion, as the existing higher donor numbers in May may get mixed with the increase from the promotion itself. We need to consider how to separate these effects to accurately measure the promotion's impact.

As shown in Figure 18, the giveaway promotion was conducted on a quarterly basis. Since each quarter shares similar seasonality, there is likely no significant change in the number of blood donors across quarters. To remove trends, the entire timeline was divided into quarters, and the pure promotion’s impact was measured.

Figure 18: A graph showing the promotion days, promotion periods(shaded), non-promotion periods(gray)

After removing the trends, there is no significant difference in the promotion response between the male and female groups. Although there is some variance due to unexplained social factors, the average response is similar, leading to more accurate results compared to using a simple dummy variable.

The Effect of Special Promotions

In addition to the giveaway promotion, the Korean Red Cross conducted various special promotions, including gift cards, souvenirs, travel vouchers, and sports event tickets. To accurately measure the effect of these special promotions, it is essential to remove the trends, just as with the giveaway promotion. In other words, we need to identify periods where there would be no differences except for the promotion. In this analysis, we examined the difference in the number of blood donors two weeks before and after the promotion period, as well as during the promotion period itself.

Figure 19: Example of Special Promotion

The increase rate in the number of blood donors by special promotions showed positive results in many regions. Among these, the offering sports viewing tickets was particularly effective. Therefore, it is suggested to use sports viewing tickets as a means to effectively increase the number of blood donors during anticipated periods of blood shortage.

Figure 20: The sports viewing ticket promotion ranks among the top

Episode for Data Collection

Here, I will conclude the analysis by sharing an episode from the data collection. The data used in this research was collected through various channels. For data related to blood services statistics, I was able to obtain well-organized information through the Statistics Korea API. However, other data sources were not as easily accessible, which was somewhat disappointing. While blood stock, usage, and supply data are available through other APIs, they only provide monthly data, which lacks the resolution needed for detailed analysis.

Fortunately, since the Korean Red Cross is a government organization, we were able to request daily data on blood stock, usage, and supply, as well as data on the giveaway promotion through a "Public Information Request". Government departments or public institutions often provide access to such data, excluding sensitive personal information. I encourage other researchers to actively use information disclosure requests to obtain high-quality data. Especially, in South Korea, where the digitization of administrative data is well-developed, researchers can access the materials they need for their studies.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Hyoung Keun Kwon (MSc Data Science, 2024)

Hyoung Keun Kwon (MSc Data Science, 2024)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Administrative Divisions Residential Districts, and Tax Systems

Regions where administrative divisions overlap are exposed to different economic, social, and political issues than those without. While local governments in the regions compete to attract resources to satiate residents’ needs, they run into unexpected results and inefficiencies.

Figure 1. Wirye New Town with 3 administrative divisions (Source: Kukmin Ilbo, 2019)

Recent “New Town” Projects in Korea, mainly designed to disperse the population from downtown of large cities and reduce unnecessary administrative burdens, are often located in regions where different administrative districts are intertwined, leading to many inconveniences. A prime example would be Wirye New Town (henceforth, Wirye) and its Light Rail Transit (LRT) project connecting Wirye and Sinsa, which already has two subway lines. With the national government, Seoul Metropolitan City, Songpa-Gu, Gyeonggi Province, Hanam City, and Seongnam City involved with different interests, the project has yet to start construction when it was originally planned to start ten years ago.

A common solution to such issues brought up in the media is an integration of administrative divisions. However, simply integrating the administrative districts will not solve the problem. Will the Gyeonggi, Hanam, or Seongnam governments agree with annexing Wirye to Songpa-Gu of Seoul with the consequence of losing their tax base, or vice versa? Even if the pertinent local governments somehow agree upon one form of integration/annexation, running school districts, as well as public facilities like fire stations, community centers, and libraries, will be a major challenge as securing and managing the resources to operate them will bring additional administrative and financial burdens.

In fact, integrating or annexing administrative districts in Korea is rather rare. There have been several attempts to integrate or annex local governments, but the sheer number is small, and examples like the integration of Changwon, Masan, and Jinhae into the unified Changwon City, and the integration of Cheongju City and Cheongwon County into the unified Cheongju City, are cases where entire local government units are merged or annexed (Source: Yonhap News, 2024). Integrating Wirye is a completely different story and annexation in such circumstances is unprecedented.

Low Fiscal Autonomy of Local Governments

One of the main reasons these problems persist in Korea is that local governments cannot be financially independent. While laws allow local governments to adjust the standard tax rate through a flexible tax rate system, it is rarely used in practice, and most local governments apply the tax rate set by the central government considering the ‘equity of tax system' (Jeong, 2021). In other words, governments only exercise the legal authority to adjust tax rates when it does not bring any particular benefit or disadvantage. Furthermore, they have little, if not zero, authority to determine the tax base, another facet of tax revenues. The national government distributes subsidies to fill the discrepancy between fiscal needs and revenue. However, the decision is ultimately made by the national government. Local governments do not have the power to determine their revenues and they can hardly come up with a long-term policy.

The United States presents a contrasting case. While the federal government holds the highest authority, each state has discretion to determine the types of taxes and tax rates. For example, while Texas, where Shin-Soo Choo played, has no individual income tax, California has one of the highest, if not the highest, individual income tax rates. Delaware, Montana, and Oregon, for instance, have no sales tax. Furthermore, local governments—cities, towns, etc.—in New York State assess real estate and impose different tax rates. New York City prohibits right turns on red whereas it is legal to make a right turn on a red light as long as one yields to oncoming traffic first. Of course, this does not mean that state and local governments are completely fiscally independent. The U.S. also has a federal or state government system of tax collection and distribution through grants and subsidies, but local governments have higher fiscal autonomy compared to Korea.

The Need for Research on Tax Competition

Let's return to the case of Wirye. If Songpa District, Seongnam City, and Hanam City had not just waited for support from the national government but had ways to secure their own resources, they could have solved the problem of the delayed construction of the LRT project. The reality that the construction has been delayed for 16 years, despite residents paying additional contributions, clearly shows how important the fiscal autonomy of local governments is (Source: Chosun Biz, 2024).

However, if local governments' taxing powers were to increase in Korea, tax competition between local governments is inevitable. When local governments can autonomously adjust tax rates, they will inevitably compete to attract resources with minimal resistance from citizens. In other words, granting more taxing power to local governments can present new challenges as it provides local governments more autonomy. Ultimately, interactions between local governments, especially tax competition, play an important role, and the impact of such competition on the regional economy may be unneglectable.

Research Question

This study aims to examine the tax competition that arises in regions where tax jurisdictions are not completely independent from one another. Particularly in situations like Wirye, where multiple administrative districts overlap, the study will use a game-theoretic approach to model how local governments determine and interact with their tax policies, and to analyze the characteristics and results of the competition.

Specifically, this study aims to address the following key questions:

  • What strategic choices do local governments make in overlapping tax jurisdictions?
  • How is tax competition in this environment different from traditional models?
  • What are the characteristics of the equilibrium state resulting from this competition?
  • What are the impacts of this competition on the welfare of local residents and the provision of public services?

Through this analysis, the study aims to contribute to the effective formulation of fiscal policies in regions with complex administrative structures. It also expects to provide theoretical insights into the problems that arise in cases like Wirye, which can help inform policy decisions in similar situations in the future. The next chapter will go into further detail on tax competition and explain the specific models and assumptions used in this research.

Literature on Tax Competition

Tax competition refers to the competition between local governments to determine tax rates in order to attract businesses and residents. This can have a significant impact on the fiscal situation of local governments and the provision of public services.

The origin of the theory of tax competition can be traced back to Tiebout's (1956) "Voting with Feet" model. Tiebout argued that residents move to areas that provide the combination of taxes and public services that best matches their preferences. Later, Oates (1972) argued that an efficiently decentralized fiscal system can provide public goods, establishing this as the decentralization theorem.

However, Wilson (1986) and Zodrow and Mieszkowski (1986) pointed out that tax competition between local governments can lead to the under-provision of public goods. They argued that as capital mobility increases, local governments tend to lower tax rates, which can ultimately lead to a decline in the quality of public services.

Tax competition takes on an even more complex form in regions with overlapping tax jurisdictions. Keen and Kotsogiannis (2002) analyzed a situation where vertical tax competition, tax competition between different levels of governments such as those between central and local governments, and horizontal tax competition, tax competition between same level governments like those between local governments, occur simultaneously. They showed that in such an overlapping structure, excessive taxation can occur.

In the case of Wirye, the overlapping jurisdictions of multiple local governments make the dynamics of tax competition even more complex. Each government tries to maximize its own tax revenue, but at the same time, they must also consider the competition with other governments. This can lead to results that differ from traditional tax competition models.

In summary, the negative aspects of tax competition are:

  • Decrease in tax revenue: Long-term shortage of tax revenue due to tax rate reductions
  • Deterioration of public service quality: Reduction of services due to budget shortages
  • Regional imbalances: Uneven distribution of public services due to fiscal disparities
  • Excessive fiscal expenditures: Financial burden from various incentives to attract businesses

However, tax competition is not always negative. Brennan and Buchanan (1980) argued that tax competition can play a positive role in restraining the excessive expansion of government.

Research on tax competition in regions with overlapping tax jurisdictions is still limited. This study aims to analyze the dynamics of tax competition in such situations using a game-theoretic approach. This can contribute to the formulation of effective fiscal policies in regions with complex administrative structures.

Model

The Need for a Toy Model

It is extremely difficult to create a model that considers all the detailed aspects of the complex administrative system described earlier. In such cases, an effective approach is to create a toy model that removes complex details and represents only the core elements of the actual system. For example, to understand the principles of how a car moves, looking at the entire engine would be complicated, as it involves various elements like the fuel used, fuel injection method, cylinder arrangement, etc. However, through a toy model like a bumper car, one can, with relative ease, learn the basic principle that pressing the accelerator makes the car move forward, and pressing the brake makes it stop.

Similarly, in this study, we plan to use a toy model that removes complex elements in order to first understand the core mechanisms of tax rate competition. After understanding the core mechanisms through this study, the goal is to gradually add more complex elements to create a model that is closer to the actual system.

Assumptions

In game theory, a "game" refers to a situation where multiple players choose their own strategies and interact with each other. Each player tries to choose one or more optimal strategies to achieve their own goals, considering other players’ strategies.

In this study, we constructed a game by adding one overlapping region to the toy model of Itaya et al. (2008) and Ogawa and Wang (2016). Based on the Solow Model, a Nobel prize winning economic model for explaining long-term economic growth, the two models mainly investigated the capital tax rate competition between two regional governments within a country.

Specifically, there is a hypothetical country divided into three regions: two independent regions with asymmetric production technologies and capital factors, S and L, and a third region, O, which overlaps with the other two. The three regions have the independent authority to impose capital taxes with tax rates $\tau_i$for $\text{region}_i$. The non-overlapping parts of S and L are defined as SS (Sub-S) and SL (Sub-L), respectively, and the overlapping regions between S and O, and L and O, are denoted as OS and OL, respectively (refer to Figure 2). S and L are higher level of jurisdictions that provide generic public good $G$while O is a special-purpose jurisdiction that provides a specific public good $H$tied to O.

Figure 2. An example of tax jurisdictions S, L, and O

To intuitively observe only the "effect of the capital tax rate" due to the existence of overlapping regions, it is assumed that the population of regions S and L is the same. All residents in the country have the same preference and are inelastically supplying one unit of labor to companies in each region. This is a strong assumption because, under any circumstances, residents do not move and continue to work for their current company. However, it was a necessary assumption to simplify the game. Furthermore, it is assumed that companies employing residents in each region produce homogeneous consumer goods.

As mentioned above, S and L are assumed to have different capital factors and production technologies. Expressing this in a formula, the average capital endowment per person for the entire country is $\bar{k}$, and the average capital endowment per person for regions S and L are as follows:

[ \overline{k_s}\equiv\bar{k}-\varepsilon, \qquad \overline{k_L}\equiv\bar{k}+\varepsilon \qquad where\ \varepsilon\in\left(0,\ \bar{k}\right]\ \ and\ \ \bar{k}\equiv\ \frac{\overline{k_s}+\overline{k_L}}{2} ]

Even though the capital endowment may differ, capital can move freely. In other words, when a resident of region S invests capital in L, the cost is no different than investing capital in S.

To briefly introduce a few more necessary variables, the amount of capital required in region $i$ is denoted $K_i$, the amount of labor supplied is $L_i$, and the labor and capital productivity coefficients are defined as $A_i$and $B_i > 2K_i$, respectively. Although regions S and L differ in capital production technology, there is no difference in labor production technology, so $A_L = A_S$while $B_L \neq B_S$. Region O does not occupy new territory in the hypothetical country but overlaps S and L regions in equal proportions in terms of area and population. Therefore, $A_O = A_L = A_S$, and $B_O$ would be the weighted average of $B_L$ and $B_S$, with the weights being the proportion of capital invested in the OL and OS regions.

Market Equilibrium

Utilizing the variables introduced above, the production function under constant returns to scale (CRS) for region $i$used in this study can be expressed as follows.

[ F_i\left(L_i,\ K_i\right)=A_iL_i+B_iK_i-\frac{K_i^2}{L_i} ]

Based on this production function, it is assumed that firms maximize their profits, and market equilibrium is assumed to occur when the total capital endowment and capital investment demand are equal. Based on this, we can calculate the capital demand and interest rates at market equilibrium:

[\begin{align}
r^
&= \frac{1}{2}\left(\left(B_S+B_L\right)-\left(\tau_S+\tau_L+\left(2-\alpha_S-\alpha_L\right)\tau_O\right)\right)-2\bar{k} \
K_S^* &= l\left(\bar{k}+\frac{1}{4}\left(\left(\tau_L-\tau_S-\left(\alpha_L-\alpha_S\right)\tau_O\right)-\left(B_L-B_S\right)\right)\right) \
K_L^* &= l\left(\bar{k}+\frac{1}{4}\left(\left(\tau_S-\tau_L+\left(\alpha_L-\alpha_S\right)\tau_O\right)+\left(B_L-B_S\right)\right)\right) \
K_{SS}^* &= \frac{2l}{3}\left(\bar{k}+\frac{1}{4}\left(\left(\tau_L-\tau_S+\left(2-\alpha_L-\alpha_S\right)\tau_O\right)-\left(B_L-B_S\right)\right)\right) \
K_{SL}^* &= \frac{2l}{3}\left(\bar{k}+\frac{1}{4}\left(\left(\tau_S-\tau_L+\left(2-\alpha_L-\alpha_S\right)\tau_O\right)+\left(B_L-B_S\right)\right)\right) \
K_O^* &= \frac{2l}{3}\left(\bar{k}-\frac{1}{2}\left(2-\alpha_L-\alpha_S\right)\tau_O\right)
\end{align*}
]

Here, $K_{SS}=\alpha_S K_S$, $K_{SL}=\alpha_L K_L$ while $0<\alpha_S, \alpha_L < 1$. Furthermore, we define $\theta$ as equal to $B_L-B_S$.

Residents maximize their post-tax income by investing capital, earning income from the market equilibrium return on capital $r^{\ast}$, and consuming it all. It is assumed that each tax jurisdiction provides public goods through taxes, and selects an optimal capital tax rate $\tau_i^{\ast}$to maximize the social welfare function, which is represented as the sum of individual consumption and the provision of public goods.

$u \left( C_i, G_i, H_i \right) \equiv C_i + G_i + H_i = $

[\begin{cases}
l\left( w_i^* + r^* \bar{k}_i \right) + K_i^* \tau_i + (1 - \alpha_i) K_i^* \tau_O, & \text{for } i \in {S, L} \
\dfrac{2l}{3}\left( w_O^* + r^* \bar{k} \right) + K_O^* \tau_O + (1 - \alpha_S) K_S^* \tau_S + (1 - \alpha_L) K_L^* \tau_L, \text{for } i = O
\end{cases}]

Then, tax rates at the market equilibrium are:

[\begin{align}
\tau_S^\ast &= \frac{4\varepsilon}{3}-\frac{\theta}{3}+\frac{\tau_L}{3}-\frac{2-3\alpha_S+\alpha_L}{3}\tau_O \
\tau_L^\ast &= -\frac{4\varepsilon}{3}+\frac{\theta}{3}+\frac{\tau_S}{3}-\frac{2-3\alpha_L+\alpha_S}{3}\tau_O \
\tau_O^\ast &= \frac{3\left(\alpha_L+\alpha_S\right)-4}{\left(2-\left(\alpha_L+\alpha_S\right)\right)\left(\alpha_L+\alpha_S\right)}\bar{k}=\Gamma\bar{k}
\end{align
}]

Nash Equilibrium and Simulations

Nash Equilibrium

First, to briefly explain the Nash equilibrium, it refers to a state in which every player in a game has made their best possible choice and has no incentive to change their strategy any further. In other words, in a Nash equilibrium, no player can improve their outcome by changing their strategy, so all players continue to maintain their current strategies.

The tax rate at market equilibrium calculated in the previous section can be expressed as the optimal response function for each region. This is because, given the strategies of other regions, each region aims to maximize the social welfare function with its optimal strategy. In other words, this function shows which tax rate is most advantageous for region $i$ when considering the tax rates of other regions. Therefore, the Nash equilibrium tax rate can be derived based on the market equilibrium tax rate, and it can be expressed in the following formula.

[\begin{align}
\tau_S^N &= \varepsilon-\frac{\theta}{4}-\Gamma\left(1-\alpha_S\right)\bar{k} \
\tau_L^N &= -\left(\varepsilon+\frac{\theta}{4}\right)-\Gamma\left(1-\alpha_L\right)\bar{k} \
\tau_O^N &= \Gamma\bar{k}
\end{align
}]

Furthermore, we can derive the following lemma and proposition.

Lemma 1. The sign of $\Phi\equiv\varepsilon-\frac{\theta}{4}$determines the net capital position of S and L, where L is the net capital exporter when it is a positive sign and vice versa.

Proposition 1. The sign of $\Gamma\equiv\frac{3\left(\alpha_L+\alpha_S\right)-4}{\left(2-\left(\alpha_L+\alpha_S\right)\right)\left(\alpha_L+\alpha_S\right)}$determines the effective tax rate of O and $\alpha_L + \alpha_S$ must be greater than 4/3 for O to provide a positive sum of special public good $H$.

Additionally, the capital demand and interest rates at the Nash equilibrium are as follows.

[\begin{align}
r^N&=\frac{1}{2\left(B_S+B_L\right)}-2\bar{k} \
K_S^N&=l\left({\bar{k}}S+\frac{1}{2}\left(\varepsilon-\frac{\theta}{4}\right)\right) \
K_L^N&=l\left({\bar{k}}_L-\frac{1}{2}\left(\varepsilon-\frac{\theta}{4}\right)\right) \
K{SS}^N&=\frac{2l}{3}\left({\bar{k}}S+\frac{1}{2}\bar{k}\Gamma\left(1-\alpha_S\right)-\frac{1}{2}\left(\varepsilon+\frac{\theta}{4}\right)\right) \
K{SL}^N&=\frac{2l}{3}\left({\bar{k}}_L+\frac{1}{2}\bar{k}\Gamma\left(1-\alpha_L\right)+\frac{1}{2}\left(\varepsilon+\frac{\theta}{4}\right)\right) \
K_O^N&=\frac{l}{3}\cdot\frac{4-\alpha_L-\alpha_S }{\alpha_L+\alpha_S}\bar{k}
\end{align
}]

Simulations and Results

Since the Nash equilibrium is nonlinear, the results can be visualized through simulations. By adjusting $\alpha_S$, $\alpha_L$, $\epsilon$, and $\theta$ in the simulation, as mentioned in Lemma 1, one can see that when $\Phi$is greater than 0, the capital demand in region L decreases, and the capital demand in region S increases, leading L to export capital and S to import it (see Figure 3).

Figure 3. Net capital position of S and L based on changes in $\Phi$

Based on the Nash equilibrium results, the utility that the representative residents of each region derive from public goods can be summarized as follows:

[u_p(G_i, H_i) =
\begin{cases}
\frac{K_S^N \tau_S}{l} & \text{for } i = SS \
\frac{K_L^N \tau_L}{l} & \text{for } i = SL \
\frac{K_S^N \tau_S}{l} + \frac{3 K_O^* \tau_O }{2l} & \text{for } i = OS \
\frac{K_L^N \tau_L}{l} + \frac{3 K_O^* \tau_O }{2l} & \text{for } i = OL
\end{cases}]

This utility function can be visualized as is represented in Figure 4.

Figure 4. Representative residents’ utility from public goods based on changes in $\Phi$

To summarize, in the case of region O, it is not directly affected by the tax rates of S and L, but it is directly influenced by the ratio of the allocated resources, i.e., the sum of $\alpha_L$and $\alpha_S$. S and L experience changes in their tax rates by $\Gamma\left(1-\alpha_S\right)\bar{k}$ and $\Gamma\left(1-\alpha_L\right)\bar{k}$, respectively, due to the existence of a special-purpose jurisdiction called O, compared to the scenario where such jurisdiction does not exist. Additionally, the net capital position is determined independently of the special-purpose jurisdiction.

Conclusion

The study examined how tax competition unfolds in regions with overlapping tax jurisdictions leveraging ideas from game theory. Specifically, a simplified toy model was constructed to understand the impact of tax competition in complex administrative structures, and from this, a Nash equilibrium was derived. Through this, it was identified that in the presence of overlapping tax jurisdictions, the patterns and outcomes of tax competition differ in certain ways from what is predicted in the models of Itaya et al. (2008) and Ogawa and Wang (2016).

However, this study has several limitations. First, the use of a toy model simplifies the complex nuances of real-world scenarios, meaning it does not fully reflect the various factors that could occur in practice. For instance, factors such as population mobility, governmental policy responses, various tax bases, and income disparities among residents were excluded from the model, which limits the generalizability of the results. Second, the economic variables assumed in the model, such as differences in capital endowments, production technologies, and resident preferences for public goods, may differ from reality, necessitating a cautious approach when applying these findings to real-world situations.

Despite these limitations, this study provides an important theoretical foundation for understanding the dynamics of tax competition in regions with overlapping administrative structures. Specifically, it suggests the need for a policy approach that considers the interaction between capital mobility and public goods provision, rather than merely focusing on a “race to the bottom” in tax rates. Future research should expand the model used in this study to include additional variables such as population mobility and governmental policy responses. Moreover, it will be essential to examine how tax competition evolves in repeated games. Finally, testing the model under various economic and social conditions will be crucial to improving its reliability for practical policy applications. By doing so, we can more accurately assess the real-world impacts of tax competition in complex administrative structures and contribute to designing effective policies.

Reference

Brennan, G., & Buchanan, J. M. (1980). The power to tax: Analytical foundations of a fiscal constitution. Cambridge University Press.

Keen, M., & Kotsogiannis, C. (2002). Does federalism lead to excessively high taxes? American Economic Review, 92(1), 363-370.

Itaya, J.-i., Okamura, M., & Yamaguchi, C. (2008). Are regional asymmetries detrimental to tax coordination in a repeated game setting? Journal of Public Economics, 92(12), 2403–2411.

Jeong, J. (2021). A Study on the Improvement of the Flexible Tax Rate System. Korea Institute of Local Finance.

Oates, W. E. (1972). Fiscal federalism. Harcourt Brace Jovanovich.

Ogawa, H., & Wang, W. (2016). Asymmetric tax competition and fiscal equalization in a repeated game setting. International Review of Economics & Finance, 41, 1–10.

Tiebout, C. M. (1956). A pure theory of local expenditures. Journal of Political Economy, 64(5), 416-424.

Wilson, J. D. (1986). A theory of interregional tax competition. Journal of Urban Economics, 19(3), 296-315.

Zodrow, G. R., & Mieszkowski, P. (1986). Pigou, Tiebout, property taxation, and the underprovision of local public goods. Journal of Urban Economics, 19(3), 356-370.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

Jeonghun Song (MSc Data Science, 2023)

Jeonghun Song (MSc Data Science, 2023)

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor

With economic uncertainty both domestically and globally, the surge in energy-related raw material prices this winter was 'expected.' Experts are urging the need to accurately predict winter energy consumption and come up with strategies to save energy. However, the industry is questioning the methods previously used to estimate energy consumption, claiming that these methods do not reflect reality.

How can energy consumption be predicted accurately? What other impacts could come from accurately predicting energy usage? This article aims to explain a statistical method based on the joint probability distribution model to predict energy consumption more realistically in simple terms for the public.

Figure 1. According to the Chicago Mercantile Exchange (CME), on August 11, the spot price of LNG on the Dutch TTF surged to $62.5 per MMBtu. / Source: CME

Global Raw Materials Supply Crisis

According to the Chicago Mercantile Exchange (CME) on August 11, the spot price of European LNG surged to \$62.5 per MMBtu on August 2. This is 6 to 7 times higher than last year’s \$8-10 for the same period and close to the record high of \$63 set in March of this year.

Experts believe the sharp rise in European LNG prices is due to Russia’s 'tightening' of natural gas supplies. Amid the ongoing Russia-Ukraine war, the West, including the U.S., has refused to pay for raw materials in rubles, pressuring Russia. In response, Russia has significantly reduced its natural gas supply.

In fact, Russia has completely halted natural gas supplies to Bulgaria, Poland, the Netherlands, Finland, Latvia, and Denmark, all of which refused to pay in rubles. At the end of last month, it also drastically cut the supply through the Nord Stream 1 pipeline to Germany, its biggest customer, to 20%. As Europe struggles with gas shortages, it has been pulling in all available global LNG, driving up Northeast Asia’s LNG spot prices to $50 as of July 27.

Adding to the LNG supply crisis, a June explosion at the U.S.'s largest LNG export facility, Freeport LNG, which exports 15 million tons of LNG annually, has limited operations until the end of the year. Meanwhile, Australia, the world’s largest LNG exporter, is considering restricting natural gas exports under the pretext of stabilizing raw material prices. As a result, the industry expects a 'dark period' in raw material supply in the second half of this year.

The problem is that these geopolitical issues are severely impacting South Korea’s raw material supply as well. Even in the low-demand summer and fall seasons, LNG spot prices are nearing record highs, and the industry consensus is that winter, with its heating demands, will see prices rise to unimaginable levels.

'Predicted' Energy Crisis

Experts warn that in this 'predicted' energy crisis, the winter LNG spot price may far exceed the record high of \$63 per MMBtu from March, possibly surpassing \$100. They emphasize that South Korea must accurately forecast winter energy consumption now and begin conserving resources through energy-saving measures.

So, how is energy consumption in South Korea estimated, and how accurate are these estimates? To understand this, we first need to examine how electricity and gas are consumed in the country.

Energy consumption, including electricity and gas, occurs not only in households but also in non-residential buildings such as office and commercial facilities. The energy use in non-residential buildings varies greatly depending on the building's purpose. According to the "Energy Usage by Purpose [kWh/y]" data from the New and Renewable Energy Center of the Korea Energy Agency, there are significant differences in energy consumption depending on the building’s use.

Additionally, using the 'average energy consumption per unit area' from the table below, we can estimate the total annual energy consumption of a specific building. This is calculated by multiplying the annual average usage figure for the building’s purpose by its total floor area. For example, the estimated annual energy consumption for an office building with a floor area of 1,000 square meters would be 371,660 kWh.

\begin{array}{c|c|c|c|c|c|c|c}
\hline
\textbf{Office} & \textbf{Sales/Business} & \textbf{Medical} & \textbf{Education/Research} & \textbf{Elder Care} & \textbf{Accommodation} & \textbf{Religious} \
371.66 & 408.45 & 643.52 & 231.33 & 175.58 & 526.55 & 257.49 \
\hline
\end{array}

Widely Used Energy Consumption Estimates

These energy consumption estimates for individual buildings can be widely applied. As we’ve seen, with energy raw material prices expected to hit record highs, these estimates can help ensure that 'expensive' energy is not wasted and is efficiently distributed.

Additionally, the Korea Energy Agency, which provided the above data, actively uses these statistics to calculate the required amount of renewable energy for public buildings. For example, when a public building is scheduled for new construction or expansion and plans to generate a certain amount of renewable energy, the energy consumption estimates are compared to the expected energy output. This comparison helps determine whether the building is producing enough renewable energy.

Moreover, these energy usage estimates are not limited to individual buildings but can be extended to areas or regions. For instance, if a large-scale building or new city district is planned within a specific urban area, regional energy demand will naturally increase as the buildings are constructed.

However, a limitation of this data is that the estimates rely on a simple one-variable regression, with energy consumption as the dependent variable and floor area as the independent variable. In reality, building energy consumption is influenced by various factors such as heating and cooling systems, building materials and structure, and insulation quality. Thus, explaining energy use based solely on 'floor area' reduces accuracy.

Therefore, government agencies and public corporations responsible for energy management must strive to estimate the increased energy demand from new buildings as accurately as possible. This is crucial for efficient decision-making related to energy supply, production, and infrastructure investment. A precise model to estimate energy consumption for individual buildings is clearly necessary for this purpose.

Existing Energy Estimation Studies Based on Regression Analysis

Ideally, accurately estimating a building’s energy consumption would involve analyzing all detailed characteristics, such as heating and cooling systems, building materials and structure, insulation, occupancy, and schedules. This type of estimation model is known as a Physical Model.

However, predicting energy usage using a physical model is not practical. Most construction companies do not disclose all information, especially for new buildings. While collecting this data directly from the builders may be possible for a single building, doing so for an entire district or region would result in astronomical costs.

Therefore, from a research perspective, it’s best to use a statistical model that estimates energy consumption based on a few simple building attributes. In other words, creating a regression model where the dependent variable is a building's energy consumption, and the independent variables are attributes such as floor area, purpose, number of floors, age, and materials.

Regression analysis is a well-known statistical method for identifying correlations between observed independent variables and a dependent variable. Researchers can use regression analysis to statistically test how much a change in an independent variable influences the dependent variable and, further, predict the dependent variable's value from the independent variables. To ensure a reasonable analysis, researchers must also consider mathematical and statistical assumptions, such as whether their model violates the Gauss-Markov assumptions. Details on these considerations will be discussed in the later part of this research.

To conduct regression model research with monthly energy consumption of individual buildings as the dependent variable, data is required. In South Korea, monthly energy consumption records for non-residential buildings are made available through the Building Data Open System. Information about building attributes, which serves as independent variables, is recorded in the title section and is also provided by the Building Data Open System. This allows anyone to combine monthly energy consumption data with title section data to carry out such research.

Returning to the main point, due to the practical 'cost' issue and the ease of data collection for regression model research, previous studies estimating energy consumption of individual buildings have primarily used regression-based statistical models. A notable domestic study is “Development of Standard Models for Building Energy in Seoul’s Residential/Commercial Sector” (Kim Min-kyung et al., 2014). This research derived a model by performing linear regression on monthly electricity usage with various independent variables and monthly dummy variables (which convert existing variables into 0s and 1s based on certain criteria). Similarly, in a prominent overseas study on heating energy estimates, a model was derived by regressing 'per unit area' monthly heating energy consumption during the heating season against building and climate-related independent variables.

Monthly' Energy Usage Trends

One common feature of the studies reviewed earlier is that the dependent variable in the regression models is not 'annual' energy consumption, but 'monthly' energy consumption. This is to reflect the seasonal trends in energy usage. For example, electricity usage is higher in the summer due to air conditioning, and gas consumption is higher in the winter due to heating. It’s no surprise that electricity usage peaks in July and August, while gas consumption is highest from December to February. In fact, most buildings exhibit similar 'seasonal trends' in energy consumption, as shown in Figure 3.

Figure 3. Typical seasonal trends in electricity and gas usage in buildings.

Therefore, when planning energy supply and maintenance for energy production facilities, it is crucial to accurately predict monthly energy demand by considering seasonal fluctuations. This ensures that sufficient energy is available during high-consumption periods to prevent blackouts, and that energy reserves are minimized during low-consumption periods, allowing for efficient use of government budgets. However, previous studies' energy consumption estimates have not been widely adopted in the industry due to their lack of accuracy and failure to reflect reality. This is because traditional regression models did not incorporate a 'joint' probability distribution model based on the second moment for monthly energy usage.

Hidden Factors Among Variables

Consider two hypothetical office buildings with nearly identical attributes but differing actual energy usage. Both buildings are categorized as office buildings, with similar floor area, number of floors, age, and building materials. However, in one building, employees frequently work overtime and on weekends, using air conditioning extensively, resulting in high electricity consumption. In contrast, the other building emphasizes energy saving, with employees leaving on time daily, leading to much lower energy use.

In this case, even though the explanatory variable values for the two buildings are very similar, their actual electricity usage would differ significantly. The first building would use more electricity compared to the average office building of similar size and materials, while the second would use less. This means that the energy consumption of two buildings with identical attributes like floor area, number of floors, and materials would vary due to the hidden variable of "whether employees leave on time." Since it's practically impossible to collect data on the work hours of all employees in a building, including this variable in existing models is not feasible.

Of course, regression analysis accounts for such variability through the error term. The energy consumption of average buildings is calculated by setting the error term to zero, while buildings that consume more than average will have a positive error term, and those that consume less will have a negative one.

Correlation Among Dependent Variables

In proper research, not only are the coefficient estimates for each explanatory variable provided in a regression model, but so is the estimate of error variance. Using this error variance estimate, the expected energy usage for each month can be obtained as both a point estimate and a confidence interval. In a normal regression model, this confidence interval would cover most of the variability in energy usage mentioned earlier. However, mathematically, one more factor needs to be considered: the 'correlation among energy usage in different months.'

For example, if the electricity consumption in August of a building that frequently has overtime and uses a lot of air conditioning is significantly higher than other similar-sized buildings, it is likely that this building will also consume more electricity in other months, from January to December, compared to other similar buildings. Similarly, if a building that focuses on energy saving has low electricity usage in August, it will likely consume less electricity in other months as well.

This is mathematically referred to as a 'positive correlation.' Previous regression-based studies did not account for this positive correlation. For instance, if we assume that monthly electricity usage follows a probability distribution with the average usage predicted by the existing regression model, and we draw samples of monthly electricity usage for a specific building, it's possible that the sample value for July might be much higher than average, while the sample value for August could be much lower than average.

Common sense tells us that a building that used significantly more electricity than similar buildings in July is unlikely to use much less electricity than other buildings in August. In other words, if the regression model captures all relevant information, the samples of electricity usage for July and August for the same building should be positively correlated—they should both be high or both be low. However, if there is no second moment value (i.e., 'covariance') between the error terms for the two months, such unrealistic samples may occur.

Covariance Among Error Terms

Let’s examine this more mathematically. When viewing the monthly electricity usage (January, February, …, December) for a building over a year as a 12-dimensional vector random variable, previous studies have estimated the first moment vector and the diagonal components of the second moment matrix (the variance of error terms for each month). The first moment vector is obtained by inputting the explanatory variable values into the regression equation and setting the error term to zero. The diagonal components of the second moment matrix correspond to the estimated variances of the error terms for each month. However, previous studies did not estimate the off-diagonal components of the second moment matrix—i.e., the 'covariance' between the error terms of different regression equations—leading to difficulties in accurately modeling real-world scenarios.

If, in addition to calculating the first moment vector, the second moment matrix with covariances is fully estimated, the multivariate normal distribution (Multivariate Normal Distribution) of the multivariate random variable can be defined mathematically. In practical terms, this would allow us to sample monthly electricity usage for a specific building while accounting for the 'correlation between energy usage in different months.' This way, a building that uses significantly more electricity than similar buildings in July would also be expected to use more electricity in August.

These accurately generated samples (monthly energy estimates) can greatly help urban energy-related research by allowing statistical analysis of uncertainties. Additionally, if some monthly energy usage data are missing, the second moment matrix can be used to estimate (impute) the missing values, thereby significantly improving the quality of the data.

However, for the multivariate normal distribution to be defined in this context, the research data must be nearly symmetrical around the mean, and the tails of the distribution must not be excessively thick or thin. Furthermore, the 2021 building data (energy usage, floor area, building purpose, etc.) used in this discussion are generally in line with these assumptions.

Sample Extraction Using Multivariate Normal Distribution

By defining the multivariate normal distribution based on the second moment matrix, it is possible to extract samples of monthly energy usage (January, February, …, December) for an entire year. This approach differs from previous studies because it accounts for the correlation between residuals in the regression model, thus incorporating 'seasonal trends' when generating samples. In simple terms, a building that used significantly more electricity than similar buildings in July can now also be estimated to use more electricity in August.

Example Reflecting Covariance

To validate this claim, let's examine the energy usage data samples drawn from the multivariate normal distribution in the figure below.

Figure 4. Samples of monthly energy usage vectors based on given building attributes.

The figure shows that the seasonal energy usage trends in the samples are very similar to the actual data. For example, electricity consumption rises significantly during the summer months (July-August) when air conditioning is heavily used, while gas consumption increases during the winter months (December-February) when heating is in high demand. This confirms that our statistical model accurately reflects reality.

Example Without Covariance

Now, let’s see what happens when we extract samples without considering the covariance between energy usage in different months, as in previous studies. This is equivalent to setting the off-diagonal elements of the covariance matrix to zero in the multivariate normal distribution used for the sample extraction.

Figure 5. Unrealistic samples generated when covariance is not considered.

If a building exhibits significantly lower energy usage in July, it would be reasonable to expect that it consistently uses less energy, meaning its August usage should also be below average.

However, in this case, the model failed to incorporate the covariance information, leading to unrealistic results. As illustrated in the first figure, a building that consumed much less electricity than similar buildings in July unexpectedly uses much more electricity in August compared to others, which defies typical expectations.

Missing Data Estimation Using Multivariate Normal Distribution

In addition to sample extraction, another application is missing data estimation (imputation). For example, the Ministry of Land, Infrastructure, and Transport data sometimes has missing monthly energy usage for certain buildings, or some recorded values may be abnormal. If correct usage data exists for the other months, can we estimate the missing usage based on the recorded values?

If energy usage is recorded for the first and last months of a three-month period, but missing for the second month, we might compromise by using the middle value. But what if two consecutive months are missing? Or if the last month's usage is missing, so the middle value cannot be defined using the following month's data? What should be done then?

\begin{equation} \label{eq:conditional-mvn} \left[\begin{matrix} z_1\ z_2 \end{matrix}\right] {\sim}MVN \left(\left[\begin{matrix} \mu_1\ \mu_2 \end{matrix}\right] , \left[\begin{matrix} {\scriptstyle\sum}{11} & {\scriptstyle\sum}{12}\
{\scriptstyle\sum}{21} & {\scriptstyle\sum}{22}
\end{matrix}\right]\right)
\Rightarrow\ P\left(z_1\middle| z_2=a\right)=MVN \left(\mu_1+ {\scriptstyle\sum}{12} {\scriptstyle\sum}{22}^{-1} \left(a-\mu_2 \right), {\scriptstyle\sum}{11}- {\scriptstyle\sum}{12} {\scriptstyle\sum}{22}^{-1} {\scriptstyle\sum}{21} \right)
\end{equation
}

Using the multivariate normal distribution derived in this study, missing values can be reasonably estimated in any case. As shown in the formula, when some elements of a random vector following a multivariate normal distribution are fixed, the remaining elements follow a reduced-dimensional conditional multivariate normal distribution based on the fixed values. This allows us to estimate the missing values using the conditional mean of this distribution.

Figure 6. Example of Missing Data Imputation Using Conditional Multivariate Normal Distribution

The graph above shows missing values filled in using the conditional mean of the multivariate normal distribution. The blue solid line represents the actual monthly energy usage of a building, while the orange circles represent the conditional mean for February, July, and October, assuming these months' usage is missing. The green squares represent the conditional mean for October to December, assuming the usage from January to September is given, which can be viewed as future usage predictions. The conditional mean does not deviate much from the actual values, indicating that using the conditional mean to estimate missing values is reasonable.

All in all, accurate energy consumption forecasting requires a statistical approach that goes beyond simple regression models, taking into account the correlations between various variables and complex factors. By using a multivariate normal distribution model, it is possible to make more realistic predictions by considering the correlations between monthly energy consumption, which can improve the efficiency of energy supply planning. This approach can also be useful for addressing statistical errors overlooked in previous studies and for imputing missing data. Ultimately, more accurate energy consumption forecasting will serve as crucial foundational data for preparing for winter energy crises, while also contributing to improving energy efficiency and preventing resource waste.

To view the article in Korean, please click here.

Picture

Member for

3 months 1 week
Real name
SIAI Editor
Bio
SIAI Editor