Proposition of a hybrid price index formula for the Consumer Price Index measurement

Research background: The Consumer Price Index (CPI) is a basic, commonly accepted and used measure of inflation. The index is a proxy for changes in the costs of household consumption and it assumes constant consumer utility. In practice, most statistical agencies use the Laspeyres price index to measure the CPI. The Laspeyres index does not take into account movements in the structure of consumption which may be consumers' response to price changes during a given time interval. As a consequence, the Laspeyres index can suffer from commodity substitution bias. The Fisher index is perceived as the best proxy for the COLI but it needs data on consumption from both the base and research period. As a consequence, there is a practical need to look for a proxy of the Fisher price index which does not use current expenditure shares as weights. Purpose of the article: The general purpose of the article is to present a hybrid price index, the idea of which is based on the Young and Lowe indices. The particular aim of the paper is to discuss the usefulness of its special case with weights based on correlations between prices and quantities. Methods: A theoretical background for the hybrid price index (and its geometric version) is constructed with the Lowe and Young price indices used as a starting point. In the empirical study, scanner data on milk, sugar, coffee and rice are utilized to show that the hybrid index can be a good proxy for the Fisher index, although it does not use the expenditures from the research period. Findings & Value added: The empirical and theoretical considerations con-firm the hybrid nature of the proposed index, i.e. in a special case it forms the convex combination of the Young and Lowe indices. This study points out the usefulness of the proposed price index in the CPI measurement, especially when the target index is the Fisher formula. The proposed general hybrid Equilibrium. Quarterly Journal of Economics and Economic Policy, 15(4), 697–716 698 price index formula is a new one in the price index theory. The proposed system of weights, which is based on the correlations between prices and quantities, is a novel idea in the price index methodology.


Introduction
The Consumer Price Index (CPI) is a basic, commonly accepted and used measure of inflation. The index is a proxy for changes in the costs of household consumption and it assumes constant consumer utility, i.e. the Cost of Living Index (COLI). The CPI is used for indexing nominal values in the economy, which is important in price decision making by enterprises. The CPI is also important in the monetary policy if the central bank uses the direct inflation targeting strategy. It has been in force in Poland since 1999, and the CPI has been the reference indicator from the beginning.
In practice, in the case of the so-called "traditional data collection," statistical agencies use the Laspeyres (1871) price index to calculate the CPI (see Clements and Izan (1987) or White (1999)).The Laspeyres index does not take into account movements in the structure of consumption which may be consumers' response to price changes during a given time interval. As a consequence, the Laspeyres index can suffer from the commodity substitution bias. To be more precise, please note that this kind of CPI bias results from changes in relative prices of individual goods included in the CPI basket. The substitution effect is that consumers respond to price changes by exchanging those goods or services that are relatively more expensive for relatively cheaper ones (Hałka & Leszczyńska, 2011). Although the substitution bias is not the only CPI bias, it is best recognized in the literature and can be crucial from the point of view of any financial decision which is based on the inflation rate. For example, the Boskin Commission (1996), when analyzing data from the USA for the period 1995-1996, determined the level of total CPI bias at 1.1 percentage point, split into: i) 0.5 p.p. -the substitution bias (0.4 p.p). and the outlet bias (0.1 p.p.); ii), and the remaining 0.6 p.p. -the bias resulting from the change in the quality of goods along with the new goods bias. Today, sales markets are much more dynamic and technological changes are more rapid than 20 years ago, hence it can be expected that the CPI bias is not negligibly small. Please note that the CPI bias is always a cost for the economy. Apart from the fact that many central banks use the direct inflation targeting strategy, in many countries (including Poland), the CPI index is used for the valorization of pensions and for indexing financial contracts, including interbank ones. Therefore, the CPI estimation should be as accurate as possible and the reduction of the CPI measurement bias by even a per mille has financial significance.
Most economists perceive superlative indices (such as the Fisher, Walsh or Törnqvist indices) as the best proxies for the COLI (Von der Lippe, 2007), CPI Manual (2004)). The difference between the Laspeyres index and any superlative index plays a role of an approximation of the CPI substitution bias. The Fisher index is treated as being the best proxy for the COLI but, similarly to other superlative indices, it needs data on consumption from both the base and research period. Obtaining consumption data for the current period is problematic from the practical point of view, because the household budgets survey always provides this information with a certain delay (usually a year or longer). This is due, among others, to the fact that this type of survey is very expensive. However, there are some ways to approximate the Fisher price index and to reduce the CPI substitution bias by using indices which require information about expenditures only from the base period, e.g. the Lloyd-Moulton price index, the AG Mean index, or the Lowe and Young indices (CPI Manual, 2004).
The paper proposes a hybrid and general price index the idea of which is based on the Young and Lowe indices. The aim of the paper is to discuss its special case with weights based on correlations between prices and quantities. The proposed system of weights is a new idea in the price index methodology, which is the added value of the paper. To confirm the usefulness of the proposed index method, real scanner dataset obtained from one supermarket chain in Poland is used, since scanner data contain full information about sold products, including current expenditures. The monthly aggregated scanner data on milk, sugar, coffee, and rice is used to show that a hybrid index can be a good proxy for the Fisher index, although it does not use the expenditures from the research period. The empirical and theoretical considerations confirm the hybrid nature of the proposed index, i.e. in a special case, it forms the convex combination of the Young and Lowe indices. The study points out the usefulness of the proposed price index in the CPI measurement.
The paper is organized as follows: after the literature review concerning approaches to seeking the best price index in the CPI measurement, the next section presents methodology of the research, including a description of construction of the proposed hybrid index formula, the main hypothesis for the index and a summary of scanner data processing. The section with obtained results presents comparisons between the Fisher index and the new suggested price index formula, which is the special case of the general hybrid formula. Finally, the obtained results are compared to those known from the literature, and the paper concludes with some recommendations for the CPI substitution bias reduction.

Literature review
The choice of the optimal price index formula in the CPI measurement is not easy and it depends on the data aggregation level and the data source. For instance, in the so-called "traditional" data collection (data collected by the interviewer in the field) if you do not have any information about the consumption level, we must choose among so-called elementary indices, e.g. the Dutot (1738), Carli (1804) or Jevons (1865) price indices. Having data on consumption, it is possible use weighted index formulas, i.e. in practice, the Laspeyres-type (1871) index is used, although the Fisher (1922) index seems to be a much better choice (von der Lippe, 2007). On the other hand, obtaining data from supermarkets, i.e. scanner data from electronic terminals located in sale points, allows you to have full information on products even at the elementary level, and the list of potential indices which can be used in this case is much longer. For instance, multilateral indices have recently gained popularity here, e.g. the GEKS (Gini, 1931;Eltetö & Köves, 1964;Szulc, 1983), the Geary-Khamis (Geary, 1958;Khamis, 1970)  Although the history of the CPI has spanned 100 years, there are still open problems in the index methodology, and in general there are several main approaches when selecting the best price index formula. In the stochastic approach, the price index is treated as an unknown parameter in the regression model describing price changes (Selvanathan & Prasada Rao, 1992;Diewert, 2005). In the economic approach, correlations between prices and quantities are taken into considerations (Von der Lippe, 2007;CPI Manual, 2004), and we expect that the price index formula is as close to the true cost of living index as possible. The final report of the Boskin Commission begins with a recommendation that "the Bureau of Labor Sta-tistics (BLS) should establish a cost of living index (COLI) as its objective in measuring consumer prices" (see Boskin et al., 1996, p. 2). Further discussion on the COLI theory can be found in the papers of: Diewert (1993), Jorgenson and Slesnick (1983), and Pollak (1989). Finally, in the axiomatic approach, a well-constructed price index should satisfy a group of "tests" or axioms (Balk, 1995). Systems of minimum requirements of price indices were provided by Martini (1992), Eichhorn and Voeller (1976) and Olt (1996). Any new price index proposition should fulfill axioms from the system of minimal requirements (Białek, 2014a).
As mentioned above, superlative indices (Diewert, 1976) may be considered as best proxies for the COLI (Von der Lippe, 2007;CPI Manual, 2004). In an ideal case, the "traditional" CPI should be measured by using the Fisher index, but the problem is that the index needs data on consumption from the current period, which is out of range of any NSI (National Statistical Institute). One possible solution to this problem is using the price index method which is able to approximate the Fisher index despite the lack of weights from the current period. In the literature, one can encounter several interesting ideas for a Fisher index proxy, e.g. using the Constant Elasticity of Substitution (CES) framework, it is possible to approximate the Fisher price index if only you can estimate the elasticity of the substitution parameter. In particular, the Lloyd-Moulton price index (Lloyd, 1975;Moutlon, 1996;Shapiro & Wilcox, 1997) and the AG mean indices (Lent & Dorfman, 2009) seem to be a good proxy for the Fisher index. Nevertheless, although there are lots of ways of estimating the elasticity parameter (Feenstra & Reinsdorf, 2007;Biggeri & Ferrari, 2010;Greenlees, 2011;Armknecht & Silver, 2012), the problem with this parameter is its instability in time (Białek, 2017a).
Another group of indices which can be applied for the approximation of the Fisher price index are the Young and the Lowe indices (Armknecht & Silver, 2012;CPI Manual, 2004), which use information about quantities from the arbitrary fixed prior period. It is empirically shown that the quality of their approximations is good enough (Białek, 2017b;Juszczak, 2020). However, it would be very interesting to provide a general, hybrid price index formula which: a) fulfills the most important requirements from the axiomatic approach; b) may serve as a good proxy for the Fisher index; c) includes the Lowe and Young indices as their special cases; and d) uses additional information about correlations between prices and quantities of goods and services. Some other general price index propositions can be found in the following papers: Białek (2012Białek ( , 2015Białek ( , 2020b, and the research on the impact of correlations between prices and quantities on price index results can be also encountered in the literature (Białek, 2019). Neverthe-less, according to the author's best knowledge, there is a lack of papers which consider all the above-mentioned postulates "a" -"d". In the author's opinion, the general hybrid price index formula which is proposed in the paper is a new one in the price index theory and practice, and may be considered as a good proxy for the Fisher price index.

Research methodology
The weighted bilateral (direct) price index formula is a function of a set of prices } , ; ,..., and thus, similarly to the Paasche formula, it uses quantities from the current period. The CPI is measured as the weighted arithmetic mean of price relatives where the weights are the expenditure shares in the base period. In practice, there are crucial differences between the expenditure shares from the survey period (τ ) and the expenditure shares from the current period ( t ), but please note that the compilation of household expenditure data requires time. NSIs use prior period τ survey weights to calculate the CPI, (2) where t s < < τ . As a consequence, the Young price index can be proposed (CPI Manual, 2004): In the paper, the geometric Young index formula is also considered, which is as follows: It is easy to show that formulas (12) and (13) satisfy Martini's system of axioms (the proof is upon request), according to which a properly constructed price index should meet the following postulates: identity, commensurability and linear homogeneity. In short, it means that (a) with identical prices in the base period and the current period, the index should be equal to one; (b) the index should not depend on the monetary unit in which the prices are expressed; (c) the index value should depend linearly on identical changes of all prices in the current period.
Please note that special cases of formulas (12) and (13) However, the author's previous results show that the choice between the Young and Lowe indices may depend on the correlation between price and quantity of sold products (Białek, 2017a(Białek, , 2017b. Let b a, ρ denote the Pearson correlation coefficient between prices from the period a and quantities from the period b . Then three correlation coefficients can be considered: τ τ ρ , , τ ρ , s and τ ρ , t , and it is assumed that at least one of these coefficients is nonzero. Thus it seems reasonable to fix the γ -vector components in proportion to levels of correlation of prices and quantities in the following manner: , and the empirical part of the paper considers hybrid indices (12) and (13) based on the system of weights 0 γ . On the one hand, the formulas need some additional calculations (correlation coefficients), which may be considered as their weakness, on the other hand, the completed empirical study proves that this additional information about correlations allows for a better approximation of the ideal Fisher price index.
The main hypothesis states that formulas 0 γ H P and 0 γ GH P are reasonable and better proxies for the Fisher index than the commonly used Laspeyres index. To prove this hypothesis, scanner data on milk, sugar, coffee, and rice obtained from one of the retail chains in Poland were used. The monthly data covered the period from Dec, 2018 to Dec, 2019. Before index calculations were carried out, data sets had been carefully prepared, i.e. the data set with over 500000 records had undergone all necessary processing stages (product classification, product matching and data filtration). One of the main problems encountered at this stage of data processing was the standardization of product descriptions. The purpose of the standardization process was to obtain a set of unique (identical) product names that could be used to match products over time. The original datasets from the retailer had been uncleaned, duplicated and contained errors and non-homogeneous names. As a result, labels such as "Milk" and "milk" (letter case) or "1L (14) drink" and "1LDrink" would be treated as unique. To be more precise: after the manual classification of raw data, performed by the Statistical Office in Opole, there were over 400 different product names for which COICOP groups could not be easily matched. In order to classify products into COI-COP 5 and local (Polish) COICOP 6 groups, text mining methods based on regular expressions were used to (1) extract measurement units (e.g. weight, volume) from product labels; (2) remove special signs and some typos and (3) separate key words and characteristic phrases. The LASSO regression, proposed by Santosa and Symes (1986) and Tibshirani (1996), was used for further product classification. After the classification process, products were matched over time by using the reclin R package separately for each outlet. The matching process took into consideration EAN product codes, retailer codes, and additional product labels. After the product matching, two data filters were used to exclude extreme price changes and products with relatively low sales (PriceIndices R package). To be more precise: in the case of extreme price filter, those products whose monthly price increase was above 200% or whose monthly price decrease was above 75% were eliminated. In the case of a low-sales filter, products whose relative share in monthly sales was less than 1% were eliminated. As a consequence, the sample size was reduced by over 30% when using these data filters. Finally, the Fisher price index and proposed hybrid indices were calculated for the considered group of products for the prior period set to Dec, 2017 and for the variant with and without filters.

Results
Tab. 1 and Tab. 2 include values of discussed parameters (also with correlations between prices and quantities) and the Laspeyres, Fisher, hybrid and geometric hybrid price indices calculated for datasets on milk, yoghurt, coffee, and rice. The needed calculations were done for the case without . A more accurate comparison of the indices, determined for each month from the annual time window, is shown in Fig. 1 and Fig. 2. It can be observed here that in the case of data for milk and yogurt, the hybrid formulas clearly better approximate the Fisher index than the Laspeyres index. The comparison of yearly indices together with calculated correlations and weights can be found in Tab. 1 and Tab. 2. It can basically be concluded that the hybrid indices and the Laspeyres index perform almost equally well at the end of the analyzed year. Let us add, however, that the largest differences between the Laspeyres index and the hybrid indices arose for coffee data, and in the case of rice, the corresponding differences are negligible and smaller after the data filtering. Detailed results, presenting simple statistics (Mean Absolute Error -MSE) comparing the quality of approximations of the Fisher index, are presented in Tab. 3. Please note that the use of the geometric hybrid index always leads to the reduction of the substitution effect, i.e. it is always a better proxy for the Fisher index than the Laspeyres formula, and almost always its use is more effective than the use of the basic hybrid price index. The basic hybrid index also works very well, i.e. it reduces the substitution bias effect almost in each case (e.g. the exception is the rice case). Finally, please note that the above-presented conclusions do not really depend on whether or not data filters were used. Interestingly, the use of filters had a clear impact on the values of the correlation coefficients (compare Tab. 1 and Tab. 2), but the proportion between them was kept, so the hybrid index itself changed only slightly (see Fig. 1 and Fig. 2).

Discussion
First of all, it should be noted that due to the negative correlation between prices and quantities determined on the basis of the examined data sets, the Laspeyres index overestimated the actual inflation (if the Fisher index is the reference point), which results from so-called Bortkiewicz inequalities (Von der Lippe, 2007). The hybrid index (both basic and geometric) is not subject to this rule, i.e. in the study, their index values happened to be both above and below the Fisher index value, though in both cases they approx-imated it quite accurately. Such in-plus and in-minus value oscillation of the hybrid indices throughout the year may have a beneficial effect if one wished to use chain versions of these indices, which has not been done in the paper. Chain versions of hybrid indices would not accumulate the distance from the Fisher index, while in the case of the Laspeyres chain index, its value would deviate from the Fisher index with each month. Another note concerns the update of weights in the CPI basket, the frequency of which varies from country to country. In Poland, the weight update is annual, so that the substitution bias (measured as the difference between the Laspeyres and Fisher index) is negligible (Hałka & Leszczyńska, 2011;Białek, 2014). In other countries, the frequency of updating CPI basket weights, and thus also the level of substitution bias, is far from ideal. The potential scale of this type of bias in measuring inflation can be found, for example, in: (Crawford, 1998). Nevertheless, given the high costs of the household budget survey which provides information on the level of consumption, perhaps an annual weight update is not necessary if the hybrid indices are used. The premise for this supposition is the fact that in the study the hybrid indices proved to be not inferior but usually superior approximations of the Fisher index to the Laspeyres index, and yet they used year-lagged consumption information (compared to the base period). The advantage of the hybrid indices is also that they use additional information about the level of correlation between prices and quantities. Earlier studies indicate that the accuracy of approximation of the Fisher index by indices using "lagged weights" (Young, Lowe) may depend on the level of the said correlation (Białek, 2017a(Białek, , 2017b. Let us add, however, that the very measurement of correlations required by hybrid formulas may cause interpretation difficulties. We measure here, among others, correlations between prices and quantities from different periods (e.g. τ ρ , t is the correlation coefficient determined for prices from the current period but quantities from the prior period).

Conclusions
To sum up, the proposed hybrid indices (or more precisely: the hybrid index in its basic (12) and geometric (13) version) seem to be useful in the practice of statistical offices in the framework of the so-called traditional data collection. In a situation where due to the cost of researching household budgets, the CPI basket update is annual or even less frequent, such indices can successfully compete with the Laspeyres index. In the completed study, these indices most often turned out to be a better approximation of the Fisher index. The presented research confirms that especially the geometric hybrid price index is a very promising method leading to the reduction of the CPI substitution bias. These indices are also interesting from the theoretical point of view because they use additional knowledge about the correlation between prices and quantities of CPI basket components.
The only limitation here is the requirement that at least one of the three correlation coefficients used is nonzero, which in practice almost always occurs. Let us add, however, that the proposal of a system of weights based on correlations is only one of the possibilities and here the problem of optimal selection of these weights can still be considered as open.
Future research could also focus on taking into account all time moments from the entire interval τ . The author of this paper also plans to check how far the prior period can be removed in relation to the base period to continue to effectively approximate the Fisher index. Source: own calculations in the PriceIndices package.  Source: own calculations in the PriceIndices package.