Introduction
For the past few decades, the interest in reducing the gender pay gap (GPG), that is to diminish the imbalances in wages between men and women, has increased in labor markets (Bishu and Alkadry, 2016: 65; Blau and Kahn, 2017: 789) . However, despite the efforts made, gender parity in pay has proved hard to achieve and it persists worldwide, to a greater or lesser extent (Goldin, 2014: 1091 ; Blau and Kahn, 2017: 789). According to the World Economic Forum (2020), at current rates of change, the global gender gap will close in more than 250 years.
The Global Gender Gap Index (2020) shows that there has been no great progress towards closing the gap, and not even one out of the 153 countries reported have yet achieved gender parity in salaries. The International Labour Organization (ILO) (2018) use average hourly wages based on data for 73 countries estimates that the global GPG stands at around 16%, though using a factor-weighted approach the global estimate rises to 19%. Similar results emerge from the Global Gender Gap Report (World Economic Forum, 2020). According to the Organisation for Economic Co-operation and Development (OECD) data, the differential in men’s and women’s median income is about 13.5% and improving, while for non-OECD countries is around 15% and worsening. However, these figures vary greatly by country. For example, in 2021, within the European Union where the salary gap lies around 16%, the highest GPG was recorded in Latvia (22.3%) and in Estonia (21.7%), the lowest in Luxembourg (1.3%) and Romania (3.0%) (European Commission, 2021). GPGs are notorious even in countries with equal pay legislation. In the United States, the gap has remained around 17 to 20% for at least fifteen years despite the Equal Pay Act of 1963 (Fontenot et al., 2018; U.S. Bureau of Labor Statistics, 2020), in the UK, with its Equal Pay Act of 1970, and France, which legislated in 1972, the gaps are nearly 17% and 12% respectively, and in Australia it remains around 12% (OECD, 2021).
Predictably, numbers do not get better when moving to Latin American and Caribbean countries. In 2018, this region fell in the middle of the overall global GPG, behind Western Europe, North America, Eastern Europe, and Central Asian countries (International Labour Organization, 2018). On average, women in the region work 25 hours more per month than the average man (United Nations, 2020), and half of them work for no pay or profit at all. Argentina, Brazil, Chile, Colombia, Mexico, Peru, and Venezuela prohibit gender pay discrimination and most countries embrace the ILO’s notion of “work of equal value.” Nevertheless, in Argentina women earn on average 29% less than men (D’Alessandro et al., 2020) . This difference is observed in all occupational categories and the gap becomes greater when analyzing the hierarchical positions.
Breaking down the gap
Accurately measuring the GPG is relevant to assess how far we are from equality since it is a symbol of women’s position in the workforce in comparison to men. However, the unadjusted (raw) GPG is a complex indicator. Although it provides an overall picture of the difference between men and women salaries, it could mask the fact that this difference can be attributed not only to direct discrimination through ‘unequal pay for equal work’, but also related to many other factors equally or more powerful determinants of male-female earnings differentials, including social and historical factors, such as the concentration of one gender in certain activities ('segregation') ,and the ease of access to higher paid hierarchical positions (‘glass ceiling’) (Karamessini and Ioakimoglou, 2007: 31) . Therefore, being able to decompose the GPG and to determine the contribution of its different components is important to design appropriate policies for reducing it. In this regard, some efforts have been made, typically through regression models that are based on Mincer-type wage equations, Oaxaca-Blinder decompositions or further developments of this method, or the Wellington approach to estimate the GPG along time, among other proposals (Karamessini and Ioakimoglou, 2007: 31; Chernozhukov et al., 2013: 2205 ; Blau and Kahn, 2017: 789 ; Töpfer, 2017; Amado et al., 2018: 357 ; European Commission. Statistical Office of the European Union, 2018).
Traditionally, economists’ approaches to understand the GPG have rested on two sets of economic theories: human capital model and models of labor market discrimination(Grybaitė, 2006: 85; Ospino et al., 2010: 237) .
The human capital model is usually used to analyze the so-called explained part of the GPG (adjusted GPG), the one that can be attributed to differences in qualifications. The basic idea is that every person has some form of human capital, i.e., the abilities and skills acquired through education, training and working experience. According to this model, if women have less human capital than men, they should rightfully receive lower wages (Mincer and Polachek, 1974: S76 ; Polachek, 1981: 60; Grybaitė, 2006: 85) . On average, making this assumption is valid and there are numerous reasons to explain it. For instance, women tend to have lesser labor market experience because they work part-time and intermittently due to the traditional division of labor by gender, maternity and the hourly dedication to housework and childcare. Moreover, all of these factors result in fewer incentives to invest in education and training (Becker, 1985: S33 ; Grybaitė, 2006: 85). It is important to point out, as several studies have already done, that human capital factors are based on broad assumption and does not take into account the fact that women and men cannot be studied as individuals independently from material and social frameworks since all decisions are made in a normative context where there are pre-established ideas about what women and men ought to do (Grybaitė, 2006: 85).
Although the human capital model plays an important role in explaining the GPG, it does not account for the total gap. The remaining portion of the GPG, that is the unexplained part of the gap, is generally presumed to be due to labor market discrimination and refers to difference in salaries for workers that have the same abilities, experience and training. Therefore, it can be defined as direct discrimination since it accounts for ‘unequal pay for equal work’ (Grybaitė, 2006: 85) . There are several models aimed at trying to understand this portion of the gap, such as the statistical discrimination model proposed by Edmund Phelps (1972: 659) , other statistical models (Becker, 1971; Bergmann, 1974: 103 ; Aigner and Cain, 1977: 175 ; Lundberg and Startz, 1983: 340), or the overcrowding model developed by Barbara Bergmann (1974: 103). However, none of these models, while helpful in understanding some of the reasons behind the unexplained part of the GPG, manage to fully encompass it or propose ways to resolve inequity.
Both contributions to the GPG have varied over time. For instance, in the USA, by 2010, conventional human capital variables (education and labor-market experience) that were an important part of the GPG decades before, have decreased in importance probably due to the reversal of the gender difference in education and the substantial reduction in the experience gap (Blau and Kahn, 2017: 789). However, the persistence of an unexplained GPG suggests that labor-market discrimination continues to contribute. In 2014, the average adjusted GPG for the EU was 11.5% with a variation from 2.5% in Belgium to 24.2% in Lithuania. Notably, in many EU countries, the adjusted GPG is higher than the unadjusted figure. This means that women are expected to earn more than men due to better, on average, characteristics in the labour market (European Commission, 2018).
Since the unexplained part reflects differences in salaries of subjects with supposedly identical characteristics aside from gender, it could be claimed to reflect direct discrimination (Goldin, 2014: 1091) .
To estimate the unexplained (adjusted) GPG, that is the pay penalty of being female is not an easy task since it is necessary to control for all relevant factors that are simultaneously correlated with salaries and gender (Fortin et al., 2011) , such as experience, educational level, abilities, position (Goldin, 2006: 1 ; Mandel and Semyonov, 2014: 1597 ; Blau and Kahn, 2017: 789 ; Töpfer and Brieland, 2022) .
Machine Learning, a novel approach
As Qin and Chiang (2019: 465) point out, over the last 20 years there has been a revolution in statistical science given the possibility of ‘extracting important patterns and trends, and understanding “what the data says”’, so-called “learning from data” thanks to big data analysis and machine learning (ML).
Machine learning (ML) is a branch of artificial intelligence (AI) in which algorithms -that is sequences of statistical processing steps- are trained to find patterns in massive amounts of data in order to make decisions, inferences, and predictions. The resulting trained algorithm is the ML model.
There are four basic steps for building a ML model:
(1) Select and prepare a training data set. Representative to solve the problem in question. In some cases, the training data is labeled data, that is ‘tagged’ to call out features and classifications the model will need to identify. Other data is unlabeled. Data is usually divided into subsets for training and cross-evaluation (also known as “leave one out method”).
(2) Choose an architecture, an algorithm to run on the training data set. The type of algorithm depends on the type (labeled or unlabeled) and the amount of data in the training data set and on the type of problem to be solved. Common types of ML algorithms for using with labeled data include linear and logistic regression, and decision trees while algorithms for use with unlabeled data include clustering algorithms and association algorithms.
(3) Training the algorithm to create the model.
(4) Using and improving the model. The final step is to use the model with new data and, in the best case, to improve its accuracy and effectiveness over time.
The whole process is an iterative one.
In this regard, ML methods could offer a new approach to estimate the adjusted GPG (Karimian et al., 2019; Bonaccolto-Töpfer and Briel, 2022) . In the literature, relevant variables to control for the estimation of the adjusted GPG are typically chosen based on economic reasoning. However, there is a limited understanding of the functional form, which includes identifying relevant interactions and polynomials (Bonaccolto-Töpfer and Briel, 2022). Additionally, certain character skills may have a nonlinear impact on wages. Estimating the adjusted GPG becomes particularly complicated when there is a lot of heterogeneous data since numerous factors may contribute to pay differences between genders, and their relevance may vary depending on the wage level being considered. ML methods provide a more systematic approach to avoid arbitrary selection of variables (Bonaccolto-Töpfer and Briel, 2022).
Being able to perform a GPG decomposition controlling simultaneously for several factors, even in heterogeneous samples could help clarify the different effect each of the variables have to diminish this gap, and globally understand how each of the components vary over time.
Undoubtedly, the most effective approach for assessing the level of discrimination would be to compare an individual's salary with what they would earn in the exact same conditions if they were of the opposite gender (Alatrista-Salas et al., 2017) . However, this method is unrealistic since it is impossible to observe someone's characteristics as the opposite gender. Nevertheless, by utilizing ML, we can simulate this scenario to some extent and that is what we attempted to do in this research.
In this article, we will show how through a ML model and based on specific information (gender, age, years of experience, number of employees at their company, position, etc.), provided by a population of 5,742 IT-related workers through an anonymous online survey conducted by a community called Sysarmy we can infer with certain degree of precision, this person’s salary. Based on this model, we will propose a decomposition approach for the GPG to find out the value for the unexplained GPG considering size sample disparity and the factors determinants of male-female earnings differentials in the explained part of the GPG.
Methodology
This investigation consisted of four phases: (I) characterization of the sample, (II) creating the salary predictor, (III) adjusting the GPG, (IV) and analyzing the explained part of the GPG. Subsequently, we will develop the methodology deployed in each of them and in the following section. We will share our results.
Phase I. Characterization of the sample
Our original idea was to generate a salary predictor for IT-related workers. To do that, a ML model was developed from real salary data provided by an open and anonymous survey on IT-related workers. The answers were from a self-selected population, and therefore did not represent the entire IT population of Argentina, but it allowed us to analyze trends.
Period. The recollection was made between the months of December 2019 and January 2020 in Argentina.
Features. Workers were asked about their gender identity, years of experience, workplace, number of employees in their companies, level, and area of studies, among other factors. Full datasets and features are of public access (Sysarmy, 2020). In Table 1,1 we have listed the features that were considered for this study along with a brief description or the elective options for each of them.
Income data. In Argentina (a country with high rates of informal labor), the salary is typically paid monthly and is the value commonly used for economic estimators (for example, poverty and indigence indices are estimated based on monthly family income, subsidies for electricity and gas are also granted based on these incomes). Therefore, we considered for the analysis only those participants who have a full-time monthly salary (40 hours per week) and their gross monthly income. This, of course, implies a selection bias but minimizes the error of including individuals, who work as freelancers and also standardizes the number of worked hours regardless of gender.
The salary values mentioned in the article correspond to Argentine pesos. Between December 2019 and 2020 —the period during which the information was collected—, on average, in Argentina, one US dollar was equivalent to 63 Argentine pesos. The inflation indexes for these months were 3.7% and 2.3%, respectively. This could bring some variability to the analysis because we did not have the date of each record to normalize the values if there were any salary adjustments.
Data preparation
First, we tried to detect and eliminate anomalies —that is, extreme values that did not make sense for salaries, number of employees, age, etc.— that were introduced maybe intentionally due to typing errors or misunderstanding of the questions. For instance, someone answered that their company had over a billion employees, clearly an outlier. Many other workers entered that their salaries were $1, perhaps unemployed people who still wanted to participate in the survey. A person indicated that he had been employed by the same company for 2000 years; it was probably someone who misinterpreted the statement and understood that he was being asked since what year he had been working for his current company.
From a total of 5,982 responses, we finally considered 5,766 in the present analysis, that is, 96% of the total. We were, then, faced with two types of data: numerical (e.g., years of experience and salary) and categorical, that is non-numerical (e.g., gender and province), which had to be transformed into numerical variables.
Numerical data. Despite the fact that it was possible to operate directly with these values, it is important to take into account that they do not necessarily follow a linear progression.
Regarding experience, for example, we might think that it does not have the same effect on salary to go from having no experience to having 1 year, that to go from having 10 to 11 years. That is to say, the more years of experience a person has, the less impact in the salary will have the addition of a new one. A good way to adjust this behavior is by using a logarithmic function. We could also apply logarithm to the salary values themselves because it is not the same to earn $1,000 more for someone who earns $10,000 than for a person who receives $200,000. It is important to note that monotonic transformations of the inputs do not affect the result in a tree model, like the one we used.
Categorical data. To transform non-numerical into numerical data, we created a matrix assigning binary values to each of the categorical characteristics.
Argentina is made up of 24 jurisdictions: 23 provinces and the Autonomous City of Buenos Aires (CABA). Since the cost of living in each jurisdiction is quite uneven, salaries tend to be too. Although people from different provinces responded to the survey, some of the districts had very few answers and that made generalization difficult. So, we decided to group them into larger blocks based on the average salary for each province and cultural similarities that we expected to cause their indicators to behave similarly. Based on this analysis, we divided the provinces as follows:
Northwest: Catamarca, Jujuy, La Rioja, Salta, Santiago del Estero, Tucumán.
Northeast: Chaco, Corrientes, Entre Ríos, Formosa, Misiones.
Cuyo: Mendoza, San Juan, San Luis.
Pampean Plain: La Pampa, Santa Fe, Córdoba, Buenos Aires.
Patagonia: Chubut, Neuquén, Río Negro, Santa Cruz, Tierra del Fuego.
AMBA: CABA and part of Buenos Aires.
Therefore, for geographical data, we used one column for each region and assign each person a binary value for each column. It is important to note that, when grouping, information is lost on provinces that had public technology development policies such as Tierra del Fuego and San Luis.
This approach had the advantage of accounting for features that are not mutually exclusive, such as programming languages: multiple options could be selected in the question and that information could be reflected in the matrix. Consequently, a man from CABA, who uses Java and JavaScript in his work, will be represented by the values shown in Table 2.
Sample characterization
In order to adequately describe the original sample, we chose certain features of the Sysarmy form that we considered most relevant (some of which would later be used to interpret the explained part of the GPG, see Phase IV), and we analyzed the wage distribution according to these features: age, career, level of education attained, institution where the person studied, specific occupation within the company, number of employees in the company, number of dependents (workers they coordinate), years of experience, and geographical distribution.
Phase II. Creating the salary predictor
Then, we moved towards the construction of the model itself: from the actual data collected we wanted to build a model that would be able to estimate wages from new inputs. The construction of a model requires two steps, (1) training and (2) prediction. In the training stage, the model received data with labels that represent the ‘correct’ result, in this case, the salary. In the prediction step, the ‘correct’ result was unknown and the model had to predict it. It would be very easy to make a model that memorizes the ‘correct’ answers and gives them as outputs whenever it is asked, but despite being able to perfectly predict these values, this model would have little predictive capacity in the face of unknown data that do not fit exactly those from whom it learned.
So, what we needed was to train the model with a set of known data and evaluate another set of data, also known, but that the model had never ‘seen’. This would allow us to know how good the model was. Since we had relatively little data, we chose a technique called cross validation (leave one out method), which consists of dividing the data into several groups and training the model many times. At each of those times, a different group is excluded from the training and used for the prediction. In this way, if we cross-validate with five groups, we are going to train five different models (models A1 to A5), each with four fifths of the data, and we are going to ask each model to infer the data, that is predict the salary, of the remaining group. Thus, once the process is finished, we will have a prediction for each value, reached by the model that did not ‘see’ that data when training, and we could estimate how close to the ‘correct’ result the predictions were through a coefficient of determination (R2). For example, if one person earns $100 and the model estimates $110 and for another that earns $200 it predicts $180, then R2 is 0.9 because it has a 10% error in each estimate. The best possible coefficient is 1, equivalent to say that all the predictions were correct.
Using XGBoost2, a model based on decision trees that tends to work best in ML. We obtained R2 for models A1-A5. The code is publicly accessible in Github (Waisbrot, 2022) .
Then, it was time to obtain the salary predictor (model A). We created it by training on the whole dataset and using the architecture that had been validated in the previous step (with the five groups A1-A5) (Waisbrot, 2020) .
Phase III. Adjusting the GPG
When sharing the salary predictor with the public through social networks to test how well it worked, some users noticed that changing gender from "male" to "female" but keeping all other variables exactly the same often led to a decrease in salary. We decided to explore this situation.
From here onwards, we worked only with the data of those individuals who had identified themselves as male or female (5,742 responses) and deliberately excluded the category "others" because, unfortunately, there were very few data (24 responses).
First, we calculated the unadjusted GPG from the actual salaries reported by the IT workers in the survey through comparing the median values for the salary of men and women. Then, we tried to estimate whether there were differences exclusively by gender, that is the adjusted GPG.
To do that, we constructed the same structure as for model A and trained it with the same data but reversing the gender for each entry yielding five models (B1-B5) and ten expected outcomes: each person had an outcome with a model that accounts for their gender and one outcome with the model that include their gender reversed. Thus, we learned how much the model predicted it should pay, keeping all the variables the same (education, experience, etc.) except gender.
We also considered the difference in the sample size: we had six times more responses from men than from women. Accordingly, we created another model that compensated for this skew (model C).
Access to the code for the construction of each model is public in Github (Waisbrot, 2020) .
Phase IV. Analyzing the explained part of the GPG
To analyze the explained part of the GPG we used model A -as it considered the whole dataset- to establish which of the features the prediction model considered relevant, that is which were the more important salary predictors. With this information, we looked at what the variation in salary by gender was for each of these features in the original data set (characterization of the sample in Phase I). It should be noted that it is not enough just to look at the distribution of the data. We must also consider the number of responses in each group. If there were a large pay gap in a group where there are few people, its impact on the total gap would not be significant. Therefore, both variables were analyzed in each case through different methods of plotting numeric data.
Results and Discussion
From the data collected in the survey, we obtained 820 responses from women with a median salary value of $62,050 and 4,922 from men, with a median salary value of $77,000. Therefore, an unadjusted GPG of 20% was obtained in favor of men. In other words, women were paid $0,80 for each $1 paid to men.
Using XGBoost we obtained an average R2 of 0.5175 for models A1-A5 (in other words, we could explain more than half of the salary with our models), and an average R2 of 0.6244 for model B.
Model B predicted a median salary for women of $74,243 and of $80,492 for men, that is an adjusted GPG of 7,76%. Furthermore, 12.3% of the gap can be attributed to other factors besides gender (explained part).
The salary distribution for males (orange) and females (blue) was plotted according to the actual data collected and from the results obtained with model B by means of violin plots. This method allows to visualize, both summary statistics and the probability density of the data at different values: for each group the medians, the maximum frequency of each distribution (in both graphics the orange distribution for male salaries shows a peak at higher values), and the areas can be compared (the bigger the difference of orange and blue surfaces, the bigger the gap) (Figure 1) .
If instead of calculating the difference in medians, we calculate the average of the point-to-point difference between the predicted wage for each individual considering their gender as female or male and normalizing it by the male wage, then, the difference turns out to be 6.92%.
It is interesting to note that model B seemed to have learned that greater the person's experience, greater the difference between men and women must be (Figure 2) . This could be interpreted as the well-known combination of ‘sticky floors’ and ‘glass ceiling’ (Ciminelli et al., 2021) .
Accounting for the sample size disparity
There were approximately six times more data from men than from women (4922 responses vs. 820 responses, respectively). This disparity brought with it a problem: the model was more unfair to women, i.e., it was less penalized when making a mistake with women. To compensate for this asymmetry, we built model C.
Model C architecture was similar to that of model B (gender reversal) but in order to train it, we "cloned" each woman five times so that each of them was worth six. By giving more "weight" to the data collected from women, the model became fairer, equally penalizing errors for men and women. Thus, the adjusted GPG decreased from 6.92% to 5.77%. Why do we say that the model is fairer? Because if we use it to estimate the salary a woman should receive in a company, it will show that, instead of paying her 6.92% less, we should pay her “only” 5.77% less than men. There is still a gap due to gender alone, but now it is smaller.
Salary predictors: the explained part of the GPG
To estimate salaries, the prediction model A takes as main characteristics years of experience, degree, number of employees in the company, profession, age, level of education attained, college they attended, whether they had finished their degree, and number of people that person has in charge (Figure 3) .
Since these characteristics seem to be the most important contributors to a person's salary, we decided to analyze them for the original sample data to better understand the explained part of the GPG.
Features that significantly contribute to de GPG
Experience. The main predictor of salary was the experience of the person: the more experienced, the higher the pay. However, as experience increases, the gap between men and women widens and there are fewer women with 10 or more years of experience (Figure 4). Given these, the experience contribution to the explained GPG is significant.
Degree. In Figure 5, we see that the gap favors men in almost all careers. The most feminized are graphic design and bachelor's degree in administration, where the gap is less significant.
Level of education attained. For each level of study, we see that the income distribution of men is equal to or higher than that of women as shown in Figure 6. The only exception is in the "Secondary in progress" category. However, again, the limited number of data (there are only three women in that category) makes it difficult to draw a conclusion in this regard and, moreover, its influence on the total gap is very low. A relevant detail is that even though, proportionally, more women have completed university or higher education, the distribution of salaries at all levels favors men.
Age. According to the salaries distribution by age, the GPG is small in all groups except the last one (older than 39), where the highest proportion of members is found (Figure 7) . The number of employed women decreases significantly above the age of 30. This result reflects known trends linked to childbearing and shows the robustness of the sample, despite being self-selected.
Features that do not significantly contribute to de GPG
Number of employees in the company. Larger companies, with 2,000 employees or more, are the largest contributors to the wage gap but there is no preference in terms of company size by gender.
Number of dependents. As for the number of dependents, the vast majority of people have no dependents and the differences in other groups do not seem significant. Therefore, we could think that the contribution to the unadjusted GPG is not significant.
Profession. In terms of profession, most of respondents were developers and, in this group the gender pay gap is not evident. However, the well-known "segregation" emerges: QA and UX are the most feminized occupations in contrast with SysAdmin and DevOps.
Conclusions
In this paper we proposed a decomposition approach based on a machine learning model to find out the value of the adjusted and unadjusted GPG among a population of 5742 Argentinean IT-related workers.
From our analysis, based on the current data, there is a GPG of 20%, of which 7.7% can be explained exclusively by direct discrimination (adjusted GPG) while 12.3% can be attributed to other factors, such as total years of experience, degree, level of education attained and age.
We also found evidence of glass ceiling, sticky floor and segregation phenomena and inferred that the influence of age on GPG has a direct correlate with motherhood.
Our proposal has certain limitations, of course: it is possible that the model does not sufficiently fit the data, that there are variables that were not taken into account (because there was no control over the features that were incorporated in the Sysarmy form) and, above all, that there are selection biases given that the sample was not chosen by statistical methods but was self-selected. This self-selection could be the cause of the observed differences. Moreover, one problem of using this model is the fact that the wage distribution is skewed. This, in the future, could be improved by changing the architecture.
In any case, our results are consistent with those in the literature (Blau and Kahn, 2017: 789 ; European Commission, 2018) and complement results obtained by other research groups (Töpfer and Brieland, 2022) . Unlike classic models, ML models allow to work with heterogeneous samples and to juggle a large number of interactions all at once, thus providing new insights to the GPG analysis. It poses a helpful tool for an impending problem that must be tackled from all possible approaches with a main objective: to design appropriate policies for reducing and, eventually, closing the gap.