--- title: "Querying large amounts of data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{large_data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message = FALSE, warning=FALSE} library(comtradr) library(dplyr) library(ggplot2) ``` ### The limits When you are on the free API tier, you will encounter two limits when wanting to calculate on rather large datasets: 1) you can only make 500 calls per day, 2) of which each can take up to 100.000 rows. This information might be updated in the Comtrade API, check the most up to date values [here](https://uncomtrade.org/docs/subscriptions/). So when trying to fetch the maximum amount of data in the minimum amount of days, you want to minimize the amount of requests you need by making them as comprehensive as possible, without ever exceeding the 100k row limit. ### The example - Imports of the EU Let's say, we want to know which is the top exporter of the EU in each product class of the Harmonic System. This could be useful, when thinking about dependencies between countries, e.g. when evaluating the usefulness or impact of Regulation such as the [EUDR Regulation](https://environment.ec.europa.eu/topics/forests/deforestation/regulation-deforestation-free-products_en). Quoting from the [ITC](https://tradebriefs.intracen.org/2023/11/spotlight): "Under the regulation that entered into force on 29 June 2023, any operator or trader placing [specific] commodities on the EU market or exporting them from the EU, has to be able to prove that the goods do not originate from deforested land (cutoff date 31 December 2020) or contribute to forest degradation". One question that arises would be, what are the impacts of this regulation on other countries and which of these is particularly important to the EU? Let's replicate some of their numbers, e.g. the relevance to the EU indicator from the map of the ITC spotlight cited above. *What is the share of one country in the EU's imports of a product?* Remember, to get Comtrade data, we need the Commodity Codes, the respective iso3 codes for the countries and a time frame. #### 1. What are the commodity codes? First we need to find out, which HS codes, or which goods are affected. We look at Annex 1 of the [Regulation](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32023R1115&qid=1687867231461). This gives us more or less the following list of commodities for the product class 'Wood'. ```{r} wood <- c( "4401", "4402", "4403", "4404", "4405", "4406", "4407", "4408", "4409", "4410", "4411", "4412", "4413", "4414", "4415", "4416", "4417", "4418", "4419", "4420", "4421", "940330", "940340", "940350", "940360", "940391", "940610", "4701", "4702", "4703", "4704", "4705", "470610", "470691", "470692", "470693", "48", "49", "9401" ) wood_df <- data.frame(cmd_code = wood, product_cat = "wood") ``` #### 2. Which are the countries? First we need a list of the EU27 countries as iso3 codes. I use the `giscoR` package to get this information quickly. ```{r, echo = FALSE} v_data_6 <- system.file("extdata", "vignette_data_6.rda", package = "comtradr") if (!file.exists(v_data_6)) { stop("internal vignette data set '~/extdata/vignette_data_6.rda' not found", call. = FALSE) } load(v_data_6) ``` ```{r, eval = F} eu_countries <- giscoR::gisco_countrycode |> filter(eu == T) |> pull(ISO3_CODE) ``` Secondly, we need a vector for all other countries, but we can get this easily by just specifying `all_countries` in the parameter to the partner argument. ### Getting the data To calculate the relevance of a product category to the EU, we need to calculate the following: $\text{Import Relevance to EU in %} = \frac{\sum_{c \in \text{EU27}} I_{c,p}}{\sum_{c \in \text{EU27}} W_{c,p}}$ Where: - $I_{c,p}$: Imports of EU country c in product category $p$ from a specific non-EU country. - $W_{c,p}$: Total imports of EU country $c$ in product category $p$ from the world. In plain English, we need to know how much of a given product comes from one country and which share of the total imports it has. Let's get $I_{c,p}$ first. #### EU imports from all countries If you were to execute this query, you will get a return object that contains exactly 100k rows and a warning, that if you did not intend to get exactly 100k rows, you have most likely hit the limit. This seems natural in this case, as we are trying to get data on about 39 commodity classes for 27 EU countries as reporters and about 190 partners in 4 years. This equals to potentially 800k rows. ```{r, eval = F} data_eu_imports <- ct_get_data( commodity_code = wood, reporter = eu_countries, partner = "all_countries", flow_direction = "import", start_date = 2018, end_date = 2022 ) ``` Let's break this down into a simple loop. We could iterate over years, but that would only get us down to about 200k rows for each year. Since we have 250 calls per day, let's just iterate over each eu country and get the data separately. ```{r, echo = FALSE} v_data_4 <- system.file("extdata", "vignette_data_4.rda", package = "comtradr") if (!file.exists(v_data_4)) { stop("internal vignette data set '~/extdata/vignette_data_4.rda' not found", call. = FALSE) } load(v_data_4) ``` ```{r, echo = FALSE} v_data_5 <- system.file("extdata", "vignette_data_5.rda", package = "comtradr") if (!file.exists(v_data_5)) { stop("internal vignette data set '~/extdata/vignette_data_5.rda' not found", call. = FALSE) } load(v_data_5) ``` ```{r, eval = F} ## initiate a new instance of an empty tibble() data_eu_imports <- data.frame() for(reporter in eu_countries){ ## for a simple status, print the country we are at ## you can get a lot fancier with the library `progress` for progress bars print(reporter) ## assign the result into a temporary object temp <- ct_get_data( commodity_code = wood, reporter = reporter, partner = "all_countries", flow_direction = "import", start_date = 2018, end_date = 2022 ) ## bind the subset to the complete data data_eu_imports <- rbind(data_eu_imports, temp) ## note that I did not include any sleep() command here to make the requests ## wait for a specified amount of time, the package keeps track of that for ## you automatically and backs off when needed } ``` ```{r} nrow(data_eu_imports) ``` Congratulations, you have now queried 170k rows of data, exceeding the limit of one call to the API! Getting the data for imports from the World, $W_{c,p}$, should be easy now! ```{r, eval = F} data_eu_imports_world <- ct_get_data( commodity_code = wood, reporter = eu_countries, partner = "World", flow_direction = "import", start_date = 2018, end_date = 2022 ) ``` ```{r} nrow(data_eu_imports_world) ``` #### Data cleaning ```{r, echo = F} ## we merge the product category as a variable and size down our data data_eu_imports_clean <- data_eu_imports |> left_join(wood_df, by = "cmd_code") |> select( reporter_iso, reporter_desc, flow_desc, partner_iso, partner_desc, cmd_code, cmd_desc, product_cat, primary_value, ref_year ) |> ## we now aggregate the imports by the product category, year and partners group_by(partner_iso, partner_desc, flow_desc, product_cat, ref_year) |> summarise(eu_import_product_cat = sum(primary_value)) |> ungroup() ``` ```{r} data_eu_imports_world_clean <- data_eu_imports_world |> left_join(wood_df, by = "cmd_code") |> select( reporter_iso, reporter_desc, flow_desc, partner_iso, partner_desc, cmd_code, cmd_desc, product_cat, primary_value, ref_year ) |> ## we now aggregate the imports by the product category and year ## over all reporters, since we are interested ## in the imports of the whole EU group_by(product_cat, ref_year) |> summarise(eu_import_product_cat_world = sum(primary_value)) |> ungroup() ``` ```{r} #### relevance to EU relevance <- data_eu_imports_clean |> left_join(data_eu_imports_world_clean) |> ## join the two datasets mutate(relevance_eu = eu_import_product_cat/ eu_import_product_cat_world*100) |> ## calculate the ratio between world imports and imports from one partner |> select( -flow_desc) |> ungroup() ``` Let's see who has the biggest share in the EU import market for wood (excluding EU countries). ```{r} top_10 <- relevance |> filter(!partner_iso %in% eu_countries) |> group_by(ref_year) |> slice_max(order_by = relevance_eu, n = 10) |> select(partner_desc, relevance_eu, ref_year) |> arrange(desc(ref_year)) head(top_10, 10) ``` Let's also do a little sanity check. When summing up all the shares over one year, we should get to 100 %. This seems to hold. ```{r} relevance |> ungroup() |> group_by(ref_year) |> summarise(sum = sum(relevance_eu)) ``` Maybe the last year was an outlier? We can calculate the mean relevante a country had in the past 5 years and get the most important countries again. ```{r} average_share <- relevance |> filter(!partner_iso %in% eu_countries) |> group_by(partner_iso, partner_desc) |> summarise(mean_relevance_eu = mean(relevance_eu)) |> ungroup() |> slice_max(order_by = mean_relevance_eu, n = 10) ``` ```{r,fig.width=10} ggplot(average_share) + geom_col(aes(reorder(partner_desc,mean_relevance_eu), mean_relevance_eu), fill = 'brown') + coord_flip() + theme_minimal() + labs(title = 'Share of EU imports', subtitle = 'as average over last 4 years') + xlab('average relevance in %')+ ylab('') ``` ### Caveat on trade "dependencies" However, several important caveats are to be made. I have not addressed these, because this vignette is about how to query larger data, not about trade dependencies per se. Hence a real-world example, but without all the complications that you would have to address for a valid analysis. 1) Dependency on a certain country in terms of how much of a product others import from them, does not equal dependency in terms of this product, as a country might have a substantial production of such items domestically. E.g. if (hypothetically) Argentina would import 90% of its wine from Chile that would not imply that Argentina depends on wine from Chile, if Argentina themselves produce a substantive amount of wine. 2) We included in our calculation of the total imports of the EU, imports from other EU countries, a.k.a. intra-EU imports. That would mean we probably underestimate how much of total EU imports really come from partner countries outside the EU. 3) Some of the HS codes for wood are not completely to be included per the Regulation, but I have spared us the intricacies of which ones and how to exclude them here. 4) Imports data is already better than exports data, because countries have an incentive to get good data to charge tariffs, however the best data for the EU comes from Eurostat, not Comtrade. But since this is about the Comtrade data, it will have to do.