Querying large amounts of data

The example - Imports of the EU

Let’s say, we want to know which is the top exporter of the EU in each product class of the Harmonic System. This could be useful, when thinking about dependencies between countries, e.g. when evaluating the usefulness or impact of Regulation such as the EUDR Regulation. Quoting from the ITC: “Under the regulation that entered into force on 29 June 2023, any operator or trader placing [specific] commodities on the EU market or exporting them from the EU, has to be able to prove that the goods do not originate from deforested land (cutoff date 31 December 2020) or contribute to forest degradation”. One question that arises would be, what are the impacts of this regulation on other countries and which of these is particularly important to the EU?

Let’s replicate some of their numbers, e.g. the relevance to the EU indicator from the map of the ITC spotlight cited above.

What is the share of one country in the EU’s imports of a product?

Remember, to get Comtrade data, we need the Commodity Codes, the respective iso3 codes for the countries and a time frame.

1. What are the commodity codes?

First we need to find out, which HS codes, or which goods are affected. We look at Annex 1 of the Regulation. This gives us more or less the following list of commodities for the product class ‘Wood’.

wood <-
  c(
    "4401",
    "4402",
    "4403",
    "4404",
    "4405",
    "4406",
    "4407",
    "4408",
    "4409",
    "4410",
    "4411",
    "4412",
    "4413",
    "4414",
    "4415",
    "4416",
    "4417",
    "4418",
    "4419",
    "4420",
    "4421",
    "940330",
    "940340",
    "940350",
    "940360",
    "940391",
    "940610",
    "4701",
    "4702",
    "4703",
    "4704",
    "4705",
    "470610", 
    "470691", 
    "470692", 
    "470693",
    "48",
    "49",
    "9401"
  )
wood_df <- data.frame(cmd_code = wood, product_cat = "wood")

2. Which are the countries?

First we need a list of the EU27 countries as iso3 codes. I use the giscoR package to get this information quickly.

eu_countries <- giscoR::gisco_countrycode |> 
  filter(eu == T) |> 
  pull(ISO3_CODE)

Secondly, we need a vector for all other countries, but we can get this easily by just specifying all_countries in the parameter to the partner argument.

Getting the data

To calculate the relevance of a product category to the EU, we need to calculate the following:

\(\text{Import Relevance to EU in %} = \frac{\sum_{c \in \text{EU27}} I_{c,p}}{\sum_{c \in \text{EU27}} W_{c,p}}\)

Where:

\(I_{c,p}\): Imports of EU country c in product category \(p\) from a specific non-EU country.
\(W_{c,p}\): Total imports of EU country \(c\) in product category \(p\) from the world.

In plain English, we need to know how much of a given product comes from one country and which share of the total imports it has.

Let’s get \(I_{c,p}\) first.

EU imports from all countries

If you were to execute this query, you will get a return object that contains exactly 100k rows and a warning, that if you did not intend to get exactly 100k rows, you have most likely hit the limit.

This seems natural in this case, as we are trying to get data on about 39 commodity classes for 27 EU countries as reporters and about 190 partners in 4 years. This equals to potentially 800k rows.

data_eu_imports <- ct_get_data(
    commodity_code = wood,
    reporter = eu_countries,
    partner = "all_countries",
    flow_direction = "import",
    start_date = 2018,
    end_date = 2022
  )

Let’s break this down into a simple loop. We could iterate over years, but that would only get us down to about 200k rows for each year. Since we have 250 calls per day, let’s just iterate over each eu country and get the data separately.

## initiate a new instance of an empty tibble()
data_eu_imports <- data.frame()

for(reporter in eu_countries){
  ## for a simple status, print the country we are at 
  ## you can get a lot fancier with the library `progress` for progress bars
  print(reporter)
  
  ## assign the result into a temporary object
  temp <- ct_get_data(
    commodity_code = wood,
    reporter = reporter,
    partner = "all_countries",
    flow_direction = "import",
    start_date = 2018,
    end_date = 2022
  ) 
  
  ## bind the subset to the complete data
  data_eu_imports <- rbind(data_eu_imports, temp)
  
  ## note that I did not include any sleep() command here to make the requests
  ## wait for a specified amount of time, the package keeps track of that for 
  ## you automatically and backs off when needed
}

nrow(data_eu_imports)
#> [1] 173358

Congratulations, you have now queried 170k rows of data, exceeding the limit of one call to the API!

Getting the data for imports from the World, \(W_{c,p}\), should be easy now!

data_eu_imports_world <- ct_get_data(
    commodity_code = wood,
    reporter = eu_countries,
    partner = "World",
    flow_direction = "import",
    start_date = 2018,
    end_date = 2022
  )

nrow(data_eu_imports_world)
#> [1] 5004

Data cleaning

#> `summarise()` has grouped output by 'partner_iso', 'partner_desc', 'flow_desc',
#> 'product_cat'. You can override using the `.groups` argument.

data_eu_imports_world_clean <- data_eu_imports_world |>
  left_join(wood_df, by = "cmd_code") |>
  select(
    reporter_iso,
    reporter_desc,
    flow_desc,
    partner_iso,
    partner_desc,
    cmd_code,
    cmd_desc,
    product_cat,
    primary_value,
    ref_year
  ) |>
    ## we now aggregate the imports by the product category and year 
    ## over all reporters, since we are interested 
    ## in the imports of the whole EU
  group_by(product_cat, ref_year) |>
  summarise(eu_import_product_cat_world = sum(primary_value)) |>
  ungroup()
#> `summarise()` has grouped output by 'product_cat'. You can override using the
#> `.groups` argument.

#### relevance to EU
relevance <- data_eu_imports_clean |> 
  left_join(data_eu_imports_world_clean) |> 
  ## join the two datasets
  mutate(relevance_eu = eu_import_product_cat/
           eu_import_product_cat_world*100) |> 
  ## calculate the ratio between world imports and imports from one partner |> 
  select( -flow_desc) |> ungroup()  
#> Joining with `by = join_by(product_cat, ref_year)`

Let’s see who has the biggest share in the EU import market for wood (excluding EU countries).

top_10 <- relevance |> 
  filter(!partner_iso %in% eu_countries) |> 
  group_by(ref_year) |> 
  slice_max(order_by = relevance_eu, n = 10) |> 
  select(partner_desc, relevance_eu, ref_year) |> 
  arrange(desc(ref_year))

head(top_10, 10)
#> # A tibble: 10 × 3
#> # Groups:   ref_year [1]
#>    partner_desc       relevance_eu ref_year
#>    <chr>                     <dbl>    <int>
#>  1 China                     7.77      2022
#>  2 USA                       2.03      2022
#>  3 Brazil                    1.96      2022
#>  4 United Kingdom            1.95      2022
#>  5 Türkiye                   1.19      2022
#>  6 Ukraine                   1.07      2022
#>  7 Russian Federation        0.971     2022
#>  8 Switzerland               0.876     2022
#>  9 Norway                    0.799     2022
#> 10 Indonesia                 0.656     2022

Let’s also do a little sanity check. When summing up all the shares over one year, we should get to 100 %. This seems to hold.

relevance |> ungroup() |> 
  group_by(ref_year) |> 
  summarise(sum = sum(relevance_eu))
#> # A tibble: 5 × 2
#>   ref_year   sum
#>      <int> <dbl>
#> 1     2018  100.
#> 2     2019  100.
#> 3     2020  100.
#> 4     2021  100.
#> 5     2022  100.

Maybe the last year was an outlier? We can calculate the mean relevante a country had in the past 5 years and get the most important countries again.

average_share <- relevance |> 
  filter(!partner_iso %in% eu_countries) |> 
  group_by(partner_iso, partner_desc) |>
  summarise(mean_relevance_eu = mean(relevance_eu)) |> 
  ungroup() |> 
  slice_max(order_by = mean_relevance_eu, n = 10)
#> `summarise()` has grouped output by 'partner_iso'. You can override using the
#> `.groups` argument.

ggplot(average_share) +
  geom_col(aes(reorder(partner_desc,mean_relevance_eu), mean_relevance_eu), 
           fill = 'brown') +
  coord_flip() + 
  theme_minimal() +
  labs(title = 'Share of EU imports', 
       subtitle = 'as average over last 4 years') +
  xlab('average relevance in %')+
  ylab('')