Scraping financial statements of Slovak financial entities

2020-06-07 2062 words 10 minutes

It is always interesting to go back to your older projects. You can spot, how is your coding style evolving, and how you, as a programmer, are improving. Recently, I had to go through the code of one of my first projects in R and boy, was it a mess. It was supposed to download Financial Statements of all the businesses in Slovakia for a certain year. It worked, barely. But trying to understand the code was a pain. No functions, random variable names, no documentation… There was even a public API, but for some reason (and I am sure, that at a time It was a “great” one), I have decided not to use it but to scrape web pages directly.

So I have decided this is a great opportunity to see, how would I do things now, and finally do this code some justice and banish this abomination that I have once called code.

Public API

First of all, we have to take a look at a public API endpoints and data model. Its full documentation can be found here. From the data model, we are interested in accounting entities and financial statements. It is also possible to see an expected output from the endpoints, which we will use for preparing our database tables.

Project structure

This time, things should be done right, that’s why I have decided to create an R package. This is a first step, that will give this project an initial structure. This also gives developer bigger control over which packages are used, which functions we want to export, and a lot more. All these great features come at almost no cost, so there is no reason not to create an R package. If you haven’t created an R package before, Hadley has a great tutorial on how to do things right.

In the previous project, data were saved as .csv files which were the manually concated. Now we want to have downloaded data stored reasonably. Because of the relational nature of the data, we will use the PostgreSQL database.

Preparing database

After you install your PostgreSQL database, we need to create tables, where we will store our data. All the tables can be created with the following script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


CREATE TABLE financial_report_for_financial_statement (
	id serial PRIMARY KEY,
	id_financial_statement VARCHAR (10),
	id_financial_report VARCHAR (10)
);

CREATE TABLE financial_report_title (
	id serial PRIMARY KEY,
    id_financial_report VARCHAR (10),
    nazovUctovnejJednotky VARCHAR (500),
    ico VARCHAR (8),
    dic VARCHAR (10),             
    adresa VARCHAR (500),
    skNace VARCHAR (100),
    typZavierky VARCHAR (30),
    obdobieOd DATE,
    obdobieDo DATE,
    predchadzajuceObdobieOd DATE,
    predchadzajuceObdobieDo DATE,
    datumSchvalenia DATE,
    datumZostavenia DATE,
    oznacenieObchodnehoRegistra VARCHAR (100),
    type VARCHAR (3)
);

CREATE TABLE financial_report_base (
	id serial PRIMARY KEY,
    id_financial_report VARCHAR (10),
    pristupnostDat VARCHAR (20),
	datumPoslednejUpravy DATE,
	zdrojDat VARCHAR(20),
	kodDanovehoUradu VARCHAR(20),
	idSablony VARCHAR(20)
);

DO language 'plpgsql'
$$
DECLARE mujAkt text := 'CREATE TABLE financial_report_assets_muj(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,46) As i;
DECLARE mujPas text := 'CREATE TABLE financial_report_lae_muj(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,44) As i;
DECLARE mujZS text := 'CREATE TABLE financial_report_is_muj(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,76) As i;

DECLARE podAkt text := 'CREATE TABLE financial_report_assets_pod(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,312) As i;
DECLARE podPas text := 'CREATE TABLE financial_report_lae_pod(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,134) As i;
DECLARE podZS text := 'CREATE TABLE financial_report_is_pod(id serial PRIMARY KEY, id_financial_report varchar(10),'
	|| string_agg('V' || i::text || ' integer', ',') || ');'
	FROM generate_series(1,122) As i;


BEGIN
  EXECUTE mujAkt;
	EXECUTE mujPas;
	EXECUTE mujZS;
	
	EXECUTE podAkt;
	EXECUTE podPas;
	EXECUTE podZS;
END;
$$ ;

After running the script, it’s time to prepare code that will fill the tables with data.

Communication with API

First we create functions, which will communicate with the API. A natural way to communicate with API would be to use a different types of HTTP request methods. However, this is not the case. If you try httr::GET it will fail. However, we can use function readLines from the base package. This will return the content of the website which we can then transform into a nice object using rjson::fromJSON.

We can create a nice little function for this, as we will use this a lot.

1
2
3
4


getObjectFromURL <- function(url) {
  readLines(url, warn = FALSE, encoding = "UTF-8") %>% 
    rjson::fromJSON()
}

For creating URLs with queries, We could use paste for each endpoint with all the parameters, however, this solution would be error-prone and could get messy soon. For this, we can also use a small utility function, which will prepare the URL.

1
2
3
4
5
6
7
8
9


createUrl <- function(endpoint, ..., baseUrl = "http://www.registeruz.sk/cruz-public/api") {
  params <- list(...)
  params <- paste(names(params), params, sep = "=")
  if (length(params) == 0) {
    paste0(baseUrl, endpoint)
  } else {
    paste0(baseUrl, endpoint, "?", paste0(unlist(params), collapse = "&"))
  }
}

And this last function, which we will use for downloading data is a little bit hacky. While batch downloading, I have noticed, that querying some URLs will throw an error, but if we query the same URL next time, it will return the data. So this function will try to query URL a few times and it will return NULL only if it fails each time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


tryUntilSuccess <- function(url, numberOfTries = 20, fun = getObjectFromURL) {
  if (numberOfTries == 0) {
    warning("Unable to read from: ", url)
    NULL
  } else {
    res <- tryCatch(
      fun(url),
      error = function(x) {tryUntilSuccess(url, numberOfTries - 1, fun)}
    )
    res
  }
}

After we put all this stuff together, we can create nice small functions for communicating with endpoints.

1
2
3
4
5
6
7
8
9


getChangedFinancialReports <- function(from = "2019-01-01", maxNumber = 10000, afterId = 1) {
  createUrl("/uctovne-vykazy", "zmenene-od" = from, "max-zaznamov" = maxNumber, "pokracovat-za-id" = afterId) %>%
    tryUntilSuccess()
}

getFinancialReportDetails <- function(id) {
  createUrl("/uctovny-vykaz", id = id) %>%
    tryUntilSuccess()
}

Database communication

Next, we need to prepare functions, which will save downloaded data to the database. We pretty much only need a database connection function, which will be appending data into the database. The database connection is defined as a global variable here as I have wanted to create something similar to a singleton pattern. I am not fully satisfied with this solution, as it misses encapsulation, so I will probably look at how to approach this stuff in the future.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


DB_CONNECTION <- NULL


getDBConnection <- function() {
  db <- "postgres"
  host_db <- "127.0.0.1"
  db_port <- "5432"
  db_user <- "postgres"
  db_password <- "admin"
  if (is.null(DB_CONNECTION)) {
    message("Opening new DB connection")
    DB_CONNECTION <<-
      DBI::dbConnect(
        RPostgres::Postgres(),
        dbname = db,
        host = host_db,
        port = db_port,
        user = db_user,
        password = db_password
      )
  }

  DB_CONNECTION
}


appendIfMissing <- function(x, tableName, con) {
  whereStatement <- lapply(names(x), function(name) {
    glue::glue_sql("{`name`} NOT IN ({x[[name]]*})", .con = con)
  }) %>% unlist() %>% glue::glue_collapse(sep = " OR ")

  res <- glue::glue_sql(paste("SELECT * FROM {`tableName`}"), .con = con) %>%
    RPostgres::dbGetQuery(con, .)
  names(x) <- tolower(names(x))
  toAppend <- dplyr::anti_join(toCharDF(x), toCharDF(res[names(x)]), by = names(x))

  if (nrow(toAppend) > 0) {
    nrOfAppendedRows <- RPostgres::dbAppendTable(con, tableName, toAppend, row.names = NULL)
    usethis::ui_info(paste("Appended missing values to table:", nrOfAppendedRows))
  }
}

Except for these functions, that will make our communication with DB easier, we will also need to prepare data to be appended into the database. We are going to split each financial report into multiple different parts. The base part of the financial report, which specifies where the data comes from and which template was used. Then we have the title page of the report, which contains information about the financial unit. The final parts that we will extract are assets, liabilities and equity, and income statement, of the financial entity.

For these things, we will use the following functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


getBaseFinReport <- function(finReport) {
  finReport[!names(finReport) %in% c("prilohy", "obsah", "idUctovnejZavierky", "id")] %>%
    as.data.frame(stringsAsFactors = FALSE) %>%
    cbind(id_financial_report = finReport$id)
}

getTitleFinReport <- function(finReport) {
  .normalizeDate <- function(x) {
    if (!is.null(x) && !is.na(x)) addDayToDate(x) else x
  }

  ncol(getAktivaFinReport(finReport))

  title <- finReport$obsah$titulnaStrana
  title$adresa <- paste(title$adresa, collapse = " ")
  if (ncol(getAktivaFinReport(finReport)) == 47) {
    title$type <- "MUJ"
  } else if (ncol(getAktivaFinReport(finReport)) == 313){
    title$type <- "POD"
  }

  title <- title %>% as.data.frame(stringsAsFactors = FALSE)
  title$obdobieOd <- .normalizeDate(title$obdobieOd)
  title$obdobieDo <- .normalizeDate(title$obdobieDo)
  title$predchadzajuceObdobieDo <- .normalizeDate(title$predchadzajuceObdobieDo)
  title$predchadzajuceObdobieOd <- .normalizeDate(title$predchadzajuceObdobieOd)

  cbind(id_financial_report = finReport$id, title)
}

getNumericReportData <- function(finReport, position) {
  finReport$obsah$tabulky[[position]]$data %>%
    as.numeric() %>%
    t() %>%
    as.data.frame() %>%
    cbind(id_financial_report = finReport$id)
}

getAssetsFinReport <- function(finReport) {
  getNumericReportData(finReport, 1)
}

getLAEFinReport <- function(finReport) {
  getNumericReportData(finReport, 2)
}

getISFinReport <- function(finReport) {
  getNumericReportData(finReport, 3)
}

addDayToDate <- function(x, day = "01") {
  paste(x, day, sep = "-")
}

toCharDF <- function(x) {
  x[] <- lapply(x, as.character)
  x
}

You might have noticed, that function getTitleFinReport also adds type column to the data.frame. Financial entities can do their bookkeeping in two different styles. Double-entry bookkeeping and simple bookkeeping. As we want to download both styles, we need to differentiate between them and save each of them to their respective database tables. As there is no field in the response, which would tell us, which type of bookkeeping is used, we derive it from the number of fields in the assets part of the JSON.

And this is pretty much it. We only need to combine these functions and we are ready to save the financial report into the database with the following function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


appendFinancialReport <- function(finReport) {
  .appendType <-
    function(x, finReportTitle) {
      if (finReportTitle$type == "MUJ")
        paste0(x, "_muj")
      else
        paste0(x, "_pod")
    }
  message("Process report:", finReport$id)

  con <- getDBConnection()
  finReportBase <- getBaseFinReport(finReport)
  finReportTitle <- getTitleFinReport(finReport)
  if (!is.null(finReportTitle$type)) {

    finReportsForFinStatement <- data.frame(
      id_financial_report = finReport$id,
      id_financial_statement = finReport$idUctovnejZavierky,
      stringsAsFactors = FALSE
    )
  
    message("Add report with id:", finReport$id)
      tryCatch(
        {
          RPostgres::dbBegin(con)
          appendIfMissing(finReportBase, "financial_report_base", con)
          appendIfMissing(finReportTitle, "financial_report_title", con)
          appendIfMissing(finReportsForFinStatement, "financial_report_for_financial_statement", con)
          appendIfMissing(getAssetsFinReport(finReport), .appendType("financial_report_assets", finReportTitle), con)
          appendIfMissing(getLAEFinReport(finReport), .appendType("financial_report_lae", finReportTitle), con)
          appendIfMissing(getISFinReport(finReport), .appendType("financial_report_is", finReportTitle), con)
          RPostgres::dbCommit(con)
        },
        error = function(e) {RPostgres::dbRollback(con)}
      )
  }
}

As you can see, we have appending to the database wrapped in the tryCatch function. We do this, to keep transaction atomic. This means, that we will commit all the data into the database or nothing will occur. So in case of the error, dbRollback is called and the transaction is reverted.

Putting it all together

Now it is time to combine everything we have prepared and start scraping data. Thanks to our modular design, this should be pretty easy. All we have to do now is to get all the IDs of financial reports, that have changed since a certain date and then download each of them and save it into the database.

1
2
3
4
5
6


getAllChangedFinancialReports("2020-05-01") %>%
  .$result.id %>%
  lapply(function(x) {
    getFinancialReportDetails(x) %>%
      appendFinancialReport()
  })

And this is it, all it remains is to wait until all the financial reports are downloaded. Afterwards we can do further analysis of the data, but this might come in the future blog post :). Complete code for this package can be found in this GitHub repository.