An Algorithm for Matching Heterogeneous Financial Databases: A Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm and incorporate different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint COMPUSTAT/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with a F-score over 0.94 using only a hundred of manually labeled company links.


Introduction
The increasing volume of data during last years arises new challenges for machine learning and databases communities focused on the efficient integration, interoperability and comparative of data originated from different sources and stored in heterogeneous databases. Financial data analysis is the domain where these problems are especially challenging given the evident relevance of the data, the ubiquitous presence of sources of information, and the inherent difficulty of dealing with noisy, spurious or missing data. The existence of different agencies or institutions independently reporting their financial data together with the use of their own concepts, vocabulary and company identifiers make the manual integration of these databases time consuming and dependent on expert knowledge. Therefore, the implementation of machine learning algorithms capable of automatically matching financial databases is almost mandatory. In this paper we will match the joint Compustat/CRSP database and the Institutional Brokers' Estimate System (I/B/E/S) History database. The joint Compustat/CRSP database contains information about the fundamentals of a company, while the I/B/E/S History database provides detailed and summary analysts' estimates of up to 26 forecast measures (earnings per share, revenue/sales, net income, among others) as well as information about the fundamental data reported by the companies. Each company in these databases is defined by a set of time-stamped records formed by a set of string and numeric attributes such as company name and fundamental data. However, the direct identification of joint Compustat/CRSP and I/B/E/S variables according to their definitions is not straightforward and more sophisticated approaches are required. In addition, each database has its own primary identifiers associated with each company or security, which makes databases merging more complicated. Therefore, two interrelated tasks need to be solved [24,33]: a schema matching (attribute matching) problem responsible for finding correspondences between attributes, and an entity matching (object matching or record linkage) problem oriented to discovering links between companies based on the relationships found by the schema matching procedure. However, the entity matching problem associated with financial databases is different from entity matching problems commonly addressed in the literature. The entity matching problem is usually defined as the problem of determining whether two entities in different databases refer to the same object. In other words, entity matching algorithms try to discover one-to-one relationships between records since they assume that an object is represented by a single record. However, in financial databases the objects are companies with multiple records in the database. Typically, each of these records in the database has a time stamp that indicates the period of time to which the data is referred.
The main goal of this work is to provide a general machine learning framework to link companies in heterogeneous financial databases. The databases are linked by maximizing the F-score to aid in the construction of predictive modeling and other applications [12,18,29]. The FASD system consists of two algorithms, each of them responsible for solving the attribute matching and company matching problems. The goal of our schema matching algorithm is to provide an automatic procedure to obtain one-to-one links between attributes in Compustat/CRSP and I/B/E/S databases without using any kind of prior knowledge to reduce as much as possible the human supervision. FASD uses the Kullback-Leibler divergence (KL) as similarity measure between fields in different databases. This is an unsupervised algorithm in which we only consider the distribution of the values of the attributes without using neither temporal nor entity matching information. Additionally, this approach takes advantage of the large amount of data available in financial databases and relies on the assumption that matching attributes should have similar probability distributions given that we expect a high overlapping between both databases. Among differ-ent approaches oriented to solve the attribute matching problem [2,10,30,28,7], Kullback-Leibler divergence was previously used to find links between continuous numeric and discrete-valued attributes [15,14] and mutual information has also been suggested to find attribute links [17,32]. However, FASD's schema matching solution is the first attempt to use Kullback-Leibler divergence to link string and numeric attributes in heterogeneous databases. To solve the company matching problem, FASD finds the optimal parameters of a generalized company matching measure capable of (i) taking into account the temporal information inherent to financial data, (ii) determining which of the links obtained by the schema matching procedure are really useful, (iii) choosing the best distance for each string attribute among three well-known string matching measures, (iv) combining string matching and token-based distances, (v) considering word order and possible bad ordering of the sequence of words, and (vi) establishing the optimal threshold that determines whether two companies are the same according to our company-similarity metric. Supervised learning of the generalized distance is especially suitable for reducing human effort and expert knowledge since our similarity function admits many different configurations that do not have to be manually tuned. Unlike our schema matching approach, the company matching algorithm is supervised, so it needs as inputs some pairs of linked and/or unlinked companies. Though there exist numerous approaches in the literature to solve the entity matching problem [20,21,9,6,13,4], they do not provide the FASD's flexibility to handle heterogeneous data types and temporal data, and they generally need to be manually tuned and configured [20,21,19].
The rest of the paper is organized as follows. Sections 2 and 3 describe our schema and entity matching algorithms. Section 4 presents the Compustat/CRSP and I/B/E/S data used in our experiments, it describes the experimental setup, and it shows the results obtained by the FASD algorithm. Finally, Section 5 states the conclusions derived from this work.

Schema Matching
Given two databases, our schema matching algorithm gives as an output a ranking of the most likely links between pairs of attributes according to their Kullback-Leibler divergence. Financial databases can be formed by both numerical and string-type fields, but only matches between fields of the same type are considered. Therefore, according to the standard taxonomy to classify schema matching techniques [2,28], FASD's matching strategy can be categorized as a instance-matching approach since it determines the similarity of attributes according to a statistical measure, but it also can be considered as a constraintbased approach as only matchings between attributes of the same type (e.g. string or numeric) are analyzed. The instance-matching approach is especially suitable for these financial databases in order to take advantage of the large amount of information available. We adopt the discrete Kullback-Leibler divergence even for continuous numerical features because of its easy implementa-tion. For string-type attributes the distribution of each attribute is obtained by considering the distribution of the words appearing in either of the two of the attributes to compare, while for numeric-type attributes the continuous probability distribution is approximated by a histogram with N evenly spaced bins, where N is a meta-parameter of the schema matching algorithm. In all cases, only non-missing data were used to obtain the histograms. Specifically, given two attributes A 1 i and A 2 j from two different databases, and their discrete probability , respectively, the Kullback-Leibler divergence can be defined as follows The KL divergence is a asymmetric measure, and we use its symmetric version expressed as follows Once the Kullback-Leibler divergences are computed for all the possible pairs of attributes of the same type (string/numeric), we apply a filter in order to avoid spurious links. In particular, we define two thresholds for the entropy of the attributes ( ent ) and the percentage of missing values ( mv ), respectively. We are not interested in matching attributes with low entropies or high percentage of missing values since they do not contain enough information to reliably estimate probability distributions. The output of FASD's first stage is a ranking of filtered attribute links sorted according to their KL divergence; the lower KL divergence, the better. FASD's schema matching algorithm is fully-automatic and does not require any kind of human effort except for selecting the final number of links. The user can keep as many links as she/he wants from the ranking of attribute links, and the entity matching algorithm described in the following section will determine which of these links are indeed useful for company matching purposes.

Company Matching
The main advantage of FASD's company matching approach is its versatility. Companies are matched as a function of a self-tuning company distance whose optimal parameters are adjusted by optimizing a cost function (metaheuristic). The cost function that we use to guide the optimization process is the F-score (or F-measure). The larger F-score, the better; thus, the FASD's objective is to maximize the F-score. The F-score is defined as the weighted harmonic mean of precision and recall measures commonly used in Information Retrieval [1]. In our problem, precision is the fraction of correspondences identified by our algorithm that are indeed correct, while recall measures the fraction of correct correspondences identified by our algorithm over all the possible correspondences. That is, where TP (True Positives) is the number of links correctly identified, FP (False Positives) is the number of links incorrectly identified, and FN (False Negatives)is the number of positive links not identified. F-score ranges between 0 and 1 where 0 and 1 correspond to total failure and perfect retrieval scenarios, respectively. Parameters maximizing the F-score are determined by a Genetic Algorithm [11] for convenience, but any other metaheuristic optimization techniques such as simulated annealing [8] may be applied. In the following subsections, we will build FASD's company distance, which can be defined by four levels in a bottom-up architecture: -Level 1 (bottom) is responsible of computing the distance between two records with the same time stamp and only considering one attribute or field. -Level 2 combines distances from Level 1 by assigning different weights to databases' fields. -Level 3 averages distances from Level 2 over the set of intersecting time stamps. -Level 4 finds the optimal threshold to determine whether or not two companies are linked.

Level 1: String Distance and Numeric Distance Functions
We implemented a general string distance that is the convex combination of a edit distance metric and a token-based (or statistical) metric. On the one hand, edit distance metrics (also known as string matching methods) quantify differences between strings as a function of the number of character insertions and deletions needed to transform one string into the other one. Examples of these metrics include the Levenshtein [22], Jaro [16], Jaro-Winkler [31], and Monge-Elkan [25,26] approaches. The success of these methods relies on the assumption that same concepts are likely to be modeled with similar names. This assumption is useful in financial matching problems as it is expected that, for example, the same company has similar names in both databases. On the other hand, statistical metrics (also known as token-based distances) are based on an underlying statistical model that somehow measure the importance of a word in a document. Jaccard similarity, cosine distance (or TF-IDF), and Soft TF-IDF are some examples of token-based metrics [3,5]. The integration of statistical distances is also beneficial since there are several terms commonly present in company names that are not so discriminative for matching purposes; for example, terms such as "corporation" and "incorporated". In addition, our string distance also takes into account the order of words in the string by assuming that first words are more relevant, and it also integrates the Monge-Elkan concept to consider possible mismatches due to bad ordering of the sequence of words.
In what follows, we assume that databases are only formed by attributes selected according to the schema matching algorithm, and these attributes (or fields) are arranged so that the first field in both databases corresponds to the first attribute link and so on. Let F m i (t) and F m j (t) two strings corresponding to the m-th field of the i-th and j-th records in both databases respectively. We assume that both records have the same time stamp t. F m i (t) and F m j (t) can be split on both set of words {w k i } Ni k=1 and {w k j } Nj k=1 , respectively. Let N = max(N i , N j ), we complete the shortest string with empty words up to have N words. In order to accelerate the learning algorithm, we implemented the Monge-Elkan concept by applying the permutation operation P l with the restriction that we only permute the first three words of each string. Then, the string distance between F m i (t) and F m j (t) can be written as follows The term S m refers to the corpus formed by all the words in all the entries of the m-th field in both databases. Parameter α m is a positive value that regulates the convex combination of the edit distance (µ edit ) and the token-based distance (µ token ), while parameter γ m is also a positive real number that adjusts the weight (importance) of each word assuming that first words are more relevant. Our edit distance allows the optimization algorithm to choose the best among three well-known edit distances namely, the Levenshtein [22], Jaro [16], and Jaro-Winkler [31] metrics, and it can be expressed as follows where d m type is a parameter that takes three discrete values to represent the three edit-distance metrics (Levenshtein, Jaro, and Jaro-Winkler). Nevertheless, it is straightforward to incorporate any other matching string distances. On the other hand, our token-based distance accounts for the context statistics by using the Inverse Document Frequency score (IDF), which determines how common is a word across all the records. We normalize the IDF score to have it in the interval [0, 1], and then make both terms in the convex combination in (3) similar in magnitude. Our token-based distance can be expressed as follows IDF(w, S) = log #records in the corpus S #records in S containing the word w + 1 .
Finally, the numeric distance between two numeric fields F m i (t) and F m j (t) with the same time stamp t is given by the exponential of the relative error between both fields. The exponential penalization guarantees that only those companies whose fundamental data are very similar are considered as possible matchings. The numeric distance is defined as follows .
Note that the numeric distance can be further optimized by weighting the exponential term by a factor. However, this would require another metaparameter to search and the current solution gives good results as shown below.

Level 2: Record Distance
Once we have defined the string and numerical distances for each individual field, we combine these magnitudes to define the distance between two records with the same time stamp t by adjusting a parameter η m ∈ [0, 1] for each field. These parameters represent the weight (or importance) of each field in the global score. Therefore, our record distance between two entries F i (t) and F j (t) with the same time stamp t and M fields is given by the weighted average over all the fields, and it can be written as follows ηm(is string(m)) · µstr(F m i (t), F m j (t)) + (4) where is string(m) is a predicate that takes the value 1 when the m-th field in both databases is string-type and 0, otherwise. It is worth noting that both the string and numeric distances take values in the interval [0, 1], and, thus, all the terms in ( 5) have the same range of values.

Level 3: Company Distance
We define the distance between two companies as a function of the distance between their records. Given that one company is defined by a set of records with an associated time stamp, the company distance is defined as the average record distance (Equation (5)) between those pairs of records sharing the same time stamp. Let F i = {F i (t)} t∈Ti and F j = {F j (t)} t∈Tj be the records corresponding to a company in the first and second databases, respectively. The FASD's company distance is given by the following expression where {T i ∩ T j } is the set of time stamps with records in both databases and | · | denotes the cardinality of this set.

Level 4: Matching Rule
According to the company distance defined by (6), we classify two companies as a match when their company distance is below a threshold θ to be automatically tuned; that is, two companies F i and F j are identified as the same company if µ comp (F i , F j ) < θ. After putting all the levels together, we end up with five parameters to be optimized. For the sake of clarity, we summarize them in Table 1.

Datasets and Experimental Evaluation
The aim of this section is to describe the data used and present the results obtained in the evaluation of FASD's schema and company matching algorithms proposed in Sections 4.1 and 4.2, respectively. The C++ code for the schema and company matching algorithms are available upon request. The PGA-PACK library [23] was used to implement the optimization. Though there exist some frameworks to automatically constructing entity matching strategies [19], the results shown below cannot be compared to other methods because of the lack of a generic framework to compare heterogeneous databases with the characteristics of the financial domain: time-stamps at different scales, text, numbers, and nonintersecting fields at different degrees.

Results on Schema Matching
To identify the relationship between the Compustat/CRSP and I/B/E/S attributes 1 , all the available quarterly data from 2008 to 2012 were used. Information prior to 2008 was not included since some I/B/E/S variables were not available until 2008. All quarterly data features (332 attributes) and all year-to-date features (232 attributes) as well as 'Company Name' (CONAME) and 'Company Legal Name' (COLEGNAME) fields from Compustat/CRSP database formed inputs to the algorithm. All the non-industry specific measures available in the I/B/E/S Summary History were included together with the 'Long Company Name' field extracted from Company Identification files. The best four disjoint numeric-type links corresponding to four different attributes in Compustat/CRSP database and four different attributes in I/B/E/S database together with the best string-type matching are shown in Table 3. We set the number of bins N equals to 100, the entropy threshold ent equals to 10 −2 , and the missing value threshold equals to mv = 50%. Before invoking FASD and in order to homogenize company name, we applied a dictionary based expansion to extend the most common acronyms; for instance, "inc" was replaced by "incorporated" and "ltd" was substituted by "limited". For example, Fig. 4.1 shows the histogram of the attributes 'COLEGNAME' and 'CONAME' 1 Actuals Data from I/B/E/S Summary History was employed. According to I/B/E/S Summary History User Guide -April 2013 version 3, "Actuals Data provides financial data relating to previous fiscal periods as reported by the company. Actuals for fiscal periods and interim periods are obtained from news services and company filings and adjusted by Thomson Reuters market specialists to be comparable to the estimates made by analysts".
in Compustat/CRSP and I/B/E/S databases, respectively. Common terms such as "incorporated" or "corporation" have probabilities of occurrence of 0.17 and 0.16 in Compustat/CRSP and I/B/E/S databases, respectively, which diminishes its discriminative power and motivates the use of token-based distances.

Results
The best four numeric-type links and the best string-type link presented in Table 3 were used as input to FASD's company matching algorithm described in Section 3. Taking into account these heterogeneous attributes is fundamental to obtain a good performance, as it will be shown further down. Though there are many consistent cases in which company names and fundamental data are very similar, there also exist other scenarios in which either the company names are very similar but there are discrepancies in the numerical attributes, or company names are significantly different but fundamental data are almost identical. We filtered the data shown in Table 2 by dismissing those records containing a missing value in any of the five attributes chosen. The data used is shown in Table 4. Since FASD's company matching algorithm is a supervised algorithm, we need to provide it with a subset of positive and negative company links. We can obtain a partial list of company matchings by using the historic CUSIP number 2 available in both databases by following the procedure provided by Wharton Table 3. Kullback-Leibler divergence of the best four disjoint numeric-type and the best string-type links between Compustat/CRSP and I/B/E/S databases. Research Data Services (WRDS) [27]. Historical CUSIP numbers allow recovering information about the companies as the first six digits uniquely identify the company while the last two digits represent the company's stock issue. Moreover, CUSIP numbers are never reassigned to other companies or stocks. We linked companies using the most recent CUSIP in order to have a list of company matchings that allow us to evaluate FASD's effectiveness. However, CUSIP numbers are not always available or completely correct, but they are very useful for calibrating the parameters of FASD. We want to evaluate the performance of FASD's company matching algorithm in terms of its effectiveness to identify company links and the human labeling cost. Therefore, our results will present the F-score of FASD's company matching approach when different number of company links labeled as linked or unlinked are used in training. At this point, it is worth noting that there may exist a few N-to-N links between companies, and we do not expect a perfect matching since there are some companies present in Compustat/CRSP that are not in I/B/E/S, and vice versa. It is known that the effectiveness of a supervised entity matching methods highly depends on the size and quality of the training data that ideally should reproduce the real operating scenario [19,21]. We applied three different strategies to select n training pairs of positive and negative company links to be used by FASD: -Random selection. As a baseline error, we randomly choose n/2 positive pairs from the list of known links, and we randomly choose n/2 unlinked pairs. Please note that this strategy does not represent a feasible scenario in practice since linked and unlinked companies are not known beforehand. However, a pure random selection would be entirely dominated by negative links. For example, in our data (  (6) and using uninformative parameters: γ m = 0, α m = 0.5, d m type = 3 (Jaro-Winkler), and η m = 1/M . A value for the threshold θ is not needed to compute the distance.
-Heuristic selection. As an upper bound, we consider that we have at most L links between companies with L the maximum number of companies per database. In our case, L = max{5, 695 , 4, 447} = 5, 695. We randomly select n candidate pairs to be manually labeled among the 2 · L pairs with the lowest company distance with uninformative parameters.
The performance of the algorithm is evaluated over a test set formed by all possible company links obtained by the cross product between companies. Though the complete list of positive links is not available, we used as ground truth the positive links obtained by the CUSIP number procedure described above and the positive links found by manually labeling the first 2 · L pairs with the lowest uninformative company distance for three different sets of attributes: (i) the best four numeric-type links and the best string-type link, (ii) the best four numeric links, and (iii) the best string link. The total number of positive links with data in both databases is 3, 932. Table 5 shows the F-score of FASD's company matching algorithm as a function of the number of training samples n and the training sampling strategy when the best four numeric-type links and the best string-type link are considered. Figure 2 compares the FASD's F-score in test for heterogeneous attributes with the FASD's F-score obtained when only either numeric or string attributes are taken into account in the computation of the company distance. Table 5 shows that FASD's company matching algorithm is able to effectively discover correspondences between companies. It yields a F-score of 0.94 when trained only with 100 positive and negative company links chosen according to our heuristic strategy and using the numeric and string attribute links found by FASD's schema matching method. In fact, the performance of the algorithm keeps almost constant for more than 100 training pairs, which means that increasing the labeling effort does not provide any advantage in terms of effectiveness. The other two scenarios considering only either string or numeric Table 5. F-score of FASD's company matching algorithm using different training set selection strategies. The best four links for numeric attributes and the best link for string attributes found by FASD's schema matching algorithm were considered. Results for the random and heuristic training selection show the mean and standard deviation for 10 different runs of the algorithm.

Random
Deterministic Heuristic  (Fig. 2) are not competitive enough even when they are provided with a large number of training samples. This result reveals the importance of taking into account heterogeneous information given the complex casuistry of company matching problems. Finally, the heuristic strategy for training set selection yielded the best results regardless of the set of attributes used. This is not a surprising result given that our heuristic approach focuses on those samples that are more difficult to differentiate and close to the decision boundary of the matching rule. On the other hand, the poor performance of the deterministic and random strategies is not surprising either. While the training set in the deterministic approach is likely to be dominated by positive examples, the negative training samples in the random strategy are unlikely to be challenging cases. In short, these training sets do not properly capture the complexity of the underlying problem.

Conclusions
The development of trading strategies using multimodal sources of information require a painful process of curating the data to avoid any sort of forward looking bias. We also need to have the data aligned in the proper format such that trading algorithms can be developed and trained. The data management problem eventually overtakes and hampers the model development process. Thus, this paper deals with a challenging problem for machine learning, databases, and fi- nance communities consisting of discovering correspondences between companies in heterogeneous financial databases. The company matching problem is mainly characterized by the lack of a common company's identifier across databases originated at different times without a common standard that evolved in tandem with computer technological breakthroughs. Finance databases are treacherous and the curation of the data is time consuming. There is difficulty of finding correspondences between heterogeneous attributes, which requires the application of a schema matching algorithm. Moreover, there exist several records per company, which leads to reformulate traditional matching approaches commonly oriented to finding one-to-one correspondences. The challenge of having a common framework to make comparisons is hard because every database matching problem leads to a different subproblem. The finance databases shown here are characteristic of a generic situation not previously addressed in the literature in which 1/ time scales of the reported data are different, 2/ attributes may be conceptually similar but they are linearly or nonlinearly scaled and shifted, 3/ there are strings and numbers, 4/ unique key identifiers in each database are different, 5/ companies with different names are actually the same in both databases, 6/ companies with similar identifiers are actually different.
To solve the company matching problem, we propose a two-stage algorithm that requires no expert knowledge and low human labeling effort. The first step of the algorithm provides a solution to the schema matching problem based on the computation of the Kullback-Leibler divergence between attributes. The al-gorithm is completely unsupervised and admits string and numeric attributes. The second stage is responsible for the company matching task, which is carried out in a supervised manner. FASD's company matching algorithm defines a auto-configurable company distance as a function of a set of parameters, and it lets an optimization algorithm to discover the optimal configuration according to the attribute links found in the first stage. Our company distance is capable of dealing with heterogeneous types of data (numeric and string) and entities (companies) defined by a "time series" of records with a time stamp associated. FASD's company distance is also able to automatically find a good combination of edit-based and token-based metrics, choose among different edit-based metrics, assign different weights (importance) to each attribute, and establish the optimal threshold that defines the matching rule. Experimental results on matching companies from Compustat/CRSP and I/B/E/S databases by considering different sets of attributes and training set selection strategies show that FASD is able to successfully discover correspondences between companies. FASD yields a F-score of 0.94 when a hundred positive and negative company links are intelligently chosen for training, and numeric and string attributes are considered.