2
\$\begingroup\$

The task is: I get companies' information from a large CSV file and then I need to compare this information with each company in the database to know if the company is a new one or if it is already in the database.

The problem is: The information from the CSV file could contain some errors (wrong zip code, typo in the company name, etc...) so I need to know if the received company and the database company is probably the same. I need to know the similarity between the companies' information.

To achieve this I thought in the following algorithm:

  1. Define a similarity punctuation to 0;
  2. Compare the companies' names. If the names has 90% of similarity, sum one point to similarity punctuation.
  3. Compare the companies' addresses. If the addresses has 90% of similarity, sum one point to similarity punctuation.
  4. Compare the companies' zip codes. If the zip codes has 90% of similarity, sum one point to similarity punctuation.
  5. If the punctuation is 3, the companies are the same. Stop
  6. If the punctuation is less than 3, make a additional test with name and address
    1. Compare the companies' names and verify if the first one have less than two word different to the second one. If so, compare the companies' names again. If the names has 75% of similarity, sum one point to similarity punctuation.
    2. Compare the companies' addresses and verify if the first one have less than two word different to the second one. If so, compare the companies' addresses again. If the addresses has 75% of similarity, sum one point to similarity punctuation.

Here is a table of some example companies using it:

| Company from database                           | Company received from CSV                             | out                         | note                                             |
|-------------------------------------------------|-------------------------------------------------------|-----------------------------|--------------------------------------------------|
| Company of John - 125, 5th Avenue - 99999-9999, | Company - 126, 5th Avenue - 999999-9999               | same company (3 points)     | same zip, similar names and similar street names |
| Hotel 1 - 2358, Generic Street - ON M2N2W9      | Hotel with good beds,- 256, Rue de La Vie - ON M2N2W6 | different company (1 point) | almost the same zip                              |
| Factory of metal - 635, Street One - 0000-000   | Factory of plastic - 635, Street One - 0000-000       | same company (3 points)     | similar names, same street name and same zip     |

Following is my PHP class that implements this algorithm:

<?php
namespace Utils\Merge;

use Model\Company;
use Model\AddressCompany;

/**
 * Class Compare
 * Class to receive a Company Object and company's information from database. 
 * Compare these informations and define if the the two companys are the same
 * @author James Miranda <jameswpm [at] gmail dot com>
 * @access public
 */
class Compare
{
    /**
     * @var Company $company
     * Object Company retrieved from database
     */
    private $company;

    /**
     * @var AdreessCompany $adrCompany
     * Object from table AddressCompany that represents a Address for the $company
     */
    private $adrCompany;

    /**
     * @var Array $companyAudit
     * Array with company's information received from user
     */
    private $companyAudit;

    /**
     * @var Int $equals
     * Attribute that defines a scale for similarity between two companies. The larger the scale, the more similar * the companies are
     */
    public $equals;

    /**
     * Method __construct()
     * Initializes the objects to be used in comparison
     * @param Array $compAudi
t    * @param Company $compObj
     */
    public function __construct($compAudit,Company $compObj)
    {
        $this->company = $compObj;
        $this->adrCompany = new AddressCompany($compObj->getId());//loads a address object from database by company ID
        $this->companyAudit = $compAudit;
        $this->equals = 0;
    }

    /**
     * Method Compare()
     * Method called to make a comparison of two instantiated objects in construction
     */
    public function compare()
    {
        //Firstly, check the similarity between the following attributes: Name, street and zip code 
        $this->compareName();
        $this->compareStreet();
        $this->compareZip();
        //After these operations, if $equals is greater than 3, it indicates that the three attributes (name, street, and zip) has more than 90% similarity, indicating that companies are, possibly, equal
        if ($this->equals < 3) {
            //when the similarity did not reach 90% in the three tested attributes, additional tests are needed to ensure that the companies are different
            $this->additionalTest();
        }
    }   

    /**
     * Method compareName()
     * Compare the names of two companies
     * @param int $similarity Percentage expected. By default, seeks a high similarity of 90%
     * @see http://php.net/manual/en/function.similar-text.php
     */
    private function compareName($similarity = 90)
    {
        $percent = 0;
        similar_text($this->company->getName(),$this->companyAudit['name'], $percent);
        if ($percent >= $similarity) {
            //The similarity is greater than 90%. So, + 1 point
            $this->equals += 1;
        }
        unset($percent);
    }

    /**
     * Method compareStreet()
     * Compare the street names of two companies
     * @param int $similarity Percentage expected. By default, seeks a high similarity of 90%
     * @see http://php.net/manual/en/function.similar-text.php
     */
    private function compareStreet($similarity = 90)
    {
        $percent = 0;
        similar_text($this->adrCompany->getStreet(),$this->companyAudit['street'], $percent);
        if ( $percent >= $similarity) {
            //The similarity is greater than 90%. So, + 1 point
            $this->equals += 1;
        }
        unset($percent);
    }

    /**
     * Method compareZip()
     * Compare the zip codes of two companies
     * @param int $similarity Percentage expected. By default, seeks a high similarity of 90%
     * @see http://php.net/manual/en/function.similar-text.php
     */
    private function compareZip($similarity = 90)
    {
        $percent = 0;
        similar_text($this->adrCompany->setZip(),$this->companyAudit['zipCode'], $percent);
        if ($percent >= $similarity) {
            //The similarity is greater than 90%. So, + 1 point
            $this->equals += 1;
        }
        unset($percent);
    }

    /**
     * Method additionalTest()
     * Called when more tests are needed to ensure the difference between two companies
     * @see http://php.net/manual/en/function.array-diff.php
     */
    private function additionalTest()
    {
        //Firstly, transforms the companies' data in arrays to compare the words
        $name1 = explode (' ', strtolower($this->company->getName()));
        $str1 = explode (' ', strtolower($this->adrCompany->getStreet()));
        $name2 = explode (' ', strtolower($this->companyAudit['name']));
        $str2 = explode (' ', strtolower($this->companyAudit['street']));
        //Compare the two arrays and verifies if the first array contains more than one word of the second array
        //The array_diff function returns an array of differences, ie the elements of the first array that are not in the second. The larger array size, the more they are different
        if(count(array_diff($name1,$name2)) <= 2 ||count(array_diff($name2,$name1)) <= 2) {
            //when less than two different words are found between the companies' names
            //the companies maybe have the same name, then compares again, with less sensitivity, to ensure if they are different
            $this->compareName(75);
        }
        if(count(array_diff($str1,$str2)) <= 2 ||count(array_diff($str2,$str1)) <= 2) {
            //when less than two different words are found between the companies' street name
            //the companies maybe have the same street name, then compares again, with less sensitivity, to ensure if they are different
            $this->compareStreet(75);
        }
        ////After these tests, if $equals < 3, it is highly likely that the companies are not the same
    }   
 }

The use of this class is in the code below:

//$auditCompany from CSV and $company from Database
$comp = new Compare($auditCompany,$company);
$comp->compare();
if ($comp->equals >= 3) {
    //the companies are equals
    $this->updateCompany(auditCompany);
}

The code is working fine with some acceptable failure until now.

What I want to review?

1) I know that exists some excellent algorithms that can be used to compare text like Euclidian Distance or Cosine Similarity, but I don't know if these algorithms are good approaches to this problem, because I'm not comparing two texts, I'm comparing two "objects" made by peaces of text. Are these algorithms recommended for this kind of problem?

2) The levenshtein() function is better than the similar_text() function in this case?

3) Improve the time of execution. Using a CSV file with 16000 companies, the script runs for almost 15 hours (!!!).

4) Improve the reliability. How can I ensure the minimum number of false positives?

Any other comment or tip is very welcome.

\$\endgroup\$

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.