share
Stack OverflowRemoving duplicates based on one column, and keeping the row that has value in different column, and if there isn't any, keep lowest ID row
[0] [1] Coalission
[2017-03-17 15:52:04]
[ mysql sql ]
[ https://stackoverflow.com/questions/42861927/removing-duplicates-based-on-one-column-and-keeping-the-row-that-has-value-in-d ]

Using MySQL 5.7 on Google Cloud, I'm trying to deduplicate MySQL data based on an "EmailAddress" column, but some of the rows have a value in the "FullName" column and some of them don't. I want to keep the ones that have a value in the FullName column, but if none of the rows with that EmailAddress value a FullName value, then just keep the duplicate with the lowest ID number (first column - primary key).

I've finally broken it down into two separate queries, one to first remove the rows with no value in the FullName column IF there's another duplicate row that does have a value in the FullName column:

DELETE
FROM customer_info
WHERE id IN
(
 SELECT * 
 FROM
    (   
     SELECT c1.id
     FROM customer_info c1
            INNER JOIN customer_info c2 on c1.EmailAddress=c2.EmailAddress and c1.id!=c2.id
     WHERE 
            (trim(c1.FullName)='' or c1.FullName is NULL)
            and c2.FullName is not NULL
            and length(trim(c2.FullName))!=0
    ) t
)

and another query to remove the rows with the bigger IDs where no value was found in the FullName column:

DELETE
FROM customer_info
WHERE id IN
(
 SELECT * 
 FROM
    (   
     SELECT c1.id
     FROM customer_info c1
            INNER JOIN customer_info c2 on c1.EmailAddress=c2.EmailAddress and c1.id>c2.id
    ) t
)

This "works", but not really. It worked one time when I left it running overnight for a smaller segment of the data, and when I woke up there was an error, but I looked at the data and it was complete.

Am I missing something in my query that's making it highly inefficient, or is it just par for the course for this type of query, and there's no optimization possible in my code that would make a tangible improvement? I've maxed out a Google Cloud SQL instance to their db-n1-highmem-32 size, with 32 GB of memory and 1000 GB of storage space, and it still chokes up and spits out a 2013 error after running for an hour. I need to do this for a total of a little over 3 million rows.

For example, this:

id | FullName      | EmailAddress            |
----------------------------------------------
1  | John Doe      | john.doe@email.com      |
2  | null          | janedoe@box.com         |
3  | null          | billybob@bobby.com      |
4  | null          | john.doe@email.com      |
5  | John Lennon   | jlennon@yoohoo.com      |
6  | null          | james.smith@coolmail.com|
7  | null          | billybob@bobby.com      |
8  | Jane Doe      | janedoe@box.com         |

would result in this:

id | FullName      | EmailAddress            |
----------------------------------------------
1  | John Doe      | john.doe@email.com      |
3  | null          | billybob@bobby.com      |
5  | John Lennon   | jlennon@yoohoo.com      |
6  | null          | james.smith@coolmail.com|
8  | Jane Doe      | janedoe@box.com         |
Can you provide an example with sample data ? - Shushil Bohara
@Suraz I've added the example to the original post. - Coalission
[0] [2017-03-17 16:01:21] SqlZim

using exists() might be simpler in this situation

delete
from customer_info c
where (trim(c.FullName)='' or c.FullName is null)
  and exists (   
     select 1
     from customer_info i
     where i.Email = c.EmailAddress 
      and trim(i.FullName)>'' 
  )

delete
from customer_info c
where exists (
     select 1
     from customer_info i
     where i.Email = c.EmailAddress 
       and i.id < c.id
  )

1