Rob Farley

Rob Rob Farley has been consulting in IT since completing a Computer Science degree with first class honours in 1997. Before moving to Adelaide, he worked in consultancies in Melbourne and London. He runs the development department in one of Australia's leading IT firms, as well as doing database application consultancy and training. He heads up the Adelaide SQL Server User Group, and holds several Microsoft certifications.

Rob has been involved with Microsoft technologies for most of his career, but has also done significant work with Oracle and Unix systems. His preferred database is SQL Server and his preferred language is C#. Recently he has been involved with Microsoft Learning in the US, creating and reviewing new content for the next generation of Microsoft exams.

Over the years, Rob's clients have included BP Oil, OneLink Transit, Accenture, Avanade, Australian Electorial Commission, the Chartered Institute of Personnel and Development, the Royal Borough of Kingston, Help The Aged, Unisys, Department of Treasury and Finance (Vic), National Mutual, the Bible Society and others.

Did you mean to come here? My blog is now at http://msmvps.com/blogs/robfarley



11 September 2006

Using row_number() to remove duplicates

For those of you who are coming to Wagga for the SQL Code Camp (less than four weeks away), you'll hear this tip then. But for the rest of you, here's one of the great uses for row_number() in SQL2005.

Suppose you have a table of data, which includes some duplicates. It happened recently for a guy who posted to the SQLDownUnder mailing list. You want to get rid of duplicates, and keep only the first row for each subset of fields. For example, in the mailing list question, the example was that he wanted to filter to a single field, which I will call idfield, which contained duplicates. It's worth noting here that there is no primary key on the table. If there were, life would be much easier.

One option (without row_number()) is to use min() or max() for each field. Something like this:

select idfield, min(field1), max(field2)
from dbo.sometable
group by idfield

But some types, such as guids, cannot be used with min() and max(). So for that, let's try:

select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield), max(t1.field2)
from dbo.sometable t1
group by t1.idfield

This will do the trick, but it makes life tough for checking that you get fields from the same row. After all, max(field1) and max(field2) might be from different rows. So you can use top with order by to get the right row.

select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4), (select top 1 t2.field2 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4)
from dbo.sometable t1
group by t1.idfield

And you can see, we're getting pretty ugly. So let's try with row_number(). If we partition over the field that we want to become unique, then our row_number() field starts again from 1 each time it changes. This is ideal, because if we filter to rows which have row_number() as 1, we get exactly what we want!

select * from
(select *, row_number() over (partition by idfield order by idfield, field1, field2) as rn
from dbo.sometable) t
where t.rn = 1

Lovely. Of course, you wouldn't use "select *", but you get the drift. :)