Using row_number() to remove duplicates

Rob Farley

Rob Farley has been consulting in IT since completing a Computer Science degree with first class honours in 1997. Before moving to Adelaide, he worked in consultancies in Melbourne and London. He runs the development department in one of Australia's leading IT firms, as well as doing database application consultancy and training. He heads up the Adelaide SQL Server User Group, and holds several Microsoft certifications.

Rob has been involved with Microsoft technologies for most of his career, but has also done significant work with Oracle and Unix systems. His preferred database is SQL Server and his preferred language is C#. Recently he has been involved with Microsoft Learning in the US, creating and reviewing new content for the next generation of Microsoft exams.

Over the years, Rob's clients have included BP Oil, OneLink Transit, Accenture, Avanade, Australian Electorial Commission, the Chartered Institute of Personnel and Development, the Royal Borough of Kingston, Help The Aged, Unisys, Department of Treasury and Finance (Vic), National Mutual, the Bible Society and others.

11 September 2006

Using row_number() to remove duplicates

For those of you who are coming to Wagga for the SQL Code Camp (less than four weeks away), you'll hear this tip then. But for the rest of you, here's one of the great uses for row_number() in SQL2005.

Suppose you have a table of data, which includes some duplicates. It happened recently for a guy who posted to the SQLDownUnder mailing list. You want to get rid of duplicates, and keep only the first row for each subset of fields. For example, in the mailing list question, the example was that he wanted to filter to a single field, which I will call idfield, which contained duplicates. It's worth noting here that there is no primary key on the table. If there were, life would be much easier.

One option (without row_number()) is to use min() or max() for each field. Something like this:

select idfield, min(field1), max(field2)
from dbo.sometable
group by idfield

But some types, such as guids, cannot be used with min() and max(). So for that, let's try:

select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield), max(t1.field2)
from dbo.sometable t1
group by t1.idfield

This will do the trick, but it makes life tough for checking that you get fields from the same row. After all, max(field1) and max(field2) might be from different rows. So you can use top with order by to get the right row.

select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4), (select top 1 t2.field2 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4)
from dbo.sometable t1
group by t1.idfield

And you can see, we're getting pretty ugly. So let's try with row_number(). If we partition over the field that we want to become unique, then our row_number() field starts again from 1 each time it changes. This is ideal, because if we filter to rows which have row_number() as 1, we get exactly what we want!

select * from
(select *, row_number() over (partition by idfield order by idfield, field1, field2) as rn
from dbo.sometable) t
where t.rn = 1

Lovely. Of course, you wouldn't use "select *", but you get the drift. :)

Rob Farley - You could win from here!

Rob Farley

Links

Blogs

11 September 2006

Using row_number() to remove duplicates

Subscribe

MVP

Contact Me

Blog Archive

Rob Farley Rob Farley has been consulting in IT since completing a Computer Science degree with first class honours in 1997. Before moving to Adelaide, he worked in consultancies in Melbourne and London. He runs the development department in one of Australia's leading IT firms, as well as doing database application consultancy and training. He heads up the Adelaide SQL Server User Group, and holds several Microsoft certifications. Rob has been involved with Microsoft technologies for most of his career, but has also done significant work with Oracle and Unix systems. His preferred database is SQL Server and his preferred language is C#. Recently he has been involved with Microsoft Learning in the US, creating and reviewing new content for the next generation of Microsoft exams. Over the years, Rob's clients have included BP Oil, OneLink Transit, Accenture, Avanade, Australian Electorial Commission, the Chartered Institute of Personnel and Development, the Royal Borough of Kingston, Help The Aged, Unisys, Department of Treasury and Finance (Vic), National Mutual, the Bible Society and others. Links My Aussenal podcast Arsenal Football Club SQL in Australia Diplomacy Blogs Alastair Waddell Andrew Coates Athena Pawlowski Chuck Sterling Clarke Scott Darren Neimke Darryl Burling Dave Glover Dave Lemphers David Gardiner Dr Neil Roodyn Frank Arrigo Geoff Orr Greg Low Jeff Wharton Jesper Johansson Michael Kleef Mitch Wheat Nick Randolph Paul Stovell Peter Ward Should you be here?... mail me	Did you mean to come here? My blog is now at http://msmvps.com/blogs/robfarley 11 September 2006 Using row_number() to remove duplicates For those of you who are coming to Wagga for the SQL Code Camp (less than four weeks away), you'll hear this tip then. But for the rest of you, here's one of the great uses for row_number() in SQL2005. Suppose you have a table of data, which includes some duplicates. It happened recently for a guy who posted to the SQLDownUnder mailing list. You want to get rid of duplicates, and keep only the first row for each subset of fields. For example, in the mailing list question, the example was that he wanted to filter to a single field, which I will call idfield, which contained duplicates. It's worth noting here that there is no primary key on the table. If there were, life would be much easier. One option (without row_number()) is to use min() or max() for each field. Something like this: select idfield, min(field1), max(field2) from dbo.sometable group by idfield But some types, such as guids, cannot be used with min() and max(). So for that, let's try: select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield), max(t1.field2) from dbo.sometable t1 group by t1.idfield This will do the trick, but it makes life tough for checking that you get fields from the same row. After all, max(field1) and max(field2) might be from different rows. So you can use top with order by to get the right row. select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4), (select top 1 t2.field2 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4) from dbo.sometable t1 group by t1.idfield And you can see, we're getting pretty ugly. So let's try with row_number(). If we partition over the field that we want to become unique, then our row_number() field starts again from 1 each time it changes. This is ideal, because if we filter to rows which have row_number() as 1, we get exactly what we want! select * from (select , row_number() over (partition by idfield order by idfield, field1, field2) as rn from dbo.sometable) t where t.rn = 1 Lovely. Of course, you wouldn't use "select ", but you get the drift. :) Posted by Rob Farley at 7:30 pm Newer Post Older Post Home	Subscribe MVP I'm a SQL Server MVP Contact Me MSN Msgr: rob_farley@hotmail.com Yahoo Msgr: robjfarley Skype: rob_farley Email: rob_farley@hotmail.com Blog Archive ► 2007 (1) ► January (1) ▼ 2006 (133) ► October (5) ▼ September (21) More on primes Primes Ownership chaining Auxiliary table of numbers SA Budget cuts Dev / ITPro wars TTB seems fixed Not what it says on the box Windows Mobile networking profile switching Helper outerers Value of being valued (MVP) More on the Val() function Access Val() function Using row_number() to remove duplicates Leitmotifs Developing presenters It's "Go the way you want to" week Proof I was there! HandsOn Labs manuals from TechEd Sydney @@datefirst Msgr toast you don't want to see ► August (12) ► July (1) ► June (11) ► May (19) ► April (5) ► March (23) ► February (16) ► January (20) ► 2005 (87) ► December (12) ► November (21) ► October (24) ► September (22) ► June (1) ► April (7)