Finding Duplicate and Missing (null) Values in Oracle Table
Finding Duplicate and Missing (null) Values in Oracle Table
I am looking for errors in a table and want to report both duplicates and missing values. I am unsure of the best way to do this and am looking for advice on a better way to accomplish this. This is in Oracle 12c.
This appears to achieve the desired result:
SELECT a.id,
a.mainfield,
a.location,
b.counter
FROM maintable a
INNER JOIN (
SELECT mainfield,
Count(*) counter
FROM maintable
GROUP BY mainfield
HAVING Count(mainfield) > 1 OR mainfield IS NULL
) b ON a.mainfield = b.mainfield OR
( a.mainfield IS NULL AND b.mainfield IS NULL )
ORDER BY a.mainfield;
This works and gives me the ID, the potentially null MAINFIELD, the location and a count of either the duplicate MAINFIELD values or the null MAINFIELD values.
Is there something simpler or potentially more efficient that I could be using? I have to admit that my SQL skills are quite rusty.
Sample data may or may not help, but the ID is the primary key, is a number and not nullable. The other fields are both NVARCHAR2 and nullable. None of those are indexed. Here is what the output could look like. Some records are outright errors. Some are obvious typos. Some appear to be test data.
ID MAINFIELD LOCATION COUNTER
------- --------- --------------------------------- -------
16626 206000650 9A OLIVER ST CENTRAL STATION 2
18805 206000650 3 SWIFT CT CENTRAL STATION 2
22409 940000170 2 MARKET ST NEWARK DE 2
22003 940000170 1 MARKET ST NEWARK NJ 2
29533 970000030 95 MILL RD ANDOVER 2
20256 970000030 12 RAILROAD AVE 2
29018 978900050 44 BROAD STREET 2
28432 978900050 WASHINGTON ST AND HAMILTON AVE 2
21831 980700050 BROADWAY NEWTOWN 2
24147 980700050 MAIN STREET LEVITTOWN 2
26418 3
26738 TEST DATA 3
26755 3
The last three rows have a null MAINFIELD and there are three such records (two of which have the location null also).
After adding some insight into the data above, I realized I might consider using NVL to eliminate part of the conditions, like this (assuming the value I chose would not be a valid value in the mainfield):
SELECT a.id,
a.mainfield,
a.location,
b.counter
FROM maintable a
INNER JOIN (
SELECT mainfield,
Count(*) counter
FROM maintable
GROUP BY mainfield
HAVING Count(mainfield) > 1 OR mainfield IS NULL
) b ON NVL(a.mainfield,'***NULL***') = NVL(b.mainfield.'***NULL***')
ORDER BY a.mainfield;
This executes a bit quicker and seems to produce the desired result. I have been trying other alternatives without success, so this may be the best alternative.
One alternative that I discarded which might be appropriate for a slightly different scenario (but was the worst performer for me) is this one:
SELECT id,
mainfield,
location,
COUNT (id) OVER (PARTITION BY mainfield) counter
FROM maintable a
WHERE mainfield IS NULL
OR EXISTS(SELECT 1 from maintable b
WHERE mainfield = a.mainfield AND ROWID <> a.ROWID)
ORDER BY a.mainfield;
I just really liked the way this was put together and was hopeful that it would be somewhat efficient. We are not talking that it runs for days, but I am trying to re-learn in Oracle what might have once been a skill back when I was coding with SQL/DS.
If any of the above gives anyone an idea of a better alternative, I am all ears. (For example, is there a way to reference the counter [the COUNT (id) over PARTITION BY mainfield] in the WHERE clause?)
Thanks again.
Thanks. I added some "sample" data which is similar to what I am dealing with. The actual data is not much different, but there are a couple of hundred thousand record where the MAINFIELD is unique and not null. I did not think it was necessary to show those, I hope this helps.
– FocusWiz
Jun 27 at 14:29
i think you might get a better respons on dba.stackexchange.com as this question is more tuning/optimization rather than you cannot solve your problem
– EoinS
Jun 29 at 18:16
Thank you @EoinS. I did not know such an animal exists.
– FocusWiz
Jun 29 at 18:27
1 Answer
1
This seems to be a good compromise between readability and reliability and efficiency which was offered by Balazs Papp on the dba.stackexchange.com board:
https://dba.stackexchange.com/a/210998/154392
SELECT * FROM (
SELECT id,
mainfield,
location,
COUNT (id) OVER (PARTITION BY mainfield) counter
FROM maintable a
) where counter > 1 or mainfield IS NULL
ORDER BY mainfield;
This is a simplification of the last alternative of the original post. It does not appear to be more inefficient than my original alternative (as far as I can tell), but to me it is more readable.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Sample data and desired results would really help.
– Gordon Linoff
Jun 27 at 14:01