Problem with left join

noetsi

Fortran must die
#1
When I run the following code I get a table, but one missing customers, because they are not in one of the tables.

proc SQL;
Create table work.firstep
as select distinct t1.*
, "VR" || substr(cats ("0000000", put(CUSProfile_AN,7.)),length (strip (cats ("0000000", put(CUSProfile_AN,7.))))-6,7) as VRID
from sasuser.test1 t1
join rimsrpt.rptcasesummary t2
on t1.'SSN Without Dashes'n = t2.customerid
where t2.extractdate ='31aug2017'd;
quit;

So I try left join to test where the extra cases are. And I get no table at all. Any suggestions what could cause this (it makes no sense to me at all).

This is the left join I run.

proc SQL;
Create table work.firstep
as select distinct t1.*
, "VR" || substr(cats ("0000000", put(CUSProfile_AN,7.)),length (strip (cats ("0000000", put(CUSProfile_AN,7.))))-6,7) as VRID
from sasuser.test1 t1
left join rimsrpt.rptcasesummary t2
on t1.'SSN Without Dashes'n = t2.customerid
where t2.extractdate ='31aug2017'd;
quit;
 

Dason

Ambassador to the humans
#2
A left join with a where condition on the 'right' table doesn't make much sense to me.

I'm not sure what you mean by your original statement though. Is the customer missing from the 'left' table or from the 'right' table?
 

noetsi

Fortran must die
#3
The left table is a group of individuals sent to us from another unit to find their id in our customer data base (which is in the right table, We create a series of extracts for our customer data that are mirrors of past data plus what is added since last month. If you don't run the where statement you get massive duplication. There is no extract data in the left table at all which does not need one).

The reason the join is necessary, the left join, is that if you don't do this you lose the individuals who are not customers, and they do not want this. They want to find out any person in the original table who is not a customer (none should be, although it happens).

The following code eliminates the problem, but does not explain why it would exist in the first place. Essentially you join the two tables above (losing the non-customers) then you join this table back to the original one. It makes zero sense that you have to do this - I don't know if this is a SAS issue or is generic to SQL. I suspect the fact that you are importing the left table plays some role in the problem.

This is the code that works (minus the lib name statement).

PROC IMPORT
DATAFILE="S:\CIU\sastestfolder\vrid.xlsx"
OUT=sasuser.test1
DBMS=XLSX
REPLACE;
SHEET="Sheet1";
GETNAMES=YES;
DATAROW=2;
RUN;

.....(lib name statement goes here)

proc SQL;
Create table work.firstep
as select distinct t1.*
, "VR" || substr(cats ("0000000", put(CUSProfile_AN,7.)),length (strip (cats ("0000000", put(CUSProfile_AN,7.))))-6,7) as VRID
from sasuser.test1 t1
join rimsrpt.rptcasesummary t2
on t1.'SSN Without Dashes'n = t2.customerid
where t2.extractdate ='31aug2017'd;
quit;

proc sql;

Create table work.firststep2 As
Select distinct t1.*
, t2.vrid
from sasuser.test1 t1
left join work.firstep t2
On t1.'SSN Without Dashes'n = t2.'SSN Without Dashes'n
;
quit;
 

Dason

Ambassador to the humans
#4
What happens if you use your original left join code but move the where clause into the 'on' for the join. So do your: left join blah.thing = bleh.thang AND t2.extractdate = 'stuff'd;

My original statement saying that the where with a condition on the 'right' table doesn't make sense when doing a left join was to point out that anything you 'gain' by changing the join to a left join will definitely be filtered out by your where statement anyways. So if there wasn't a match in the right table then we know extractdate will be NULL in the result so the where statement will then filter out that row anyways since it is NULL and thus not equal to that value you were filtering on.

Moving that condition to the on clause will only bring the info from the right table when that particular extractdate condition is met but it will retain the rows that don't give a match on the 'right' table because you're doing the left join.

So to keep this in mind the "on" conditions are done to evaluate which rows to join. The "where" clauses will filter *after* the join is done.

I don't have your data so can't tell if this is the reason or why it would result in an empty result in your original case but at the very least hopefully this sheds some light on what I was trying to get at earlier.
 

noetsi

Fortran must die
#5
The lesson being that outer joins with filters are of doubtful value.

What I don't understand is why it eliminated all rows in the table rather than the two.

I ran your suggestion Dason and it worked.
 
Last edited: