Wednesday, 26 February 2014

Duplicate Table

Has this ever happened to you?  You are doing left joins to tables and somehow your most recent table has more records than it should.  What happened?  You were careful to ensure your ID field is unique, but for some unexplained reason you have too many records.  The best way to solve the problem is to identify if your table has duplicates and what those duplicates are.  I like to create a sub-table that lists all of the fields of every record that is part of a duplicate.  The code below will quickly create a table called "dups" that contains every record that is part of a duplicate and sort it by the field in question.

Create a table of duplicate records


%Let TableName = TableName;
%Let FieldName = FieldName;

proc sql;
      create table dups as
      select *
      from &TableName
      where ((&FieldName in(select &FieldName from &TableName as tmp
      group by &FieldName having Count(*)>1)))
      order by &FieldName
      ;
Quit;


This query uses the in statement to query against itself and pulls all fields from all records where your field name occurs more than once.  The limitation of this query is that it assumes you are dealing with one field that is supposed to be unique.

Macro Names

Use the %Let function to identify your table name and your field name so you don't have to type the table and field names multiple times


TableName is the table you think has duplicate records
FieldName is the field that is supposed to be unique

Usage

I keep this code in a macro abbreviation (see previous blog) under the shortcut text “dups”.  This way it is always very easy to do a quick duplicate check.  Use this code to identify the records where duplicate occur.  Usually once you see the value of all of the fields that are part of a duplicate is becomes obvious why the duplicates occur.  I like to use this code to confirm there are no duplicates in my tables.  Run the code and check the log screen to ensure the number of records for the dups is 0.  It takes less than a minute and it is a good check.



I would love to hear how you check for duplicates in your tables.

Tuesday, 25 February 2014

Use Macro Abbreviations for Shortcuts

One of my biggest issues with SAS code is remembering the syntax of the code.  For example, when you reference a field in proc freq the syntax is "table" but in proc sort the field is referenced by the syntax "by".  With macro abbreviations you won't have to remember these inconsistencies of SAS code.  I mentioned the macro abbreviations in my previous blog, but I want to explore how I use them for a lot of little code that I use everyday.


 Using the Macro Abbreviations


  1. In the SAS editor go to Tools -> Add Abbreviation
  2. Under Abbreviation enter a string of text you want to use for the shortcut.  This is case sensitive so I suggest you type the shortcut in the case you usually code in.
  3. Under Text to insert for abbreviation, type in the text you want to appear after you type your shortcut.
I use this tool for all sorts of code.  When I type the text "freq" I get the text



Proc Freq Data_____;
      Table ______;
run;


The underline is very handy because they are a visual reminder what information you need  to add and it is easy to highlight with a double click.

I use SAS sql as my main query language and it is great to start every query by typing "sql" hit tab and I get the following text.


proc sql;
      create table _____ as
            select *
      from ____
      where 
      _____
      ;
quit;


It is great for code you don't use very often and can't remember the syntax.  I have a tough time remembering the syntax for delete so I created a macro abbreviation for the text "delete" and the following text appears.


proc sql;
      delete
      from _____
      where Field_Name in (Select field_name from OtherTable)
      ;
quit;

This code deletes from a table where the value of one field equals the same field from another table.  It is very handy when you create a control table and want to ensure that customers in the control are not in the campaign table.

The great thing about macro abbreviations is there is no limit to the volume of code that can go into the "Text to insert for abbreviation" box.  I have put pages of code in that box because I have had situations where certain projects started with exactly the same code.

I would love to hear some of the ways you use the macro abbreviation to make your programming efficient.





Monday, 24 February 2014

Starting a project in SAS

You are about to start a brand new project and you have the basic SAS white screen staring at you.  Early in my career my first move was to write code and get the project started.  Over time I learned that was the wrong thing to do. What you should do is spend time to think about the project, the outcome and document that information at the top of the code.  There are two good reasons to start with proper documentation.

  1. Documenting forces you to think about the whole project and what the final result will be. If you start coding right away you tend to only think about the next step, which is a good way to get off course very quickly.
  2. Well documented SAS code is easier to read and understand when you go back to the code months later because your client has asked you to "do the same thing that was done last quarter".  I typically work on three or four projects during a week.  Over a period of four months that can add up to over 40 projects that I have touch over that time.  When I get a request to re-create a project, it is very difficult and time consuming to open up the code I think I used and read through it to confirm if I am looking at the correct code.
Below is an example of the first thing I type in before starting any project of any size


/********************************************************************************** 
Program: CampaignRetention20140224.sas 
Programmer: Patrick Booth
Creation Date: 2014-02-24 
Purpose: Select all customers who has a contract that is coming to an end and create a campaign list with their contact information
Input: A list of customers who has a contract that will end from March 1, 2014 to March 31, 2014.  This request came from Darla in marketing
Output: Campaign list with all contact information including store number
Program Dependency: Make sure the fall campaign is complete before selecting these customers
History: This is a campaign that came from a pilot that was run in 2013
************************************************************************/
The code gives you a lot of information at a glance which means you don't have to scroll down the hundreds of lines of code to confirm if you are looking at the correct file.
This method can be as detailed as you want and it only takes a few minutes to type in the information.

Below is a summary of each line and why they are important

Program:  This may seem redundant because the name of the program is the name of the file, so why do you need to type it in here?  I found it helps to write down the name of the program because it forces you to think about what a logical name is.
Programmer:  This is useful if you work with a team of programmers and you can immediately see who created the file.  If you work alone and nobody else programs in your company, then don't bother with this line.
Creation Date:  Important to know when you started the project because most of your clients will refer to projects by a name and when you created it.
Purpose:  Try to be as specific as possible, because that will help many months down the road.  Think about this section as a note to yourself in the future and what information you will need when you are trying to understand what this code is about.  I find that I go back and add to this section as the project changes over time.
Input:  Sometimes your client will give you a file to work with.  Name the file or describe that file here.  Other times the request will come from an individual.  Put the requester's name here because that is usually the first indication if you have the correct code.
Output:  This is more about forcing you to think about the outcome of the project.  If you think about the end goal then you tend to be more efficient with your coding.  Take a few minutes to think about this, it will save you coding time.
Program Dependency:  Include any events that has to happen before this program starts.  It is a good way to ensure everything is ready before you start the project
History:  Most projects happen as a result of events, previous projects or a change of policy at the company. This is an opportunity to add context to the project and help the "future you" to understand the code better.

Executing this code

I know what you are thinking.  "I can't be bothered to write all of this text at the beginning of every project.  It will affect my efficiency".  That is a legitimate concern, but I have a trick that will make this process a lot faster and easier.  
SAS has an abbreviation macro that will replace a small amount of code with a large amount of code.  I use the macro replace the text "startup"  with the text

/********************************************************************************** 
Program: Codename.sas 
Programmer: Patrick Booth
Creation Date: YYYY-MM-DD 
Purpose: 
Input: 
Output: 
Program Dependency: 
History: 
************************************************************************/

Here is how you do it:

  • Open a SAS program window
  • Select Program -> Add Abbreviation macro
  • Under Abbreviation type a brief name for your macro.  In my case I used the text "startup"
  • Under "Text to insert" type in the template for the beginning of your program.
  • Hit OK.
Now every time you want your startup template to appear type "startup" hit tab and start filling in your information.  Note:  The abbreviation macros are case sensitive, Startup would not work.

The abbreviation macro is a great tool for storing code that you use most often and I will talk about how to use it in future blogs.

I would love to hear about how you like to start your SAS coding projects.


Tuesday, 11 February 2014

First post of SAS 4 Marketers

The first blog of SAS 4 Marketers:
The goal of this blog is to share SAS ideas and techniques with programmers and analysts who work for marketing departments.
I have worked for financial companies, telecom companies and retailers and my experience has shown that there is no place on the web for exchanging SAS programming techniques that deal with real work problems in the marketing world.  What I mean by "real work problems" is the area of a programmers job between the SAS solution and the publishing of the solution to their client (i.e. the marketing department).
My hope is to share my experience and help others transition from a marketing department request to supplying the marketing department the answers they need.

This first blog will start simply with getting organised.  It has nothing to do with SAS, rather it offers a way to organize your projects, data and code so that you can easily find it at a later date when the marketing department asks you to redo the project.
Below is a sample of folders across two projects.  For every project, I create these folders as places to hold all of the pieces I need to complete each task.


Here a description of every folder and how I use them.

  • 2014: Is the year I created the project.  It is important to have your projects divided into yearly folders because a lot of marketing campaigns are annual and it is important to separate them from year to year.
  • 01 Retention Campaign:  This is the name of the project with the number prefix representing the month you created it.  Make sure your name is descriptive enough that you recognize it from other projects.  I know the last sentence was very obvious, but it is easier said than done.  If a project is an ongoing project and occurs monthly or quarterly, then you want to think about a name that easily changes with each version of the project as well as coinciding with the jargon that is used by your client.  Remember, you will be coming back to this project a few months or even years later when the marketing department sends you a request that sounds like "Can you recreate that list of customers you made for us in January?  It was for retaining our customers."  The prefix is important because people refer to projects by the date they occur and if you have a prefix that identifies the month it was created your life will be easier.  One note on the prefix, use 01, 02, 03, 04 instead of 1, 2, 3, 4 because this will sort your projects in chronological order.  A sorted list of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 can come out as 10, 1, 11, 12, 2, 3, 4 etc. in some versions of windows.
  • Code:  This is where you store your code.  It keeps it in a separate place so you can see all of your versions of code that occurred for the project.  When I name my code, I always give it a descriptive name and follow it with a suffix of the date it was last saved.  This a a good way to  maintain version control of your code.  It also ensures you don't overwrite earlier versions of your code.  The date should be in the format of YYYYMMDD to maintain chronological order when sorted.  So my code tends to be named like this "RetentionCampaign20140211.sas".  Now I know the code was last save on February 11, 2014 and if that is the most recent date, then I know it is the most recent version.
  • Data:  This is a place for all of the data that was inputted for the project.  For example if marketing gives you a list of customers that you need to input into SAS and manipulate, then store the data in this folder.  This is important because it separates the data from the output you are going to create and it keeps a record of exactly what marketing gave you, which may be important later on.
  • Output:  Store everything you send to marketing in output.  It is advisable to date stamp each file with a date suffix described in the Code section.  It is important to only store what was sent out in this folder.  Some projects require you to create many files before you get to your final file and that is why it is important to store only what was sent out, because months later marketing will come back and ask for the file you sent them.  If your Output folder is full of work files then it will be difficult to find exactly which one you sent.
  • Work:  This is where you store all of the files that you created during your project.  Think of it as that messy drawer in your kitchen that has no particular place.  Nobody will see these files so put whatever you want in here.  It is for your eyes only.  Dump everything in here and once you have created the final file you want then move it to the output folder.  Months later, when marketing asked for that file you created for them, you will be thankful you separated your work from your output.
If you keep to this basic organisation you should be able to find what you need months later when the file is required again for analysis.  I would love to hear how you organize your folders for projects and why you do it your way.