If your database has export capabilities, use it. Now!

It should be to no one’s surprise that ETL is a process that should go out the door. If you read my two prior posts, you will see how newer databases that employ export functionality provide far better ways to capture data and send it data warehouses.

The difference in data quality is stark.  Through an ETL process for capturing data and loading it into databases, you have to work through several sources, some of which you may never have the data you need. Sometimes it feels like you are a magician for making data appear.  Then you have the export process which sends all data, that you choose, in as raw as a form as possible for the data gurus to play around with it an mold it into terrific stories.

I’ve been through it all and believe me, there are challenges even when you use the export functionality.  The challenges included finding the latest point for each user to throw in a graph, finding latest values for a specific user field when a user has multiple updates of that property, or even .  Why are those actually difficult? Well for a few reasons, databases like Vertica don’t use keys the same ways we are used to storing data. These c-store databases store values by column, not by row. You won’t have unique rows per user, instead you will see every single update. Once you get past the idea that data is stored by column and you don’t have unique values, a light bulb goes off once you can see all the potential for having more data.

I had abstracted my data quite a bit. This allowed me to create a simple set of functions that would be able to export data to Vertica from VoltDB for nearly any user data created or updated.  I’ll post some Java code so you have an idea what I did.

public class metrics_user_upsert extends VoltProcedure {
	public final SQLStmt checkStmt = new SQLStmt("SELECT * FROM user_metrics WHERE accountID = ? AND metricID = ?;");
	public final SQLStmt insertStmt = new SQLStmt("INSERT INTO user_metrics (accountID, metricID, daily, weekly, monthly, total, lastUpdate) VALUES  (?,?,?,?,?,?,?);");
	public final SQLStmt updateStmt = new SQLStmt("UPDATE user_metrics SET " +
						  " daily = ?," +
						  " weekly = ?," +	
						  " monthly = ?," +
						  " total = ?," +
						  " lastUpdate = ?" +
						  " WHERE accountID = ? AND metricID = ?;");
						  
	public final SQLStmt exportStmt = new SQLStmt("INSERT INTO user_metrics_export (accountID, metricID, value, lastUpdate) VALUES  (?,?,?,?);");
	public final SQLStmt exportLogStmt = new SQLStmt("INSERT INTO user_metrics_logs_export (accountID, metricID, period, value, increase, decrease) VALUES  (?,?,?,?,?,?);");
    
	public VoltTable[] run( long	accountID,
					long 	metricID,
					long 	value,
					String 	setType,
					long	defaultValue,
					String	min_max
		     ) throws VoltAbortException {
		long currentTime = getTransactionTime().getTime()*1000;

		// check if symbol exists
		voltQueueSQL(checkStmt, accountID, metricID);
		VoltTable[] metricResults = voltExecuteSQL();
		
		long[] userMetricValues = metricValue(setType, value, metricResults[0]);
		long[] userMetricWHValues = metricWHValue(setType, value);
			
		// update or insert
		if (metricResults[0].getRowCount() > 0) {
			voltQueueSQL( updateStmt,
				  userMetricValues[0],
				  userMetricValues[1],
				  userMetricValues[2],
				  userMetricValues[3],
				  currentTime,
				  accountID,
				  metricID
				  );
		} else {
			userMetricValues[0] +=  defaultValue;
			userMetricValues[1] +=  defaultValue;
			userMetricValues[2] +=  defaultValue;
			userMetricValues[3] +=  defaultValue;
			
			voltQueueSQL( insertStmt,
				  accountID,
				  metricID,
				  userMetricValues[0],
				  userMetricValues[1],
				  userMetricValues[2],
				  userMetricValues[3],
				  currentTime
				  );
		}
		
		//Write to export
		voltQueueSQL( exportStmt,
				  accountID,
				  metricID,
				  userMetricValues[3],
				  currentTime
				  );
				  
		voltQueueSQL( exportLogStmt,
				  accountID,
				  metricID,
				  currentTime,
				  userMetricWHValues[0],
				  userMetricWHValues[1],
				  userMetricWHValues[2]
				  );

        // Passing true parameter since this is the last voltExecuteSQL for this procedure.
        voltExecuteSQL();

		return 1;	
    }
	
	private long[] metricValue(String setType, long value, VoltTable metric) throws VoltAbortException {
		long[] finalValue = new long[] {0,0,0,0}; //daily, weekly, monthly, total
		if (metric.getRowCount() >= 1) {
			finalValue[0] = metric.fetchRow(0).getLong(2);
			finalValue[1] = metric.fetchRow(0).getLong(3);
			finalValue[2] = metric.fetchRow(0).getLong(4);
			finalValue[3] = metric.fetchRow(0).getLong(5);
		}
		
		if (setType.equals("increase") || setType.equals("increment")){
			finalValue[0] += value;
			finalValue[1] += value;
			finalValue[2] += value;
			finalValue[3] += value;
		} else if (setType.equals("decrease") || setType.equals("decrement")){
			finalValue[0] -= value;
			finalValue[1] -= value;
			finalValue[2] -= value;
			finalValue[3] -= value;
		} else { //Else we just set to whatever they want
			finalValue[0] = value;
			finalValue[1] = value;
			finalValue[2] = value;
			finalValue[3] = value;
		}	
		
		return finalValue;
	}
	
	
	private long[] metricWHValue(String setType, long value) throws VoltAbortException {
		long[] finalValue = new long[] {0,0,0}; //value, increase, decrease
		
		if (setType.equals("increase")){
			finalValue[0] += value;
			finalValue[1] += value;
		} else if (setType.equals("decrease")){
			finalValue[0] -= value;
			finalValue[2] += value;
		} else { //Else we just set to whatever they want
			finalValue[0] = value;
		}		
		
		return finalValue;
	}
}

The above code was used to update user metrics. Metrics were used for any numeric value that would be associated to the user. Whether points, money, how many of an item they own, actions they perform, etc. The Java function captured how the data needs to be modified, whether we were looking to increment, decrement or simply replace the value. Based on that change, the final value was updated in VoltDB. The transaction itself was sent off to Vertica.

VoltDB was responsible for ensuring the data went to Vertica.  VoltDB maintained that connection with Vertica and sent data to it several times per second. When Vertica was having issues, VoltDB was keeping track of the issues to ensure that when things are back to normal the data will be caught up on Vertica. It was a glorious system that saved me many times.

Out of all this, the end result was that anything I had to do now could be done in real time. The data was being used for real time analytics and contextual data pulling. Reports had the ability to be adhoced, which meant that any pie in the sky idea for a question on analytics was more often than not answerable, even when the data was not stored for that specific intent. The flexibility that comes out of storing data on a per transaction basis made my anxiety level go down significantly when someone asked if a certain report was possible.

This ends this week’s segment on my Big Data Series. I will be back with another on Monday, where I will cover pulling of data from a warehouse database to use for analytic purposes.