Sunday, April 24, 2022

Importing Excel data at run-time (15,000 row import workaround)

If you have found yourself working w/ some larger Collections, Excel Tables, or other data sources, you'll find some limits that you start to hit.  For importing data into a static Collection during design-time, that limit is 15,000 rows.  Now you can of course chop up your list into multiple lists to get past this, but that can get frustrating if you're doing this lots and lots of times.

As well, some of the default methods that might "work fine" for smaller lists, appear to blow up w/ large datasets.  However, w/ a little bad programming, you can get around this.  Read on.

Note: Microsoft made a recent (November 2023) change that limited input lengths on Text input fields.  

Let's say you have....oh...a 38,000 row Employee table you need to import weekly to keep your data fresh as a part of some payroll process.  Each pay period you need to pull in this data.  Now, in theory, your database team has probably done all of this work for you.  Spun up an exposed data source, wrapped it in an API, made it nice and simple (and free) for you to access through PowerApps.

Or not.

So to get this data in a way you can use it, you could host it in SharePoint.  Set up a process to pull this in, remove old records that no longer are valid, update existing, add new, etc.  However, SharePoint Online also is fairly slow at this kind of process.  So you might find yourself getting frustrated with exactly how (un-?)fast, (non-?)efficient, and (definitely) error-prone a process that is designed to save you all time might be.

So how do we fix this?

One option is to allow your end-users a way to import this data at run-time.

Primary Use-Case: Payroll processing

Lots of processes are more "batch" and are done by a team of last-mile staff processing/parsing/fixing data before ingestion by an ERP, accounting, or payroll application. For this kind of scenario, this is a good fit.

It works for this kind of use-case because this is generally a small team who can be trained to do a task reliably and are generally skilled w/ the core applications associated w/ such a process (i.e. Excel).  As well, these systems also generally have a "punt" of exporting/importing CSV files.  The key is always to get these files properly formatted.  

Imagine again our 38,000 record employee listing that is being exported from our ERP system.

We'll probably need to merge more network-friendly information from Active Directory with more ERP-friendly payroll data.  For example, I know everyone's GUID or network ID from AD, but I might not know all of the various jobs they do, certifications they have, etc.  How do I merge this data if my data team hasn't quite yet got it all ready for me in a curated source?

The ERP source probably has an Email -> EmployeeID crosswalk table.  So easy-peasy, right?  Well, that's where the 15,000 row problem starts to get annoying and to throw off your processes.  

You certainly could, as a Developer, split and import this list every few weeks and accomplish your goals.  However, what if you wanted your end-users to be able to do this w/o you needing to be involved?

Ingesting a CSV at run-time through copy/paste

On a small-scale, this problem is easy to solve w/o any real effort.  

  1. Create a Text Input (named runtimeCSVImport) and set it for multiline
  2. Run your app and copy a few rows/columns from your Excel file
  3. Create a button to Split your list on Char(10) then Char(9) and put it into a Collection into Rows/Columns

If we copy/paste out of Excel, it drops in a Windows EOL/LF (Char(13) & Char(10)) at the end of each line and a Tab (Char(9)) between each column.  So splitting on Char(10) gives us rows, Char(9) gives us columns.

This "works" with a little finagling but good coding practices are likely to result in this failing if you have a very large list.

Here's an example, using a simple crosswalk of two columns: EmployeeID, Email.

If I have my end-users copy/paste these rows/columns into that field, then I could use the following code to turn it into a Collection named allData_raw:

ClearCollect(
    allData_raw,
    Split(
        runtimeCSVImport.Text,
        Char(10)
    )
)

This gets us a a single-column Collection as a starting point.  Clearly we need to throw away the first line, but how do we split this into the columns we need?  Well, Excel splits copy/pasted data using the TAB character (i.e. Char(9))

NOTE: I generally then include an If() statement to check and see if the end-user included the headers or not and then remove that row via the RemoveIf() command.

So if we now run the following code (after our above code) we'll get things closer to what we want:

ClearCollect(
    allData_split,
    AddColumns(
        allData_raw,
        "employeeID",
        First(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result
    )
)

But let's take a closer look at how this worked.  It ran a single command to parse all rows of our first Collection (allData_raw) and split/merge them into our new Collection (allData_split).  We could have used a ForAll() function to roll through each row and parse/process, but the ClearCollect command allows us to do this all at once and much faster.  Pretty slick.

So now that you're all cocky, let's finish it up and add the last column via the following extension of our above command:

ClearCollect(
    allData_split,
    AddColumns(
        allData_raw,
        "employeeID",
        First(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result,
        "email",
        Last(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result
    )
)



So that's it, right?  Now we can import in 38,000 rows and everything's grand?  Nope.  This is where good coding breaks down w/ the realities of pushing the limits of the platform beyond the expected configuration (2,000 rows).

Breaking the platform

The problem here is the addition of this last column in a single command.  When we did the Result and the employeeID everything was fine.  When we added the third column email the platform starts to implode when the record count gets high.  You will find your browser freezes, times out, and essentially never finishes the above code.  It will simply sit there attempting to process that list forever (or at least long enough that I kill the process).

So why does this break when the exact same code works w/ small lists just fine?  Well, I haven't followed it all the way down the rabbit-hole, but I'd imagine that somewhere in the PowerFX language source we'd find that complex inner->outer equations don't process as we'd imagine they might in a more traditional language (e.g. C#, Python, etc.).  If you've worked with the language enough, you realize that sometimes you need to do equations line by line instead of doing them as a single calculation.

This means that the solution is actually REALLY simple: write bad code.  

The Solution: Break up the steps

Ultimately, we just need to split up our code into chunks and let them each process individually.  This code works all the way through no matter how many records (at least as far as I've tested) we throw at it:

ClearCollect(
    allData_raw,
    Split(
        runtimeCSVImport.Text,
        Char(10)
    )
);
ClearCollect(
    allData_split,
    AddColumns(
        allData_raw,
        "employeeID",
        First(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result
    )
);
ClearCollect(
    allData_final,
    AddColumns(
        allData_split,
        "email",
        Last(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result
    )
)

To break this down again, we first parse all rows into a Collection, then we split out the first column, then we split out the last column.  Each step we're running a Split() function to take another chunk of the string out and drop it into a new column.

If we needed to do >2 columns, then of course we could be doing a combination of Last() and FirstN() to split these up into a larger # of columns.

Performance

At this point, we've got a 38,000 list in memory that can be called using a LookUp() function.

LookUp(
    allData_final,
    email = "nowhere@noplace.nope"
).employeeID

If we were processing tens of thousands of transactions and attempting to merge data coming from our Power Platform tools (e.g. email) with data needed for the ERP (e.g. employeeID) we can now perform these lookups in a mere fraction of the time it would take to grab this from SharePoint or even if we attempted to do it from the original Excel document we copy/pasted from.  

As long as the system running the application has enough memory & CPU horsepower, this should be x1000 faster.

Alternate Approach

As I've turned this into a more standardized method and tried to let Excel people think like Excel people, I've approached this differently.  Let's say we start w/ a similar design as above except I pre-populate it w/ enough columns to cover my table:

ClearCollect(
    allData_raw,
    AddColumns(
        RenameColumns(
            Split(
                runtimeCSVImport.Text,
                Char(10)
            ),
            "Result",
            "fullLine"
        ),
        "A",
        "",
        "B",
        "",
        //...just repeat until you're done
        "V",
        ""
    )
);

Here we have created a list of items w/ a first column named fullLine that includes the entire line of text pasted into our control.  Then we're adding in a bunch of blank columns named A through V.  Now we populate them all in one fell swoop:

UpdateIf(
    allData_raw,
    true,
    {
        A: First(
            Split(
                fullLine,
                Char(9)
            )
        ).Value,
        B: Index(
                Split(
                    fullLine,
                    Char(9)
                ),
                2
            
        ).Value,
//...just adjusting our column name and the FirstN count up by 1 each time until the last step
        V: Last(
            Split(
                fullLine,
                Char(9)
            )
        ).Value
    }
);

First off, this works very well and executes almost immediately even on tens of thousands of rows and lots of columns.

Also, we can match our columns up identically to Excel columns (A, B, C, etc.).  We also can decide if/how we want to move columns to a new Collection depending upon the Order in which they were pasted in.  Meaning, as long as the end-user includes the headers, they can even paste them in out of order and we can then move these around based upon the values in the first row.

Switch(
    First(allData_raw).A,
    "employeeID",
    ...
)

What If We...

It would be REALLY NICE if we could dynamically create column names based upon the text included in the header fields.  However, it doesn't appear this is possible.  For example, this just doesn't work:

ClearCollect(
    allData_test,
    AddColumns(
        allData_raw,
        First(
            Split(
                First(allData_raw).Result,
                Char(9)
            )
        ).Result, //this actually ='s "employeeID"
        First(
            Split(
                ThisRecord.Result,
                Char(9)
            )
        ).Result
    )
)

Even though the run-time value of the code snippet marked is "employeeID" and changing this code to be the actual string "employeeID" works, the code above doesn't.  Why is that?  Because in PowerApps, we cannot dynamically alter column names like this.

Maybe PowerApps will be enhanced in the future to allow this, but for now, this is about as far as we can get.

Final Thoughts

This actually is pretty handy to know how to do.  There are an array of scenarios where we need to allow end-users to copy in Excel-formatted data into our applications for processing.  And it is extra-interesting/beneficial that ingesting, sorting, classifying all of this data is still much faster than individually asking for only a subset of this from SharePoint.

This can be a starting-off point to allowing for more ad-hoc creation of records in SharePoint (copy/paste a batch of data into our app to create, modify, delete specific records.  In fact, Power BI can generate CSV output pretty easily, so this also allows easy generation of reports by one team that can be exported into CSV's and then later imported in and executed by another.

This kind of methodology allows for some fairly elaborate division of labor w/o fully understanding what each team is doing.

From my perspective, it does allow us to still maintain a highly integrated process from start->finish while also letting each team operate independently.

No matter how many records some lunatic decides to paste in.

No comments:

Post a Comment

Because some d-bag is throwing 'bot posts at my blog I've turned on full Moderation. If your comment doesn't show up immediately then that's why.