Thursday, November 3, 2022

Stripping HTML tags from SharePoint data at runtime

If you find yourself in a situation where "somebody" imported a bunch of SharePoint data that included what SharePoint decided had some HTML tags in it, then you might wind up with a multiline text field that has whatever value you care about surrounded with some very unique "<div></div>" tags.  Everything might look fine when you're glancing at your original list, but when you query the field you see something like:

<div class="ExternalClassCA06509747B841D08AA4020D3A431C1E">Some Value</div>

And don't think that SharePoint will be nice and use the same tag within that <div> tag.  It will not.  So if you're going to do any manipulation/comparison on that within your PowerApp (like looking for matching values), you're going to need to get rid of that garbage.

While it is of course always better to try and fix the data, if you're in a situation where you don't control it or cannot do it easily (say when you're dealing w/ hundreds of thousands of rows), then it is time to manipulate this within your app.

The Basic Problem

So let's assume we're starting our journey by grabbing a group of employee records from a SharePoint list by Employee ID.  We might start w/ something similar to:

ClearCollect(employeeSubmissions,
    Filter(mySharePointListOfEmployeeSubmissions, 
        EID=12345
    )
);

At this point, we have a Collection of all submissions for this Employee ID.  But now if we need to Filter down further say on an example field named 'Goal Type' where there are <div> tags in the field, you'll note that something like this won't work:

ClearCollect(employeeSubmissions, 
        Filter(employeeSubmissions, 
            'Goal Type'="Annual"
        )
)

The reason this does not work is because the VALUE inside 'Goal Type' is actually something like:

<div class="ExternalClassCA06509747B841D08AA4020D3A431C1E">Annual</div>

It is not an exact match, so it won't work.

NOTE: It *appears* that this "...ExternalClass{some string}" value is actually a GUID that probably relates to the individual row in SharePoint GUID, and is therefore likely always 32 characters long, I'm not willing to bet my life on it.  It probably is...but...YMMV

String Functions to the Rescue

While the bold out there can go lean on some Regex() functions to strip this, I never suggest that to someone who is learning or intending their code to be actually supported.  Trying to debug a Regex() is just problematic at times and sets a higher bar for who can debug this kind of work.  So using the more common string functions is generally what I suggest.  In this scenario, the Mid() function combined w/ Find() will solve for this.

To jump straight to a solution that will work for this situation, let's continue w/ our original code from above and fix this specific string field:

ClearCollect(employeeSubmissions,
    Filter(mySharePointListOfEmployeeSubmissions, 
        EID=12345
    )
);
UpdateIf(employeeSubmissions,
    true,
    {
        'Goal Type': Mid(
            'Goal Type',
            Find(
                ">",
                'Goal Type'
            ) + 1,
            Len('Goal Type')-6 - Find(
                ">",
                'Goal Type'
            
        )
    }
)

To run through what we're doing here, we are using the Mid() function to strip our string down and remove everything up until the end of the first <div ...> tag, then to include everything up until the last 6 characters (which should be the closing </div> (which is 6 characters long).  

Mid() works by taking the starting point and then the # of characters to grab.  The starting point is just the last character of the starting <div> tag.  However, to calculate the # of characters we need, we actually wind up subtracting the starting point from where our final character exists.  So, the code above takes the length of the string, subtracts 6 characters (length of our DIV) then subtracts where the > character was from the original <div> statement.

And while I cannot 100% guarantee that the next code will ALWAYS work, it certainly appears that the text that gets thrown into the <div class=...> key is the GUID for that record (or maybe that field for that record?  anybody know?).  If true, this makes it always a fixed length (32), so that means the following simpler code probably works in all cases:

ClearCollect(employeeSubmissions,
    Filter(mySharePointListOfEmployeeSubmissions, 
        EID=12345
    )
);
UpdateIf(employeeSubmissions,
    true,
    {
        'Goal Type': Mid(
            'Goal Type',
            60,
            Len('Goal Type')-65
        )
    }
)

Now that this UpdateIf() has run, then the following Filter would actually work as expected:

ClearCollect(employeeSubmissions, 
        Filter(employeeSubmissions, 
            'Goal Type'="Annual"
        )
)

This is because 'Goal Type" had all the extraneous HTML garbage removed and now just holds the value "Annual" in it (at least within our local collection, SharePoint is another story).

Fixing Multiple Fields w/ HTML in one Command

Just to close the loop here, if you have a problem w/ one field, then you might have it w/ many.  The approach above can be easily built out to fix this for many fields at once in the same command.  Let's assume I have two fields I need to fix named 'Goal Type' and 'Goal Class'.  The below code would fix both in one fell swoop:

UpdateIf(employeeSubmissions,
    true,
    {
        'Goal Type': Mid(
            'Goal Type',
            60,
            Len('Goal Type')-65
        ),
       'Goal Class': Mid(
            'Goal Class',
            60,
            Len('Goal Class')-65
        )
    }
)

Essentially, you just add the next field below the prior until you're done (w/ a comma separating them).  Just make sure that for your Mid() and Len() functions within these sections you rename it to the correct field in each segment.

Final Thoughts

I know that SharePoint has its reasons for this that likely are due to how SharePoint data can be used in an array of ways to also populate out the pages in SharePoint to reflect certain details.  And there probably is some way to do a calculated column to get a similar result by fixing it at the SharePoint side.

However, if you find yourself in a pinch and just need to pull those HTML tags off the front/back from field values in Lists that you don't necessarily have full control over, then this will get the job done.

No comments:

Post a Comment

Because some d-bag is throwing 'bot posts at my blog I've turned on full Moderation. If your comment doesn't show up immediately then that's why.