Fuzzy Grouping in Excel: How to Merge Similar but Not Identical Data

Have you ever had to manually group similar but not identical data in Excel? Discover how Power Query offers an effective solution through fuzzy grouping, a powerful but little-known feature that automates this tedious process.

Excel SEO
Excel SEO

Working with non-standardized data in Excel can be a frustrating task. We often encounter values that should be considered the same but have minor typing, formatting, or syntax differences. Manually grouping these elements takes precious time and increases the risk of errors. Fortunately, Excel offers an effective but little-known solution: fuzzy grouping in Power Query, an advanced fuzzy matching technique that greatly simplifies this process.

Understanding fuzzy grouping in Excel

Fuzzy grouping is an advanced Power Query feature that allows you to automatically identify and merge similar but not identical items. But before we dive deeper, let’s clarify: what is Power Query? It’s a powerful data preparation tool built into Excel that allows you to transform and clean data efficiently. Unlike standard Excel functions that require exact matches, fuzzy grouping uses similarity algorithms to recognize variations of the same data. This feature is particularly useful when working with:

  • Lists of names with spelling variations (e.g., “John Smith” and “Smith John”)
  • Data entered by different people with different conventions
  • Information from different systems with non-uniform formatting
  • Answers to open-ended questions in surveys or questionnaires

Fuzzy grouping is based on a configurable “similarity score” that determines how similar two strings must be to be considered equivalent. Additionally, you can create custom translation tables to map specific terms you want to treat as identical. In the following paragraphs, I will guide you through all the necessary steps to implement this powerful feature in your spreadsheets, including intelligent data preparation and the grouping operation.

Prepare data for fuzzy grouping

Before using fuzzy grouping, it’s essential to organize your data correctly. Proper preparation will ensure optimal results and simplify the entire process. The first step is to convert your data into an Excel table. This is a requirement to use Power Query and access fuzzy grouping features.

To create a table:

  1. Select any cell within your data
  2. Press Ctrl+T or go to the “Insert” tab and click “Table”
  3. Check the option “My table has headers” if the first row contains column names
  4. Confirm by clicking “OK”

Once the table is created, it’s advisable to give it a meaningful and concise name. This will make it easier to refer to it in formulas and Power Query. To rename the table:

  1. Select any cell in the table
  2. In the “Table Design” tab that appears, change the name in the “Table Name” field in the top-left corner

It’s also important to check data quality before proceeding. Look for blank cells, formatting errors, or special characters that could affect the grouping process. If necessary, perform data cleaning by removing extra spaces, standardizing capitalization, or correcting obvious errors. This data cleaning phase is crucial to ensure the integrity of the dataset and improve the effectiveness of fuzzy grouping.

Finally, identify which columns contain the values you want to group. Fuzzy grouping works best when applied to a single column at a time, so you may need to reorganize your data accordingly. This intelligent data preparation will help you achieve more accurate results in the subsequent steps.

Create a custom translation table

One of the most powerful features of fuzzy grouping is the ability to use a custom translation table. This table, acting as a reference table, allows you to explicitly define which terms should be considered equivalent, regardless of their similarity score. The translation table must have a specific structure with two columns:

  • From: contains the original values you want to map
  • To: contains the values to which you want to convert the original terms

For example, you might want to consider “email”, “e-mail”, and “electronic mail” as the same concept. In this case, the translation table might look like this:

From A
email electronic mail
e-mail electronic mail
mail email

To create this transformation table:

  1. Enter “From” and “To” headers in two adjacent cells
  2. Fill in the rows with the pairs of values to map
  3. Select the entire area and convert it into a table (Ctrl+T)
  4. Assign a meaningful name to the table, for example, “Translation”

The translation table is particularly useful for:

  • Standardizing industry-specific terminologies
  • Unifying abbreviations and full forms
  • Managing synonyms or equivalent terms in different contexts
  • Correcting common typos or formatting errors

The more complete and accurate your translation table is, the better the fuzzy grouping results will be. It is worth investing time in creating a comprehensive translation table, especially if you plan to frequently perform merge operations on similar data.

Import data into Power Query

Once your data and translation table are prepared, it’s time to import everything into Power Query to begin the fuzzy grouping process. Loading data into Power Query is a fundamental step that will allow you to apply advanced transformations before loading the results back into Excel.

To import the main table:

  1. Select any cell within the data table
  2. Go to the “Data” tab in the ribbon
  3. Click on “From Table/Range” in the “Get & Transform Data” group

The Power Query Editor will open with your data. This environment allows you to apply advanced transformations before loading the results back into Excel. Next, you need to import the translation table as well (if you created one). The process is identical:

  1. Return to Excel without closing the Power Query Editor
  2. Select a cell in the translation table
  3. Go to the “Data” tab and click on “From Table/Range”

You will now have two separate queries in the Power Query Editor, visible in the “Queries” pane on the left. It is important that both queries are available in the Power Query environment before proceeding with fuzzy grouping.

Before continuing, it is advisable to verify that the data types are correct in both tables. Power Query automatically assigns data types based on content, but sometimes you may need to correct them:

  1. Select the column to modify
  2. Right-click and choose “Change Type”
  3. Select the appropriate data type (usually “Text” for the data to be grouped)

With the data correctly imported and formatted, you are ready to apply fuzzy grouping. If you need to load multiple files from a folder, Power Query also offers this functionality, which can be useful for more complex projects involving multiple data sources.

Apply basic grouping in Power Query

Before using fuzzy grouping, it is helpful to understand how standard grouping works in Power Query. This will provide us with a foundation to then modify the formula and implement fuzzy grouping. To apply standard grouping:

  1. In the Power Query Editor, select the column that contains the values to be grouped
  2. Go to the “Transform” tab in the ribbon
  3. Click on the “Group By” button

The “Group By” dialog box will open with several options:

  • Group by:select the column to use for grouping
  • New column:enter a name for the column that will contain the grouping results
  • Operation:choose “All rows” to keep all original data

After configuring these settings, click “OK” to apply the standard grouping. Power Query will generate an M formula using the function
Table.Group(). This formula will appear in the formula bar at the top of the editor. The result will be a new table with two columns:

  • The column with the unique values found in the selected field
  • A column containing nested tables with all rows corresponding to each unique value

This standard grouping, however, only works with exact matches. To achieve grouping based on similarity, we need to modify the generated formula and transform it into a fuzzy grouping, thus implementing a more flexible fuzzy matching.

Modify the formula for fuzzy grouping

The crucial step to implement fuzzy grouping is to manually modify the formula generated by standard grouping. This is necessary because Power Query’s user interface does not offer a direct button for fuzzy grouping. After applying standard grouping, observe the formula bar at the top of the editor. You will see a formula similar to this:

= Table.Group(#"Tipo modificato", {"NomeColonna"}, {{"Dati", each _, type table [NomeColonna=nullable text]}})

To convert it to fuzzy grouping, you need to:

  1. ChangeTable.GrouptoTable.FuzzyGroup
  2. Add a fourth parameter that defines the fuzzy grouping options

The modified formula should look like this:


= Table.FuzzyGroup(#"Tipo modificato", {"NomeColonna"}, {{"Dati", each _, type table [NomeColonna=nullable text]}}, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.8, TransformationTable=Traduzione])

The options in the fourth parameter control the fuzzy grouping behavior:

  • IgnoreCase:when set to true, grouping ignores case differences
  • IgnoreSpace:when set to true, spaces are ignored during comparison
  • Threshold:a value between 0 and 1 that determines how similar two strings must be to be grouped (0.8 is a good starting point)
  • TransformationTable:the name of the query containing the translation table

After modifying the formula, press Enter or click the checkmark next to the formula bar to apply the change. Power Query will perform fuzzy grouping according to the specified parameters.

It’s important to note that the Threshold value requires experimentation. A value too high (close to 1) will require near-perfect similarity, while a value too low (close to 0) might group items that shouldn’t be considered similar. This string similarity-based aggregation is the core of fuzzy grouping.

Configure Similarity Options

The success of fuzzy grouping largely depends on correctly configuring the similarity options. These options determine which items will be considered similar and thus grouped together. The Threshold option is particularly important. It represents the minimum similarity score (from 0 to 1) needed for two strings to be considered equivalent:

  • A value of 1.0 requires an exact match (equivalent to standard grouping)
  • A value of 0.0 would group all items together (rarely useful)
  • Values between 0.7 and 0.9 are generally most effective for most applications

The choice of the optimal value depends on the nature of your data:

  • For data with minor spelling variations: try 0.8-0.9
  • For more significant variations in wording: try 0.6-0.8
  • For related concepts expressed differently: try 0.5-0.7

The optionsIgnoreCase e IgnoreSpace are simpler to configure:

  • IgnoreCase=true: useful in most cases, as case differences rarely indicate different meanings
  • IgnoreSpace=true: useful when spaces are inconsistent (e.g., “data base” vs “database”)

It is advisable to start with conservative settings (high threshold) and gradually reduce the value if necessary. After each change, carefully review the results to ensure the grouping is logical and consistent with your expectations.

Remember that you can always go back and modify these settings if the results are not satisfactory. The process of optimizing similarity options is often iterative and requires experimentation. Some similarity algorithms, such as the Jaccard similarity algorithm, can be particularly effective for certain data types, so it’s worth exploring different options.

Expand Grouping Results

After applying fuzzy grouping, you will get a table with two columns: the grouped values column and a column containing nested tables with all the original data. To make these results more usable, you need to expand the nested tables.

To expand the results:

  1. In the column containing the nested tables, click the expand icon (two diverging arrows) in the column header.
  2. In the dialog box that appears, select the columns you want to include in the expanded results.
  3. Choose whether to keep or remove the original column name prefix.
  4. Click “OK” to apply the expansion.

This table expansion process will transform the nested structure into a flat table with all the original data, but now organized according to the applied fuzzy grouping. Each row will show the grouped value along with the corresponding original data.

If the original table contained many columns, you might want to select only the most relevant ones during expansion to keep the results manageable. You can always change this selection later if needed. In some cases, it might also be useful to consider removing unnecessary columns to further simplify the dataset.

Expanding the results is particularly useful when you want to:

  • See all the original values that were grouped together
  • Verify the accuracy of the fuzzy grouping
  • Perform further analysis on the grouped data

Prepare data for visualization or reporting

After expansion, it is advisable to reorder the columns logically to facilitate the interpretation of the results. You can do this by dragging the column headers to the desired position or by using the “Move” option in the column’s context menu. This step is important for creating a well-arranged and ordered set of data that will be easier to analyze and present.

At this stage, you may also want to consider standardizing the values in some columns to ensure consistency in your reports. For example, you might want to standardize the format of date fields or ensure that all names are in a consistent format (e.g., “Last Name, First Name”). These final cleaning operations will help improve the overall quality of your dataset.

Load Results into Excel

Once you have completed the fuzzy grouping and configured the output as desired, it’s time to load the results back into Excel for final analysis or presentation. To load the results:

  1. In the Power Query Editor, navigate to the “Home” tab on the ribbon.
  2. Click the “Close & Load” button to send the data directly to Excel.
  3. Alternatively, click the arrow under “Close & Load” and select “Close & Load To…” for more options.

In the “Import Data” dialog box, you can choose:

  • Table:loads the data as a formatted Excel table (recommended option)
  • PivotTable:creates a PivotTable directly from the grouped data
  • Only Connection:creates only a connection to the data without loading it into a sheet
  • Add these data to the Data Model:useful for more complex analysis or use with Power Pivot

Also select the location where you want to load the data:

  • Existing worksheet:specifies a cell in an existing sheet
  • New worksheet:creates a new sheet for the results

After confirming your choices, Excel will load the grouped data into the specified location. The data will maintain a dynamic link to the Power Query query, meaning you can refresh the results if the source data changes.

To refresh the data in the future:

  1. Select any cell within the result table.
  2. Go to the “Data” tab on the ribbon.
  3. Click “Refresh All” or “Refresh” in the “Queries & Connections” group.

This will re-run the Power Query query, reapplying the fuzzy grouping to any updated data. This automatic refresh capability is particularly helpful when working with data that changes frequently or when you need to merge queries from different sources.

Review and Refine Results

After loading the results into Excel, it’s crucial to verify the accuracy of the fuzzy grouping and make any necessary refinements. Even with the best settings, automatic grouping might not be perfect on the first try. Here are some strategies for reviewing and improving the results:

  1. Examine the created groups: Sort the data by the grouped value and check if all items within each group are truly related. Look for any anomalies or items that seem out of place.
  2. Identify false positives: Different items that have been incorrectly grouped together. These indicate that the similarity threshold might be too low.
  3. Look for false negatives: Similar items that have not been grouped together as expected. These suggest that the similarity threshold might be too high.
  4. Update the mapping table: If you find recurring errors, add new mappings to your mapping table to explicitly correct them.
  5. Adjust similarity settings: Go back to the Power Query Editor and modify the Threshold value or other similarity options to improve the results.

To modify the query and refine the grouping:

  1. Select any cell in the results table.
  2. Go to the “Query” or “Data” tab in the ribbon
  3. Click “Edit” to reopen the Power Query Editor
  4. Edit the fuzzy grouping formula or the translation table
  5. Close and load again to refresh the results

Fuzzy grouping refinement is often an iterative process that requires several attempts to achieve optimal results. Don’t hesitate to experiment with different settings until you find the combination that works best for your specific data. This refinement process will help ensure the integrity of your dataset and the quality of your final results.

Practical use cases for fuzzy grouping

Fuzzy grouping in Excel is a versatile tool with numerous practical applications across various industries. Here are some common use cases where this functionality can make a difference:

  • Customer database cleanup: Identify duplicates with minor variations in names (e.g., “Smith LLC” and “Smith, LLC”)
    • Standardize names of acquired companies or those with different brands
    • Unify customer records from different systems
  • Feedback and survey analysis: Group open-ended responses with similar meanings
    • Identify common themes in customer reviews or comments
    • Categorize suggestions or complaints for prioritization
  • Inventory management: Standardize names of manually entered products
    • Identify similar or equivalent products from different suppliers
    • Consolidate product categories with slightly different nomenclature
  • Financial analysis: Group similar expense items recorded with different names
    • Standardize bank transaction descriptions
    • Consolidate cost categories for more accurate reporting
  • Market research and analysis: Group competitor names with different spelling variations
    • Standardize names of geographic locations or regions
    • Unify industry terms or technical jargon

For each of these use cases, fuzzy grouping offers significant time savings compared to manual categorization, while also reducing the risk of human error. The key to success is to tailor the similarity settings and translation table to the specific needs of your scenario.

Limitations and alternatives to fuzzy grouping

Despite its power, fuzzy grouping in Power Query has some limitations that are important to be aware of:

Main limitations:

  • Works best with relatively short texts; long sentences may yield unpredictable results
  • Requires Power Query, which may not be available in all versions of Excel
  • Performance can degrade with very large datasets (tens of thousands of rows)
  • The similarity algorithm is not fully transparent or customizable
  • Does not handle multilingual comparisons or special characters well

For situations where fuzzy grouping is not suitable, consider these alternatives:

  • Approximate lookup functions: CERCA.VERT combined with functions like SIMILE o DISTANZA.TESTO
  • Complex array formulas to identify approximate matches
  • Third-party add-ins specialized in fuzzy matching
  • External approaches to Excel: Specialized data deduplication software
  • ETL (Extract, Transform, Load) tools with fuzzy matching capabilities
  • Database solutions with fuzzy lookup capabilities
  • Programming languages such as Python or R with libraries for fuzzy matching
  • Hybrid methods: Pre-processing data to standardize common formats
    • Initial grouping based on parts of the text (e.g., first letters)
    • Combining automatic grouping with manual review

If fuzzy grouping in Power Query doesn’t meet your needs, consider if one of these alternatives might be more suitable for your specific case. In many scenarios, a combined approach using various techniques can provide the best results.

Conclusions and best practices

Fuzzy grouping in Power Query is a powerful yet often underestimated tool in Excel’s arsenal. It allows you to automate a process that would otherwise require hours of manual, error-prone work. To achieve the best results with fuzzy grouping, consider these best practices:

  1. Properly prepare your data: clean your data before grouping, removing inconsistent formatting or unnecessary special characters.
  2. Invest in your mapping table: a well-constructed mapping table can significantly improve results, especially for industry-specific terms or common abbreviations.
  3. Iterate and refine: Don’t expect perfect results on the first try. Be prepared to experiment with different similarity settings and refine the process.
  4. Verify results: Always manually check a sample of the results to ensure the grouping is logical and consistent with your expectations.
  5. Document the process: Take note of the settings used and decisions made, especially if you plan to repeat the process in the future.
  6. Consider the context: Adapt the similarity settings to the specific context of your data and the required level of precision.
  7. Keep original data: Always keep a copy of the original ungrouped data for future reference or alternative iterations.

Fuzzy grouping is particularly valuable in an era of increasing data volume and variety. Mastering this technique will allow you to transform messy and inconsistent data into structured and usable information, significantly improving the quality of your analyses and reports in Excel.

Pubblicato in

Se vuoi rimanere aggiornato su Fuzzy Grouping in Excel: How to Merge Similar but Not Identical Data iscriviti alla nostra newsletter settimanale

Be the first to comment

Leave a Reply

Your email address will not be published.


*