Remove duplicate entries from a large csv C # .Net file

I created a solution that reads a large csv file currently in size of 20-30 mb, I tried to delete duplicate rows based on certain column values ​​that the user selects at runtime using the usual technique of finding duplicate rows but it is so slow that it seems that the program does not work at all.

What other method can be used to remove duplicate entries from a csv file

Here's the code, definitely I'm doing something wrong

DataTable dtCSV = ReadCsv (file, columns);
// columns is a list of string List column
DataTable dt = RemoveDuplicateRecords (dtCSV, columns);

private DataTable RemoveDuplicateRecords (DataTable dtCSV, List <string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter = string.Empty;

            if (dt == null)
            dt = dv.ToTable (). Clone ();

            DataRow row = dtCSV.Rows [0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter + = "[" + col + "]" + "= '" + row [col] .ToString (). Replace ("'", "''") + "'and";
                    }
                    RowFilter = RowFilter.Substring (0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow ();
                    bool result = RowExists (dt, RowFilter);
                    if (! result)
                    {
                        dr.ItemArray = dv.ToTable (). Rows [0] .ItemArray;
                        dt.Rows.Add (dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }
+3
source share
5 answers

One way to do this is to go through the table by building HashSet<string>one that contains the combined column values ​​that you are interested in. If you try to add a line that already exists, then you have a repeating line, Something like:

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}

It should be very fast.

+6
source

for foreach, , , , , .

- , , , , , , .

+2

Linq?

Linq ..

0

, , - .

Linq2Objects, , , , Linq Distinct (-uniques ).

:

from row in inputCSV.rows
select row.Distinct()

, CSV , - , , CSV .

Linq - - http://www.developerfusion.com/article/84468/linq-to-log-files/

0

, , - - , DataTable DataRows, :

class DataRowEqualityComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        // perform cell-by-cell comparison here
        return result;
    }

    public int GetHashCode(DataRow obj)
    {
        return base.GetHashCode();
    }
}

// ...

var comparer = new DataRowEqualityComparer();
var filteredRows = from row in dtCSV.Rows
                   select row.Distinct(comparer);
0

All Articles