On Thu, 10 May 2012 13:55:40 -0700 (PDT), Sammy Danso
declaimed the following in
gmane.comp.python.general:> Thank you all very much for your response and help. I managed to work out the problem eventually. But my code is ridiculously long compared to what you have just offered here. I think your code is elegant and should be much faster.
>
> outputfile = codecs.open(csvfile3, 'wb')
> inputfile1 = codecs.open(csvfile1, 'rb')
> inputfile2 = codecs.open(csvfile2, 'rb')
>I'll concede that's a new one for me... I know the csv module is not
rated to work with Unicode... The help system doesn't explain what
happens when no encoding is specified for the input.>
> dictreader2 = csv.DictReader(inputfile2)
> dictreader1 = csv.DictReader(inputfile1)Do you really need dictionary? {see comments below} For one thing,
dictionaries are un-ordered, so the output columns may not be in the
same order as the input...> cnt = 0
> cntb = 0
> for dictline1 in dictreader1:
> cnt += 1
> print cnt"cnt" doesn't need to be maintained by hand -- you could use
enumerate on the for loop to get a counter value each time...
for (cnt, dictline1) in enumerate(dictreader1):
> mergedDict = dictline1.copy()What purpose does this copy serve? Both dictline1 and mergedDict are
overwritten on each pass of the loop?
> mergedDict['UniqID']= cnt
> matchedlistA.append(mergedDict)And here you appear to use "cnt" merely to add a "key" to the
dictionary containing the line number of the original data. Instead of
this you could just use a list of lists (no dictionary reader, just the
normal reader that returns a list of values). After all, your "cnt"
value is just the index into the list (remembering that lists index
starts at 0, not 1).
You've essentially read the entire file into memory, converting it
into a list of dictionaries, with each dictionary containing the same
set of keys, to which you've added a key whose value is the record
number.>
>
> for dictline2 in dictreader2:<snip>
Here you do the same thing with the second file. Identical code
except for the 1/A becoming 2/B... First means of shortening the code
would be to define a function to read one file and return the results.
Then call that function passing each file...
A = readFile(csvfile1)
B = readFile(csvfile2)>
> for dictline1 in matchedlistA:
> for dictline2 in matchedlistB:
> if dictline1['UniqID'] == dictline2['UniqID']:Big time waste... Your lists are ALREADY IN RECORD NUMBER ORDER...
Instead, for each record in the outer list, you are "reading" ALLentries in the second list, trying to match on a value you wrote to the
records just to keep track of the order of the data.
> entry = dictline1.copy()
> entry.update(dictline2)
> matchedlist.append(entry)
>Even more -- you don't break out of the inner list when you do find
the match. If each list contains 3 records, you end up processing 9
comparisons!
1, 1 save
1, 2 null
1, 3 null
2, 1 null
2, 2 save
2, 3 null
3, 1 null
3, 2 null
3, 3 save
For 10 records, you do 100 comparisons. AND you are creating a third
list in memory.
Since you are matching solely by record position, AND only saving
data for records in common to both files (that is, you ignore any data
if a file is longer than the other) the entire operation condenses to:
-=-=-=-=-=-
import csv
FILEIN_1 = "file_1_name.csv"
FILEIN_2 = "file_2_name.csv"
OUTPUT = "output_file_name.csv"
fin1 = open(FILEIN_1, "rb")
fin2 = open(FILEIN_2, "rb")
outf = open(OUTPUT, "wb")
csvin1 = csv.reader(fin1)
csvin2 = csv.reader(fin2)
csvout = csv.writer(outf)
while True:
line1 = csvin1.next()
line2 = csvin2.next()
if not (line1 and line2): break
csvout.writerow(line1.extend(line2))
outf.close()
fin1.close()
fin2.close()
-=-=-=-=-=-
BTW: you never showed your output code <G>