Excel and COM using Python
Excel is such a ubiquitous tool these days that it is hard to imagine how people would do various tasks without it. Unfortunately, doing things by hand can be error prone and slow compared to doing things programatically. As such, the subject of interacting with Excel from various programming languages comes up often on stack overflow as well as the associated site code review. Rightly so people take to automating spreadsheets to speed things up and to make them less error prone. The questions are varied but often the questions are related to slow performance. I was on Code Review and looking at a few questions when I noticed that there were two in a row that were basically the same and related to slow Interop. I wrote an answer to one but thought it would be a good a time as any to document the issue a little better and share a few tips that I have used in the past to speed things up substantially.
If you have written applications that interact with Excel using COM or just written VBA macros inside the provided Visual Basic editor you may have noticed that it can be slow due to overhead involved. The basics of avoiding this slowness are very similar weither you use internal VBA macros or code that uses the COM interface. Over the years I have used both although the VBA side much more often than COM. I have used several methods when writing simple VBA macros that have turned spreadsheet macros from horribly slow to sometimes very fast. A few of these things can be applied to COM access. So before we get to COM Interop, some general tips about dealing with Excel programatically are in order. For this we will use VBA inside of Excel to demonstrate a couple issues.
Simple Excel Overhead
There is an overhead to doing changes to a value on a sheet.
- Formulas tend to recalculate when things change.
- In order to show changes to the user, the screen refreshes.
To explore this, lets look at a simple example. Suppose you have a sheet in which you have values in column A and column B. You want to add them up and put the result in column C. Granted this is a very simple example that you could argue someone should not use code for but it is an easy way to demonstrate the concept. So.. lets consider a sheet with values in A, B, C and a formula in D. In our example there are about 30,000 rows..
A | B | C | D |
---|---|---|---|
1 | 1 | 0 | =C1/B1 |
The following code can be used to calculate a value for C
Sub SimpleRanges() Dim ColA As Range Dim ColB As Range Dim ColC As Range Dim Counter As Long Dim start As Long start = Timer For Counter = 2 To ActiveSheet.UsedRange.Rows.Count Set ColA = ActiveSheet.Cells(Counter, 1) Set ColB = ActiveSheet.Cells(Counter, 2) Set ColC = ActiveSheet.Cells(Counter, 3) ColC = ColA + ColB ColA = ColC - ColB ColB = ColA Next MsgBox ("Runtime : " + Str(Timer - start)) End Sub
If your results are anything like mine when running code like this then you will get a result that gives a messagebox and says it takes around 70 seconds.
Avoiding Simple overhead
We can do some simple things to avoid some of the Excel overhead. The simple ways to avoid the first two items above is to turn the updating off near the beginning of your code. This can be done in the example above by adding two simple lines
Application.ScreenUpdating = False Application.Calculation = xlManual
This will disable both of those in this particular workbook. You should remember to turn them back on after you are done doing your calculations
Application.ScreenUpdating = True Application.Calculation = xlAutomatic
The code now looks like this
Sub SimpleRanges2() Dim ColA As Range Dim ColB As Range Dim ColC As Range Dim Counter As Long Dim start As Long start = Timer Application.ScreenUpdating = False Application.Calculation = xlManual For Counter = 2 To ActiveSheet.UsedRange.Rows.Count Set ColA = ActiveSheet.Cells(Counter, 1) Set ColB = ActiveSheet.Cells(Counter, 2) Set ColC = ActiveSheet.Cells(Counter, 3) ColC = ColA + ColB ColA = ColC - ColB ColB = ColA Next Application.ScreenUpdating = True Application.Calculation = xlAutomatic MsgBox ("Runtime : " + Str(Timer - start)) End Sub
Code with these changes is significantly faster. If we run it we would find that in our example it will run in about 3.5 seconds. So just by adding a couple of lines we have code now that is about 20 times faster. This is a pretty serious performance change from such a simple addition to the code that it is a very worthwhile thing to do.
Faster Still
To make this really fast though, we need to limit the amount of time spent working our way through the Excel objects. Right now, we are still accessing each cell on each iteration. In order to address it we will need to minimize the interaction between the code we are running and Excel. To do this we are going to alter the code to only get the values and put them back when we are done.
Sub SimpleArray() Dim ColA As Variant Dim ColB As Variant Dim ColC As Variant Dim Counter As Long Dim EndNum As Long Dim start As Long start = Timer Application.ScreenUpdating = False Application.Calculation = xlManual EndNum = ActiveSheet.UsedRange.Rows.Count ColA = ActiveSheet.Range(Cells(1, 1), Cells(EndNum, 1)).Value ColB = ActiveSheet.Range(Cells(1, 2), Cells(EndNum, 2)).Value ColC = ActiveSheet.Range(Cells(1, 3), Cells(EndNum, 3)).Value For Counter = 1 To EndNum ColC(Counter, 1) = ColA(Counter, 1) + ColB(Counter, 1) ColA(Counter, 1) = ColC(Counter, 1) - ColB(Counter, 1) ColB(Counter, 1) = ColA(Counter, 1) Next ActiveSheet.Range(Cells(1, 1), Cells(EndNum, 1)) = ColA ActiveSheet.Range(Cells(1, 2), Cells(EndNum, 2)) = ColB ActiveSheet.Range(Cells(1, 3), Cells(EndNum, 3)) = ColC Application.ScreenUpdating = True Application.Calculation = xlAutomatic MsgBox ("Runtime : " + Str(Timer - start)) End Sub
In our example the messagebox gives odd numbers sometimes. It runs in far less than a second. It runs so fast that our timing method sometimes does not record it properly.
So, this is all great and fine but what does this have to do with accessing Excel from COM?
Adding COM Overhead
The overhead associated with COM can be much higher than the method we were using. In our example we noted that the screen updating, calculation, and working with each cell individually adds a tremendous amount of time. Taking that knowledge and applying it when we access Excel will make a huge difference in the time required to run our code.
The examples that follow are done in Python 3 using the pywin32 module. They are all running using Python 3.4. We are going to avoid the pitfalls we discovered when using VBA and limit the amount of time we spend going back and forth between the COM/Excel layer. Our simple code might look like this
import time import win32com.client as win32 def simple_ranges(): """ Open Excel using COM """ starttime = time.time() xl = win32.gencache.EnsureDispatch('Excel.Application') wb = xl.Workbooks.Open('c:\programs\python\Excel_COM.xlsm') sh = wb.ActiveSheet end_num = sh.UsedRange.Rows.Count for i in range(1, end_num + 1): sh.Cells(i, 3).Value = sh.Cells(i, 1).Value + sh.Cells(i, 2).Value sh.Cells(i, 1).Value = sh.Cells(i, 3).Value - sh.Cells(i, 2).Value sh.Cells(i, 2).Value = sh.Cells(i, 1).Value wb.Save() xl.Quit() print("Runtime " + "{:.2f}".format(round(time.time() - starttime, 1)))
It takes a very long time... about 450-500 seconds as it goes back and forth through the COM and Excel layers.
We could apply our findings about screen updating and calculation but the screen is not visible by default. If it does anything at all when coming from COM I wouldn't know since I have not done it enough to tell. So lets skip right to limiting the back and forth through layers and just throw the others in for good measure.
It gives us a final piece of code that looks like this
import time import win32com.client as win32 def simple_array(): """ Open Excel using COM """ starttime = time.time() xlCalculationManual = -4135 xlCalculationAutomatic = -4105 xl = win32.gencache.EnsureDispatch('Excel.Application') wb = xl.Workbooks.Open('c:\programs\python\Excel_COM.xlsm') sh = wb.ActiveSheet xl.Calculation = xlCalculationManual xl.ScreenUpdating = False end_num = sh.UsedRange.Rows.Count col_a = sh.Range(xl.Cells(1, 1), xl.Cells(end_num, 1)).Value col_b = sh.Range(xl.Cells(1, 2), xl.Cells(end_num, 2)).Value col_c = sh.Range(xl.Cells(1, 3), xl.Cells(end_num, 3)).Value cells_a = [0] * end_num cells_b = [0] * end_num cells_c = [0] * end_num for i in range(end_num - 1): cells_c[i] = col_a[i][0] + col_b[i][0] cells_a[i] = cells_c[i] - col_b[i][0] cells_b[i] = cells_a[i] sh.Range(xl.Cells(1, 1), xl.Cells(end_num, 1)).Value = cells_a sh.Range(xl.Cells(1, 2), xl.Cells(end_num, 2)).Value = cells_b sh.Range(xl.Cells(1, 3), xl.Cells(end_num, 3)).Value = cells_c xl.ScreenUpdating = True xl.Calculation = xlCalculationAutomatic wb.Save() xl.Quit() print("Runtime " + "{:.2f}".format(round(time.time() - starttime, 1)))
This runs in about 8 seconds.
There are improvements that could be made to the code, I am certainly not saying that this exact code is the best way to go about interacting with Excel however I am showing that it is much faster than interacting with each cell as we do calculations or set values into specific ranges and cells. Changes that could be further made have been left out since this ties as closely as I could think between the different code pieces in the two languages. The main point I was hoping to show is that doing things in batches when transfering over the COM interface in order to reduce the interaction can make a huge difference and I think in that regard the code has served it's purpose.