Kettle data extraction efficiency enhancement

Copyright notice: This article is an original article for bloggers. It can not be reproduced without permission from bloggers. Https://blog.csdn.net/xpliruizhi123/article/details/54580850

Recently, it has been found that KETTLE decimation is getting slower and slower, especially when the incremental INSERT / UPDATE is incremental (TRASFORMATION takes 20 incremental decimations of 30 W per day from a 400 W table)Hour!!!! The read rate is 5/s…), which was discovered after my KETTLE tool was upgraded from 3.2 to 7.0 (slower than it used to be, it’s just acceptable, it’s been upgraded to the point where it can’t be changed), but KETTLE is improving.So the reason is to find it out of itself.

So far, I have found that KETTLE extraction data is slow for several reasons:

A：SPOON Startup time memory is small, and in the spoon. bat startup file, the JVM memory XMX is configured, (“% PENTAHO_DI_JAVA_OPTIONS%”==””set PENTAHO_DI_JAVA_OPTION”S =”- Xms8192m””- Xmx8192m””- XX: MaxPermSize = 4096m”, defaults to 256M, 512M 256M, where Xms refers to the heap initially allocated by the JVMXmx is the maximum amount of stack memory allocated by the JVM (which JAVA code can relate to for storing data variables), so XMS must & lt; = XMX, XX: MaxPermSize, refers to the non-stack memory allocated by the JVM to itself.I changed it to 8192M 8192M 4096M because I was running on the server. This can’t be an infinite increase. I need to consider the total memory size. Generally speaking, online reference is the largest stack memory not more than 3/8 of the total memory.Half, in short, there is a degree.

B：The key fields of the extracted source database are not indexed. In my table, for example, I need to extract the updated data incrementally from DB2 by date every day, but this field is not indexed, so it also causes slow decimation.

C：The key field index of the source database is invalid in SPOON. Let’s look at the two paragraph SQL.

SQL1:select TO_CHAR(TO_DATE(t_date,’YYYYMMDD’)+1,’YYYYMMDD’) from delta_table where t_name=’~~~’

SELECT * FROM SAPCP1.ZCSSDH053 WHERE REPORT_CREATE_DATE >= ?

SQL2:select t_date from delta_table where t_name=’~~~’

SELECT * FROM SAPCP1.ZCSSDH053 WHERE REPORT_CREATE_DATE > ?

Both SQLs perform the same data extraction process, but SQL1 is 20 times faster than SQL2 when the REPORT_CREATE_DATE field is indexed. The reason is that when &gt is used, the index will fail, while &gt will not.

So I searched the Internet for a situation where I should have gone, but not indexed.
Use < > (sometimes it is possible to use &lt alone or &gt).
likeYou can’t be sure when the first character is “%”.
Non preamble columns using composite indexes alone
Table no analysis
Character type mismatches have been displayed or implicitly converted or index columns have been computed.
Using not in or not exist
The cost of full table scan is small based on CBO.
b-treeWhen the index is not null, when it is not null, when it is not null, when it does not move the bitmap, when it is not null, when it is not null, look at the leading column

D：There are too many indexes in the destination database table, which is obvious because the index table is updated again when the data is inserted, too many indexes make the insertion slower.

E：In the insertion process, the data COMMIT process is too frequent, and the data insertion COMMIT is too frequent also affects the efficiency.，30WData submission 300 times and data submission 30 times are obviously not the same speed, as long as your memory settings can temporarily accommodate inserted data, COMMIT can be set as high as possible (kettle limit can not exceed 5000)

F：INSERT/UPDATE Process andTable input, these two processes are certainly much faster than the latter, but the author in the actual process will encounter a problem is, in case of ETL decimation process a link error, resulting in ETL after the process DUMP dropped, or did not insert or partially insert, then, it is certainly not.There are updated key fields. (Keyfield updates are the last step in all processes), so when I get rid of this problem and restart the process, if I start over, there’s no problem with INSERT / UPDATE fields, and data duplication is updated, but for table inputThere will be problems. The data written yesterday is already there, so if we do not deal with it, we will definitely report it wrong. KETTLESevenTo solve this problem, set up the failed data rollback operation in advance when the JOB layer is designed, delete the inserted data in the rollback operation, and so on.There is no worry about the input of words.

Finally, 30 W data into the above optimization time from 20 hours to 0.5 hours ~ ~ I temporarily found these optimization methods, you have other ways to welcome exchanges ~!!!!

Leave a Reply Cancel reply