Elastic Search missing some documents while creating index from another index


Elastic Search missing some documents while creating index from another index



I have created one index in elastic search name as documents_local and load data from Oracle Database via logstash. This index containing following values.


documents_local


"FilePath" : "Path of the file",
"FileName" : "filename.pdf",
"Language" : "Language_name"



Then, i want to index those file contents also, so i have created one more index in elastic search named as document_attachment and using the following Java code, i fetched FilePath & FileName from the index documents_local with the help of the filepath i retrived file will is available in my local drive and i have index those file contents using ingest-attachment plugin processor.


document_attachment


FilePath


FileName


documents_local


filepath



Please find my java code below, where am indexing the files.


private final static String INDEX = "documents_local"; //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment... -- Destination Index
private final static String TYPE = "doc";


public static void main(String args) throws IOException {


RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();

try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}


//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(3000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}

SearchHit searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;

int line=1;
String docpath = null;

Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {

String encodedfile = null;
File file=null;

Map<String, Object> sourceAsMap = hit.getSourceAsMap();
doc.setId((int) sourceAsMap.get("id"));
doc.setLanguage(sourceAsMap.get("language"));
doc.setFilename(sourceAsMap.get("filename").toString());
doc.setPath(sourceAsMap.get("path").toString());

String filepath=doc.getPath().concat(doc.getFilename());

System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);

file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}

jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("language", doc.getLanguage());
jsonMap.put("filename", doc.getFilename());
jsonMap.put("path", doc.getPath());
jsonMap.put("fileContent", encodedfile);

String id=Long.toString(doc.getId());

IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);


try {
IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
}

}

System.out.println("Indexing done...");
}



Please find my mappings details for ingest attachment plugin( i did mappping first and then executing this java code ).


ingest attachment plugin


PUT _ingest/pipeline/document_attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "fileContent"
}
}
]
}



But while doing this process, Am missing some of the doucments,



My source index is documents_local which is having 2910 documents.
Am fetching all the 118 documents and attaching my PDF ( after converting to base64) and writting in another index document_attachment


documents_local


2910


118


document_attachment



But document_attachment index having only 118 it should be 2910. some of the doucments are missing. Also, for indexing its taking very long time.


document_attachment


118


2910



Am not sure, how the documents are missing in second index ( document_attachment ) and is there any other way to improvise this process ?


document_attachment



Can we include Thread mechanism here. Because, i future i have to index more than 100k pdf, in this same way.





What is the elastic version?
– Andre Piantino
2 days ago





@AndrePiantino Am using elasticsearch 6.2.3 version
– Karthikeyan
2 days ago





@Val - I have attached my ingest attachment plugin mappings details. I did my mapping first for my index document_attachment. Then am executing this java code.
– Karthikeyan
2 days ago



ingest attachment plugin


document_attachment





And you don't see any stack traces when running your code? any errors in the ES logs?
– Val
2 days ago






Some of the files is not available in the directory. so i make a code like, if the files is not available in the path, then ignore the fileContent ( keep it as null ) and index the rest of the content. This is what i planned and my approach also. But it is not happening in this way..
– Karthikeyan
2 days ago


fileContent









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift